summaryrefslogtreecommitdiff
path: root/include/uapi
AgeCommit message (Collapse)Author
2024-07-03um: add shared memory optimisation for time-travel=extJohannes Berg
With external time travel, a LOT of message can end up being exchanged on the socket, taking a significant amount of time just to do that. Add a new shared memory optimisation to that, where a number of changes are made: - the controller sends a client ID and a shared memory FD (and a logging FD we don't use) in the ACK message to the initial START - the shared memory holds the current time and the free_until value, so that there's no need to exchange messages for that - if the client that's running has shared memory support, any client (the running one included) can request the next time it wants to run inside the shared memory, rather than sending a message, by also updating the free_until value - when shared memory is enabled, RUN/WAIT messages no longer have an ACK, further cutting down on messages Together, this can reduce the number of messages very significantly, and reduce overall test/simulation run time. Co-developed-by: Mordechay Goodstein <mordechay.goodstein@intel.com> Signed-off-by: Mordechay Goodstein <mordechay.goodstein@intel.com> Link: https://patch.msgid.link/20240702192118.6ad0a083f574.Ie41206c8ce4507fe26b991937f47e86c24ca7a31@changeid Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2024-07-03um: time-travel: support time-travel protocol broadcast messagesMordechay Goodstein
Add a message type to the time-travel protocol to broadcast a small (64-bit) value to all participants in a simulation. The main use case is to have an identical message come to all participants in a simulation, e.g. to separate out logs for different tests running in a single simulation. Down in the guts of time_travel_handle_message() we can't use printk() and not even printk_deferred(), so just store the message and print it at the start of the userspace() function. Unfortunately this means that other prints in the kernel can actually bypass the message, but in most cases where this is used, for example to separate test logs, userspace will be involved. Also, even if we could use printk_deferred(), we'd still need to flush it out in the userspace() function since otherwise userspace messages might cross it. As a result, this is a reasonable compromise, there's no need to have any core changes and it solves the main use case we have for it. Signed-off-by: Mordechay Goodstein <mordechay.goodstein@intel.com> Link: https://patch.msgid.link/20240702192118.c4093bc5b15e.I2ca8d006b67feeb866ac2017af7b741c9e06445a@changeid Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2024-07-03drm/panthor: Fix sync-only jobsBoris Brezillon
A sync-only job is meant to provide a synchronization point on a queue, so we can't return a NULL fence there, we have to add a signal operation to the command stream which executes after all other previously submitted jobs are done. v2: - Fixed a UAF bug - Added R-bs Fixes: de8548813824 ("drm/panthor: Add the scheduler logical block") Signed-off-by: Boris Brezillon <boris.brezillon@collabora.com> Reviewed-by: Liviu Dudau <liviu.dudau@arm.com> Reviewed-by: Steven Price <steven.price@arm.com> Link: https://patchwork.freedesktop.org/patch/msgid/20240703071640.231278-3-boris.brezillon@collabora.com
2024-07-02ALSA: pcm: reinvent the stream synchronization ID APIJaroslav Kysela
Until the commit e11f0f90a626 ("ALSA: pcm: remove SNDRV_PCM_IOCTL1_INFO internal command"), there was a possibility to pass information about the synchronized streams to the user space. The mentioned commit removed blindly the appropriate code with an irrelevant comment. The revert may be appropriate, but since this API was lost for several years without any complains, it's time to improve it. The hardware parameters may change the used stream clock source (e.g. USB hardware) so move this synchronization ID to hw_params as read-only field. It seems that pipewire can benefit from this API (disable adaptive resampling for perfectly synchronized PCM streams) now. Note that the contents of ID is not supposed to be used for direct comparison with a specific byte sequence. The "empty" case is when all bytes are zero (driver does not offer this information) and all other cases must be only used for equal comparison among PCM streams (including different sound cards) if they are using identical hardware clock. Cc: Takashi Sakamoto <takaswie@kernel.org> Signed-off-by: Jaroslav Kysela <perex@perex.cz> Signed-off-by: Takashi Iwai <tiwai@suse.de> Link: https://patch.msgid.link/20240625172836.589380-2-perex@perex.cz
2024-07-01drm/amdgpu: handle gfx12 in amdgpu_display_verify_sizesMarek Olšák
It verified GFX9-11 swizzle modes on GFX12, which has undefined behavior. Signed-off-by: Marek Olšák <marek.olsak@amd.com> Acked-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-07-01drm/amdgpu: remove AMD_FMT_MOD_GFX12_DCC_MAX_COMPRESSED_BLOCK_* definitionsMarek Olšák
They were added accidentally. Signed-off-by: Marek Olšák <marek.olsak@amd.com> Acked-by: Alex Deucher <alexander.deucher@amd.com> Reviewed-by: Aurabindo Pillai <aurabindo.pillai@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-07-01RDMA/nldev: Add support to dump device type and parent device if existsMark Zhang
If a device has a specific type or a parent device, dump them as well. Example: $ rdma dev show smi1 3: smi1: node_type ca fw 20.38.1002 node_guid 9803:9b03:009f:d5ef sys_image_guid 9803:9b03:009f:d5ee type smi parent ibp8s0f1 Signed-off-by: Mark Zhang <markzhang@nvidia.com> Link: https://lore.kernel.org/r/4c022e3e34b5de1254a3b367d502a362cdd0c53a.1718553901.git.leon@kernel.org Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
2024-07-01RDMA/nldev: Add support to add/delete a sub IB device through netlinkMark Zhang
Add new netlink commands and attributes to support adding and deleting a sub IB device with admin privilege. Examples: $ rdma dev add smi1 type SMI parent ibp8s0f1 $ rdma dev del smi1 Signed-off-by: Mark Zhang <markzhang@nvidia.com> Link: https://lore.kernel.org/r/77cbf1b36359642be8a8d8c5c2f4e585b544282f.1718553901.git.leon@kernel.org Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
2024-07-01RDMA/core: Support IB sub device with type "SMI"Mark Zhang
This patch adds 2 APIs, as well as driver operations to support adding and deleting an IB sub device, which provides part of functionalities of it's parent. A sub device has a type; for a sub device with type "SMI", it provides the smi capability through umad for its parent, meaning uverb is not supported. A sub device cannot live without a parent. So when a parent is released, all it's sub devices are released as well. Signed-off-by: Mark Zhang <markzhang@nvidia.com> Link: https://lore.kernel.org/r/44253f7508b21eb2caefea3980c2bc072869116c.1718553901.git.leon@kernel.org Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
2024-07-01Merge tag 'nf-next-24-06-28' of ↵David S. Miller
git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next into main Pablo Neira Ayuso says: ==================== Netfilter/IPVS updates for net-next The following patchset contains Netfilter/IPVS updates for net-next: Patch #1 to #11 to shrink memory consumption for transaction objects: struct nft_trans_chain { /* size: 120 (-32), cachelines: 2, members: 10 */ struct nft_trans_elem { /* size: 72 (-40), cachelines: 2, members: 4 */ struct nft_trans_flowtable { /* size: 80 (-48), cachelines: 2, members: 5 */ struct nft_trans_obj { /* size: 72 (-40), cachelines: 2, members: 4 */ struct nft_trans_rule { /* size: 80 (-32), cachelines: 2, members: 6 */ struct nft_trans_set { /* size: 96 (-24), cachelines: 2, members: 8 */ struct nft_trans_table { /* size: 56 (-40), cachelines: 1, members: 2 */ struct nft_trans_elem can now be allocated from kmalloc-96 instead of kmalloc-128 slab. Series from Florian Westphal. For the record, I have mangled patch #1 to add nft_trans_container_*() and use if for every transaction object. I have also added BUILD_BUG_ON to ensure struct nft_trans always comes at the beginning of the container transaction object. And few minor cleanups, any new bugs are of my own. Patch #12 simplify check for SCTP GSO in IPVS, from Ismael Luceno. Patch #13 nf_conncount key length remains in the u32 bound, from Yunjian Wang. Patch #14 removes unnecessary check for CTA_TIMEOUT_L3PROTO when setting default conntrack timeouts via nfnetlink_cttimeout API, from Lin Ma. Patch #15 updates NFT_SECMARK_CTX_MAXLEN to 4096, SELinux could use larger secctx names than the existing 256 bytes length. Patch #16 adds a selftest to exercise nfnetlink_queue listeners leaving nfnetlink_queue, from Florian Westphal. Patch #17 increases hitcount from 255 to 65535 in xt_recent, from Phil Sutter. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2024-07-01tcp_metrics: add netlink protocol spec in YAMLJakub Kicinski
Add a protocol spec for tcp_metrics, so that it's accessible via YNL. Useful at the very least for testing fixes. In this episode of "10,000 ways to complicate netlink" the metric nest has defines which are off by 1. iproute2 does: struct rtattr *m[TCP_METRIC_MAX + 1 + 1]; parse_rtattr_nested(m, TCP_METRIC_MAX + 1, a); for (i = 0; i < TCP_METRIC_MAX + 1; i++) { // ... attr = m[i + 1]; This is too weird to support in YNL, add a new set of defines with _correct_ values to the official kernel header. Signed-off-by: Jakub Kicinski <kuba@kernel.org> Reviewed-by: Donald Hunter <donald.hunter@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-07-01tcp_metrics: add UAPI to the header guardJakub Kicinski
tcp_metrics' header lacks the customary _UAPI in the header guard. This makes YNL build rules work less seamlessly. We can easily fix that on YNL side, but this could also be problematic if we ever needed to create a kernel-only tcp_metrics.h. Signed-off-by: Jakub Kicinski <kuba@kernel.org> Reviewed-by: Donald Hunter <donald.hunter@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-06-30iio: core: Add new DMABUF interface infrastructurePaul Cercueil
Add the necessary infrastructure to the IIO core to support a new optional DMABUF based interface. With this new interface, DMABUF objects (externally created) can be attached to a IIO buffer, and subsequently used for data transfer. A userspace application can then use this interface to share DMABUF objects between several interfaces, allowing it to transfer data in a zero-copy fashion, for instance between IIO and the USB stack. The userspace application can also memory-map the DMABUF objects, and access the sample data directly. The advantage of doing this vs. the read() interface is that it avoids an extra copy of the data between the kernel and userspace. This is particularly userful for high-speed devices which produce several megabytes or even gigabytes of data per second. As part of the interface, 3 new IOCTLs have been added: IIO_BUFFER_DMABUF_ATTACH_IOCTL(int fd): Attach the DMABUF object identified by the given file descriptor to the buffer. IIO_BUFFER_DMABUF_DETACH_IOCTL(int fd): Detach the DMABUF object identified by the given file descriptor from the buffer. Note that closing the IIO buffer's file descriptor will automatically detach all previously attached DMABUF objects. IIO_BUFFER_DMABUF_ENQUEUE_IOCTL(struct iio_dmabuf *): Request a data transfer to/from the given DMABUF object. Its file descriptor, as well as the transfer size and flags are provided in the "iio_dmabuf" structure. These three IOCTLs have to be performed on the IIO buffer's file descriptor, obtained using the IIO_BUFFER_GET_FD_IOCTL() ioctl. Signed-off-by: Paul Cercueil <paul@crapouillou.net> Co-developed-by: Nuno Sa <nuno.sa@analog.com> Signed-off-by: Nuno Sa <nuno.sa@analog.com> Link: https://patch.msgid.link/20240620122726.41232-4-paul@crapouillou.net Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
2024-06-28fs: export mount options via statmount()Josef Bacik
statmount() can export arbitrary strings, so utilize the __spare1 slot for a mnt_opts string pointer, and then support asking for and setting the mount options during statmount(). This calls into the helper for showing mount options, which already uses a seq_file, so fits in nicely with our existing mechanism for exporting strings via statmount(). Signed-off-by: Josef Bacik <josef@toxicpanda.com> Link: https://lore.kernel.org/r/3aa6bf8bd5d0a21df9ebd63813af8ab532c18276.1719257716.git.josef@toxicpanda.com Reviewed-by: Jeff Layton <jlayton@kernel.org> [brauner: only call sb->s_op->show_options()] Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-06-28ethtool: Add an interface for flashing transceiver modules' firmwareDanielle Ratson
CMIS compliant modules such as QSFP-DD might be running a firmware that can be updated in a vendor-neutral way by exchanging messages between the host and the module as described in section 7.3.1 of revision 5.2 of the CMIS standard. Add a pair of new ethtool messages that allow: * User space to trigger firmware update of transceiver modules * The kernel to notify user space about the progress of the process The user interface is designed to be asynchronous in order to avoid RTNL being held for too long and to allow several modules to be updated simultaneously. The interface is designed with CMIS compliant modules in mind, but kept generic enough to accommodate future use cases, if these arise. Signed-off-by: Danielle Ratson <danieller@nvidia.com> Reviewed-by: Petr Machata <petrm@nvidia.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-06-28pidfs: allow retrieval of namespace file descriptorsChristian Brauner
For users that hold a reference to a pidfd procfs might not even be available nor is it desirable to parse through procfs just for the sake of getting namespace file descriptors for a process. Make it possible to directly retrieve namespace file descriptors from a pidfd. Pidfds already can be used with setns() to change a set of namespaces atomically. Link: https://lore.kernel.org/r/20240627-work-pidfs-v1-4-7e9ab6cc3bb1@kernel.org Reviewed-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com> Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-06-28fs: add an ioctl to get the mnt ns id from nsfsJosef Bacik
In order to utilize the listmount() and statmount() extensions that allow us to call them on different namespaces we need a way to get the mnt namespace id from user space. Add an ioctl to nsfs that will allow us to extract the mnt namespace id in order to make these new extensions usable. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Link: https://lore.kernel.org/r/180449959d5a756af7306d6bda55f41b9d53e3cb.1719243756.git.josef@toxicpanda.com Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-06-28fs: Allow listmount() in foreign mount namespaceChristian Brauner
Expand struct mnt_id_req to add an optional mnt_ns_id field. When this field is populated, listmount() will be performed on the specified mount namespace, provided the currently application has CAP_SYS_ADMIN in its user namespace and the mount namespace is a child of the current namespace. Co-developed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Link: https://lore.kernel.org/r/49930bdce29a8367a213eb14c1e68e7e49284f86.1719243756.git.josef@toxicpanda.com Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-06-28fs: export the mount ns id via statmountJosef Bacik
In order to allow users to iterate through children mount namespaces via listmount we need a way for them to know what the ns id for the mount. Add a new field to statmount called mnt_ns_id which will carry the ns id for the given mount entry. Co-developed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Link: https://lore.kernel.org/r/6dabf437331fb7415d886f7c64b21cb2a50b1c66.1719243756.git.josef@toxicpanda.com Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-06-28listmount: allow listing in reverse orderChristian Brauner
util-linux is about to implement listmount() and statmount() support. Karel requested the ability to scan the mount table in backwards order because that's what libmount currently does in order to get the latest mount first. We currently don't support this in listmount(). Add a new LISTMOUNT_REVERSE flag to allow listing mounts in reverse order. For example, listing all child mounts of /sys without LISTMOUNT_REVERSE gives: /sys/kernel/security @ mnt_id: 4294968369 /sys/fs/cgroup @ mnt_id: 4294968370 /sys/firmware/efi/efivars @ mnt_id: 4294968371 /sys/fs/bpf @ mnt_id: 4294968372 /sys/kernel/tracing @ mnt_id: 4294968373 /sys/kernel/debug @ mnt_id: 4294968374 /sys/fs/fuse/connections @ mnt_id: 4294968375 /sys/kernel/config @ mnt_id: 4294968376 whereas with LISTMOUNT_REVERSE it gives: /sys/kernel/config @ mnt_id: 4294968376 /sys/fs/fuse/connections @ mnt_id: 4294968375 /sys/kernel/debug @ mnt_id: 4294968374 /sys/kernel/tracing @ mnt_id: 4294968373 /sys/fs/bpf @ mnt_id: 4294968372 /sys/firmware/efi/efivars @ mnt_id: 4294968371 /sys/fs/cgroup @ mnt_id: 4294968370 /sys/kernel/security @ mnt_id: 4294968369 Link: https://lore.kernel.org/r/20240607-vfs-listmount-reverse-v1-4-7877a2bfa5e5@kernel.org Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-06-27Merge tag 'wireless-next-2024-06-27' of ↵Jakub Kicinski
git://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless-next Johannes Berg says: ==================== Highlights this time are: - cfg80211/nl80211: * improvements for 6 GHz regulatory flexibility - mac80211: * use generic netdev stats * multi-link improvements/fixes - brcmfmac: * MFP support (to enable WPA3) - wilc1000: * suspend/resume improvements - iwlwifi: * remove support for older FW for new devices * fast resume (keeping the device configured) - wl18xx: * support newer firmware versions * tag 'wireless-next-2024-06-27' of git://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless-next: (100 commits) wifi: brcmfmac: of: Support interrupts-extended wifi: brcmsmac: advertise MFP_CAPABLE to enable WPA3 net: rfkill: Correct return value in invalid parameter case wifi: mac80211: fix NULL dereference at band check in starting tx ba session wifi: iwlwifi: mvm: fix rs.h kernel-doc wifi: iwlwifi: fw: api: datapath: fix kernel-doc wifi: iwlwifi: fix remaining mistagged kernel-doc comments wifi: iwlwifi: fix prototype mismatch kernel-doc warnings wifi: iwlwifi: fix kernel-doc in iwl-fh.h wifi: iwlwifi: fix kernel-doc in iwl-trans.h wifi: iwlwifi: pcie: fix kernel-doc wifi: iwlwifi: dvm: fix kernel-doc warnings wifi: iwlwifi: mvm: don't log error for failed UATS table read wifi: iwlwifi: trans: make bad state warnings wifi: iwlwifi: fw: api: fix some kernel-doc wifi: iwlwifi: mvm: remove init_dbg module parameter wifi: iwlwifi: update the BA notification API wifi: iwlwifi: mvm: always unblock EMLSR on ROC end wifi: iwlwifi: mvm: use IWL_FW_CHECK for link ID check wifi: iwlwifi: mvm: don't flush BSSes on restart with MLD API ... ==================== Link: https://patch.msgid.link/20240627114135.28507-3-johannes@sipsolutions.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-06-28Merge tag 'drm-xe-next-2024-06-26' of ↵Dave Airlie
https://gitlab.freedesktop.org/drm/xe/kernel into drm-next UAPI Changes: - New uapi adding OA functionality to Xe (Ashutosh) Cross-subsystem Changes: - devcoredump: Add dev_coredumpm_timeout (Jose) Driver Changes: - More SRIOV preparation, including GuC communication improvements (Michal) - Kconfig update: do not select ACPI_BUTTON (Jani) - Rework GPU page fault handling (Brost) - Forcewake clean-up and fixes (Himal, Michal) - Drop EXEC_QUEUE_FLAG_BANNED (Brost) - Xe/Xe2 Workarounds fixes and additions (Tejas, Akshata, Sai, Vinay) - Xe devcoredump changes (Jose) - Tracing cleanup and add mmio tracing (RK) - Add BMG PCI IDs (Roper) - Scheduler fixes and improvements (Brost) - Some overal driver clean-up around headers and print macros (Michal) - Rename xe_exec_queue::compute to xe_exec_queue::lr (Francois) - Improve RTP rules to allow easier 'OR' conditions in WA declaration (Lucas) - Use ttm_uncached for BO with NEEDS_UC flag (Michal) - Other OA related work and fixes (Ashutosh, Michal, Jose) - Simplify locking in new_vma (Brost) - Remove xe_irq_shutdown (Ilia) Signed-off-by: Dave Airlie <airlied@redhat.com> From: Rodrigo Vivi <rodrigo.vivi@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/ZnyW9RdC_aWSla_q@intel.com
2024-06-27RDMA/mlx5: Send UAR page index as ioctl attributeAkiva Goldberger
Add UAR page index as a driver ioctl attribute to increase the number of supported indices, previously limited to 16 bits by mlx5_ib_create_cq struct. Link: https://lore.kernel.org/r/0e18b34d7ec3b1ae02d694b0d545aed7413c0ef7.1719512393.git.leon@kernel.org Signed-off-by: Akiva Goldberger <agoldberger@nvidia.com> Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2024-06-27RDMA: Pass entire uverbs attr bundle to create cq functionAkiva Goldberger
Changes the create_cq verb signature by sending the entire uverbs attr bundle as a parameter. This allows drivers to send driver specific attrs through ioctl for the create_cq verb and access them in their driver specific code. Also adds a new enum value for driver specific ioctl attributes for methods already supporting UHW. Link: https://lore.kernel.org/r/ed147343987c0d43fd391c1b2f85e2f425747387.1719512393.git.leon@kernel.org Signed-off-by: Akiva Goldberger <agoldberger@nvidia.com> Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2024-06-27media: uapi: Add PiSP Compressed RAW Bayer formatsJacopo Mondi
Add Raspberry Pi compressed RAW Bayer formats. The compression algorithm description is provided by Nick Hollinghurst <nick.hollinghurst@raspberrypi.com> from Raspberry Pi. Signed-off-by: Jacopo Mondi <jacopo.mondi@ideasonboard.com> Reviewed-by: Naushir Patuck <naush@raspberrypi.com> Signed-off-by: Sakari Ailus <sakari.ailus@linux.intel.com> Signed-off-by: Hans Verkuil <hverkuil-cisco@xs4all.nl>
2024-06-27media: uapi: Add meta pixel format for PiSP BE configJacopo Mondi
Add format description for the PiSP Back End configuration parameter buffer. Signed-off-by: Jacopo Mondi <jacopo.mondi@ideasonboard.com> Reviewed-by: Naushir Patuck <naush@raspberrypi.com> Signed-off-by: Sakari Ailus <sakari.ailus@linux.intel.com> Signed-off-by: Hans Verkuil <hverkuil-cisco@xs4all.nl>
2024-06-27media: uapi: Add Raspberry Pi PiSP Back End uAPIJacopo Mondi
Add the Raspberry Pi PiSP Back End uAPI header. The header defines the data type used to configure the PiSP Back End ISP. The detailed description of the types and of the ISP configuration procedure is available at https://datasheets.raspberrypi.com/camera/raspberry-pi-image-signal-processor-specification.pdf Signed-off-by: Jacopo Mondi <jacopo.mondi@ideasonboard.com> Signed-off-by: Sakari Ailus <sakari.ailus@linux.intel.com> Signed-off-by: Hans Verkuil <hverkuil-cisco@xs4all.nl>
2024-06-27media: uapi: Add a pixel format for BGR48 and RGB48Jacopo Mondi
Add BGR48 and RGB48 16-bit per component image formats. Signed-off-by: Jacopo Mondi <jacopo.mondi@ideasonboard.com> Reviewed-by: Kieran Bingham <kieran.bingham@ideasonboard.com> Reviewed-by: Naushir Patuck <naush@raspberrypi.com> Signed-off-by: Sakari Ailus <sakari.ailus@linux.intel.com> Signed-off-by: Hans Verkuil <hverkuil-cisco@xs4all.nl>
2024-06-27Merge tag 'drm-intel-gt-next-2024-06-12' of ↵Dave Airlie
https://gitlab.freedesktop.org/drm/i915/kernel into drm-next UAPI Changes: - Support replaying GPU hangs with captured context image (Tvrtko Ursulin) Driver Changes: Fixes/improvements/new stuff: - Automate CCS Mode setting during engine resets [gt] (Andi Shyti) - Revert "drm/i915: Remove extra multi-gt pm-references" (Janusz Krzysztofik) - Fix HAS_REGION() usage in intel_gt_probe_lmem() (Ville Syrjälä) - Disarm breadcrumbs if engines are already idle [gt] (Chris Wilson) - Shadow default engine context image in the context (Tvrtko Ursulin) - Support replaying GPU hangs with captured context image (Tvrtko Ursulin) - avoid FIELD_PREP warning [guc] (Arnd Bergmann) - Fix CCS id's calculation for CCS mode setting [gt] (Andi Shyti) - Increase FLR timeout from 3s to 9s (Andi Shyti) - Update workaround 14018575942 [mtl] (Angus Chen) Future platform enablement: - Enable w/a 16021333562 for DG2, MTL and ARL [guc] (John Harrison) Miscellaneous: - Pass the region ID rather than a bitmask to HAS_REGION() (Ville Syrjälä) - Remove counter productive REGION_* wrappers (Ville Syrjälä) - Fix typo [gem/i915_gem_ttm_move] (Deming Wang) - Delete the live_hearbeat_fast selftest [gt] (Krzysztof Niemiec) Signed-off-by: Dave Airlie <airlied@redhat.com> From: Tvrtko Ursulin <tursulin@igalia.com> Link: https://patchwork.freedesktop.org/patch/msgid/Zmmazub+U9ewH9ts@linux
2024-06-26scsi: mpi3mr: Add ioctl support for HDBRanjan Kumar
Add interface for applications to manage the host diagnostic buffers and update the automatic diag buffer capture triggers. Co-developed-by: Sathya Prakash <sathya.prakash@broadcom.com> Signed-off-by: Sathya Prakash <sathya.prakash@broadcom.com> Signed-off-by: Ranjan Kumar <ranjan.kumar@broadcom.com> Link: https://lore.kernel.org/r/20240626102646.14298-4-ranjan.kumar@broadcom.com Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
2024-06-26drm/xe/oa/uapi: Allow preemption to be disabled on the stream exec queueAshutosh Dixit
Mesa VK_KHR_performance_query use case requires preemption and timeslicing to be disabled for the stream exec queue. Implement this functionality here. v2: Minor change to debug print to print both ret values (Umesh) Acked-by: José Roberto de Souza <jose.souza@intel.com> Reviewed-by: Umesh Nerlige Ramappa <umesh.nerlige.ramappa@intel.com> Signed-off-by: Ashutosh Dixit <ashutosh.dixit@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20240626181817.1516229-3-ashutosh.dixit@intel.com Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
2024-06-26drm/xe/oa: Fix kernel doc in xe_drm.hAshutosh Dixit
Fix kernel doc in xe_drm.h. Also eliminate private/non-abi enum definitions. v2: Remove __DRM_XE_PERF_TYPE_MAX since it is unused (Michal) v3: Also remove DRM_XE_OA_PROPERTY_MAX since it can also be eliminated (Michal) Suggested-by: Michal Wajdeczko <michal.wajdeczko@intel.com> Signed-off-by: Ashutosh Dixit <ashutosh.dixit@intel.com> Reviewed-by: Michal Wajdeczko <michal.wajdeczko@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20240623203119.3840283-1-ashutosh.dixit@intel.com Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
2024-06-26Merge drm/drm-next into drm-xe-nextRodrigo Vivi
Need to sync some header include that propagated through drm-intel-next. v2: After some changes in drm/drm-next Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
2024-06-26xfrm: support sending NAT keepalives in ESP in UDP statesEyal Birger
Add the ability to send out RFC-3948 NAT keepalives from the xfrm stack. To use, Userspace sets an XFRM_NAT_KEEPALIVE_INTERVAL integer property when creating XFRM outbound states which denotes the number of seconds between keepalive messages. Keepalive messages are sent from a per net delayed work which iterates over the xfrm states. The logic is guarded by the xfrm state spinlock due to the xfrm state walk iterator. Possible future enhancements: - Adding counters to keep track of sent keepalives. - deduplicate NAT keepalives between states sharing the same nat keepalive parameters. - provisioning hardware offloads for devices capable of implementing this. - revise xfrm state list to use an rcu list in order to avoid running this under spinlock. Suggested-by: Paul Wouters <paul.wouters@aiven.io> Tested-by: Paul Wouters <paul.wouters@aiven.io> Tested-by: Antony Antony <antony.antony@secunet.com> Signed-off-by: Eyal Birger <eyal.birger@gmail.com> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2024-06-25ethtool: provide customized dim profile managementHeng Qi
The NetDIM library, currently leveraged by an array of NICs, delivers excellent acceleration benefits. Nevertheless, NICs vary significantly in their dim profile list prerequisites. Specifically, virtio-net backends may present diverse sw or hw device implementation, making a one-size-fits-all parameter list impractical. On Alibaba Cloud, the virtio DPU's performance under the default DIM profile falls short of expectations, partly due to a mismatch in parameter configuration. I also noticed that ice/idpf/ena and other NICs have customized profilelist or placed some restrictions on dim capabilities. Motivated by this, I tried adding new params for "ethtool -C" that provides a per-device control to modify and access a device's interrupt parameters. Usage ======== The target NIC is named ethx. Assume that ethx only declares support for rx profile setting (with DIM_PROFILE_RX flag set in profile_flags) and supports modification of usec and pkt fields. 1. Query the currently customized list of the device $ ethtool -c ethx ... rx-profile: {.usec = 1, .pkts = 256, .comps = n/a,}, {.usec = 8, .pkts = 256, .comps = n/a,}, {.usec = 64, .pkts = 256, .comps = n/a,}, {.usec = 128, .pkts = 256, .comps = n/a,}, {.usec = 256, .pkts = 256, .comps = n/a,} tx-profile: n/a 2. Tune $ ethtool -C ethx rx-profile 1,1,n_2,n,n_3,3,n_4,4,n_n,5,n "n" means do not modify this field. $ ethtool -c ethx ... rx-profile: {.usec = 1, .pkts = 1, .comps = n/a,}, {.usec = 2, .pkts = 256, .comps = n/a,}, {.usec = 3, .pkts = 3, .comps = n/a,}, {.usec = 4, .pkts = 4, .comps = n/a,}, {.usec = 256, .pkts = 5, .comps = n/a,} tx-profile: n/a 3. Hint If the device does not support some type of customized dim profiles, the corresponding "n/a" will display. If the "n/a" field is being modified, -EOPNOTSUPP will be reported. Signed-off-by: Heng Qi <hengqi@linux.alibaba.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20240621101353.107425-4-hengqi@linux.alibaba.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-06-26netfilter: nf_tables: rise cap on SELinux secmark contextPablo Neira Ayuso
secmark context is artificially limited 256 bytes, rise it to 4Kbytes. Fixes: fb961945457f ("netfilter: nf_tables: add SECMARK support") Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2024-06-25nsfs: add pid translation ioctlsChristian Brauner
Add ioctl()s to translate pids between pid namespaces. LXCFS is a tiny fuse filesystem used to virtualize various aspects of procfs. LXCFS is run on the host. The files and directories it creates can be bind-mounted by e.g. a container at startup and mounted over the various procfs files the container wishes to have virtualized. When e.g. a read request for uptime is received, LXCFS will receive the pid of the reader. In order to virtualize the corresponding read, LXCFS needs to know the pid of the init process of the reader's pid namespace. In order to do this, LXCFS first needs to fork() two helper processes. The first helper process setns() to the readers pid namespace. The second helper process is needed to create a process that is a proper member of the pid namespace. The second helper process then creates a ucred message with ucred.pid set to 1 and sends it back to LXCFS. The kernel will translate the ucred.pid field to the corresponding pid number in LXCFS's pid namespace. This way LXCFS can learn the init pid number of the reader's pid namespace and can go on to virtualize. Since these two forks() are costly LXCFS maintains an init pid cache that caches a given pid for a fixed amount of time. The cache is pruned during new read requests. However, even with the cache the hit of the two forks() is singificant when a very large number of containers are running. With this simple patch we add an ns ioctl that let's a caller retrieve the init pid nr of a pid namespace through its pid namespace fd. This significantly improves performance with a very simple change. Support translation of pids and tgids. Other concepts can be added but there are no obvious users for this right now. To protect against races pidfds can be used to check whether the process is still valid. If needed, this can also be extended to work on pidfds directly. Link: https://lore.kernel.org/r/20240619-work-ns_ioctl-v1-1-7c0097e6bb6b@kernel.org Reviewed-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com> Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-06-25syscalls: fix compat_sys_io_pgetevents_time64 usageArnd Bergmann
Using sys_io_pgetevents() as the entry point for compat mode tasks works almost correctly, but misses the sign extension for the min_nr and nr arguments. This was addressed on parisc by switching to compat_sys_io_pgetevents_time64() in commit 6431e92fc827 ("parisc: io_pgetevents_time64() needs compat syscall in 32-bit compat mode"), as well as by using more sophisticated system call wrappers on x86 and s390. However, arm64, mips, powerpc, sparc and riscv still have the same bug. Change all of them over to use compat_sys_io_pgetevents_time64() like parisc already does. This was clearly the intention when the function was originally added, but it got hooked up incorrectly in the tables. Cc: stable@vger.kernel.org Fixes: 48166e6ea47d ("y2038: add 64-bit time_t syscalls to all 32-bit architectures") Acked-by: Heiko Carstens <hca@linux.ibm.com> # s390 Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2024-06-21drm/msm: Add MSM_PARAM_RAYTRACING uapiConnor Abbott
Expose the value of the software fuse to userspace. Reviewed-by: Konrad Dybcio <konrad.dybcio@linaro.org> Reviewed-by: Dmitry Baryshkov <dmitry.baryshkov@linaro.org> Signed-off-by: Connor Abbott <cwabbott0@gmail.com> Patchwork: https://patchwork.freedesktop.org/patch/592044/ Signed-off-by: Rob Clark <robdclark@chromium.org>
2024-06-21Merge tag 'drm-misc-next-2024-06-06' of ↵Dave Airlie
https://gitlab.freedesktop.org/drm/misc/kernel into drm-next drm-misc-next for 6.10: UAPI Changes: Cross-subsystem Changes: - dma-buf: Warn when reserving 0 fence slots, internal API enhancements for heaps Core Changes: Driver Changes: - atmel-hlcdc: Support XLCDC in sam9x7 - msm: Validate registers XML description against schema in CI - v3d: Fix build warning - bridges: - analogix_dp: Various improvements - panels: - New panel: WL-355608-A8 Signed-off-by: Dave Airlie <airlied@redhat.com> From: Maxime Ripard <mripard@redhat.com> Link: https://patchwork.freedesktop.org/patch/msgid/20240606-vivid-amphibian-jackrabbit-40b1d1@houat
2024-06-21Merge tag 'drm-misc-next-2024-05-30' of ↵Dave Airlie
https://gitlab.freedesktop.org/drm/misc/kernel into drm-next drm-misc-next for 6.11: UAPI Changes: - Deprecate DRM date and return a 0 date in DRM_IOCTL_VERSION Core Changes: - connector: Create a set of helpers to help with HDMI support - fbdev: Create memory manager optimized fbdev emulation - panic: Allow to select fonts, improve drm_fb_dma_get_scanout_buffer Driver Changes: - Remove driver owner assignments - Allow more drivers to compile with COMPILE_TEST - Conversions to drm_edid - ivpu: hardware scheduler support, profiling support, improvements to the platform support layer - mgag200: general reworks and improvements - nouveau: Add NVreg_RegistryDwords command line option - rockchip: Conversion to the hdmi helpers - sun4i: Conversion to the hdmi helpers - vc4: Conversion to the hdmi helpers - v3d: Perf counters improvements - zynqmp: IRQ and debugfs improvements - bridge: - Remove redundant checks on bridge->encoder - panels: - Switch panels from register table initialization to proper code - Now that the panel code tracks the panel state, remove every ad-hoc implementation in the panel drivers - New panels: Lincoln Tech Sol LCD185-101CT, Microtips Technology 13-101HIEBCAF0-C, Microtips Technology MF-103HIEB0GA0, BOE nv110wum-l60, IVO t109nw41 Signed-off-by: Dave Airlie <airlied@redhat.com> From: Maxime Ripard <mripard@redhat.com> Link: https://patchwork.freedesktop.org/patch/msgid/20240530-hilarious-flat-magpie-5fa186@houat
2024-06-20fs: Add initial atomic write support info to statxPrasad Singamsetty
Extend statx system call to return additional info for atomic write support support for a file. Helper function generic_fill_statx_atomic_writes() can be used by FSes to fill in the relevant statx fields. For now atomic_write_segments_max will always be 1, otherwise some rules would need to be imposed on iovec length and alignment, which we don't want now. Signed-off-by: Prasad Singamsetty <prasad.singamsetty@oracle.com> jpg: relocate bdev support to another patch Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: John Garry <john.g.garry@oracle.com> Acked-by: Darrick J. Wong <djwong@kernel.org> Link: https://lore.kernel.org/r/20240620125359.2684798-5-john.g.garry@oracle.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-06-20fs: Initial atomic write supportPrasad Singamsetty
An atomic write is a write issued with torn-write protection, meaning that for a power failure or any other hardware failure, all or none of the data from the write will be stored, but never a mix of old and new data. Userspace may add flag RWF_ATOMIC to pwritev2() to indicate that the write is to be issued with torn-write prevention, according to special alignment and length rules. For any syscall interface utilizing struct iocb, add IOCB_ATOMIC for iocb->ki_flags field to indicate the same. A call to statx will give the relevant atomic write info for a file: - atomic_write_unit_min - atomic_write_unit_max - atomic_write_segments_max Both min and max values must be a power-of-2. Applications can avail of atomic write feature by ensuring that the total length of a write is a power-of-2 in size and also sized between atomic_write_unit_min and atomic_write_unit_max, inclusive. Applications must ensure that the write is at a naturally-aligned offset in the file wrt the total write length. The value in atomic_write_segments_max indicates the upper limit for IOV_ITER iovcnt. Add file mode flag FMODE_CAN_ATOMIC_WRITE, so files which do not have the flag set will have RWF_ATOMIC rejected and not just ignored. Add a type argument to kiocb_set_rw_flags() to allows reads which have RWF_ATOMIC set to be rejected. Helper function generic_atomic_write_valid() can be used by FSes to verify compliant writes. There we check for iov_iter type is for ubuf, which implies iovcnt==1 for pwritev2(), which is an initial restriction for atomic_write_segments_max. Initially the only user will be bdev file operations write handler. We will rely on the block BIO submission path to ensure write sizes are compliant for the bdev, so we don't need to check atomic writes sizes yet. Signed-off-by: Prasad Singamsetty <prasad.singamsetty@oracle.com> jpg: merge into single patch and much rewrite Acked-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: John Garry <john.g.garry@oracle.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Link: https://lore.kernel.org/r/20240620125359.2684798-4-john.g.garry@oracle.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-06-20can: isotp: remove ISO 15675-2 specification version where possibleOliver Hartkopp
With the new ISO 15765-2:2024 release the former documentation and comments have to be reworked. This patch removes the ISO specification version/date where possible. Signed-off-by: Oliver Hartkopp <socketcan@hartkopp.net> Acked-by: Vincent Mailhol <mailhol.vincent@wanadoo.fr> Acked-by: Francesco Valla <valla.francesco@gmail.com> Link: https://lore.kernel.org/all/20240420194746.4885-1-socketcan@hartkopp.net Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
2024-06-19io_uring: Introduce IORING_OP_LISTENGabriel Krisman Bertazi
IORING_OP_LISTEN provides the semantic of listen(2) via io_uring. While this is an essentially synchronous system call, the main point is to enable a network path to execute fully with io_uring registered and descriptorless files. Signed-off-by: Gabriel Krisman Bertazi <krisman@suse.de> Link: https://lore.kernel.org/r/20240614163047.31581-4-krisman@suse.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-06-19io_uring: Introduce IORING_OP_BINDGabriel Krisman Bertazi
IORING_OP_BIND provides the semantic of bind(2) via io_uring. While this is an essentially synchronous system call, the main point is to enable a network path to execute fully with io_uring registered and descriptorless files. Signed-off-by: Gabriel Krisman Bertazi <krisman@suse.de> Link: https://lore.kernel.org/r/20240614163047.31581-3-krisman@suse.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-06-18sched_ext: Implement BPF extensible scheduler classTejun Heo
Implement a new scheduler class sched_ext (SCX), which allows scheduling policies to be implemented as BPF programs to achieve the following: 1. Ease of experimentation and exploration: Enabling rapid iteration of new scheduling policies. 2. Customization: Building application-specific schedulers which implement policies that are not applicable to general-purpose schedulers. 3. Rapid scheduler deployments: Non-disruptive swap outs of scheduling policies in production environments. sched_ext leverages BPF’s struct_ops feature to define a structure which exports function callbacks and flags to BPF programs that wish to implement scheduling policies. The struct_ops structure exported by sched_ext is struct sched_ext_ops, and is conceptually similar to struct sched_class. The role of sched_ext is to map the complex sched_class callbacks to the more simple and ergonomic struct sched_ext_ops callbacks. For more detailed discussion on the motivations and overview, please refer to the cover letter. Later patches will also add several example schedulers and documentation. This patch implements the minimum core framework to enable implementation of BPF schedulers. Subsequent patches will gradually add functionalities including safety guarantee mechanisms, nohz and cgroup support. include/linux/sched/ext.h defines struct sched_ext_ops. With the comment on top, each operation should be self-explanatory. The followings are worth noting: - Both "sched_ext" and its shorthand "scx" are used. If the identifier already has "sched" in it, "ext" is used; otherwise, "scx". - In sched_ext_ops, only .name is mandatory. Every operation is optional and if omitted a simple but functional default behavior is provided. - A new policy constant SCHED_EXT is added and a task can select sched_ext by invoking sched_setscheduler(2) with the new policy constant. However, if the BPF scheduler is not loaded, SCHED_EXT is the same as SCHED_NORMAL and the task is scheduled by CFS. When the BPF scheduler is loaded, all tasks which have the SCHED_EXT policy are switched to sched_ext. - To bridge the workflow imbalance between the scheduler core and sched_ext_ops callbacks, sched_ext uses simple FIFOs called dispatch queues (dsq's). By default, there is one global dsq (SCX_DSQ_GLOBAL), and one local per-CPU dsq (SCX_DSQ_LOCAL). SCX_DSQ_GLOBAL is provided for convenience and need not be used by a scheduler that doesn't require it. SCX_DSQ_LOCAL is the per-CPU FIFO that sched_ext pulls from when putting the next task on the CPU. The BPF scheduler can manage an arbitrary number of dsq's using scx_bpf_create_dsq() and scx_bpf_destroy_dsq(). - sched_ext guarantees system integrity no matter what the BPF scheduler does. To enable this, each task's ownership is tracked through p->scx.ops_state and all tasks are put on scx_tasks list. The disable path can always recover and revert all tasks back to CFS. See p->scx.ops_state and scx_tasks. - A task is not tied to its rq while enqueued. This decouples CPU selection from queueing and allows sharing a scheduling queue across an arbitrary subset of CPUs. This adds some complexities as a task may need to be bounced between rq's right before it starts executing. See dispatch_to_local_dsq() and move_task_to_local_dsq(). - One complication that arises from the above weak association between task and rq is that synchronizing with dequeue() gets complicated as dequeue() may happen anytime while the task is enqueued and the dispatch path might need to release the rq lock to transfer the task. Solving this requires a bit of complexity. See the logic around p->scx.sticky_cpu and p->scx.ops_qseq. - Both enable and disable paths are a bit complicated. The enable path switches all tasks without blocking to avoid issues which can arise from partially switched states (e.g. the switching task itself being starved). The disable path can't trust the BPF scheduler at all, so it also has to guarantee forward progress without blocking. See scx_ops_enable() and scx_ops_disable_workfn(). - When sched_ext is disabled, static_branches are used to shut down the entry points from hot paths. v7: - scx_ops_bypass() was incorrectly and unnecessarily trying to grab scx_ops_enable_mutex which can lead to deadlocks in the disable path. Fixed. - Fixed TASK_DEAD handling bug in scx_ops_enable() path which could lead to use-after-free. - Consolidated per-cpu variable usages and other cleanups. v6: - SCX_NR_ONLINE_OPS replaced with SCX_OPI_*_BEGIN/END so that multiple groups can be expressed. Later CPU hotplug operations are put into their own group. - SCX_OPS_DISABLING state is replaced with the new bypass mechanism which allows temporarily putting the system into simple FIFO scheduling mode bypassing the BPF scheduler. In addition to the shut down path, this will also be used to isolate the BPF scheduler across PM events. Enabling and disabling the bypass mode requires iterating all runnable tasks. rq->scx.runnable_list addition is moved from the later watchdog patch. - ops.prep_enable() is replaced with ops.init_task() and ops.enable/disable() are now called whenever the task enters and leaves sched_ext instead of when the task becomes schedulable on sched_ext and stops being so. A new operation - ops.exit_task() - is called when the task stops being schedulable on sched_ext. - scx_bpf_dispatch() can now be called from ops.select_cpu() too. This removes the need for communicating local dispatch decision made by ops.select_cpu() to ops.enqueue() via per-task storage. SCX_KF_SELECT_CPU is added to support the change. - SCX_TASK_ENQ_LOCAL which told the BPF scheudler that scx_select_cpu_dfl() wants the task to be dispatched to the local DSQ was removed. Instead, scx_bpf_select_cpu_dfl() now dispatches directly if it finds a suitable idle CPU. If such behavior is not desired, users can use scx_bpf_select_cpu_dfl() which returns the verdict in a bool out param. - scx_select_cpu_dfl() was mishandling WAKE_SYNC and could end up queueing many tasks on a local DSQ which makes tasks to execute in order while other CPUs stay idle which made some hackbench numbers really bad. Fixed. - The current state of sched_ext can now be monitored through files under /sys/sched_ext instead of /sys/kernel/debug/sched/ext. This is to enable monitoring on kernels which don't enable debugfs. - sched_ext wasn't telling BPF that ops.dispatch()'s @prev argument may be NULL and a BPF scheduler which derefs the pointer without checking could crash the kernel. Tell BPF. This is currently a bit ugly. A better way to annotate this is expected in the future. - scx_exit_info updated to carry pointers to message buffers instead of embedding them directly. This decouples buffer sizes from API so that they can be changed without breaking compatibility. - exit_code added to scx_exit_info. This is used to indicate different exit conditions on non-error exits and will be used to handle e.g. CPU hotplugs. - The patch "sched_ext: Allow BPF schedulers to switch all eligible tasks into sched_ext" is folded in and the interface is changed so that partial switching is indicated with a new ops flag %SCX_OPS_SWITCH_PARTIAL. This makes scx_bpf_switch_all() unnecessasry and in turn SCX_KF_INIT. ops.init() is now called with SCX_KF_SLEEPABLE. - Code reorganized so that only the parts necessary to integrate with the rest of the kernel are in the header files. - Changes to reflect the BPF and other kernel changes including the addition of bpf_sched_ext_ops.cfi_stubs. v5: - To accommodate 32bit configs, p->scx.ops_state is now atomic_long_t instead of atomic64_t and scx_dsp_buf_ent.qseq which uses load_acquire/store_release is now unsigned long instead of u64. - Fix the bug where bpf_scx_btf_struct_access() was allowing write access to arbitrary fields. - Distinguish kfuncs which can be called from any sched_ext ops and from anywhere. e.g. scx_bpf_pick_idle_cpu() can now be called only from sched_ext ops. - Rename "type" to "kind" in scx_exit_info to make it easier to use on languages in which "type" is a reserved keyword. - Since cff9b2332ab7 ("kernel/sched: Modify initial boot task idle setup"), PF_IDLE is not set on idle tasks which haven't been online yet which made scx_task_iter_next_filtered() include those idle tasks in iterations leading to oopses. Update scx_task_iter_next_filtered() to directly test p->sched_class against idle_sched_class instead of using is_idle_task() which tests PF_IDLE. - Other updates to match upstream changes such as adding const to set_cpumask() param and renaming check_preempt_curr() to wakeup_preempt(). v4: - SCHED_CHANGE_BLOCK replaced with the previous sched_deq_and_put_task()/sched_enq_and_set_tsak() pair. This is because upstream is adaopting a different generic cleanup mechanism. Once that lands, the code will be adapted accordingly. - task_on_scx() used to test whether a task should be switched into SCX, which is confusing. Renamed to task_should_scx(). task_on_scx() now tests whether a task is currently on SCX. - scx_has_idle_cpus is barely used anymore and replaced with direct check on the idle cpumask. - SCX_PICK_IDLE_CORE added and scx_pick_idle_cpu() improved to prefer fully idle cores. - ops.enable() now sees up-to-date p->scx.weight value. - ttwu_queue path is disabled for tasks on SCX to avoid confusing BPF schedulers expecting ->select_cpu() call. - Use cpu_smt_mask() instead of topology_sibling_cpumask() like the rest of the scheduler. v3: - ops.set_weight() added to allow BPF schedulers to track weight changes without polling p->scx.weight. - move_task_to_local_dsq() was losing SCX-specific enq_flags when enqueueing the task on the target dsq because it goes through activate_task() which loses the upper 32bit of the flags. Carry the flags through rq->scx.extra_enq_flags. - scx_bpf_dispatch(), scx_bpf_pick_idle_cpu(), scx_bpf_task_running() and scx_bpf_task_cpu() now use the new KF_RCU instead of KF_TRUSTED_ARGS to make it easier for BPF schedulers to call them. - The kfunc helper access control mechanism implemented through sched_ext_entity.kf_mask is improved. Now SCX_CALL_OP*() is always used when invoking scx_ops operations. v2: - balance_scx_on_up() is dropped. Instead, on UP, balance_scx() is called from put_prev_taks_scx() and pick_next_task_scx() as necessary. To determine whether balance_scx() should be called from put_prev_task_scx(), SCX_TASK_DEQD_FOR_SLEEP flag is added. See the comment in put_prev_task_scx() for details. - sched_deq_and_put_task() / sched_enq_and_set_task() sequences replaced with SCHED_CHANGE_BLOCK(). - Unused all_dsqs list removed. This was a left-over from previous iterations. - p->scx.kf_mask is added to track and enforce which kfunc helpers are allowed. Also, init/exit sequences are updated to make some kfuncs always safe to call regardless of the current BPF scheduler state. Combined, this should make all the kfuncs safe. - BPF now supports sleepable struct_ops operations. Hacky workaround removed and operations and kfunc helpers are tagged appropriately. - BPF now supports bitmask / cpumask helpers. scx_bpf_get_idle_cpumask() and friends are added so that BPF schedulers can use the idle masks with the generic helpers. This replaces the hacky kfunc helpers added by a separate patch in V1. - CONFIG_SCHED_CLASS_EXT can no longer be enabled if SCHED_CORE is enabled. This restriction will be removed by a later patch which adds core-sched support. - Add MAINTAINERS entries and other misc changes. Signed-off-by: Tejun Heo <tj@kernel.org> Co-authored-by: David Vernet <dvernet@meta.com> Acked-by: Josh Don <joshdon@google.com> Acked-by: Hao Luo <haoluo@google.com> Acked-by: Barret Rhoden <brho@google.com> Cc: Andrea Righi <andrea.righi@canonical.com>
2024-06-18drm/xe/oa/uapi: Query OA unit propertiesAshutosh Dixit
Implement query for properties of OA units present on a device. v2: Clean up reserved/pad fields (Umesh) Follow the same scheme as other query structs v3: Skip reporting reserved engines attached to OA units v4: Expose oa_buf_size via DRM_XE_PERF_IOCTL_INFO (Umesh) v5: Don't expose capabilities as OR of properties (Umesh) v6: Add extensions to query output structs: drm_xe_oa_unit, drm_xe_query_oa_units and drm_xe_oa_stream_info v7: Change oa_units[] array to __u64 type Acked-by: Rodrigo Vivi <rodrigo.vivi@intel.com> Reviewed-by: Umesh Nerlige Ramappa <umesh.nerlige.ramappa@intel.com> Signed-off-by: Ashutosh Dixit <ashutosh.dixit@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20240618014609.3233427-13-ashutosh.dixit@intel.com
2024-06-18drm/xe/oa/uapi: Read file_operationAshutosh Dixit
Implement the OA stream read file_operation. Both blocking and non-blocking reads are supported. As part of read system call, the read copies OA perf data from the OA buffer to the user buffer, after appending packet headers for status and data packets. v2: Drop OA report headers, implement DRM_XE_PERF_IOCTL_STATUS (Umesh) v3: Introduce 'struct drm_xe_oa_stream_status' v4: Define oa_status register bitfields (Umesh) v5: Add extensions to 'struct drm_xe_oa_stream_status' v6: Minor cleanup, eliminate report32 variable v7: Use -EIO to signal to userspace to read OASTATUS using DRM_XE_PERF_IOCTL_STATUS, change previous sites returning -EIO to return -EINVAL Make drm_xe_oa_stream_status bits contiguous (Jose, Umesh) rmw oa_status bits (Umesh) Acked-by: Rodrigo Vivi <rodrigo.vivi@intel.com> Acked-by: José Roberto de Souza <jose.souza@intel.com> Reviewed-by: Umesh Nerlige Ramappa <umesh.nerlige.ramappa@intel.com> Signed-off-by: Ashutosh Dixit <ashutosh.dixit@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20240618014609.3233427-10-ashutosh.dixit@intel.com
2024-06-18drm/xe/oa/uapi: Expose OA stream fdAshutosh Dixit
The OA stream open perf op returns an fd with its own file_operations for the newly initialized OA stream. These file_operations allow userspace to enable or disable the stream, as well as apply a different metric configuration for the OA stream. Userspace can also poll for data availability. OA stream initialization is completed in this commit by enabling the OA stream. When sampling is enabled this starts a hrtimer which periodically checks for data availablility. v2: Use stream properties for stream reconfiguration with DRM_XE_PERF_IOCTL_CONFIG v3: Hold runtime_pm reference across oa buffer alloc/free v4: Fix 32 bit build Acked-by: Rodrigo Vivi <rodrigo.vivi@intel.com> Acked-by: José Roberto de Souza <jose.souza@intel.com> Reviewed-by: Umesh Nerlige Ramappa <umesh.nerlige.ramappa@intel.com> Signed-off-by: Ashutosh Dixit <ashutosh.dixit@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20240618014609.3233427-9-ashutosh.dixit@intel.com