summaryrefslogtreecommitdiff
path: root/include/uapi/linux
AgeCommit message (Collapse)Author
2025-02-20bpf: Add BPF_SOCK_OPS_TSTAMP_SND_SW_CB callbackJason Xing
Support sw SCM_TSTAMP_SND case for bpf timestamping. Add a new sock_ops callback, BPF_SOCK_OPS_TSTAMP_SND_SW_CB. This callback will occur at the same timestamping point as the user space's software SCM_TSTAMP_SND. The BPF program can use it to get the same SCM_TSTAMP_SND timestamp without modifying the user-space application. Based on this patch, BPF program will get the software timestamp when the driver is ready to send the skb. In the sebsequent patch, the hardware timestamp will be supported. Signed-off-by: Jason Xing <kerneljasonxing@gmail.com> Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Reviewed-by: Willem de Bruijn <willemb@google.com> Link: https://patch.msgid.link/20250220072940.99994-8-kerneljasonxing@gmail.com
2025-02-20bpf: Add BPF_SOCK_OPS_TSTAMP_SCHED_CB callbackJason Xing
Support SCM_TSTAMP_SCHED case for bpf timestamping. Add a new sock_ops callback, BPF_SOCK_OPS_TSTAMP_SCHED_CB. This callback will occur at the same timestamping point as the user space's SCM_TSTAMP_SCHED. The BPF program can use it to get the same SCM_TSTAMP_SCHED timestamp without modifying the user-space application. A new SKBTX_BPF flag is added to mark skb_shinfo(skb)->tx_flags, ensuring that the new BPF timestamping and the current user space's SO_TIMESTAMPING do not interfere with each other. Signed-off-by: Jason Xing <kerneljasonxing@gmail.com> Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Reviewed-by: Willem de Bruijn <willemb@google.com> Link: https://patch.msgid.link/20250220072940.99994-7-kerneljasonxing@gmail.com
2025-02-20bpf: Add networking timestamping support to bpf_get/setsockopt()Jason Xing
The new SK_BPF_CB_FLAGS and new SK_BPF_CB_TX_TIMESTAMPING are added to bpf_get/setsockopt. The later patches will implement the BPF networking timestamping. The BPF program will use bpf_setsockopt(SK_BPF_CB_FLAGS, SK_BPF_CB_TX_TIMESTAMPING) to enable the BPF networking timestamping on a socket. Signed-off-by: Jason Xing <kerneljasonxing@gmail.com> Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Reviewed-by: Willem de Bruijn <willemb@google.com> Link: https://patch.msgid.link/20250220072940.99994-2-kerneljasonxing@gmail.com
2025-02-20Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski
Cross-merge networking fixes after downstream PR (net-6.14-rc4). No conflicts or adjacent changes. Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-20io_uring/epoll: add support for IORING_OP_EPOLL_WAITJens Axboe
For existing epoll event loops that can't fully convert to io_uring, the used approach is usually to add the io_uring fd to the epoll instance and use epoll_wait() to wait on both "legacy" and io_uring events. While this work, it isn't optimal as: 1) epoll_wait() is pretty limited in what it can do. It does not support partial reaping of events, or waiting on a batch of events. 2) When an io_uring ring is added to an epoll instance, it activates the io_uring "I'm being polled" logic which slows things down. Rather than use this approach, with EPOLL_WAIT support added to io_uring, event loops can use the normal io_uring wait logic for everything, as long as an epoll wait request has been armed with io_uring. Note that IORING_OP_EPOLL_WAIT does NOT take a timeout value, as this is an async request. Waiting on io_uring events in general has various timeout parameters, and those are the ones that should be used when waiting on any kind of request. If events are immediately available for reaping, then This opcode will return those immediately. If none are available, then it will post an async completion when they become available. cqe->res will contain either an error code (< 0 value) for a malformed request, invalid epoll instance, etc. It will return a positive result indicating how many events were reaped. IORING_OP_EPOLL_WAIT requests may be canceled using the normal io_uring cancelation infrastructure. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-02-20Merge branch 'for-6.15/io_uring-rx-zc' into for-6.15/io_uring-epoll-waitJens Axboe
* for-6.15/io_uring-rx-zc: (77 commits) io_uring: Rename KConfig to Kconfig io_uring/zcrx: fix leaks on failed registration io_uring/zcrx: recheck ifq on shutdown io_uring/zcrx: add selftest net: add documentation for io_uring zcrx io_uring/zcrx: add copy fallback io_uring/zcrx: throttle receive requests io_uring/zcrx: set pp memory provider for an rx queue io_uring/zcrx: add io_recvzc request io_uring/zcrx: dma-map area for the device io_uring/zcrx: implement zerocopy receive pp memory provider io_uring/zcrx: grab a net device io_uring/zcrx: add io_zcrx_area io_uring/zcrx: add interface queue and refill queue net: add helpers for setting a memory provider on an rx queue net: page_pool: add memory provider helpers net: prepare for non devmem TCP memory providers net: page_pool: add a mp hook to unregister_netdevice* net: page_pool: add callback for mp info printing netdev: add io_uring memory provider info ...
2025-02-20Merge tag 'linux-can-next-for-6.15-20250219' of ↵Paolo Abeni
git://git.kernel.org/pub/scm/linux/kernel/git/mkl/linux-can-next Marc Kleine-Budde says: ==================== pull-request: can-next 2025-02-19 this is a pull request of 12 patches for net-next/master. The first 4 patches are by Krzysztof Kozlowski and simplify the c_can driver's c_can_plat_probe() function. Ciprian Marian Costea contributes 3 patches to add S32G2/S32G3 support to the flexcan driver. Ruffalo Lavoisier's patch removes a duplicated word from the mcp251xfd DT bindings documentation. Oleksij Rempel extends the J1939 documentation. The next patch is by Oliver Hartkopp and adds access for the Remote Request Substitution bit in CAN-XL frames. Henrik Brix Andersen's patch for the gs_usb driver adds support for the CANnectivity firmware. The last patch is by Robin van der Gracht and removes a duplicated setup of RX FIFO in the rockchip_canfd driver. linux-can-next-for-6.15-20250219 * tag 'linux-can-next-for-6.15-20250219' of git://git.kernel.org/pub/scm/linux/kernel/git/mkl/linux-can-next: can: rockchip_canfd: rkcanfd_chip_fifo_setup(): remove duplicated setup of RX FIFO can: gs_usb: add VID/PID for the CANnectivity firmware can: canxl: support Remote Request Substitution bit access can: j1939: Extend stack documentation with buffer size behavior dt-binding: can: mcp251xfd: remove duplicate word can: flexcan: add NXP S32G2/S32G3 SoC support can: flexcan: Add quirk to handle separate interrupt lines for mailboxes dt-bindings: can: fsl,flexcan: add S32G2/S32G3 SoC support can: c_can: Use syscon_regmap_lookup_by_phandle_args can: c_can: Use of_property_present() to test existence of DT property can: c_can: Simplify handling syscon error path can: c_can: Drop useless final probe failure message ==================== Link: https://patch.msgid.link/20250219113354.529611-1-mkl@pengutronix.de Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-02-19net: fib_rules: Add port mask attributesIdo Schimmel
Add attributes that allow matching on source and destination ports with a mask. Matching on the source port with a mask is needed in deployments where users encode path information into certain bits of the UDP source port. Temporarily set the type of the attributes to 'NLA_REJECT' while support is being added. Reviewed-by: Petr Machata <petrm@nvidia.com> Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Guillaume Nault <gnault@redhat.com> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://patch.msgid.link/20250217134109.311176-2-idosch@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-19Merge tag 'mm-hotfixes-stable-2025-02-19-17-49' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull misc fixes from Andrew Morton: "18 hotfixes. 5 are cc:stable and the remainder address post-6.13 issues or aren't considered necessary for -stable kernels. 10 are for MM and 8 are for non-MM. All are singletons, please see the changelogs for details" * tag 'mm-hotfixes-stable-2025-02-19-17-49' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: test_xarray: fix failure in check_pause when CONFIG_XARRAY_MULTI is not defined kasan: don't call find_vm_area() in a PREEMPT_RT kernel MAINTAINERS: update Nick's contact info selftests/mm: fix check for running THP tests mm: hugetlb: avoid fallback for specific node allocation of 1G pages memcg: avoid dead loop when setting memory.max mailmap: update Nick's entry mm: pgtable: fix incorrect reclaim of non-empty PTE pages taskstats: modify taskstats version getdelays: fix error format characters mm/migrate_device: don't add folio to be freed to LRU in migrate_device_finalize() tools/mm: fix build warnings with musl-libc mailmap: add entry for Feng Tang .mailmap: add entries for Jeff Johnson mm,madvise,hugetlb: check for 0-length range after end address adjustment mm/zswap: fix inconsistency when zswap_store_page() fails lib/iov_iter: fix import_iovec_ubuf iovec management procfs: fix a locking bug in a vmcore_add_device_dump() error path
2025-02-19can: canxl: support Remote Request Substitution bit accessOliver Hartkopp
The Remote Request Substitution bit is a dominant bit ("0") in the CAN XL frame. As some CAN XL controllers support to access this bit a new CANXL_RRS value has been defined for the canxl_frame.flags element. Signed-off-by: Oliver Hartkopp <socketcan@hartkopp.net> Link: https://patch.msgid.link/20250124142347.7444-1-socketcan@hartkopp.net Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
2025-02-18io_uring: fix spelling error in uapi io_uring.hJens Axboe
This is obviously not that important, but when changes are synced back from the kernel to liburing, the codespell CI ends up erroring because of this misspelling. Let's just correct it and avoid this biting us again on an import. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-02-18s390/vfio-ap: Signal eventfd when guest AP configuration is changedRorie Reyes
In this patch, an eventfd object is created by the vfio_ap device driver and used to notify userspace when a guests's AP configuration is dynamically changed. Such changes may occur whenever: * An adapter, domain or control domain is assigned to or unassigned from a mediated device that is attached to the guest. * A queue assigned to the mediated device that is attached to a guest is bound to or unbound from the vfio_ap device driver. This can occur either by manually binding/unbinding the queue via the vfio_ap driver's sysfs bind/unbind attribute interfaces, or because an adapter, domain or control domain assigned to the mediated device is added to or removed from the host's AP configuration via an SE/HMC The purpose of this patch is to provide immediate notification of changes made to a guest's AP configuration by the vfio_ap driver. This will enable the guest to take immediate action rather than relying on polling or some other inefficient mechanism to detect changes to its AP configuration. Note that there are corresponding QEMU patches that will be shipped along with this patch (see vfio-ap: Report vfio-ap configuration changes) that will pick up the eventfd signal. Signed-off-by: Rorie Reyes <rreyes@linux.ibm.com> Reviewed-by: Anthony Krowiak <akrowiak@linux.ibm.com> Tested-by: Anthony Krowiak <akrowiak@linux.ibm.com> Link: https://lore.kernel.org/r/20250107183645.90082-1-rreyes@linux.ibm.com Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
2025-02-18Merge tag 'sound-6.14-rc4' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound Pull sound fixes from Takashi Iwai: "A slightly large collection of fixes, spread over various drivers. Almost all are small and device-specific fixes and quirks in ASoC SOF Intel and AMD, Renesas, Cirrus, HD-audio, in addition to a small fix for MIDI 2.0" * tag 'sound-6.14-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound: (41 commits) ALSA: seq: Drop UMP events when no UMP-conversion is set ALSA: hda/conexant: Add quirk for HP ProBook 450 G4 mute LED ALSA: hda/cirrus: Reduce codec resume time ALSA: hda/cirrus: Correct the full scale volume set logic virtio_snd.h: clarify that `controls` depends on VIRTIO_SND_F_CTLS ALSA: hda: Add error check for snd_ctl_rename_id() in snd_hda_create_dig_out_ctls() ALSA: hda/tas2781: Fix index issue in tas2781 hda SPI driver ASoC: imx-audmix: remove cpu_mclk which is from cpu dai device ALSA: hda/realtek: Fixup ALC225 depop procedure ALSA: hda/tas2781: Update tas2781 hda SPI driver ASoC: cs35l41: Fix acpi_device_hid() not found ASoC: SOF: amd: Add branch prediction hint in ACP IRQ handler ASoC: SOF: amd: Handle IPC replies before FW_BOOT_COMPLETE ASoC: SOF: amd: Drop unused includes from Vangogh driver ASoC: SOF: amd: Add post_fw_run_delay ACP quirk ASoC: Intel: soc-acpi-intel-ptl-match: revise typo of rt713_vb_l2_rt1320_l13 ASoC: Intel: soc-acpi-intel-ptl-match: revise typo of rt712_vb + rt1320 support ALSA: Switch to use hrtimer_setup() ALSA: hda: hda-intel: add Panther Lake-H support ASoC: SOF: Intel: pci-ptl: Add support for PTL-H ...
2025-02-17taskstats: modify taskstats versionWang Yaxin
After adding "delay max" and "delay min" to the taskstats structure, the taskstats version needs to be updated. Link: https://lkml.kernel.org/r/20250208144901218Q5ptVpqsQkb2MOEmW4Ujn@zte.com.cn Fixes: f65c64f311ee ("delayacct: add delay min to record delay peak") Signed-off-by: Wang Yaxin <wang.yaxin@zte.com.cn> Signed-off-by: Kun Jiang <jiang.kun2@zte.com.cn> Reviewed-by: xu xin <xu.xin16@zte.com.cn> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-02-17netdev-genl: Add an XSK attribute to queuesJoe Damato
Expose a new per-queue nest attribute, xsk, which will be present for queues that are being used for AF_XDP. If the queue is not being used for AF_XDP, the nest will not be present. In the future, this attribute can be extended to include more data about XSK as it is needed. Signed-off-by: Joe Damato <jdamato@fastly.com> Suggested-by: Jakub Kicinski <kuba@kernel.org> Link: https://patch.msgid.link/20250214211255.14194-3-jdamato@fastly.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-17io_uring/zcrx: add io_recvzc requestDavid Wei
Add io_uring opcode OP_RECV_ZC for doing zero copy reads out of a socket. Only the connection should be land on the specific rx queue set up for zero copy, and the socket must be handled by the io_uring instance that the rx queue was registered for zero copy with. That's because neither net_iovs / buffers from our queue can be read by outside applications, nor zero copy is possible if traffic for the zero copy connection goes to another queue. This coordination is outside of the scope of this patch series. Also, any traffic directed to the zero copy enabled queue is immediately visible to the application, which is why CAP_NET_ADMIN is required at the registration step. Of course, no data is actually read out of the socket, it has already been copied by the netdev into userspace memory via DMA. OP_RECV_ZC reads skbs out of the socket and checks that its frags are indeed net_iovs that belong to io_uring. A cqe is queued for each one of these frags. Recall that each cqe is a big cqe, with the top half being an io_uring_zcrx_cqe. The cqe res field contains the len or error. The lower IORING_ZCRX_AREA_SHIFT bits of the struct io_uring_zcrx_cqe::off field contain the offset relative to the start of the zero copy area. The upper part of the off field is trivially zero, and will be used to carry the area id. For now, there is no limit as to how much work each OP_RECV_ZC request does. It will attempt to drain a socket of all available data. This request always operates in multishot mode. Reviewed-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: David Wei <dw@davidwei.uk> Acked-by: Jakub Kicinski <kuba@kernel.org> Link: https://lore.kernel.org/r/20250215000947.789731-7-dw@davidwei.uk Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-02-17io_uring/zcrx: add io_zcrx_areaDavid Wei
Add io_zcrx_area that represents a region of userspace memory that is used for zero copy. During ifq registration, userspace passes in the uaddr and len of userspace memory, which is then pinned by the kernel. Each net_iov is mapped to one of these pages. The freelist is a spinlock protected list that keeps track of all the net_iovs/pages that aren't used. For now, there is only one area per ifq and area registration happens implicitly as part of ifq registration. There is no API for adding/removing areas yet. The struct for area registration is there for future extensibility once we support multiple areas and TCP devmem. Reviewed-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: David Wei <dw@davidwei.uk> Acked-by: Jakub Kicinski <kuba@kernel.org> Link: https://lore.kernel.org/r/20250215000947.789731-3-dw@davidwei.uk Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-02-17io_uring/zcrx: add interface queue and refill queueDavid Wei
Add a new object called an interface queue (ifq) that represents a net rx queue that has been configured for zero copy. Each ifq is registered using a new registration opcode IORING_REGISTER_ZCRX_IFQ. The refill queue is allocated by the kernel and mapped by userspace using a new offset IORING_OFF_RQ_RING, in a similar fashion to the main SQ/CQ. It is used by userspace to return buffers that it is done with, which will then be re-used by the netdev again. The main CQ ring is used to notify userspace of received data by using the upper 16 bytes of a big CQE as a new struct io_uring_zcrx_cqe. Each entry contains the offset + len to the data. For now, each io_uring instance only has a single ifq. Reviewed-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: David Wei <dw@davidwei.uk> Acked-by: Jakub Kicinski <kuba@kernel.org> Link: https://lore.kernel.org/r/20250215000947.789731-2-dw@davidwei.uk Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-02-17Merge commit '71f0dd5a3293d75d26d405ffbaedfdda4836af32' of ↵Jens Axboe
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next into for-6.15/io_uring-rx-zc Merge networking zerocopy receive tree, to get the prep patches for the io_uring rx zc support. * git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (63 commits) net: add helpers for setting a memory provider on an rx queue net: page_pool: add memory provider helpers net: prepare for non devmem TCP memory providers net: page_pool: add a mp hook to unregister_netdevice* net: page_pool: add callback for mp info printing netdev: add io_uring memory provider info net: page_pool: create hooks for custom memory providers net: generalise net_iov chunk owners net: prefix devmem specific helpers net: page_pool: don't cast mp param to devmem tools: ynl: add all headers to makefile deps eth: fbnic: set IFF_UNICAST_FLT to avoid enabling promiscuous mode when adding unicast addrs eth: fbnic: add MAC address TCAM to debugfs tools: ynl-gen: support limits using definitions tools: ynl-gen: don't output external constants net/mlx5e: Avoid WARN_ON when configuring MQPRIO with HTB offload enabled net/mlx5e: Remove unused mlx5e_tc_flow_action struct net/mlx5: Remove stray semicolon in LAG port selection table creation net/mlx5e: Support FEC settings for 200G per lane link modes net/mlx5: Add support for 200Gbps per lane link modes ...
2025-02-14Merge tag 'thermal-6.14-rc3' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm Pull thermal control fixes from Rafael Wysocki: "Fix a regression caused by an inadvertent change of the THERMAL_GENL_ATTR_CPU_CAPABILITY value in one of the recent thermal commits (Zhang Rui) and drop a stale piece of documentation (Daniel Lezcano)" * tag 'thermal-6.14-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: thermal/cpufreq_cooling: Remove structure member documentation thermal/netlink: Prevent userspace segmentation fault by adjusting UAPI header
2025-02-14virtio_snd.h: clarify that `controls` depends on VIRTIO_SND_F_CTLSStefano Garzarella
As defined in the specification, the `controls` field in the configuration space is only valid/present if VIRTIO_SND_F_CTLS is negotiated. From https://docs.oasis-open.org/virtio/virtio/v1.3/virtio-v1.3.html: 5.14.4 Device Configuration Layout ... controls (driver-read-only) indicates a total number of all available control elements if VIRTIO_SND_F_CTLS has been negotiated. Let's use the same style used in virtio_blk.h to clarify this and to avoid confusion as happened in QEMU (see link). Link: https://gitlab.com/qemu-project/qemu/-/issues/2805 Signed-off-by: Stefano Garzarella <sgarzare@redhat.com> Acked-by: Eugenio Pérez <eperezma@redhat.com> Signed-off-by: Takashi Iwai <tiwai@suse.de> Link: https://patch.msgid.link/20250213161825.139952-1-sgarzare@redhat.com
2025-02-14landlock: Minor typo and grammar fixes in IPC scoping documentationGünther Noack
* Fix some whitespace, punctuation and minor grammar. * Add a missing sentence about the minimum ABI version, to stay in line with the section next to it. Cc: Tahera Fahimi <fahimitahera@gmail.com> Cc: Tanya Agarwal <tanyaagarwal25699@gmail.com> Signed-off-by: Günther Noack <gnoack@google.com> Link: https://lore.kernel.org/r/20250124154445.162841-1-gnoack@google.com [mic: Add newlines, update doc date] Signed-off-by: Mickaël Salaün <mic@digikod.net>
2025-02-13fs/xattr: bpf: Introduce security.bpf. xattr name prefixSong Liu
Introduct new xattr name prefix security.bpf., and enable reading these xattrs from bpf kfuncs bpf_get_[file|dentry]_xattr(). As we are on it, correct the comments for return value of bpf_get_[file|dentry]_xattr(), i.e. return length the xattr value on success. Signed-off-by: Song Liu <song@kernel.org> Acked-by: Christian Brauner <brauner@kernel.org> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Matt Bobrowski <mattbobrowski@google.com> Link: https://lore.kernel.org/r/20250130213549.3353349-2-song@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-02-13Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski
Cross-merge networking fixes after downstream PR (net-6.14-rc3). No conflicts or adjacent changes. Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-12Merge patch series "fs: allow changing idmappings"Christian Brauner
Christian Brauner <brauner@kernel.org> says: Currently, it isn't possible to change the idmapping of an idmapped mount. This is becoming an obstacle for various use-cases. /* idmapped home directories with systemd-homed */ On newer systems /home is can be an idmapped mount such that each file on disk is owned by 65536 and a subfolder exists for foreign id ranges such as containers. For example, a home directory might look like this (using an arbitrary folder as an example): user1@localhost:~/data/mount-idmapped$ ls -al /data/ total 16 drwxrwxrwx 1 65536 65536 36 Jan 27 12:15 . drwxrwxr-x 1 root root 184 Jan 27 12:06 .. -rw-r--r-- 1 65536 65536 0 Jan 27 12:07 aaa -rw-r--r-- 1 65536 65536 0 Jan 27 12:07 bbb -rw-r--r-- 1 65536 65536 0 Jan 27 12:07 cc drwxr-xr-x 1 2147352576 2147352576 0 Jan 27 19:06 containers When logging in home is mounted as an idmapped mount with the following idmappings: 65536:$(id -u):1 // uid mapping 65536:$(id -g):1 // gid mapping 2147352576:2147352576:65536 // uid mapping 2147352576:2147352576:65536 // gid mapping So for a user with uid/gid 1000 an idmapped /home would like like this: user1@localhost:~/data/mount-idmapped$ ls -aln /mnt/ total 16 drwxrwxrwx 1 1000 1000 36 Jan 27 12:15 . drwxrwxr-x 1 0 0 184 Jan 27 12:06 .. -rw-r--r-- 1 1000 1000 0 Jan 27 12:07 aaa -rw-r--r-- 1 1000 1000 0 Jan 27 12:07 bbb -rw-r--r-- 1 1000 1000 0 Jan 27 12:07 cc drwxr-xr-x 1 2147352576 2147352576 0 Jan 27 19:06 containers In other words, 65536 is mapped to the user's uid/gid and the range 2147352576 up to 2147352576 + 65536 is an identity mapping for containers. When a container is started a transient uid/gid range is allocated outside of both mappings of the idmapped mount. For example, the container might get the idmapping: $ cat /proc/1742611/uid_map 0 537985024 65536 This container will be allowed to write to disk within the allocated foreign id range 2147352576 to 2147352576 + 65536. To do this an idmapped mount must be created from an already idmapped mount such that: - The mappings for the user's uid/gid must be dropped, i.e., the following mappings are removed: 65536:$(id -u):1 // uid mapping 65536:$(id -g):1 // gid mapping - A mapping for the transient uid/gid range to the foreign uid/gid range is added: 2147352576:537985024:65536 In combination this will mean that the container will write to disk within the foreign id range 2147352576 to 2147352576 + 65536. /* nested containers */ When the outer container makes use of idmapped mounts it isn't posssible to create an idmapped mount for the inner container with a differen idmapping from the outer container's idmapped mount. There are other usecases and the two above just serve as an illustration of the problem. This patchset makes it possible to create a new idmapped mount from an already idmapped mount. It aims to adhere to current performance constraints and requirements: - Idmapped mounts aim to have near zero performance implications for path lookup. That is why no refernce counting, locking or any other mechanism can be required that would impact performance. This works be ensuring that a regular mount transitions to an idmapped mount once going from a static nop_mnt_idmap mapping to a non-static idmapping. - The idmapping of a mount change anymore for the lifetime of the mount afterwards. This not just avoids UAF issues it also avoids pitfalls such as generating non-matching uid/gid values. Changing idmappings could be solved by: - Idmappings could simply be reference counted (above the simple reference count when sharing them across multiple mounts). This would require pairing mnt_idmap_get() with mnt_idmap_put() which would end up being sprinkled everywhere into the VFS and some filesystems that access idmappings directly. It wouldn't just be quite ugly and introduce new complexity it would have a noticeable performance impact. - Idmappings could gain RCU protection. This would help the LOOKUP_RCU case and avoids taking reference counts under RCU. When not under LOOKUP_RCU reference counts need to be acquired on each idmapping. This would require pairing mnt_idmap_get() with mnt_idmap_put() which would end up being sprinkled everywhere into the VFS and some filesystems that access idmappings directly. This would have the same downsides as mentioned earlier. - The earlier solutions work by updating the mnt->mnt_idmap pointer with the new idmapping. Instead of this it would be possible to change the idmapping itself to avoid UAF issues. To do this a sequence counter would have to be added to struct mount. When retrieving the idmapping to generate uid/gid values the sequence counter would need to be sampled and the generation of the uid/gid would spin until the update of the idmap is finished. This has problems as well but the biggest issue will be that this can lead to inconsistent permission checking and inconsistent uid/gid pairs even more than this is already possible today. Specifically, during creation it could happen that: idmap = mnt_idmap(mnt); inode_permission(idmap, ...); may_create(idmap); // create file with uid/gid based on @idmap in between the permission checking and the generation of the uid/gid value the idmapping could change leading to the permission checking and uid/gid value that is actually used to create a file on disk being out of sync. Similarly if two values are generated like: idmap = mnt_idmap(mnt) vfsgid = make_vfsgid(idmap); // idmapping gets update concurrently vfsuid = make_vfsuid(idmap); @vfsgid and @vfsuid could be out of sync if the idmapping was changed in between. The generation of vfsgid/vfsuid could span a lot of codelines so to guard against this a sequence count would have to be passed around. The performance impact of this solutio are less clear but very likely not zero. - Using SRCU similar to fanotify that can sleep. I find that not just ugly but it would have memory consumption implications and is overall pretty ugly. /* solution */ So, to avoid all of these pitfalls creating an idmapped mount from an already idmapped mount will be done atomically, i.e., a new detached mount is created and a new set of mount properties applied to it without it ever having been exposed to userspace at all. This can be done in two ways. A new flag to open_tree() is added OPEN_TREE_CLEAR_IDMAP that clears the old idmapping and returns a mount that isn't idmapped. And then it is possible to set mount attributes on it again including creation of an idmapped mount. This has the consequence that a file descriptor must exist in userspace that doesn't have any idmapping applied and it will thus never work in unpriviledged scenarios. As a container would be able to remove the idmapping of the mount it has been given. That should be avoided. Instead, we add open_tree_attr() which works just like open_tree() but takes an optional struct mount_attr parameter. This is useful beyond idmappings as it fills a gap where a mount never exists in userspace without the necessary mount properties applied. This is particularly useful for mount options such as MOUNT_ATTR_{RDONLY,NOSUID,NODEV,NOEXEC}. To create a new idmapped mount the following works: // Create a first idmapped mount struct mount_attr attr = { .attr_set = MOUNT_ATTR_IDMAP .userns_fd = fd_userns }; fd_tree = open_tree(-EBADF, "/", OPEN_TREE_CLONE, &attr, sizeof(attr)); move_mount(fd_tree, "", -EBADF, "/mnt", MOVE_MOUNT_F_EMPTY_PATH); // Create a second idmapped mount from the first idmapped mount attr.attr_set = MOUNT_ATTR_IDMAP; attr.userns_fd = fd_userns2; fd_tree2 = open_tree(-EBADF, "/mnt", OPEN_TREE_CLONE, &attr, sizeof(attr)); // Create a second non-idmapped mount from the first idmapped mount: memset(&attr, 0, sizeof(attr)); attr.attr_clr = MOUNT_ATTR_IDMAP; fd_tree2 = open_tree(-EBADF, "/mnt", OPEN_TREE_CLONE, &attr, sizeof(attr)); * patches from https://lore.kernel.org/r/20250128-work-mnt_idmap-update-v2-v1-0-c25feb0d2eb3@kernel.org: fs: allow changing idmappings fs: add kflags member to struct mount_kattr fs: add open_tree_attr() fs: add copy_mount_setattr() helper fs: add vfs_open_tree() helper Link: https://lore.kernel.org/r/20250128-work-mnt_idmap-update-v2-v1-0-c25feb0d2eb3@kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-02-12statmount: add a new supported_mask fieldJeff Layton
Some of the fields in the statmount() reply can be optional. If the kernel has nothing to emit in that field, then it doesn't set the flag in the reply. This presents a problem: There is currently no way to know what mask flags the kernel supports since you can't always count on them being in the reply. Add a new STATMOUNT_SUPPORTED_MASK flag and field that the kernel can set in the reply. Userland can use this to determine if the fields it requires from the kernel are supported. This also gives us a way to deprecate fields in the future, if that should become necessary. Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Jeff Layton <jlayton@kernel.org> Link: https://lore.kernel.org/r/20250206-statmount-v2-1-6ae70a21c2ab@kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-02-12statmount: allow to retrieve idmappingsChristian Brauner
This adds the STATMOUNT_MNT_UIDMAP and STATMOUNT_MNT_GIDMAP options. It allows the retrieval of idmappings via statmount(). Currently it isn't possible to figure out what idmappings are applied to an idmapped mount. This information is often crucial. Before statmount() the only realistic options for an interface like this would have been to add it to /proc/<pid>/fdinfo/<nr> or to expose it in /proc/<pid>/mountinfo. Both solution would have been pretty ugly and would've shown information that is of strong interest to some application but not all. statmount() is perfect for this. The idmappings applied to an idmapped mount are shown relative to the caller's user namespace. This is the most useful solution that doesn't risk leaking information or confuse the caller. For example, an idmapped mount might have been created with the following idmappings: mount --bind -o X-mount.idmap="0:10000:1000 2000:2000:1 3000:3000:1" /srv /opt Listing the idmappings through statmount() in the same context shows: mnt_id: 2147485088 mnt_parent_id: 2147484816 fs_type: btrfs mnt_root: /srv mnt_point: /opt mnt_opts: ssd,discard=async,space_cache=v2,subvolid=5,subvol=/ mnt_uidmap[0]: 0 10000 1000 mnt_uidmap[1]: 2000 2000 1 mnt_uidmap[2]: 3000 3000 1 mnt_gidmap[0]: 0 10000 1000 mnt_gidmap[1]: 2000 2000 1 mnt_gidmap[2]: 3000 3000 1 But the idmappings might not always be resolvable in the caller's user namespace. For example: unshare --user --map-root In this case statmount() will skip any mappings that fil to resolve in the caller's idmapping: mnt_id: 2147485087 mnt_parent_id: 2147484016 fs_type: btrfs mnt_root: /srv mnt_point: /opt mnt_opts: ssd,discard=async,space_cache=v2,subvolid=5,subvol=/ The caller can differentiate between a mount not being idmapped and a mount that is idmapped but where all mappings fail to resolve in the caller's idmapping by check for the STATMOUNT_MNT_{G,U}IDMAP flag being raised but the number of mappings in ->mnt_{g,u}idmap_num being zero. Note that statmount() requires that the whole range must be resolvable in the caller's user namespace. If a subrange fails to map it will still list the map as not resolvable. This is a practical compromise to avoid having to find which subranges are resovable and wich aren't. Idmappings are listed as a string array with each mapping separated by zero bytes. This allows to retrieve the idmappings and immediately use them for writing to e.g., /proc/<pid>/{g,u}id_map and it also allow for simple iteration like: if (stmnt->mask & STATMOUNT_MNT_UIDMAP) { const char *idmap = stmnt->str + stmnt->mnt_uidmap; for (size_t idx = 0; idx < stmnt->mnt_uidmap_nr; idx++) { printf("mnt_uidmap[%lu]: %s\n", idx, idmap); idmap += strlen(idmap) + 1; } } Link: https://lore.kernel.org/r/20250204-work-mnt_idmap-statmount-v2-2-007720f39f2e@kernel.org Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-02-12f2fs: add ioctl to get IO priority hintJaegeuk Kim
This patch adds an ioctl to give a per-file priority hint to attach REQ_PRIO. Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2025-02-11thermal/netlink: Prevent userspace segmentation fault by adjusting UAPI headerZhang Rui
The intel-lpmd tool [1], which uses the THERMAL_GENL_ATTR_CPU_CAPABILITY attribute to receive HFI events from kernel space, encounters a segmentation fault after commit 1773572863c4 ("thermal: netlink: Add the commands and the events for the thresholds"). The issue arises because the THERMAL_GENL_ATTR_CPU_CAPABILITY raw value was changed while intel_lpmd still uses the old value. Although intel_lpmd can be updated to check the THERMAL_GENL_VERSION and use the appropriate THERMAL_GENL_ATTR_CPU_CAPABILITY value, the commit itself is questionable. The commit introduced a new element in the middle of enum thermal_genl_attr, which affects many existing attributes and introduces potential risks and unnecessary maintenance burdens for userspace thermal netlink event users. Solve the issue by moving the newly introduced THERMAL_GENL_ATTR_TZ_PREV_TEMP attribute to the end of the enum thermal_genl_attr. This ensures that all existing thermal generic netlink attributes remain unaffected. Link: https://github.com/intel/intel-lpmd [1] Fixes: 1773572863c4 ("thermal: netlink: Add the commands and the events for the thresholds") Signed-off-by: Zhang Rui <rui.zhang@intel.com> Reviewed-by: Daniel Lezcano <daniel.lezcano@linaro.org> Link: https://patch.msgid.link/20250208074907.5679-1-rui.zhang@intel.com [ rjw: Subject edits ] Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2025-02-11tcp: add the ability to control max RTOEric Dumazet
Currently, TCP stack uses a constant (120 seconds) to limit the RTO value exponential growth. Some applications want to set a lower value. Add TCP_RTO_MAX_MS socket option to set a value (in ms) between 1 and 120 seconds. It is discouraged to change the socket rto max on a live socket, as it might lead to unexpected disconnects. Following patch is adding a netns sysctl to control the default value at socket creation time. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Jason Xing <kerneljasonxing@gmail.com> Reviewed-by: Neal Cardwell <ncardwell@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-02-11wifi: nl80211/cfg80211: Stop supporting cooked monitorAlexander Wetzel
Unconditionally start to refuse creating cooked monitor interfaces to phase them out. There is no feature flag for drivers to opt-in for cooked monitor and all known users are using/preferring the modern API since the hostapd release 1.0 in May 2012. Signed-off-by: Alexander Wetzel <Alexander@wetzel-home.de> Link: https://patch.msgid.link/20250204111352.7004-1-Alexander@wetzel-home.de Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2025-02-10elf: Define note name macrosAkihiko Odaki
elf.h had a comment saying: > Notes used in ET_CORE. Architectures export some of the arch register > sets using the corresponding note types via the PTRACE_GETREGSET and > PTRACE_SETREGSET requests. > The note name for these types is "LINUX", except NT_PRFPREG that is > named "CORE". However, NT_PRSTATUS is also named "CORE". It is also unclear what "these types" refers to. To fix these problems, define a name for each note type. The added definitions are macros so the kernel and userspace can directly refer to them to remove their duplicate definitions of note names. Signed-off-by: Akihiko Odaki <akihiko.odaki@daynix.com> Acked-by: Baoquan He <bhe@redhat.com> Reviewed-by: Dave Martin <Dave.Martin@arm.com> Link: https://lore.kernel.org/r/20250115-elf-v5-1-0f9e55bbb2fc@daynix.com Signed-off-by: Kees Cook <kees@kernel.org>
2025-02-10blk-crypto: add ioctls to create and prepare hardware-wrapped keysEric Biggers
Until this point, the kernel can use hardware-wrapped keys to do encryption if userspace provides one -- specifically a key in ephemerally-wrapped form. However, no generic way has been provided for userspace to get such a key in the first place. Getting such a key is a two-step process. First, the key needs to be imported from a raw key or generated by the hardware, producing a key in long-term wrapped form. This happens once in the whole lifetime of the key. Second, the long-term wrapped key needs to be converted into ephemerally-wrapped form. This happens each time the key is "unlocked". In Android, these operations are supported in a generic way through KeyMint, a userspace abstraction layer. However, that method is Android-specific and can't be used on other Linux systems, may rely on proprietary libraries, and also misleads people into supporting KeyMint features like rollback resistance that make sense for other KeyMint keys but don't make sense for hardware-wrapped inline encryption keys. Therefore, this patch provides a generic kernel interface for these operations by introducing new block device ioctls: - BLKCRYPTOIMPORTKEY: convert a raw key to long-term wrapped form. - BLKCRYPTOGENERATEKEY: have the hardware generate a new key, then return it in long-term wrapped form. - BLKCRYPTOPREPAREKEY: convert a key from long-term wrapped form to ephemerally-wrapped form. These ioctls are implemented using new operations in blk_crypto_ll_ops. Signed-off-by: Eric Biggers <ebiggers@google.com> Tested-by: Bartosz Golaszewski <bartosz.golaszewski@linaro.org> # sm8650 Link: https://lore.kernel.org/r/20250204060041.409950-4-ebiggers@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-02-08iio: introduce the FAULT event typeGuillaume Ranquet
Add a new event type to describe an hardware failure. Reviewed-by: Nuno Sa <nuno.sa@analog.com> Signed-off-by: Guillaume Ranquet <granquet@baylibre.com> Reviewed-by: David Lechner <dlechner@baylibre.com> Link: https://patch.msgid.link/20250127-ad4111_openwire-v5-1-ef2db05c384f@baylibre.com Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
2025-02-06net: ethtool: tsconfig: Fix netlink type of hwtstamp flagsKory Maincent
Fix the netlink type for hardware timestamp flags, which are represented as a bitset of flags. Although only one flag is supported currently, the correct netlink bitset type should be used instead of u32 to keep consistency with other fields. Address this by adding a new named string set description for the hwtstamp flag structure. The code has been introduced in the current release so the uAPI change is still okay. Signed-off-by: Kory Maincent <kory.maincent@bootlin.com> Fixes: 6e9e2eed4f39 ("net: ethtool: Add support for tsconfig command to get/set hwtstamp config") Link: https://patch.msgid.link/20250205110304.375086-1-kory.maincent@bootlin.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-06Merge branch 'io_uring-zero-copy-rx'Jakub Kicinski
David Wei says: ==================== io_uring zero copy rx This patchset contains net/ patches needed by a new io_uring request implementing zero copy rx into userspace pages, eliminating a kernel to user copy. We configure a page pool that a driver uses to fill a hw rx queue to hand out user pages instead of kernel pages. Any data that ends up hitting this hw rx queue will thus be dma'd into userspace memory directly, without needing to be bounced through kernel memory. 'Reading' data out of a socket instead becomes a _notification_ mechanism, where the kernel tells userspace where the data is. The overall approach is similar to the devmem TCP proposal. This relies on hw header/data split, flow steering and RSS to ensure packet headers remain in kernel memory and only desired flows hit a hw rx queue configured for zero copy. Configuring this is outside of the scope of this patchset. We share netdev core infra with devmem TCP. The main difference is that io_uring is used for the uAPI and the lifetime of all objects are bound to an io_uring instance. Data is 'read' using a new io_uring request type. When done, data is returned via a new shared refill queue. A zero copy page pool refills a hw rx queue from this refill queue directly. Of course, the lifetime of these data buffers are managed by io_uring rather than the networking stack, with different refcounting rules. This patchset is the first step adding basic zero copy support. We will extend this iteratively with new features e.g. dynamically allocated zero copy areas, THP support, dmabuf support, improved copy fallback, general optimisations and more. In terms of netdev support, we're first targeting Broadcom bnxt. Patches aren't included since Taehee Yoo has already sent a more comprehensive patchset adding support in [1]. Google gve should already support this, and Mellanox mlx5 support is WIP pending driver changes. =========== Performance =========== Note: Comparison with epoll + TCP_ZEROCOPY_RECEIVE isn't done yet. Test setup: * AMD EPYC 9454 * Broadcom BCM957508 200G * Kernel v6.11 base [2] * liburing fork [3] * kperf fork [4] * 4K MTU * Single TCP flow With application thread + net rx softirq pinned to _different_ cores: +-------------------------------+ | epoll | io_uring | |-----------|-------------------| | 82.2 Gbps | 116.2 Gbps (+41%) | +-------------------------------+ Pinned to _same_ core: +-------------------------------+ | epoll | io_uring | |-----------|-------------------| | 62.6 Gbps | 80.9 Gbps (+29%) | +-------------------------------+ ===== Links ===== Broadcom bnxt support: [1]: https://lore.kernel.org/20241003160620.1521626-8-ap420073@gmail.com Linux kernel branch including io_uring bits: [2]: https://github.com/isilence/linux.git zcrx/v13 liburing for testing: [3]: https://github.com/isilence/liburing.git zcrx/next kperf for testing: [4]: https://git.kernel.dk/kperf.git ==================== Link: https://patch.msgid.link/20250204215622.695511-1-dw@davidwei.uk Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-06netdev: add io_uring memory provider infoDavid Wei
Add a nested attribute for io_uring memory provider info. For now it is empty and its presence indicates that a particular page pool or queue has an io_uring memory provider attached. $ ./cli.py --spec netlink/specs/netdev.yaml --dump page-pool-get [{'id': 80, 'ifindex': 2, 'inflight': 64, 'inflight-mem': 262144, 'napi-id': 525}, {'id': 79, 'ifindex': 2, 'inflight': 320, 'inflight-mem': 1310720, 'io_uring': {}, 'napi-id': 525}, ... $ ./cli.py --spec netlink/specs/netdev.yaml --dump queue-get [{'id': 0, 'ifindex': 1, 'type': 'rx'}, {'id': 0, 'ifindex': 1, 'type': 'tx'}, {'id': 0, 'ifindex': 2, 'napi-id': 513, 'type': 'rx'}, {'id': 1, 'ifindex': 2, 'napi-id': 514, 'type': 'rx'}, ... {'id': 12, 'ifindex': 2, 'io_uring': {}, 'napi-id': 525, 'type': 'rx'}, ... Reviewed-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: David Wei <dw@davidwei.uk> Link: https://patch.msgid.link/20250204215622.695511-6-dw@davidwei.uk Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-06ethtool: Add support for 200Gbps per lane link modesJianbo Liu
Define 200G, 400G and 800G link modes using 200Gbps per lane. Signed-off-by: Jianbo Liu <jianbol@nvidia.com> Reviewed-by: Shahar Shitrit <shshitrit@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-02-05bpf: Add comment about helper freezeLevi Zim
Put a comment after the bpf helper list in uapi bpf.h to prevent people from trying to add new helpers there and direct them to kfuncs. Suggested-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Levi Zim <rsworktech@outlook.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Reviewed-by: Bagas Sanjaya <bagasdotme@gmail.com> Acked-by: Daniel Xu <dxu@dxuuu.xyz> Link: https://lore.kernel.org/bpf/CAEf4BzZvQF+QQ=oip4vdz5A=9bd+OmN-CXk5YARYieaipK9s+A@mail.gmail.com/ Link: https://lore.kernel.org/bpf/20221231004213.h5fx3loccbs5hyzu@macbook-pro-6.dhcp.thefacebook.com/ Link: https://lore.kernel.org/bpf/20250204-bpf-helper-freeze-v1-1-46efd9ff20dc@outlook.com
2025-02-05docs/bpf: Document the semantics of BTF tags with kind_flagIhor Solodrai
Explain the meaning of kind_flag in BTF type_tags and decl_tags. Update uapi btf.h kind_flag comment to reflect the changes. Signed-off-by: Ihor Solodrai <ihor.solodrai@linux.dev> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20250130201239.1429648-3-ihor.solodrai@linux.dev
2025-02-05fanotify: notify on mount attach and detachMiklos Szeredi
Add notifications for attaching and detaching mounts. The following new event masks are added: FAN_MNT_ATTACH - Mount was attached FAN_MNT_DETACH - Mount was detached If a mount is moved, then the event is reported with (FAN_MNT_ATTACH | FAN_MNT_DETACH). These events add an info record of type FAN_EVENT_INFO_TYPE_MNT containing these fields identifying the affected mounts: __u64 mnt_id - the ID of the mount (see statmount(2)) FAN_REPORT_MNT must be supplied to fanotify_init() to receive these events and no other type of event can be received with this report type. Marks are added with FAN_MARK_MNTNS, which records the mount namespace from an nsfs file (e.g. /proc/self/ns/mnt). Signed-off-by: Miklos Szeredi <mszeredi@redhat.com> Link: https://lore.kernel.org/r/20250129165803.72138-3-mszeredi@redhat.com Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-02-05pidfd: add PIDFD_SELF* sentinels to refer to own thread/processLorenzo Stoakes
It is useful to be able to utilise the pidfd mechanism to reference the current thread or process (from a userland point of view - thread group leader from the kernel's point of view). Therefore introduce PIDFD_SELF_THREAD to refer to the current thread, and PIDFD_SELF_THREAD_GROUP to refer to the current thread group leader. For convenience and to avoid confusion from userland's perspective we alias these: * PIDFD_SELF is an alias for PIDFD_SELF_THREAD - This is nearly always what the user will want to use, as they would find it surprising if for instance fd's were unshared()'d and they wanted to invoke pidfd_getfd() and that failed. * PIDFD_SELF_PROCESS is an alias for PIDFD_SELF_THREAD_GROUP - Most users have no concept of thread groups or what a thread group leader is, and from userland's perspective and nomenclature this is what userland considers to be a process. We adjust pidfd_get_task() and the pidfd_send_signal() system call with specific handling for this, implementing this functionality for process_madvise(), process_mrelease() (albeit, using it here wouldn't really make sense) and pidfd_send_signal(). Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Link: https://lore.kernel.org/r/24315a16a3d01a548dd45c7515f7d51c767e954e.1738268370.git.lorenzo.stoakes@oracle.com Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-02-05counter: add direction change eventDavid Lechner
Add COUNTER_EVENT_DIRECTION_CHANGE to be used by drivers to emit events when a counter detects a change in direction. Signed-off-by: David Lechner <dlechner@baylibre.com> Link: https://lore.kernel.org/r/20250110-counter-ti-eqep-add-direction-support-v2-2-c6b6f96d2db9@baylibre.com Signed-off-by: William Breathitt Gray <wbg@kernel.org>
2025-01-29Merge tag 'fuse-update-6.14' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse Pull fuse updates from Miklos Szeredi: "Add support for io-uring communication between kernel and userspace using IORING_OP_URING_CMD (Bernd Schubert). Following features enable gains in performance compared to the regular interface: - Allow processing multiple requests with less syscall overhead - Combine commit of old and fetch of new fuse request - CPU/NUMA affinity of queues Patches were reviewed by several people, including Pavel Begunkov, io-uring co-maintainer" * tag 'fuse-update-6.14' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse: fuse: prevent disabling io-uring on active connections fuse: enable fuse-over-io-uring fuse: block request allocation until io-uring init is complete fuse: {io-uring} Prevent mount point hang on fuse-server termination fuse: Allow to queue bg requests through io-uring fuse: Allow to queue fg requests through io-uring fuse: {io-uring} Make fuse_dev_queue_{interrupt,forget} non-static fuse: {io-uring} Handle teardown of ring entries fuse: Add io-uring sqe commit and fetch support fuse: {io-uring} Make hash-list req unique finding functions non-static fuse: Add fuse-io-uring handling into fuse_copy fuse: Make fuse_copy non static fuse: {io-uring} Handle SQEs - register commands fuse: make args->in_args[0] to be always the header fuse: Add fuse-io-uring design documentation fuse: Move request bits fuse: Move fuse_get_dev to header file fuse: rename to fuse_dev_end_requests and make non-static
2025-01-27Merge tag 'nfsd-6.14' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linuxLinus Torvalds
Pull nfsd updates from Chuck Lever: "Jeff Layton contributed an implementation of NFSv4.2+ attribute delegation, as described here: https://www.ietf.org/archive/id/draft-ietf-nfsv4-delstid-08.html This interoperates with similar functionality introduced into the Linux NFS client in v6.11. An attribute delegation permits an NFS client to manage a file's mtime, rather than flushing dirty data to the NFS server so that the file's mtime reflects the last write, which is considerably slower. Neil Brown contributed dynamic NFSv4.1 session slot table resizing. This facility enables NFSD to increase or decrease the number of slots per NFS session depending on server memory availability. More session slots means greater parallelism. Chuck Lever fixed a long-standing latent bug where NFSv4 COMPOUND encoding screws up when crossing a page boundary in the encoding buffer. This is a zero-day bug, but hitting it is rare and depends on the NFS client implementation. The Linux NFS client does not happen to trigger this issue. A variety of bug fixes and other incremental improvements fill out the list of commits in this release. Great thanks to all contributors, reviewers, testers, and bug reporters who participated during this development cycle" * tag 'nfsd-6.14' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux: (42 commits) sunrpc: Remove gss_{de,en}crypt_xdr_buf deadcode sunrpc: Remove gss_generic_token deadcode sunrpc: Remove unused xprt_iter_get_xprt Revert "SUNRPC: Reduce thread wake-up rate when receiving large RPC messages" nfsd: implement OPEN_ARGS_SHARE_ACCESS_WANT_OPEN_XOR_DELEGATION nfsd: handle delegated timestamps in SETATTR nfsd: add support for delegated timestamps nfsd: rework NFS4_SHARE_WANT_* flag handling nfsd: add support for FATTR4_OPEN_ARGUMENTS nfsd: prepare delegation code for handing out *_ATTRS_DELEG delegations nfsd: rename NFS4_SHARE_WANT_* constants to OPEN4_SHARE_ACCESS_WANT_* nfsd: switch to autogenerated definitions for open_delegation_type4 nfs_common: make include/linux/nfs4.h include generated nfs4_1.h nfsd: fix handling of delegated change attr in CB_GETATTR SUNRPC: Document validity guarantees of the pointer returned by reserve_space NFSD: Insulate nfsd4_encode_fattr4() from page boundaries in the encode buffer NFSD: Insulate nfsd4_encode_secinfo() from page boundaries in the encode buffer NFSD: Refactor nfsd4_do_encode_secinfo() again NFSD: Insulate nfsd4_encode_readlink() from page boundaries in the encode buffer NFSD: Insulate nfsd4_encode_read_plus_data() from page boundaries in the encode buffer ...
2025-01-27Merge tag 'for-6.14/dm-changes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm Pull device mapper updates from Mikulas Patocka: - fix a spelling error in dm-raid - change kzalloc to kcalloc - remove useless test in alloc_multiple_bios - disable REQ_NOWAIT for flushes - dm-transaction-manager: use red-black trees instead of linear lists - atomic writes support for dm-linear, dm-stripe and dm-mirror - dm-crypt: code cleanups and two bugfixes * tag 'for-6.14/dm-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: dm-crypt: track tag_offset in convert_context dm-crypt: don't initialize cc_sector again dm-crypt: don't update io->sector after kcryptd_crypt_write_io_submit() dm-crypt: use bi_sector in bio when initialize integrity seed dm-crypt: fully initialize clone->bi_iter in crypt_alloc_buffer() dm-crypt: set atomic as false when calling crypt_convert() in kworker dm-mirror: Support atomic writes dm-io: Warn on creating multiple atomic write bios for a region dm-stripe: Enable atomic writes dm-linear: Enable atomic writes dm: Ensure cloned bio is same length for atomic write dm-table: atomic writes support dm-transaction-manager: use red-black trees instead of linear lists dm: disable REQ_NOWAIT for flushes dm: remove useless test in alloc_multiple_bios dm: change kzalloc to kcalloc dm raid: fix spelling errors in raid_ctr()
2025-01-27Merge tag 'char-misc-6.14-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc Pull Char/Misc/IIO driver updates from Greg KH: "Here is the "big" set of char/misc/iio and other smaller driver subsystem updates for 6.14-rc1. Loads of different things in here this development cycle, highlights are: - ntsync "driver" to handle Windows locking types enabling Wine to work much better on many workloads (i.e. games). The driver framework was in 6.13, but now it's enabled and fully working properly. Should make many SteamOS users happy. Even comes with tests! - Large IIO driver updates and bugfixes - FPGA driver updates - Coresight driver updates - MHI driver updates - PPS driver updatesa - const bin_attribute reworking for many drivers - binder driver updates - smaller driver updates and fixes All of these have been in linux-next for a while with no reported issues" * tag 'char-misc-6.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc: (311 commits) ntsync: Fix reference leaks in the remaining create ioctls. spmi: hisi-spmi-controller: Drop duplicated OF node assignment in spmi_controller_probe() spmi: Set fwnode for spmi devices ntsync: fix a file reference leak in drivers/misc/ntsync.c scripts/tags.sh: Don't tag usages of DECLARE_BITMAP dt-bindings: interconnect: qcom,msm8998-bwmon: Add SM8750 CPU BWMONs dt-bindings: interconnect: OSM L3: Document sm8650 OSM L3 compatible dt-bindings: interconnect: qcom-bwmon: Document QCS615 bwmon compatibles interconnect: sm8750: Add missing const to static qcom_icc_desc memstick: core: fix kernel-doc notation intel_th: core: fix kernel-doc warnings binder: log transaction code on failure iio: dac: ad3552r-hs: clear reset status flag iio: dac: ad3552r-common: fix ad3541/2r ranges iio: chemical: bme680: Fix uninitialized variable in __bme680_read_raw() misc: fastrpc: Fix copy buffer page size misc: fastrpc: Fix registered buffer page address misc: fastrpc: Deregister device nodes properly in error scenarios nvmem: core: improve range check for nvmem_cell_write() nvmem: qcom-spmi-sdam: Set size in struct nvmem_config ...
2025-01-27Merge tag 'usb-6.14-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb Pull USB / Thunderbolt driver updates from Greg KH: "Here is the USB and Thunderbolt driver updates for 6.14-rc1. Nothing huge in here, just lots of new hardware support and updates for existing drivers. Changes here are: - big gadget f_tcm driver update - other gadget driver updates and fixes - thunderbolt driver updates for new hardware and capabilities and lots more debugging functionality to handle it when things aren't working well. - xhci driver updates - new USB-serial device updates - typec driver updates, including a chrome platform driver (acked by the subsystem maintainers) - other small driver updates All of these have been in linux-next for a while with no reported issues" * tag 'usb-6.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb: (123 commits) usb: hcd: Bump local buffer size in rh_string() Revert "usb: gadget: u_serial: Disable ep before setting port to null to fix the crash caused by port being null" usb: typec: tcpci: Prevent Sink disconnection before vPpsShutdown in SPR PPS usb: xhci: tegra: Fix OF boolean read warning usb: host: xhci-plat: add support compatible ID PNP0D15 usb: typec: ucsi: Add a macro definition for UCSI v1.0 usb: dwc3: core: Defer the probe until USB power supply ready usbip: Correct format specifier for seqnum from %d to %u usbip: Fix seqnum sign extension issue in vhci_tx_urb dt-bindings: usb: snps,dwc3: Split core description usb: quirks: Add NO_LPM quirk for TOSHIBA TransMemory-Mx device usb: dwc3: gadget: Reinitiate stream for all host NoStream behavior USB: Use str_enable_disable-like helpers USB: gadget: Use str_enable_disable-like helpers USB: phy: Use str_enable_disable-like helpers USB: typec: Use str_enable_disable-like helpers USB: host: Use str_enable_disable-like helpers USB: Replace own str_plural with common one USB: serial: quatech2: fix null-ptr-deref in qt2_process_read_urb() usb: phy: Remove API devm_usb_put_phy() ...
2025-01-27Merge tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhostLinus Torvalds
Pull virtio updates from Michael Tsirkin: "A small number of improvements all over the place: - vdpa/octeon support for multiple interrupts - virtio-pci support for error recovery - vp_vdpa support for notification with data - vhost/net fix to set num_buffers for spec compliance - virtio-mem now works with kdump on s390 And small cleanups all over the place" * tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost: (23 commits) virtio_blk: Add support for transport error recovery virtio_pci: Add support for PCIe Function Level Reset vhost/net: Set num_buffers for virtio 1.0 vdpa/octeon_ep: read vendor-specific PCI capability virtio-pci: define type and header for PCI vendor data vdpa/octeon_ep: handle device config change events vdpa/octeon_ep: enable support for multiple interrupts per device vdpa: solidrun: Replace deprecated PCI functions s390/kdump: virtio-mem kdump support (CONFIG_PROC_VMCORE_DEVICE_RAM) virtio-mem: support CONFIG_PROC_VMCORE_DEVICE_RAM virtio-mem: remember usable region size virtio-mem: mark device ready before registering callbacks in kdump mode fs/proc/vmcore: introduce PROC_VMCORE_DEVICE_RAM to detect device RAM ranges in 2nd kernel fs/proc/vmcore: factor out freeing a list of vmcore ranges fs/proc/vmcore: factor out allocating a vmcore range and adding it to a list fs/proc/vmcore: move vmcore definitions out of kcore.h fs/proc/vmcore: prefix all pr_* with "vmcore:" fs/proc/vmcore: disallow vmcore modifications while the vmcore is open fs/proc/vmcore: replace vmcoredd_mutex by vmcore_mutex fs/proc/vmcore: convert vmcore_cb_lock into vmcore_mutex ...
2025-01-27virtio-pci: define type and header for PCI vendor dataShijith Thotton
Added macro definition for VIRTIO_PCI_CAP_VENDOR_CFG to identify the PCI vendor data type in the virtio_pci_cap structure. Defined a new struct virtio_pci_vndr_data for the vendor data capability header as per the specification. Acked-by: Jason Wang <jasowang@redhat.com> Signed-off-by: Shijith Thotton <sthotton@marvell.com> Message-Id: <20250103153226.1933479-3-sthotton@marvell.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>