git.armlinux.org.uk/linux.git - Linus' kernel tree

Age	Commit message (Collapse)	Author
2025-03-05	pidfs: allow to retrieve exit information	Christian Brauner
	Some tools like systemd's jounral need to retrieve the exit and cgroup information after a process has already been reaped. This can e.g., happen when retrieving a pidfd via SCM_PIDFD or SCM_PEERPIDFD. Link: https://lore.kernel.org/r/20250305-work-pidfs-kill_on_last_close-v3-6-c8c3d8361705@kernel.org Reviewed-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-03-05	pidfs: record exit code and cgroupid at exit	Christian Brauner
	Record the exit code and cgroupid in release_task() and stash in struct pidfs_exit_info so it can be retrieved even after the task has been reaped. Link: https://lore.kernel.org/r/20250305-work-pidfs-kill_on_last_close-v3-5-c8c3d8361705@kernel.org Reviewed-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-03-05	pidfs: use private inode slab cache	Christian Brauner
	Introduce a private inode slab cache for pidfs. In follow-up patches pidfs will gain the ability to provide exit information to userspace after the task has been reaped. This means storing exit information even after the task has already been released and struct pid's task linkage is gone. Store that information alongside the inode. Link: https://lore.kernel.org/r/20250305-work-pidfs-kill_on_last_close-v3-4-c8c3d8361705@kernel.org Reviewed-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-03-05	pidfs: move setting flags into pidfs_alloc_file()	Christian Brauner
	Instead od adding it into __pidfd_prepare() place it where the actual file allocation happens and update the outdated comment. Link: https://lore.kernel.org/r/20250305-work-pidfs-kill_on_last_close-v3-3-c8c3d8361705@kernel.org Reviewed-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-03-05	pidfd: rely on automatic cleanup in __pidfd_prepare()	Christian Brauner
	Rely on scope-based cleanup for the allocated file descriptor. Link: https://lore.kernel.org/r/20250305-work-pidfs-kill_on_last_close-v3-2-c8c3d8361705@kernel.org Acked-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-03-05	pidfs: switch to copy_struct_to_user()	Christian Brauner
	We have a helper that deals with all the required logic. Link: https://lore.kernel.org/r/20250305-work-pidfs-kill_on_last_close-v3-1-c8c3d8361705@kernel.org Reviewed-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-03-05	gpio: rcar: Use raw_spinlock to protect register access	Niklas Söderlund
	Use raw_spinlock in order to fix spurious messages about invalid context when spinlock debugging is enabled. The lock is only used to serialize register access. [ 4.239592] ============================= [ 4.239595] [ BUG: Invalid wait context ] [ 4.239599] 6.13.0-rc7-arm64-renesas-05496-gd088502a519f #35 Not tainted [ 4.239603] ----------------------------- [ 4.239606] kworker/u8:5/76 is trying to lock: [ 4.239609] ffff0000091898a0 (&p->lock){....}-{3:3}, at: gpio_rcar_config_interrupt_input_mode+0x34/0x164 [ 4.239641] other info that might help us debug this: [ 4.239643] context-{5:5} [ 4.239646] 5 locks held by kworker/u8:5/76: [ 4.239651] #0: ffff0000080fb148 ((wq_completion)async){+.+.}-{0:0}, at: process_one_work+0x190/0x62c [ 4.250180] OF: /soc/sound@ec500000/ports/port@0/endpoint: Read of boolean property 'frame-master' with a value. [ 4.254094] #1: ffff80008299bd80 ((work_completion)(&entry->work)){+.+.}-{0:0}, at: process_one_work+0x1b8/0x62c [ 4.254109] #2: ffff00000920c8f8 [ 4.258345] OF: /soc/sound@ec500000/ports/port@1/endpoint: Read of boolean property 'bitclock-master' with a value. [ 4.264803] (&dev->mutex){....}-{4:4}, at: __device_attach_async_helper+0x3c/0xdc [ 4.264820] #3: ffff00000a50ca40 (request_class#2){+.+.}-{4:4}, at: __setup_irq+0xa0/0x690 [ 4.264840] #4: [ 4.268872] OF: /soc/sound@ec500000/ports/port@1/endpoint: Read of boolean property 'frame-master' with a value. [ 4.273275] ffff00000a50c8c8 (lock_class){....}-{2:2}, at: __setup_irq+0xc4/0x690 [ 4.296130] renesas_sdhi_internal_dmac ee100000.mmc: mmc1 base at 0x00000000ee100000, max clock rate 200 MHz [ 4.304082] stack backtrace: [ 4.304086] CPU: 1 UID: 0 PID: 76 Comm: kworker/u8:5 Not tainted 6.13.0-rc7-arm64-renesas-05496-gd088502a519f #35 [ 4.304092] Hardware name: Renesas Salvator-X 2nd version board based on r8a77965 (DT) [ 4.304097] Workqueue: async async_run_entry_fn [ 4.304106] Call trace: [ 4.304110] show_stack+0x14/0x20 (C) [ 4.304122] dump_stack_lvl+0x6c/0x90 [ 4.304131] dump_stack+0x14/0x1c [ 4.304138] __lock_acquire+0xdfc/0x1584 [ 4.426274] lock_acquire+0x1c4/0x33c [ 4.429942] _raw_spin_lock_irqsave+0x5c/0x80 [ 4.434307] gpio_rcar_config_interrupt_input_mode+0x34/0x164 [ 4.440061] gpio_rcar_irq_set_type+0xd4/0xd8 [ 4.444422] __irq_set_trigger+0x5c/0x178 [ 4.448435] __setup_irq+0x2e4/0x690 [ 4.452012] request_threaded_irq+0xc4/0x190 [ 4.456285] devm_request_threaded_irq+0x7c/0xf4 [ 4.459398] ata1: link resume succeeded after 1 retries [ 4.460902] mmc_gpiod_request_cd_irq+0x68/0xe0 [ 4.470660] mmc_start_host+0x50/0xac [ 4.474327] mmc_add_host+0x80/0xe4 [ 4.477817] tmio_mmc_host_probe+0x2b0/0x440 [ 4.482094] renesas_sdhi_probe+0x488/0x6f4 [ 4.486281] renesas_sdhi_internal_dmac_probe+0x60/0x78 [ 4.491509] platform_probe+0x64/0xd8 [ 4.495178] really_probe+0xb8/0x2a8 [ 4.498756] __driver_probe_device+0x74/0x118 [ 4.503116] driver_probe_device+0x3c/0x154 [ 4.507303] __device_attach_driver+0xd4/0x160 [ 4.511750] bus_for_each_drv+0x84/0xe0 [ 4.515588] __device_attach_async_helper+0xb0/0xdc [ 4.520470] async_run_entry_fn+0x30/0xd8 [ 4.524481] process_one_work+0x210/0x62c [ 4.528494] worker_thread+0x1ac/0x340 [ 4.532245] kthread+0x10c/0x110 [ 4.535476] ret_from_fork+0x10/0x20 Signed-off-by: Niklas Söderlund <niklas.soderlund+renesas@ragnatech.se> Reviewed-by: Geert Uytterhoeven <geert+renesas@glider.be> Tested-by: Geert Uytterhoeven <geert+renesas@glider.be> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20250121135833.3769310-1-niklas.soderlund+renesas@ragnatech.se Signed-off-by: Bartosz Golaszewski <bartosz.golaszewski@linaro.org>
2025-03-05	gpio: aggregator: protect driver attr handlers against module unload	Koichiro Den
	Both new_device_store and delete_device_store touch module global resources (e.g. gpio_aggregator_lock). To prevent race conditions with module unload, a reference needs to be held. Add try_module_get() in these handlers. For new_device_store, this eliminates what appears to be the most dangerous scenario: if an id is allocated from gpio_aggregator_idr but platform_device_register has not yet been called or completed, a concurrent module unload could fail to unregister/delete the device, leaving behind a dangling platform device/GPIO forwarder. This can result in various issues. The following simple reproducer demonstrates these problems: #!/bin/bash while :; do # note: whether 'gpiochip0 0' exists or not does not matter. echo 'gpiochip0 0' > /sys/bus/platform/drivers/gpio-aggregator/new_device done & while :; do modprobe gpio-aggregator modprobe -r gpio-aggregator done & wait Starting with the following warning, several kinds of warnings will appear and the system may become unstable: ------------[ cut here ]------------ list_del corruption, ffff888103e2e980->next is LIST_POISON1 (dead000000000100) WARNING: CPU: 1 PID: 1327 at lib/list_debug.c:56 __list_del_entry_valid_or_report+0xa3/0x120 [...] RIP: 0010:__list_del_entry_valid_or_report+0xa3/0x120 [...] Call Trace: <TASK> ? __list_del_entry_valid_or_report+0xa3/0x120 ? __warn.cold+0x93/0xf2 ? __list_del_entry_valid_or_report+0xa3/0x120 ? report_bug+0xe6/0x170 ? __irq_work_queue_local+0x39/0xe0 ? handle_bug+0x58/0x90 ? exc_invalid_op+0x13/0x60 ? asm_exc_invalid_op+0x16/0x20 ? __list_del_entry_valid_or_report+0xa3/0x120 gpiod_remove_lookup_table+0x22/0x60 new_device_store+0x315/0x350 [gpio_aggregator] kernfs_fop_write_iter+0x137/0x1f0 vfs_write+0x262/0x430 ksys_write+0x60/0xd0 do_syscall_64+0x6c/0x180 entry_SYSCALL_64_after_hwframe+0x76/0x7e [...] </TASK> ---[ end trace 0000000000000000 ]--- Fixes: 828546e24280 ("gpio: Add GPIO Aggregator") Cc: stable@vger.kernel.org Signed-off-by: Koichiro Den <koichiro.den@canonical.com> Link: https://lore.kernel.org/r/20250224143134.3024598-2-koichiro.den@canonical.com Signed-off-by: Bartosz Golaszewski <bartosz.golaszewski@linaro.org>
2025-03-05	fscrypt: Change fscrypt_encrypt_pagecache_blocks() to take a folio	Matthew Wilcox (Oracle)
	ext4 and ceph already have a folio to pass; f2fs needs to be properly converted but this will do for now. This removes a reference to page->index and page->mapping as well as removing a call to compound_head(). Signed-off-by: "Matthew Wilcox (Oracle)" <willy@infradead.org> Link: https://lore.kernel.org/r/20250304170224.523141-1-willy@infradead.org Acked-by: Eric Biggers <ebiggers@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-03-05	platform/x86/amd/pmf: Update PMF Driver for Compatibility with new PMF-TA	Shyam Sundar S K
	The PMF driver allocates a shared memory buffer using tee_shm_alloc_kernel_buf() for communication with the PMF-TA. The latest PMF-TA version introduces new structures with OEM debug information and additional policy input conditions for evaluating the policy binary. Consequently, the shared memory size must be increased to ensure compatibility between the PMF driver and the updated PMF-TA. To do so, introduce the new PMF-TA UUID and update the PMF shared memory configuration to ensure compatibility with the latest PMF-TA version. Additionally, export the TA UUID. These updates will result in modifications to the prototypes of amd_pmf_tee_init() and amd_pmf_ta_open_session(). Link: https://lore.kernel.org/all/55ac865f-b1c7-fa81-51c4-d211c7963e7e@linux.intel.com/ Reviewed-by: Mario Limonciello <mario.limonciello@amd.com> Co-developed-by: Patil Rajesh Reddy <Patil.Reddy@amd.com> Signed-off-by: Patil Rajesh Reddy <Patil.Reddy@amd.com> Signed-off-by: Shyam Sundar S K <Shyam-sundar.S-k@amd.com> Link: https://lore.kernel.org/r/20250305045842.4117767-2-Shyam-sundar.S-k@amd.com Reviewed-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com> Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
2025-03-05	platform/x86/amd/pmf: Propagate PMF-TA return codes	Shyam Sundar S K
	In the amd_pmf_invoke_cmd_init() function within the PMF driver ensure that the actual result from the PMF-TA is returned rather than a generic EIO. This change allows for proper handling of errors originating from the PMF-TA. Reviewed-by: Mario Limonciello <mario.limonciello@amd.com> Co-developed-by: Patil Rajesh Reddy <Patil.Reddy@amd.com> Signed-off-by: Patil Rajesh Reddy <Patil.Reddy@amd.com> Signed-off-by: Shyam Sundar S K <Shyam-sundar.S-k@amd.com> Link: https://lore.kernel.org/r/20250305045842.4117767-1-Shyam-sundar.S-k@amd.com Reviewed-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com> Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
2025-03-05	perf/core: Clean up perf_try_init_event()	Peter Zijlstra
	Make sure that perf_try_init_event() doesn't leave event->pmu nor event->destroy set on failure. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: Ravi Bangoria <ravi.bangoria@amd.com> Link: https://lore.kernel.org/r/20250205102449.110145835@infradead.org
2025-03-05	doc: correcting two prefix errors in idmappings.rst	Aiden Ma
	Add the 'k' prefix to id 21000. And id `u1000` in the third idmapping should be mapped to `k31000`, not `u31000`. Signed-off-by: Aiden Ma <jiaheng.ma@foxmail.com> Link: https://lore.kernel.org/r/tencent_4E7B1F143E8051530C21FCADF4E014DCBB06@qq.com Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-03-05	doc: fix inline emphasis warning	Christian Brauner
	Fix a warning spotted by linux-next build (htmldocs): Documentation/filesystems/porting.rst:1186: WARNING: Inline emphasis start-string without end-string. [docutils] Introduced by commit 88d5baf69082 ("Change inode_operations.mkdir to return struct dentry ") Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> Fixes: 88d5baf69082 ("Change inode_operations.mkdir to return struct dentry ") Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-03-05	Merge patch series "Change inode_operations.mkdir to return struct dentry *"	Christian Brauner
	NeilBrown <neilb@suse.de> says: This revised series contains a few clean-ups as requested by various people but no substantial changes. I reviewed the mkdir functions in many (all?) filesystems and found a few that use d_instantiate() on an unlocked inode (after unlock_new_inode()) and also support export_operations. These could potentially call d_instantiate() on a directory inode which is already attached to a dentry, though making that happen would usually require guessing the filehandle correctly. I haven't tried to address those here, (this patch set doesn't make that situation any worse) but I may in the future. * patches from https://lore.kernel.org/r/20250227013949.536172-2-neilb@suse.de: VFS: Change vfs_mkdir() to return the dentry. nfs: change mkdir inode_operation to return alternate dentry if needed. fuse: return correct dentry for ->mkdir ceph: return the correct dentry on mkdir hostfs: store inode in dentry after mkdir if possible. Change inode_operations.mkdir to return struct dentry * Link: https://lore.kernel.org/r/20250227013949.536172-2-neilb@suse.de Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-03-05	VFS: Change vfs_mkdir() to return the dentry.	NeilBrown
	vfs_mkdir() does not guarantee to leave the child dentry hashed or make it positive on success, and in many such cases the filesystem had to use a different dentry which it can now return. This patch changes vfs_mkdir() to return the dentry provided by the filesystems which is hashed and positive when provided. This reduces the number of cases where the resulting dentry is not positive to a handful which don't deserve extra efforts. The only callers of vfs_mkdir() which are interested in the resulting inode are in-kernel filesystem clients: cachefiles, nfsd, smb/server. The only filesystems that don't reliably provide the inode are: - kernfs, tracefs which these clients are unlikely to be interested in - cifs in some configurations would need to do a lookup to find the created inode, but doesn't. cifs cannot be exported via NFS, is unlikely to be used by cachefiles, and smb/server only has a soft requirement for the inode, so this is unlikely to be a problem in practice. - hostfs, nfs, cifs may need to do a lookup (rarely for NFS) and it is possible for a race to make that lookup fail. Actual failure is unlikely and providing callers handle negative dentries graceful they will fail-safe. So this patch removes the lookup code in nfsd and smb/server and adjusts them to fail safe if a negative dentry is provided: - cache-files already fails safe by restarting the task from the top - it still does with this change, though it no longer calls cachefiles_put_directory() as that will crash if the dentry is negative. - nfsd reports "Server-fault" which it what it used to do if the lookup failed. This will never happen on any file-systems that it can actually export, so this is of no consequence. I removed the fh_update() call as that is not needed and out-of-place. A subsequent nfsd_create_setattr() call will call fh_update() when needed. - smb/server only wants the inode to call ksmbd_smb_inherit_owner() which updates ->i_uid (without calling notify_change() or similar) which can be safely skipping on cifs (I hope). If a different dentry is returned, the first one is put. If necessary the fact that it is new can be determined by comparing pointers. A new dentry will certainly have a new pointer (as the old is put after the new is obtained). Similarly if an error is returned (via ERR_PTR()) the original dentry is put. Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: NeilBrown <neilb@suse.de> Link: https://lore.kernel.org/r/20250227013949.536172-7-neilb@suse.de Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-03-05	nfs: change mkdir inode_operation to return alternate dentry if needed.	NeilBrown
	mkdir now allows a different dentry to be returned which is sometimes relevant for nfs. This patch changes the nfs_rpc_ops mkdir op to return a dentry, and passes that back to the caller. The mkdir nfs_rpc_op will return NULL if the original dentry should be used. This matches the mkdir inode_operation. nfs4_do_create() is duplicated to nfs4_do_mkdir() which is changed to handle the specifics of directories. Consequently the current special handling for directories is removed from nfs4_do_create() Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: NeilBrown <neilb@suse.de> Link: https://lore.kernel.org/r/20250227013949.536172-6-neilb@suse.de Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-03-05	fuse: return correct dentry for ->mkdir	NeilBrown
	fuse already uses d_splice_alias() to ensure an appropriate dentry is found for a newly created dentry. Now that ->mkdir can return that dentry we do so. This requires changing create_new_entry() to return a dentry and handling that change in all callers. Note that when create_new_entry() is asked to create anything other than a directory we can be sure it will NOT return an alternate dentry as d_splice_alias() only returns an alternate dentry for directories. So we don't need to check for that case when passing one the result. Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: NeilBrown <neilb@suse.de> Link: https://lore.kernel.org/r/174112490070.33508.15852253149143067890@noble.neil.brown.name Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-03-05	ceph: Fix error handling in fill_readdir_cache()	Matthew Wilcox (Oracle)
	__filemap_get_folio() returns an ERR_PTR, not NULL. There are extensive assumptions that ctl->folio is NULL, not an error pointer, so it seems better to fix this one place rather than change all the places which check ctl->folio. Fixes: baff9740bc8f ("ceph: Convert ceph_readdir_cache_control to store a folio") Reported-by: Dan Carpenter <dan.carpenter@linaro.org> Signed-off-by: "Matthew Wilcox (Oracle)" <willy@infradead.org> Link: https://lore.kernel.org/r/20250304154818.250757-1-willy@infradead.org Cc: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-03-05	ALSA: hda: intel: Add Dell ALC3271 to power_save denylist	Hoku Ishibe
	Dell XPS 13 7390 with the Realtek ALC3271 codec experiences persistent humming noise when the power_save mode is enabled. This issue occurs when the codec enters power saving mode, leading to unwanted noise from the speakers. This patch adds the affected model (PCI ID 0x1028:0x0962) to the power_save denylist to ensure power_save is disabled by default, preventing power-off related noise issues. Steps to Reproduce 1. Boot the system with `snd_hda_intel` loaded. 2. Verify that `power_save` mode is enabled: ```sh cat /sys/module/snd_hda_intel/parameters/power_save ```` output: 10 (default power save timeout) 3. Wait for the power save timeout 4. Observe a persistent humming noise from the speakers 5. Disable `power_save` manually: ```sh echo 0 \| sudo tee /sys/module/snd_hda_intel/parameters/power_save ```` 6. Confirm that the noise disappears immediately. This issue has been observed on my system, and this patch successfully eliminates the unwanted noise. If other users experience similar issues, additional reports would be helpful. Signed-off-by: Hoku Ishibe <me@hokuishi.be> Cc: <stable@vger.kernel.org> Link: https://patch.msgid.link/20250224020517.51035-1-me@hokuishi.be Signed-off-by: Takashi Iwai <tiwai@suse.de>
2025-03-05	ALSA: hda/realtek: update ALC222 depop optimize	Kailang Yang
	Add ALC222 its own depop functions for alc_init and alc_shutup. [note: this fixes pop noise issues on the models with two headphone jacks -- tiwai ] Signed-off-by: Kailang Yang <kailang@realtek.com> Cc: <stable@vger.kernel.org> Signed-off-by: Takashi Iwai <tiwai@suse.de>
2025-03-05	mm/slab: call kmalloc_noprof() unconditionally in kmalloc_array_noprof()	Ye Bin
	If 'n' or 'size' isn't builtin constant, we used to call __kmalloc() before commit 7bd230a26648 ("mm/slab: enable slab allocation tagging for kmalloc and friends"), which inadvertedly changed both paths to kmalloc_noprof(). As Harry Yoo points out we can just call kmalloc_noprof() unconditionally. If the compiler knows n and size are constants it doesn't guarantee that bytes will be also seen as constant, and that is the important test in kmalloc_noprof() anyway, so we can just defer to it always. [ vbabka@suse.cz: change as Harry suggested and adjust commit log ] Fixes: 7bd230a26648 ("mm/slab: enable slab allocation tagging for kmalloc and friends") Signed-off-by: Ye Bin <yebin10@huawei.com> Reviewed-by: Harry Yoo <harry.yoo@oracle.com> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2025-03-05	x86/sgx: Fix size overflows in sgx_encl_create()	Jarkko Sakkinen
	The total size calculated for EPC can overflow u64 given the added up page for SECS. Further, the total size calculated for shmem can overflow even when the EPC size stays within limits of u64, given that it adds the extra space for 128 byte PCMD structures (one for each page). Address this by pre-evaluating the micro-architectural requirement of SGX: the address space size must be power of two. This is eventually checked up by ECREATE but the pre-check has the additional benefit of making sure that there is some space for additional data. Fixes: 888d24911787 ("x86/sgx: Add SGX_IOC_ENCLAVE_CREATE") Reported-by: Dan Carpenter <dan.carpenter@linaro.org> Signed-off-by: Jarkko Sakkinen <jarkko@kernel.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: Dave Hansen <dave.hansen@intel.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: "H. Peter Anvin" <hpa@zytor.com> Link: https://lore.kernel.org/r/20250305050006.43896-1-jarkko@kernel.org Closes: https://lore.kernel.org/linux-sgx/c87e01a0-e7dd-4749-a348-0980d3444f04@stanley.mountain/
2025-03-05	x86/delay: Fix inconsistent whitespace	Charles Han
	Smatch warns about this whitespace damage: arch/x86/lib/delay.c:134 delay_halt_mwaitx() warn: inconsistent indenting Signed-off-by: Charles Han <hanchunchao@inspur.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20250305063515.3951-1-hanchunchao@inspur.com
2025-03-04	Merge tag 'x86_microcode_for_v6.14_rc6' of ↵	Linus Torvalds
	git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull AMD microcode loading fixes from Borislav Petkov: - Load only sha256-signed microcode patch blobs - Other good cleanups * tag 'x86_microcode_for_v6.14_rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/microcode/AMD: Load only SHA256-checksummed patches x86/microcode/AMD: Add get_patch_level() x86/microcode/AMD: Get rid of the _load_microcode_amd() forward declaration x86/microcode/AMD: Merge early_apply_microcode() into its single callsite x86/microcode/AMD: Remove unused save_microcode_in_initrd_amd() declarations x86/microcode/AMD: Remove ugly linebreak in __verify_patch_section() signature
2025-03-04	kunit/overflow: Fix DEFINE_FLEX tests for counted_by	Kees Cook
	Unfortunately, __builtin_dynamic_object_size() does not take into account flexible array sizes, even when they are sized by __counted_by. As a result, the size tests for the flexible arrays need to be separated to get an accurate check of the compiler's behavior. While at it, fully test sizeof, __struct_size (bdos(..., 0)), and __member_size (bdos(..., 1)). I still think this is a compiler design issue, but there's not much to be done about it currently beyond adjusting these tests. GCC and Clang agree on this behavior at least. :) Reported-by: "Thomas Weißschuh" <linux@weissschuh.net> Closes: https://lore.kernel.org/lkml/e1a1531d-6968-4ae8-a3b5-5ea0547ec4b3@t-8ch.de/ Fixes: 9dd5134c6158 ("kunit/overflow: Adjust for __counted_by with DEFINE_RAW_FLEX()") Signed-off-by: Kees Cook <kees@kernel.org>
2025-03-04	Merge branches 'docs.2025.02.04a', 'lazypreempt.2025.03.04a', ↵	Boqun Feng
	'misc.2025.03.04a', 'srcu.2025.02.05a' and 'torture.2025.02.05a'
2025-03-04	rcutorture: Make scenario TREE07 build CONFIG_PREEMPT_LAZY=y	Paul E. McKenney
	This commit tests lazy preemption by causing the TREE07 rcutorture scenario to build its kernel with CONFIG_PREEMPT_LAZY=y. Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
2025-03-04	rcutorture: Make scenario TREE10 build CONFIG_PREEMPT_LAZY=y	Paul E. McKenney
	This commit tests lazy preemption by causing the TREE10 rcutorture scenario to build its kernel with CONFIG_PREEMPT_LAZY=y. Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
2025-03-04	rcu: limit PREEMPT_RCU configurations	Ankur Arora
	PREEMPT_LAZY can be enabled stand-alone or alongside PREEMPT_DYNAMIC which allows for dynamic switching of preemption models. The choice of PREEMPT_RCU or not, however, is fixed at compile time. Given that PREEMPT_RCU makes some trade-offs to optimize for latency as opposed to throughput, configurations with limited preemption might prefer the stronger forward-progress guarantees of PREEMPT_RCU=n. Accordingly, explicitly limit PREEMPT_RCU=y to the latency oriented preemption models: PREEMPT, PREEMPT_RT, and the runtime configurable model PREEMPT_DYNAMIC. This means the throughput oriented models, PREEMPT_NONE, PREEMPT_VOLUNTARY, and PREEMPT_LAZY will run with PREEMPT_RCU=n. Cc: Paul E. McKenney <paulmck@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
2025-03-04	rcutorture: Update ->extendables check for lazy preemption	Boqun Feng
	The rcutorture_one_extend_check() function's second last check assumes that "preempt_count() & PREEMPT_MASK" is non-zero only if RCUTORTURE_RDR_PREEMPT or RCUTORTURE_RDR_SCHED bit is set. This works for preemptible RCU and for non-preemptible RCU running in a non-preemptible kernel. But it fails for non-preemptible RCU running in a preemptible kernel because then rcu_read_lock() is just preempt_disable(), which increases preempt count. This commit therefore adjusts this check to take into account the case fo non-preemptible RCU running in a preemptible kernel. Reviewed-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
2025-03-04	rcutorture: Update rcutorture_one_extend_check() for lazy preemption	Paul E. McKenney
	The rcutorture_one_extend_check() function's last check assumes that if cur_ops->readlock_nesting() returns greater than zero, either the RCUTORTURE_RDR_RCU_1 or the RCUTORTURE_RDR_RCU_2 bit must be set, that is, there must be at least one rcu_read_lock() in effect. This works for preemptible RCU and for non-preemptible RCU running in a non-preemptible kernel. But it fails for non-preemptible RCU running in a preemptible kernel because then RCU's cur_ops->readlock_nesting() function, which is rcu_torture_readlock_nesting(), will return the PREEMPT_MASK mask bits from preempt_count(). The result will be greater than zero if preemption is disabled, including by the RCUTORTURE_RDR_PREEMPT and RCUTORTURE_RDR_SCHED bits. This commit therefore adjusts this check to take into account the case fo non-preemptible RCU running in a preemptible kernel. [boqun: Fix the if condition and add comment] Reported-by: kernel test robot <oliver.sang@intel.com> Closes: https://lore.kernel.org/oe-lkp/202502171415.8ec87c87-lkp@intel.com Co-developed-by: Boqun Feng <boqun.feng@gmail.com> Signed-off-by: Boqun Feng <boqun.feng@gmail.com> Co-developed-by: Joel Fernandes <joelagnelf@nvidia.com> Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Tested-by: kernel test robot <oliver.sang@intel.com> Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
2025-03-04	osnoise: provide quiescent states	Ankur Arora
	To reduce RCU noise for nohz_full configurations, osnoise depends on cond_resched() providing quiescent states for PREEMPT_RCU=n configurations. For PREEMPT_RCU=y configurations -- where cond_resched() is a stub -- we do this by directly calling rcu_momentary_eqs(). With (PREEMPT_LAZY=y, PREEMPT_DYNAMIC=n), however, we have a configuration with (PREEMPTION=y, PREEMPT_RCU=n) where neither of the above can help. Handle that by providing an explicit quiescent state here for all configurations. As mentioned above this is not needed for non-stubbed cond_resched(), but, providing a quiescent state here just pulls in one that a future cond_resched() would provide, so doesn't cause any extra work for this configuration. Cc: Paul E. McKenney <paulmck@kernel.org> Cc: Daniel Bristot de Oliveira <bristot@kernel.org> Cc: Steven Rostedt <rostedt@goodmis.org> Suggested-by: Paul E. McKenney <paulmck@kernel.org> Acked-by: Daniel Bristot de Oliveira <bristot@kernel.org> Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Acked-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
2025-03-04	rcu: Use _full() API to debug synchronize_rcu()	Uladzislau Rezki (Sony)
	Switch for using of get_state_synchronize_rcu_full() and poll_state_synchronize_rcu_full() pair to debug a normal synchronize_rcu() call. Just using "not" full APIs to identify if a grace period is passed or not might lead to a false-positive kernel splat. It can happen, because get_state_synchronize_rcu() compresses both normal and expedited states into one single unsigned long value, so a poll_state_synchronize_rcu() can miss GP-completion when synchronize_rcu()/synchronize_rcu_expedited() concurrently run. To address this, switch to poll_state_synchronize_rcu_full() and get_state_synchronize_rcu_full() APIs, which use separate variables for expedited and normal states. Reported-by: cheung wall <zzqq0103.hey@gmail.com> Closes: https://lore.kernel.org/lkml/Z5ikQeVmVdsWQrdD@pc636/T/ Fixes: 988f569ae041 ("rcu: Reduce synchronize_rcu() latency") Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Reviewed-by: Paul E. McKenney <paulmck@kernel.org> Link: https://lore.kernel.org/r/20250227131613.52683-3-urezki@gmail.com Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
2025-03-04	rcu: Update TREE05.boot to test normal synchronize_rcu()	Uladzislau Rezki (Sony)
	Add extra parameters for rcutorture module. One is the "nfakewriters" which is set -1. There will be created number of test-kthreads which correspond to number of CPUs in a test system. Those threads randomly invoke synchronize_rcu() call. Apart of that "rcu_normal" is set to 1, because it is specifically for a normal synchronize_rcu() testing, also a newly added parameter which is "rcu_normal_wake_from_gp" is set to 1 also. That prevents interaction with other callbacks in a system. Reviewed-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Link: https://lore.kernel.org/r/20250227131613.52683-2-urezki@gmail.com Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
2025-03-04	rcutorture: Allow a negative value for nfakewriters	Uladzislau Rezki (Sony)
	Currently "nfakewriters" parameter can be set to any value but there is no possibility to adjust it automatically based on how many CPUs a system has where a test is run on. To address this, if the "nfakewriters" is set to negative it will be adjusted to num_online_cpus() during torture initialization. Reviewed-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Link: https://lore.kernel.org/r/20250227131613.52683-1-urezki@gmail.com Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
2025-03-04	Flush console log from kernel_power_off()	Paul E. McKenney
	Kernels built with CONFIG_PREEMPT_RT=y can lose significant console output and shutdown time, which hides shutdown-time RCU issues from rcutorture. Therefore, make pr_flush() public and invoke it after then last print in kernel_power_off(). [ paulmck: Apply John Ogness feedback. ] [ paulmck: Appy Sebastian Andrzej Siewior feedback. ] [ paulmck: Apply kernel test robot feedback. ] Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Reviewed-by: John Ogness <john.ogness@linutronix.de> Reviewed-by: Petr Mladek <pmladek@suse.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Sergey Senozhatsky <senozhatsky@chromium.org> Link: https://lore.kernel.org/r/5f743488-dc2a-4f19-bdda-cf50b9314832@paulmck-laptop Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
2025-03-04	context_tracking: Make RCU watch ct_kernel_exit_state() warning	Paul E. McKenney
	The WARN_ON_ONCE() in ct_kernel_exit_state() follows the call to ct_state_inc(), which means that RCU is not watching this WARN_ON_ONCE(). This can (and does) result in extraneous lockdep warnings when this WARN_ON_ONCE() triggers. These extraneous warnings are the opposite of helpful. Therefore, invert the WARN_ON_ONCE() condition and move it before the call to ct_state_inc(). This does mean that the ct_state_inc() return value can no longer be used in the WARN_ON_ONCE() condition, so discard this return value and instead use a call to rcu_is_watching_curr_cpu(). This call is executed only in CONFIG_RCU_EQS_DEBUG=y kernels, so there is no added overhead in production use. [Boqun: Add the subsystem tag in the title] Reported-by: Breno Leitao <leitao@debian.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Reviewed-by: Valentin Schneider <vschneid@redhat.com> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Link: https://lore.kernel.org/r/bd911cd9-1fe9-447c-85e0-ea811a1dc896@paulmck-laptop Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
2025-03-04	rcu/nocb: Print segment lengths in show_rcu_nocb_gp_state()	Paul E. McKenney
	Analysis of an rcutorture callback-based forward-progress test failure was hampered by the lack of ->cblist segment lengths. This commit therefore adds this information, so that what would have been ".W85620.N." (there are some callbacks waiting for grace period sequence number 85620 and some number more that have not yet been assigned to a grace period) now prints as ".W2(85620).N6." (there are 2 callbacks waiting for grace period 85620 and 6 not yet assigned to a grace period). Note that "D" (done), "N" (next and not yet assigned to a grace period, and "B" (bypass, also not yet assigned to a grace period) have just the number of callbacks without the parenthesized grace-period sequence number. In contrast, "W" (waiting for the current grace period) and "R" (ready to wait for the next grace period to start) both have parenthesized grace-period sequence numbers. Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
2025-03-04	rcu-tasks: Move RCU Tasks self-tests to core_initcall()	Paul E. McKenney
	The timer and hrtimer softirq processing has moved to dedicated threads for kernels built with CONFIG_IRQ_FORCED_THREADING=y. This results in timers not expiring until later in early boot, which in turn causes the RCU Tasks self-tests to hang in kernels built with CONFIG_PROVE_RCU=y, which further causes the entire kernel to hang. One fix would be to make timers work during this time, but there are no known users of RCU Tasks grace periods during that time, so no justification for the added complexity. Not yet, anyway. This commit therefore moves the call to rcu_init_tasks_generic() from kernel_init_freeable() to a core_initcall(). This works because the timer and hrtimer kthreads are created at early_initcall() time. Fixes: 49a17639508c3 ("softirq: Use a dedicated thread for timer wakeups on PREEMPT_RT.") Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Frederic Weisbecker <frederic@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Andrii Nakryiko <andrii@kernel.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: <linux-trace-kernel@vger.kernel.org> Tested-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Reviewed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
2025-03-04	rcu: Fix get_state_synchronize_rcu_full() GP-start detection	Paul E. McKenney
	The get_state_synchronize_rcu_full() and poll_state_synchronize_rcu_full() functions use the root rcu_node structure's ->gp_seq field to detect the beginnings and ends of grace periods, respectively. This choice is necessary for the poll_state_synchronize_rcu_full() function because (give or take counter wrap), the following sequence is guaranteed not to trigger: get_state_synchronize_rcu_full(&rgos); synchronize_rcu(); WARN_ON_ONCE(!poll_state_synchronize_rcu_full(&rgos)); The RCU callbacks that awaken synchronize_rcu() instances are guaranteed not to be invoked before the root rcu_node structure's ->gp_seq field is updated to indicate the end of the grace period. However, these callbacks might start being invoked immediately thereafter, in particular, before rcu_state.gp_seq has been updated. Therefore, poll_state_synchronize_rcu_full() must refer to the root rcu_node structure's ->gp_seq field. Because this field is updated under this structure's ->lock, any code following a call to poll_state_synchronize_rcu_full() will be fully ordered after the full grace-period computation, as is required by RCU's memory-ordering semantics. By symmetry, the get_state_synchronize_rcu_full() function should also use this same root rcu_node structure's ->gp_seq field. But it turns out that symmetry is profoundly (though extremely infrequently) destructive in this case. To see this, consider the following sequence of events: 1. CPU 0 starts a new grace period, and updates rcu_state.gp_seq accordingly. 2. As its first step of grace-period initialization, CPU 0 examines the current CPU hotplug state and decides that it need not wait for CPU 1, which is currently offline. 3. CPU 1 comes online, and updates its state. But this does not affect the current grace period, but rather the one after that. After all, CPU 1 was offline when the current grace period started, so all pre-existing RCU readers on CPU 1 must have completed or been preempted before it last went offline. The current grace period therefore has nothing it needs to wait for on CPU 1. 4. CPU 1 switches to an rcutorture kthread which is running rcutorture's rcu_torture_reader() function, which starts a new RCU reader. 5. CPU 2 is running rcutorture's rcu_torture_writer() function and collects a new polled grace-period "cookie" using get_state_synchronize_rcu_full(). Because the newly started grace period has not completed initialization, the root rcu_node structure's ->gp_seq field has not yet been updated to indicate that this new grace period has already started. This cookie is therefore set up for the end of the current grace period (rather than the end of the following grace period). 6. CPU 0 finishes grace-period initialization. 7. If CPU 1’s rcutorture reader is preempted, it will be added to the ->blkd_tasks list, but because CPU 1’s ->qsmask bit is not set in CPU 1's leaf rcu_node structure, the ->gp_tasks pointer will not be updated. Thus, this grace period will not wait on it. Which is only fair, given that the CPU did not come online until after the grace period officially started. 8. CPUs 0 and 2 then detect the new grace period and then report a quiescent state to the RCU core. 9. Because CPU 1 was offline at the start of the current grace period, CPUs 0 and 2 are the only CPUs that this grace period needs to wait on. So the grace period ends and post-grace-period cleanup starts. In particular, the root rcu_node structure's ->gp_seq field is updated to indicate that this grace period has now ended. 10. CPU 2 continues running rcu_torture_writer() and sees that, from the viewpoint of the root rcu_node structure consulted by the poll_state_synchronize_rcu_full() function, the grace period has ended. It therefore updates state accordingly. 11. CPU 1 is still running the same RCU reader, which notices this update and thus complains about the too-short grace period. The fix is for the get_state_synchronize_rcu_full() function to use rcu_state.gp_seq instead of the root rcu_node structure's ->gp_seq field. With this change in place, if step 5's cookie indicates that the grace period has not yet started, then any prior code executed by CPU 2 must have happened before CPU 1 came online. This will in turn prevent CPU 1's code in steps 3 and 11 from spanning CPU 2's grace-period wait, thus preventing CPU 1 from being subjected to a too-short grace period. This commit therefore makes this change. Note that there is no change to the poll_state_synchronize_rcu_full() function, which as noted above, must continue to use the root rcu_node structure's ->gp_seq field. This is of course an asymmetry between these two functions, but is an asymmetry that is absolutely required for correct operation. It is a common human tendency to greatly value symmetry, and sometimes symmetry is a wonderful thing. Other times, symmetry results in poor performance. But in this case, symmetry is just plain wrong. Nevertheless, the asymmetry does require an additional adjustment. It is possible for get_state_synchronize_rcu_full() to see a given grace period as having started, but for an immediately following poll_state_synchronize_rcu_full() to see it as having not yet started. Given the current rcu_seq_done_exact() implementation, this will result in a false-positive indication that the grace period is done from poll_state_synchronize_rcu_full(). This is dealt with by making rcu_seq_done_exact() reach back three grace periods rather than just two of them. However, simply changing get_state_synchronize_rcu_full() function to use rcu_state.gp_seq instead of the root rcu_node structure's ->gp_seq field results in a theoretical bug in kernels booted with rcutree.rcu_normal_wake_from_gp=1 due to the following sequence of events: o The rcu_gp_init() function invokes rcu_seq_start() to officially start a new grace period. o A new RCU reader begins, referencing X from some RCU-protected list. The new grace period is not obligated to wait for this reader. o An updater removes X, then calls synchronize_rcu(), which queues a wait element. o The grace period ends, awakening the updater, which frees X while the reader is still referencing it. The reason that this is theoretical is that although the grace period has officially started, none of the CPUs are officially aware of this, and thus will have to assume that the RCU reader pre-dated the start of the grace period. Detailed explanation can be found at [2] and [3]. Except for kernels built with CONFIG_PROVE_RCU=y, which use the polled grace-period APIs, which can and do complain bitterly when this sequence of events occurs. Not only that, there might be some future RCU grace-period mechanism that pulls this sequence of events from theory into practice. This commit therefore also pulls the call to rcu_sr_normal_gp_init() to precede that to rcu_seq_start(). Although this fixes commit 91a967fd6934 ("rcu: Add full-sized polling for get_completed() and poll_state()"), it is not clear that it is worth backporting this commit. First, it took me many weeks to convince rcutorture to reproduce this more frequently than once per year. Second, this cannot be reproduced at all without frequent CPU-hotplug operations, as in waiting all of 50 milliseconds from the end of the previous operation until starting the next one. Third, the TREE03.boot settings cause multi-millisecond delays during RCU grace-period initialization, which greatly increase the probability of the above sequence of events. (Don't do this in production workloads!) Fourth, the TREE03 rcutorture scenario was modified to use four-CPU guest OSes, to have a single-rcu_node combining tree, no testing of RCU priority boosting, and no random preemption, and these modifications were necessary to reproduce this issue in a reasonable timeframe. Fifth, extremely heavy use of get_state_synchronize_rcu_full() and/or poll_state_synchronize_rcu_full() is required to reproduce this, and as of v6.12, only kfree_rcu() uses it, and even then not particularly heavily. [boqun: Apply the fix [1], and add the comment before the moved rcu_sr_normal_gp_init(). Additional links are added for explanation.] Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org> Tested-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Link: https://lore.kernel.org/rcu/d90bd6d9-d15c-4b9b-8a69-95336e74e8f4@paulmck-laptop/ [1] Link: https://lore.kernel.org/rcu/20250303001507.GA3994772@joelnvbox/ [2] Link: https://lore.kernel.org/rcu/Z8bcUsZ9IpRi1QoP@pc636/ [3] Reviewed-by: Joel Fernandes <joelagnelf@nvidia.com> Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
2025-03-04	vlan: enforce underlying device type	Oscar Maes
	Currently, VLAN devices can be created on top of non-ethernet devices. Besides the fact that it doesn't make much sense, this also causes a bug which leaks the address of a kernel function to usermode. When creating a VLAN device, we initialize GARP (garp_init_applicant) and MRP (mrp_init_applicant) for the underlying device. As part of the initialization process, we add the multicast address of each applicant to the underlying device, by calling dev_mc_add. __dev_mc_add uses dev->addr_len to determine the length of the new multicast address. This causes an out-of-bounds read if dev->addr_len is greater than 6, since the multicast addresses provided by GARP and MRP are only 6 bytes long. This behaviour can be reproduced using the following commands: ip tunnel add gretest mode ip6gre local ::1 remote ::2 dev lo ip l set up dev gretest ip link add link gretest name vlantest type vlan id 100 Then, the following command will display the address of garp_pdu_rcv: ip maddr show \| grep 01:80:c2:00:00:21 Fix the bug by enforcing the type of the underlying device during VLAN device initialization. Fixes: 22bedad3ce11 ("net: convert multicast list to list_head") Reported-by: syzbot+91161fe81857b396c8a0@syzkaller.appspotmail.com Closes: https://lore.kernel.org/netdev/000000000000ca9a81061a01ec20@google.com/ Signed-off-by: Oscar Maes <oscmaes92@gmail.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Link: https://patch.msgid.link/20250303155619.8918-1-oscmaes92@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-03-04	mptcp: fix 'scheduling while atomic' in mptcp_pm_nl_append_new_local_addr	Krister Johansen
	If multiple connection requests attempt to create an implicit mptcp endpoint in parallel, more than one caller may end up in mptcp_pm_nl_append_new_local_addr because none found the address in local_addr_list during their call to mptcp_pm_nl_get_local_id. In this case, the concurrent new_local_addr calls may delete the address entry created by the previous caller. These deletes use synchronize_rcu, but this is not permitted in some of the contexts where this function may be called. During packet recv, the caller may be in a rcu read critical section and have preemption disabled. An example stack: BUG: scheduling while atomic: swapper/2/0/0x00000302 Call Trace: <IRQ> dump_stack_lvl (lib/dump_stack.c:117 (discriminator 1)) dump_stack (lib/dump_stack.c:124) __schedule_bug (kernel/sched/core.c:5943) schedule_debug.constprop.0 (arch/x86/include/asm/preempt.h:33 kernel/sched/core.c:5970) __schedule (arch/x86/include/asm/jump_label.h:27 include/linux/jump_label.h:207 kernel/sched/features.h:29 kernel/sched/core.c:6621) schedule (arch/x86/include/asm/preempt.h:84 kernel/sched/core.c:6804 kernel/sched/core.c:6818) schedule_timeout (kernel/time/timer.c:2160) wait_for_completion (kernel/sched/completion.c:96 kernel/sched/completion.c:116 kernel/sched/completion.c:127 kernel/sched/completion.c:148) __wait_rcu_gp (include/linux/rcupdate.h:311 kernel/rcu/update.c:444) synchronize_rcu (kernel/rcu/tree.c:3609) mptcp_pm_nl_append_new_local_addr (net/mptcp/pm_netlink.c:966 net/mptcp/pm_netlink.c:1061) mptcp_pm_nl_get_local_id (net/mptcp/pm_netlink.c:1164) mptcp_pm_get_local_id (net/mptcp/pm.c:420) subflow_check_req (net/mptcp/subflow.c:98 net/mptcp/subflow.c:213) subflow_v4_route_req (net/mptcp/subflow.c:305) tcp_conn_request (net/ipv4/tcp_input.c:7216) subflow_v4_conn_request (net/mptcp/subflow.c:651) tcp_rcv_state_process (net/ipv4/tcp_input.c:6709) tcp_v4_do_rcv (net/ipv4/tcp_ipv4.c:1934) tcp_v4_rcv (net/ipv4/tcp_ipv4.c:2334) ip_protocol_deliver_rcu (net/ipv4/ip_input.c:205 (discriminator 1)) ip_local_deliver_finish (include/linux/rcupdate.h:813 net/ipv4/ip_input.c:234) ip_local_deliver (include/linux/netfilter.h:314 include/linux/netfilter.h:308 net/ipv4/ip_input.c:254) ip_sublist_rcv_finish (include/net/dst.h:461 net/ipv4/ip_input.c:580) ip_sublist_rcv (net/ipv4/ip_input.c:640) ip_list_rcv (net/ipv4/ip_input.c:675) __netif_receive_skb_list_core (net/core/dev.c:5583 net/core/dev.c:5631) netif_receive_skb_list_internal (net/core/dev.c:5685 net/core/dev.c:5774) napi_complete_done (include/linux/list.h:37 include/net/gro.h:449 include/net/gro.h:444 net/core/dev.c:6114) igb_poll (drivers/net/ethernet/intel/igb/igb_main.c:8244) igb __napi_poll (net/core/dev.c:6582) net_rx_action (net/core/dev.c:6653 net/core/dev.c:6787) handle_softirqs (kernel/softirq.c:553) __irq_exit_rcu (kernel/softirq.c:588 kernel/softirq.c:427 kernel/softirq.c:636) irq_exit_rcu (kernel/softirq.c:651) common_interrupt (arch/x86/kernel/irq.c:247 (discriminator 14)) </IRQ> This problem seems particularly prevalent if the user advertises an endpoint that has a different external vs internal address. In the case where the external address is advertised and multiple connections already exist, multiple subflow SYNs arrive in parallel which tends to trigger the race during creation of the first local_addr_list entries which have the internal address instead. Fix by skipping the replacement of an existing implicit local address if called via mptcp_pm_nl_get_local_id. Fixes: d045b9eb95a9 ("mptcp: introduce implicit endpoints") Cc: stable@vger.kernel.org Suggested-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Krister Johansen <kjlx@templeofstupid.com> Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20250303-net-mptcp-fix-sched-while-atomic-v1-1-f6a216c5a74c@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-03-04	net: ethtool: netlink: Allow NULL nlattrs when getting a phy_device	Maxime Chevallier
	ethnl_req_get_phydev() is used to lookup a phy_device, in the case an ethtool netlink command targets a specific phydev within a netdev's topology. It takes as a parameter a const struct nlattr *header that's used for error handling : if (!phydev) { NL_SET_ERR_MSG_ATTR(extack, header, "no phy matching phyindex"); return ERR_PTR(-ENODEV); } In the notify path after a ->set operation however, there's no request attributes available. The typical callsite for the above function looks like: phydev = ethnl_req_get_phydev(req_base, tb[ETHTOOL_A_XXX_HEADER], info->extack); So, when tb is NULL (such as in the ethnl notify path), we have a nice crash. It turns out that there's only the PLCA command that is in that case, as the other phydev-specific commands don't have a notification. This commit fixes the crash by passing the cmd index and the nlattr array separately, allowing NULL-checking it directly inside the helper. Fixes: c15e065b46dc ("net: ethtool: Allow passing a phy index for some commands") Signed-off-by: Maxime Chevallier <maxime.chevallier@bootlin.com> Reviewed-by: Kory Maincent <kory.maincent@bootlin.com> Reported-by: Parthiban Veerasooran <parthiban.veerasooran@microchip.com> Link: https://patch.msgid.link/20250301141114.97204-1-maxime.chevallier@bootlin.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-03-04	ppp: Fix KMSAN uninit-value warning with bpf	Jiayuan Chen
	Syzbot caught an "KMSAN: uninit-value" warning [1], which is caused by the ppp driver not initializing a 2-byte header when using socket filter. The following code can generate a PPP filter BPF program: ''' struct bpf_program fp; pcap_t *handle; handle = pcap_open_dead(DLT_PPP_PPPD, 65535); pcap_compile(handle, &fp, "ip and outbound", 0, 0); bpf_dump(&fp, 1); ''' Its output is: ''' (000) ldh [2] (001) jeq #0x21 jt 2 jf 5 (002) ldb [0] (003) jeq #0x1 jt 4 jf 5 (004) ret #65535 (005) ret #0 ''' Wen can find similar code at the following link: https://github.com/ppp-project/ppp/blob/master/pppd/options.c#L1680 The maintainer of this code repository is also the original maintainer of the ppp driver. As you can see the BPF program skips 2 bytes of data and then reads the 'Protocol' field to determine if it's an IP packet. Then it read the first byte of the first 2 bytes to determine the direction. The issue is that only the first byte indicating direction is initialized in current ppp driver code while the second byte is not initialized. For normal BPF programs generated by libpcap, uninitialized data won't be used, so it's not a problem. However, for carefully crafted BPF programs, such as those generated by syzkaller [2], which start reading from offset 0, the uninitialized data will be used and caught by KMSAN. [1] https://syzkaller.appspot.com/bug?extid=853242d9c9917165d791 [2] https://syzkaller.appspot.com/text?tag=ReproC&x=11994913980000 Cc: Paul Mackerras <paulus@samba.org> Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Reported-by: syzbot+853242d9c9917165d791@syzkaller.appspotmail.com Closes: https://lore.kernel.org/bpf/000000000000dea025060d6bc3bc@google.com/ Signed-off-by: Jiayuan Chen <jiayuan.chen@linux.dev> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20250228141408.393864-1-jiayuan.chen@linux.dev Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-03-04	Merge branch 'fixes-for-ipa-v4-7'	Jakub Kicinski
	Luca Weiss says: ==================== Fixes for IPA v4.7 During bringup of IPA v4.7 unfortunately some bits were missed, and it couldn't be tested much back then due to missing features in tqftpserv which caused the modem to not enable correctly. Especially the last commit is important since it makes mobile data actually functional on SoCs with IPA v4.7 like SM6350 - used on the Fairphone 4. Before that, you'd get an IP address on the interface but then e.g. ping never got any response back. ==================== Link: https://patch.msgid.link/20250227-ipa-v4-7-fixes-v1-0-a88dd8249d8a@fairphone.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-03-04	net: ipa: Enable checksum for IPA_ENDPOINT_AP_MODEM_{RX,TX} for v4.7	Luca Weiss
	Enable the checksum option for these two endpoints in order to allow mobile data to actually work. Without this, no packets seem to make it through the IPA. Fixes: b310de784bac ("net: ipa: add IPA v4.7 support") Signed-off-by: Luca Weiss <luca.weiss@fairphone.com> Reviewed-by: Alex Elder <elder@riscstar.com> Link: https://patch.msgid.link/20250227-ipa-v4-7-fixes-v1-3-a88dd8249d8a@fairphone.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-03-04	net: ipa: Fix QSB data for v4.7	Luca Weiss
	As per downstream reference, max_writes should be 12 and max_reads should be 13. Fixes: b310de784bac ("net: ipa: add IPA v4.7 support") Signed-off-by: Luca Weiss <luca.weiss@fairphone.com> Reviewed-by: Alex Elder <elder@riscstar.com> Link: https://patch.msgid.link/20250227-ipa-v4-7-fixes-v1-2-a88dd8249d8a@fairphone.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-03-04	net: ipa: Fix v4.7 resource group names	Luca Weiss
	In the downstream IPA driver there's only one group defined for source and destination, and the destination group doesn't have a _DPL suffix. Fixes: b310de784bac ("net: ipa: add IPA v4.7 support") Signed-off-by: Luca Weiss <luca.weiss@fairphone.com> Reviewed-by: Alex Elder <elder@riscstar.com> Link: https://patch.msgid.link/20250227-ipa-v4-7-fixes-v1-1-a88dd8249d8a@fairphone.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-03-04	HID: Intel-thc-hid: Intel-quickspi: Correct device state after S4	Even Xu
	During S4 retore flow, quickspi device was resetted by driver and state was changed to RESETTED. It is needed to be change to ENABLED state after S4 re-initialization finished, otherwise, device will run in wrong state and HID input data will be dropped. Signed-off-by: Even Xu <even.xu@intel.com> Fixes: 6912aaf3fd24 ("HID: intel-thc-hid: intel-quickspi: Add PM implementation") Signed-off-by: Jiri Kosina <jkosina@suse.com>