summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2025-03-10block: protect nr_requests update using q->elevator_lockNilay Shroff
The sysfs attribute nr_requests could be simultaneously updated from elevator switch/update or nr_hw_queue update code path. The update to nr_requests for each of those code paths runs holding q->elevator_lock. So we should protect access to sysfs attribute nr_requests using q-> elevator_lock instead of q->sysfs_lock. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Nilay Shroff <nilay@linux.ibm.com> Link: https://lore.kernel.org/r/20250304102551.2533767-6-nilay@linux.ibm.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-03-10block: introduce a dedicated lock for protecting queue elevator updatesNilay Shroff
A queue's elevator can be updated either when modifying nr_hw_queues or through the sysfs scheduler attribute. Currently, elevator switching/ updating is protected using q->sysfs_lock, but this has led to lockdep splats[1] due to inconsistent lock ordering between q->sysfs_lock and the freeze-lock in multiple block layer call sites. As the scope of q->sysfs_lock is not well-defined, its (mis)use has resulted in numerous lockdep warnings. To address this, introduce a new q->elevator_lock, dedicated specifically for protecting elevator switches/updates. And we'd now use this new q->elevator_lock instead of q->sysfs_lock for protecting elevator switches/updates. While at it, make elv_iosched_load_module() a static function, as it is only called from elv_iosched_store(). Also, remove redundant parameters from elv_iosched_load_module() function signature. [1] https://lore.kernel.org/all/67637e70.050a0220.3157ee.000c.GAE@google.com/ Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Nilay Shroff <nilay@linux.ibm.com> Link: https://lore.kernel.org/r/20250304102551.2533767-5-nilay@linux.ibm.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-03-10block: remove q->sysfs_lock for attributes which don't need itNilay Shroff
There're few sysfs attributes in block layer which don't really need acquiring q->sysfs_lock while accessing it. The reason being, reading/ writing a value from/to such attributes are either atomic or could be easily protected using READ_ONCE()/WRITE_ONCE(). Moreover, sysfs attributes are inherently protected with sysfs/kernfs internal locking. So this change help segregate all existing sysfs attributes for which we could avoid acquiring q->sysfs_lock. For all read-only attributes we removed the q->sysfs_lock from show method of such attributes. In case attribute is read/write then we removed the q->sysfs_lock from both show and store methods of these attributes. We audited all block sysfs attributes and found following list of attributes which shouldn't require q->sysfs_lock protection: 1. io_poll: Write to this attribute is ignored. So, we don't need q->sysfs_lock. 2. io_poll_delay: Write to this attribute is NOP, so we don't need q->sysfs_lock. 3. io_timeout: Write to this attribute updates q->rq_timeout and read of this attribute returns the value stored in q->rq_timeout Moreover, the q->rq_timeout is set only once when we init the queue (under blk_mq_ init_allocated_queue()) even before disk is added. So that means that we don't need to protect it with q->sysfs_lock. As this attribute is not directly correlated with anything else simply using READ_ONCE/WRITE_ONCE should be enough. 4. nomerges: Write to this attribute file updates two q->flags : QUEUE_FLAG_ NOMERGES and QUEUE_FLAG_NOXMERGES. These flags are accessed during bio-merge which anyways doesn't run with q->sysfs_lock held. Moreover, the q->flags are updated/accessed with bitops which are atomic. So, protecting it with q->sysfs_lock is not necessary. 5. rq_affinity: Write to this attribute file makes atomic updates to q->flags: QUEUE_FLAG_SAME_COMP and QUEUE_FLAG_SAME_FORCE. These flags are also accessed from blk_mq_complete_need_ipi() using test_bit macro. As read/write to q->flags uses bitops which are atomic, protecting it with q->stsys_lock is not necessary. 6. nr_zones: Write to this attribute happens in the driver probe method (except nvme) before disk is added and outside of q->sysfs_lock or any other lock. Moreover nr_zones is defined as "unsigned int" and so reading this attribute, even when it's simultaneously being updated on other cpu, should not return torn value on any architecture supported by linux. So we can avoid using q->sysfs_lock or any other lock/ protection while reading this attribute. 7. discard_zeroes_data: Reading of this attribute always returns 0, so we don't require holding q->sysfs_lock. 8. write_same_max_bytes Reading of this attribute always returns 0, so we don't require holding q->sysfs_lock. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Nilay Shroff <nilay@linux.ibm.com> Link: https://lore.kernel.org/r/20250304102551.2533767-4-nilay@linux.ibm.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-03-10block: move q->sysfs_lock and queue-freeze under show/store methodNilay Shroff
In preparation to further simplify and group sysfs attributes which don't require locking or require some form of locking other than q-> limits_lock, move acquire/release of q->sysfs_lock and queue freeze/ unfreeze under each attributes' respective show/store method. While we are at it, also remove ->load_module() as it's used to load the module before queue is freezed. Now as we moved queue-freeze under ->store(), we could load module directly from the attributes' store method before we actually start freezing the queue. Currently, the ->load_module() is only used by "scheduler" attribute, so we now load the relevant elevator module before we start freezing the queue in elv_iosched_store(). Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Nilay Shroff <nilay@linux.ibm.com> Link: https://lore.kernel.org/r/20250304102551.2533767-3-nilay@linux.ibm.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-03-10block: acquire q->limits_lock while reading sysfs attributesNilay Shroff
There're few sysfs attributes(RW) whose store method is protected with q->limits_lock, however the corresponding show method of these attributes run holding q->sysfs_lock and that doesn't make sense as ideally the show method of these attributes should also run holding q->limits_lock instead of q->sysfs_lock. Hence update the show method of these sysfs attributes so that reading of these attributes acquire q->limits_lock instead of q->sysfs_lock. Similarly, there're few sysfs attributes(RO) whose show method is currently protected with q->sysfs_lock however updates to these attributes could occur using atomic limit update APIs such as queue_ limits_start_update() and queue_limits_commit_update() which run holding q->limits_lock. So that means that reading these attributes holding q->sysfs_lock doesn't make sense. Hence update the show method of these sysfs attributes(RO) such that they run with holding q-> limits_lock instead of q->sysfs_lock. We have defined a new macro QUEUE_LIM_RO_ENTRY() which uses new ->show_ limit() method and it runs holding q->limits_lock. All existing sysfs attributes(RO) which needs protection using q->limits_lock while reading have been now updated to use this new macro for initialization. Also, the existing QUEUE_LIM_RW_ENTRY() is updated to use new ->show_ limit() method for reading attributes instead of existing ->show() method. As ->show_limit() runs holding q->limits_lock, the existing sysfs attributes(RW) requiring protection are now inherently protected using q->limits_lock instead of q->sysfs_lock. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Nilay Shroff <nilay@linux.ibm.com> Link: https://lore.kernel.org/r/20250304102551.2533767-2-nilay@linux.ibm.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-03-10sched/clock: Don't define sched_clock_irqtime as static keyYafang Shao
The sched_clock_irqtime was defined as a static key in: 8722903cbb8f ("sched: Define sched_clock_irqtime as static key") However, this change introduces a 'sleeping in atomic context' warning: arch/x86/kernel/tsc.c:1214 mark_tsc_unstable() warn: sleeping in atomic context As analyzed by Dan, the affected code path is as follows: vcpu_load() <- disables preempt -> kvm_arch_vcpu_load() -> mark_tsc_unstable() <- sleeps virt/kvm/kvm_main.c 166 void vcpu_load(struct kvm_vcpu *vcpu) 167 { 168 int cpu = get_cpu(); ^^^^^^^^^^ This get_cpu() disables preemption. 169 170 __this_cpu_write(kvm_running_vcpu, vcpu); 171 preempt_notifier_register(&vcpu->preempt_notifier); 172 kvm_arch_vcpu_load(vcpu, cpu); 173 put_cpu(); 174 } arch/x86/kvm/x86.c 4979 if (unlikely(vcpu->cpu != cpu) || kvm_check_tsc_unstable()) { 4980 s64 tsc_delta = !vcpu->arch.last_host_tsc ? 0 : 4981 rdtsc() - vcpu->arch.last_host_tsc; 4982 if (tsc_delta < 0) 4983 mark_tsc_unstable("KVM discovered backwards TSC"); arch/x86/kernel/tsc.c 1206 void mark_tsc_unstable(char *reason) 1207 { 1208 if (tsc_unstable) 1209 return; 1210 1211 tsc_unstable = 1; 1212 if (using_native_sched_clock()) 1213 clear_sched_clock_stable(); --> 1214 disable_sched_clock_irqtime(); ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ kernel/jump_label.c 245 void static_key_disable(struct static_key *key) 246 { 247 cpus_read_lock(); ^^^^^^^^^^^^^^^^ This lock has a might_sleep() in it which triggers the static checker warning. 248 static_key_disable_cpuslocked(key); 249 cpus_read_unlock(); 250 } Let revert this change for now as {disable,enable}_sched_clock_irqtime are used in many places, as pointed out by Sean, including the following: The code path in clocksource_watchdog(): clocksource_watchdog() | -> spin_lock(&watchdog_lock); | -> __clocksource_unstable() | -> clocksource.mark_unstable() == tsc_cs_mark_unstable() | -> disable_sched_clock_irqtime() And the code path in sched_clock_register(): /* Cannot register a sched_clock with interrupts on */ local_irq_save(flags); ... /* Enable IRQ time accounting if we have a fast enough sched_clock() */ if (irqtime > 0 || (irqtime == -1 && rate >= 1000000)) enable_sched_clock_irqtime(); local_irq_restore(flags); [ lkp@intel.com: reported a build error in the prev version ] [ mingo: cherry-picked it over into sched/urgent ] Closes: https://lore.kernel.org/kvm/37a79ba3-9ce0-479c-a5b0-2bd75d573ed3@stanley.mountain/ Fixes: 8722903cbb8f ("sched: Define sched_clock_irqtime as static key") Reported-by: Dan Carpenter <dan.carpenter@linaro.org> Debugged-by: Dan Carpenter <dan.carpenter@linaro.org> Debugged-by: Sean Christopherson <seanjc@google.com> Debugged-by: Michal Koutný <mkoutny@suse.com> Signed-off-by: Yafang Shao <laoar.shao@gmail.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lkml.kernel.org/r/20250205032438.14668-1-laoar.shao@gmail.com
2025-03-10ASoC: rt722-sdca: add missing readable registersBard Liao
SDW_SDCA_CTL(FUNC_NUM_MIC_ARRAY, RT722_SDCA_ENT_FU15, RT722_SDCA_CTL_FU_CH_GAIN, CH_01) ... SDW_SDCA_CTL(FUNC_NUM_MIC_ARRAY, RT722_SDCA_ENT_FU15, RT722_SDCA_CTL_FU_CH_GAIN, CH_04) are used by the "FU15 Boost Volume" control, but not marked as readable. And the mbq size are 2 for those registers. Fixes: 7f5d6036ca005 ("ASoC: rt722-sdca: Add RT722 SDCA driver") Signed-off-by: Bard Liao <yung-chuan.liao@linux.intel.com> Reviewed-by: Ranjani Sridharan <ranjani.sridharan@linux.intel.com> Reviewed-by: Shuming Fan <shumingf@realtek.com> Link: https://patch.msgid.link/20250310080440.58797-1-yung-chuan.liao@linux.intel.com Signed-off-by: Mark Brown <broonie@kernel.org>
2025-03-10m68k: setup: Remove size argument when calling strscpy()Thorsten Blum
The size parameter of strscpy() is optional and specifying the size of the destination buffer is unnecessary. Remove it to simplify the code. Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev> Tested-by: Jean-Michel Hautbois <jeanmichel.hautbois@yoseli.org> Reviewed-by: Geert Uytterhoeven <geert@linux-m68k.org> Link: https://lore.kernel.org/20250302230532.245884-2-thorsten.blum@linux.dev Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
2025-03-10Merge tag 'thunderbolt-for-v6.14-rc7' of ↵Greg Kroah-Hartman
ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/westeri/thunderbolt into usb-linus Mika writes: thunderbolt: Fix for v6.14-rc7 This includes single USB4/Thunderbolt fix for v6.14-rc7: - Fix use-after-free in resume from hibernate. This has been in linux-next with no reported issues. * tag 'thunderbolt-for-v6.14-rc7' of ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/westeri/thunderbolt: thunderbolt: Prevent use-after-free in resume from hibernate
2025-03-10x86/sgx: Warn explicitly if X86_FEATURE_SGX_LC is not enabledVladis Dronov
The kernel requires X86_FEATURE_SGX_LC to be able to create SGX enclaves, not just X86_FEATURE_SGX. There is quite a number of hardware which has X86_FEATURE_SGX but not X86_FEATURE_SGX_LC. A kernel running on such hardware does not create the /dev/sgx_enclave file and does so silently. Explicitly warn if X86_FEATURE_SGX_LC is not enabled to properly notify users that the kernel disabled the SGX driver. The X86_FEATURE_SGX_LC, a.k.a. SGX Launch Control, is a CPU feature that enables LE (Launch Enclave) hash MSRs to be writable (with additional opt-in required in the 'feature control' MSR) when running enclaves, i.e. using a custom root key rather than the Intel proprietary key for enclave signing. I've hit this issue myself and have spent some time researching where my /dev/sgx_enclave file went on SGX-enabled hardware. Related links: https://github.com/intel/linux-sgx/issues/837 https://patchwork.kernel.org/project/platform-driver-x86/patch/20180827185507.17087-3-jarkko.sakkinen@linux.intel.com/ [ mingo: Made the error message a bit more verbose, and added other cases where the kernel fails to create the /dev/sgx_enclave device node. ] Signed-off-by: Vladis Dronov <vdronov@redhat.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: Kai Huang <kai.huang@intel.com> Cc: Jarkko Sakkinen <jarkko@kernel.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Sean Christopherson <sean.j.christopherson@intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Link: https://lore.kernel.org/r/20250309172215.21777-2-vdronov@redhat.com
2025-03-10gpio: adnp: use new line value setter callbacksBartosz Golaszewski
struct gpio_chip now has callbacks for setting line values that return an integer, allowing to indicate failures. Convert the driver to using them. Link: https://lore.kernel.org/r/20250306-gpiochip-set-conversion-v2-2-a76e72e21425@linaro.org Signed-off-by: Bartosz Golaszewski <bartosz.golaszewski@linaro.org>
2025-03-10gpio: adnp: use lock guards for the I2C lockBartosz Golaszewski
Reduce the code complexity by using automatic lock guards with the I2C mutex. Link: https://lore.kernel.org/r/20250306-gpiochip-set-conversion-v2-1-a76e72e21425@linaro.org Signed-off-by: Bartosz Golaszewski <bartosz.golaszewski@linaro.org>
2025-03-10gpio: aspeed-sgpio: use new line value setter callbacksBartosz Golaszewski
struct gpio_chip now has callbacks for setting line values that return an integer, allowing to indicate failures. Convert the driver to using them. Link: https://lore.kernel.org/r/20250303-gpiochip-set-conversion-v1-15-1d5cceeebf8b@linaro.org Signed-off-by: Bartosz Golaszewski <bartosz.golaszewski@linaro.org>
2025-03-10gpio: aspeed-sgpio: use lock guardsBartosz Golaszewski
Reduce the code complexity by using automatic lock guards with the raw spinlock. Link: https://lore.kernel.org/r/20250303-gpiochip-set-conversion-v1-14-1d5cceeebf8b@linaro.org Signed-off-by: Bartosz Golaszewski <bartosz.golaszewski@linaro.org>
2025-03-10gpio: aspeed: use new line value setter callbacksBartosz Golaszewski
struct gpio_chip now has callbacks for setting line values that return an integer, allowing to indicate failures. Convert the driver to using them. Link: https://lore.kernel.org/r/20250303-gpiochip-set-conversion-v1-13-1d5cceeebf8b@linaro.org Signed-off-by: Bartosz Golaszewski <bartosz.golaszewski@linaro.org>
2025-03-10gpio: aspeed: use lock guardsBartosz Golaszewski
Reduce the code complexity by using automatic lock guards with the raw spinlock. Link: https://lore.kernel.org/r/20250303-gpiochip-set-conversion-v1-12-1d5cceeebf8b@linaro.org Signed-off-by: Bartosz Golaszewski <bartosz.golaszewski@linaro.org>
2025-03-10gpio: arizona: use new line value setter callbacksBartosz Golaszewski
struct gpio_chip now has callbacks for setting line values that return an integer, allowing to indicate failures. Convert the driver to using them. Reviewed-by: Richard Fitzgerald <rf@opensource.cirrus.com> Link: https://lore.kernel.org/r/20250303-gpiochip-set-conversion-v1-11-1d5cceeebf8b@linaro.org Signed-off-by: Bartosz Golaszewski <bartosz.golaszewski@linaro.org>
2025-03-10gpio: amd-fch: use new line value setter callbacksBartosz Golaszewski
struct gpio_chip now has callbacks for setting line values that return an integer, allowing to indicate failures. Convert the driver to using them. Link: https://lore.kernel.org/r/20250303-gpiochip-set-conversion-v1-10-1d5cceeebf8b@linaro.org Signed-off-by: Bartosz Golaszewski <bartosz.golaszewski@linaro.org>
2025-03-10gpio: amd8111: use new line value setter callbacksBartosz Golaszewski
struct gpio_chip now has callbacks for setting line values that return an integer, allowing to indicate failures. Convert the driver to using them. Link: https://lore.kernel.org/r/20250303-gpiochip-set-conversion-v1-9-1d5cceeebf8b@linaro.org Signed-off-by: Bartosz Golaszewski <bartosz.golaszewski@linaro.org>
2025-03-10gpio: altera: use new line value setter callbacksBartosz Golaszewski
struct gpio_chip now has callbacks for setting line values that return an integer, allowing to indicate failures. Convert the driver to using them. Link: https://lore.kernel.org/r/20250303-gpiochip-set-conversion-v1-8-1d5cceeebf8b@linaro.org Signed-off-by: Bartosz Golaszewski <bartosz.golaszewski@linaro.org>
2025-03-10gpio: altera-a10sr: use new line value setter callbacksBartosz Golaszewski
struct gpio_chip now has callbacks for setting line values that return an integer, allowing to indicate failures. Convert the driver to using them. Link: https://lore.kernel.org/r/20250303-gpiochip-set-conversion-v1-7-1d5cceeebf8b@linaro.org Signed-off-by: Bartosz Golaszewski <bartosz.golaszewski@linaro.org>
2025-03-10gpio: adp5585: use new line value setter callbacksBartosz Golaszewski
struct gpio_chip now has callbacks for setting line values that return an integer, allowing to indicate failures. Convert the driver to using them. Acked-by: Michael Hennerich <michael.hennerich@analog.com> Link: https://lore.kernel.org/r/20250303-gpiochip-set-conversion-v1-6-1d5cceeebf8b@linaro.org Signed-off-by: Bartosz Golaszewski <bartosz.golaszewski@linaro.org>
2025-03-10gpio: adp5520: use new line value setter callbacksBartosz Golaszewski
struct gpio_chip now has callbacks for setting line values that return an integer, allowing to indicate failures. Convert the driver to using them. Acked-by: Michael Hennerich <michael.hennerich@analog.com> Link: https://lore.kernel.org/r/20250303-gpiochip-set-conversion-v1-5-1d5cceeebf8b@linaro.org Signed-off-by: Bartosz Golaszewski <bartosz.golaszewski@linaro.org>
2025-03-10gpio: adnp: use devm_mutex_init()Bartosz Golaszewski
The mutex initialized in probe() is never cleaned up. Use devm_mutex_init() to do it automatically. Link: https://lore.kernel.org/r/20250303-gpiochip-set-conversion-v1-3-1d5cceeebf8b@linaro.org Signed-off-by: Bartosz Golaszewski <bartosz.golaszewski@linaro.org>
2025-03-10gpio: 74x164: use new line value setter callbacksBartosz Golaszewski
struct gpio_chip now has callbacks for setting line values that return an integer, allowing to indicate failures. Convert the driver to using them. Link: https://lore.kernel.org/r/20250303-gpiochip-set-conversion-v1-1-1d5cceeebf8b@linaro.org Signed-off-by: Bartosz Golaszewski <bartosz.golaszewski@linaro.org>
2025-03-10dt-bindings: gpio: vf610: Add i.MX94 supportFrank Li
Add compatible string "fsl,imx94-gpio" for the i.MX94 chip, which is backward compatible with i.MX8ULP. Set it to fall back to "fsl,imx8ulp-gpio". Signed-off-by: Frank Li <Frank.Li@nxp.com> Acked-by: Conor Dooley <conor.dooley@microchip.com> Link: https://lore.kernel.org/r/20250306170921.241690-1-Frank.Li@nxp.com Signed-off-by: Bartosz Golaszewski <bartosz.golaszewski@linaro.org>
2025-03-10Merge afs RCU pathwalk fixChristian Brauner
Bring in the fix for afs_atcell_get_link() to handle RCU pathwalk from the afs branch for this cycle. This fix has to go upstream now. Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-03-10Merge tag 'afs-next-20250310' of ↵Christian Brauner
ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs This fixes an occasional hang that's only really encountered when rmmod'ing the kafs module, one of the reasons why I'm proposing it for the next merge window rather than immediate upstreaming. The changes include: (1) Remove the "-o autocell" mount option. This is obsolete with the dynamic root and removing it makes the next patch slightly easier. (2) Change how the dynamic root mount is constructed. Currently, the root directory is (de)populated when it is (un)mounted if there are cells already configured and, further, pairs of automount points have to be created/removed each time a cell is added/deleted. This is changed so that readdir on the root dir lists all the known cell automount pairs plus the @cell symlinks and the inodes and dentries are constructed by lookup on demand. This simplifies the cell management code. (3) A few improvements to the afs_volume tracepoint. (4) A few improvements to the afs_server tracepoint. (5) Pass trace info into the afs_lookup_cell() function to allow the trace log to indicate the purpose of the lookup. (6) Remove the 'net' parameter from afs_unuse_cell() as it's superfluous. (7) In rxrpc, allow a kernel app (such as kafs) to store a word of information on rxrpc_peer records. (8) Use the information stored on the rxrpc_peer record to point to the afs_server record. This allows the server address lookup to be done away with. (9) Simplify the afs_server ref/activity accounting to make each one self-contained and not garbage collected from the cell management work item. (10) Simplify the afs_cell ref/activity accounting to make each one of these also self-contained and not driven by a central management work item. The current code was intended to make it such that a single timer for the namespace and one work item per cell could do all the work required to maintain these records. This, however, made for some sequencing problems when cleaning up these records. Further, the attempt to pass refs along with timers and work items made getting it right rather tricky when the timer or work item already had a ref attached and now a ref had to be got rid of. * tag 'afs-next-20250310' of ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs: afs: Simplify cell record handling afs: Fix afs_server ref accounting afs: Use the per-peer app data provided by rxrpc rxrpc: Allow the app to store private data on peer structs afs: Drop the net parameter from afs_unuse_cell() afs: Make afs_lookup_cell() take a trace note afs: Improve server refcount/active count tracing afs: Improve afs_volume tracing to display a debug ID afs: Change dynroot to create contents on demand afs: Remove the "autocell" mount option https://lore.kernel.org/r/802823.1741600633@warthog.procyon.org.uk Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-03-10afs: Simplify cell record handlingDavid Howells
Simplify afs_cell record handling to avoid very occasional races that cause module removal to hang (it waits for all cell records to be removed). There are two things that particularly contribute to the difficulty: firstly, the code tries to pass a ref on the cell to the cell's maintenance work item (which gets awkward if the work item is already queued); and, secondly, there's an overall cell manager that tries to use just one timer for the entire cell collection (to avoid having loads of timers). However, both of these are probably unnecessarily restrictive. To simplify this, the following changes are made: (1) The cell record collection manager is removed. Each cell record manages itself individually. (2) Each afs_cell is given a second work item (cell->destroyer) that is queued when its refcount reaches zero. This is not done in the context of the putting thread as it might be in an inconvenient place to sleep. (3) Each afs_cell is given its own timer. The timer is used to expire the cell record after a period of unuse if not otherwise pinned and can also be used for other maintenance tasks if necessary (of which there are currently none as DNS refresh is triggered by filesystem operations). (4) The afs_cell manager work item (cell->manager) is no longer given a ref on the cell when queued; rather, the manager must be deleted. This does away with the need to deal with the consequences of losing a race to queue cell->manager. Clean up of extra queuing is deferred to the destroyer. (5) The cell destroyer work item makes sure the cell timer is removed and that the normal cell work is cancelled before farming the actual destruction off to RCU. (6) When a network namespace is destroyed or the kafs module is unloaded, it's now a simple matter of marking the namespace as dead then just waking up all the cell work items. They will then remove and destroy themselves once all remaining activity counts and/or a ref counts are dropped. This makes sure that all server records are dropped first. (7) The cell record state set is reduced to just four states: SETTING_UP, ACTIVE, REMOVING and DEAD. The record persists in the active state even when it's not being used until the time comes to remove it rather than downgrading it to an inactive state from whence it can be restored. This means that the cell still appears in /proc and /afs when not in use until it switches to the REMOVING state - at which point it is removed. Note that the REMOVING state is included so that someone wanting to resurrect the cell record is forced to wait whilst the cell is torn down in that state. Once it's in the DEAD state, it has been removed from net->cells tree and is no longer findable and can be replaced. Signed-off-by: David Howells <dhowells@redhat.com> cc: Marc Dionne <marc.dionne@auristor.com> cc: linux-afs@lists.infradead.org cc: linux-fsdevel@vger.kernel.org Link: https://lore.kernel.org/r/20250224234154.2014840-16-dhowells@redhat.com/ # v1 Link: https://lore.kernel.org/r/20250310094206.801057-12-dhowells@redhat.com/ # v4
2025-03-10afs: Fix afs_server ref accountingDavid Howells
The current way that afs_server refs are accounted and cleaned up sometimes cause rmmod to hang when it is waiting for cell records to be removed. The problem is that the cell cleanup might occasionally happen before the server cleanup and then there's nothing that causes the cell to garbage-collect the remaining servers as they become inactive. Partially fix this by: (1) Give each afs_server record its own management timer that rather than relying on the cell manager's central timer to drive each individual cell's maintenance work item to garbage collect servers. This timer is set when afs_unuse_server() reduces a server's activity count to zero and will schedule the server's destroyer work item upon firing. (2) Give each afs_server record its own destroyer work item that removes the record from the cell's database, shuts down the timer, cancels any pending work for itself, sends an RPC to the server to cancel outstanding callbacks. This change, in combination with the timer, obviates the need to try and coordinate so closely between the cell record and a bunch of other server records to try and tear everything down in a coordinated fashion. With this, the cell record is pinned until the server RCU is complete and namespace/module removal will wait until all the cell records are removed. (3) Now that incoming calls are mapped to servers (and thus cells) using data attached to an rxrpc_peer, the UUID-to-server mapping tree is moved from the namespace to the cell (cell->fs_servers). This means there can no longer be duplicates therein - and that allows the mapping tree to be simpler as there doesn't need to be a chain of same-UUID servers that are in different cells. (4) The lock protecting the UUID mapping tree is switched to an rw_semaphore on the cell rather than a seqlock on the namespace as it's now only used during mounting in contexts in which we're allowed to sleep. (5) When it comes time for a cell that is being removed to purge its set of servers, it just needs to iterate over them and wake them up. Once a server becomes inactive, its destroyer work item will observe the state of the cell and immediately remove that record. (6) When a server record is removed, it is marked AFS_SERVER_FL_EXPIRED to prevent reattempts at removal. The record will be dispatched to RCU for destruction once its refcount reaches 0. (7) The AFS_SERVER_FL_UNCREATED/CREATING flags are used to synchronise simultaneous creation attempts. If one attempt fails, it will abandon the attempt and allow another to try again. Note that the record can't just be abandoned when dead as it's bound into a server list attached to a volume and only subject to replacement if the server list obtained for the volume from the VLDB changes. Signed-off-by: David Howells <dhowells@redhat.com> cc: Marc Dionne <marc.dionne@auristor.com> cc: linux-afs@lists.infradead.org cc: linux-fsdevel@vger.kernel.org Link: https://lore.kernel.org/r/20250224234154.2014840-15-dhowells@redhat.com/ # v1 Link: https://lore.kernel.org/r/20250310094206.801057-11-dhowells@redhat.com/ # v4
2025-03-10afs: Use the per-peer app data provided by rxrpcDavid Howells
Make use of the per-peer application data that rxrpc now allows the application to store on the rxrpc_peer struct to hold a back pointer to the afs_server record that peer represents an endpoint for. Then, when a call comes in to the AFS cache manager, this can be used to map it to the correct server record rather than having to use a UUID-to-server mapping table and having to do an additional lookup. Signed-off-by: David Howells <dhowells@redhat.com> cc: Marc Dionne <marc.dionne@auristor.com> cc: linux-afs@lists.infradead.org cc: linux-fsdevel@vger.kernel.org Link: https://lore.kernel.org/r/20250224234154.2014840-14-dhowells@redhat.com/ # v1 Link: https://lore.kernel.org/r/20250310094206.801057-10-dhowells@redhat.com/ # v4
2025-03-10rxrpc: Allow the app to store private data on peer structsDavid Howells
Provide a way for the application (e.g. the afs filesystem) to store private data on the rxrpc_peer structs for later retrieval via the call object. This will allow afs to store a pointer to the afs_server object on the rxrpc_peer struct, thereby obviating the need for afs to keep lookup tables by which it can associate an incoming call with server that transmitted it. Signed-off-by: David Howells <dhowells@redhat.com> cc: Marc Dionne <marc.dionne@auristor.com> cc: Jakub Kicinski <kuba@kernel.org> cc: "David S. Miller" <davem@davemloft.net> cc: Eric Dumazet <edumazet@google.com> cc: Paolo Abeni <pabeni@redhat.com> cc: Simon Horman <horms@kernel.org> cc: linux-afs@lists.infradead.org cc: linux-fsdevel@vger.kernel.org cc: netdev@vger.kernel.org Link: https://lore.kernel.org/r/20250224234154.2014840-13-dhowells@redhat.com/ # v1 Link: https://lore.kernel.org/r/20250310094206.801057-9-dhowells@redhat.com/ # v4
2025-03-10afs: Drop the net parameter from afs_unuse_cell()David Howells
Remove the redundant net parameter to afs_unuse_cell() as cell->net can be used instead. Signed-off-by: David Howells <dhowells@redhat.com> cc: Marc Dionne <marc.dionne@auristor.com> cc: linux-afs@lists.infradead.org cc: linux-fsdevel@vger.kernel.org Link: https://lore.kernel.org/r/20250224234154.2014840-12-dhowells@redhat.com/ # v1 Link: https://lore.kernel.org/r/20250310094206.801057-8-dhowells@redhat.com/ # v4
2025-03-10afs: Make afs_lookup_cell() take a trace noteDavid Howells
Pass a note to be added to the afs_cell tracepoint to afs_lookup_cell() so that different callers can be distinguished. Signed-off-by: David Howells <dhowells@redhat.com> cc: Marc Dionne <marc.dionne@auristor.com> cc: linux-afs@lists.infradead.org cc: linux-fsdevel@vger.kernel.org Link: https://lore.kernel.org/r/20250224234154.2014840-11-dhowells@redhat.com/ # v1 Link: https://lore.kernel.org/r/20250310094206.801057-7-dhowells@redhat.com/ # v4
2025-03-10afs: Improve server refcount/active count tracingDavid Howells
Improve server refcount/active count tracing to distinguish between simply getting/putting a ref and using/unusing the server record (which changes the activity count as well as the refcount). This makes it a bit easier to work out what's going on. Signed-off-by: David Howells <dhowells@redhat.com> cc: Marc Dionne <marc.dionne@auristor.com> cc: linux-afs@lists.infradead.org cc: linux-fsdevel@vger.kernel.org Link: https://lore.kernel.org/r/20250224234154.2014840-10-dhowells@redhat.com/ # v1 Link: https://lore.kernel.org/r/20250310094206.801057-6-dhowells@redhat.com/ # v4
2025-03-10afs: Improve afs_volume tracing to display a debug IDDavid Howells
Improve the tracing of afs_volume objects to include displaying a debug ID so that different instances of volumes with the same "vid" can be distinguished. Also be consistent about displaying the volume's refcount (and not the cell's). Signed-off-by: David Howells <dhowells@redhat.com> cc: Marc Dionne <marc.dionne@auristor.com> cc: linux-afs@lists.infradead.org cc: linux-fsdevel@vger.kernel.org Link: https://lore.kernel.org/r/20250224234154.2014840-9-dhowells@redhat.com/ # v1 Link: https://lore.kernel.org/r/20250310094206.801057-5-dhowells@redhat.com/ # v4
2025-03-10afs: Change dynroot to create contents on demandDavid Howells
Change the AFS dynamic root to do things differently: (1) Rather than having the creation of cell records create inodes and dentries for cell mountpoints, create them on demand during lookup. This simplifies cell management and locking as we no longer have to create these objects in advance *and* on speculative lookup by the user for a cell that isn't precreated. (2) Rather than using the libfs dentry-based readdir (the dentries now no longer exist until accessed from (1)), have readdir generate the contents by reading the list of cells. The @cell symlinks get pushed in positions 2 and 3 if rootcell has been configured. (3) Make the @cell symlink dentries persist for the life of the superblock or until reclaimed, but make cell mountpoints disappear immediately if unused. It's not perfect as someone doing an "ls -l /afs" may create a whole bunch of dentries which will be garbage collected immediately. But any dentry that gets automounted will be pinned by the mount, so it shouldn't be too bad. (4) Allocate the inode numbers for the cell mountpoints from an IDR to prevent duplicates appearing in the event it cycles round. The number allocated from the IDR is doubled to provide two inode numbers - one for the normal cell name (RO) and one for the dotted cell name (RW). Signed-off-by: David Howells <dhowells@redhat.com> cc: Marc Dionne <marc.dionne@auristor.com> cc: linux-afs@lists.infradead.org cc: linux-fsdevel@vger.kernel.org Link: https://lore.kernel.org/r/20250224234154.2014840-8-dhowells@redhat.com/ # v1 Link: https://lore.kernel.org/r/20250310094206.801057-4-dhowells@redhat.com/ # v4
2025-03-10afs: Remove the "autocell" mount optionDavid Howells
Remove the "autocell" mount option. It was an attempt to do automounting of arbitrary cells based on what the user looked up but within the root directory of a mounted volume. This isn't really the right thing to do, and using the "dyn" mount option to get the dynamic root is the right way to do it. The kafs-client package uses "-o dyn" when mounting /afs, so it should be safe to drop "-o autocell". Signed-off-by: David Howells <dhowells@redhat.com> cc: Marc Dionne <marc.dionne@auristor.com> cc: linux-afs@lists.infradead.org cc: linux-fsdevel@vger.kernel.org Link: https://lore.kernel.org/r/20250224234154.2014840-7-dhowells@redhat.com/ # v1 Link: https://lore.kernel.org/r/20250310094206.801057-3-dhowells@redhat.com/ # v4
2025-03-10afs: Fix afs_atcell_get_link() to handle RCU pathwalkDavid Howells
The ->get_link() method may be entered under RCU pathwalk conditions (in which case, the dentry pointer is NULL). This is not taken account of by afs_atcell_get_link() and lockdep will complain when it tries to lock an rwsem. Fix this by marking net->ws_cell as __rcu and using RCU access macros on it and by making afs_atcell_get_link() just return a pointer to the name in RCU pathwalk without taking net->cells_lock or a ref on the cell as RCU will protect the name storage (the cell is already freed via call_rcu()). Fixes: 30bca65bbbae ("afs: Make /afs/@cell and /afs/.@cell symlinks") Reported-by: Alexander Viro <viro@zeniv.linux.org.uk> Signed-off-by: David Howells <dhowells@redhat.com> cc: Marc Dionne <marc.dionne@auristor.com> cc: linux-afs@lists.infradead.org cc: linux-fsdevel@vger.kernel.org Link: https://lore.kernel.org/r/20250310094206.801057-2-dhowells@redhat.com/ # v4
2025-03-10genirq: Make a few functions staticThomas Gleixner
None of these functions are used outside of their source files. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/all/878qpe2gnx.ffs@tglx
2025-03-10irqdomain: Remove extern from function declarationsJiri Slaby (SUSE)
'extern' is not needed for function declarations. So remove it from irqdomain.h. Note that the declarations are now unified as some had 'extern' and some did not. Signed-off-by: Jiri Slaby (SUSE) <jirislaby@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/all/20250115085409.1629787-2-jirislaby@kernel.org
2025-03-10net/mlx5: Add IFC bits for PPCNT recovery counters groupYael Chemla
Add recovery counters group layout of PPCNT (Ports Performance Counters Register). This group counts recovery events per link. Also add the corresponding bit in PCAM to indicate this group is supported. Signed-off-by: Yael Chemla <ychemla@nvidia.com> Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://patch.msgid.link/1741545697-23041-1-git-send-email-tariqt@nvidia.com Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-03-10iommu/vt-d: Cleanup intel_context_flush_present()Lu Baolu
The intel_context_flush_present() is called in places where either the scalable mode is disabled, or scalable mode is enabled but all PASID entries are known to be non-present. In these cases, the flush_domains path within intel_context_flush_present() will never execute. This dead code is therefore removed. Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Tested-by: Zhangfei Gao <zhangfei.gao@linaro.org> Link: https://lore.kernel.org/r/20250228092631.3425464-7-baolu.lu@linux.intel.com Signed-off-by: Joerg Roedel <jroedel@suse.de>
2025-03-10iommu/vt-d: Move PRI enablement in probe pathLu Baolu
Update PRI enablement to use the new method, similar to the amd iommu driver. Enable PRI in the device probe path and disable it when the device is released. PRI is enabled throughout the device's iommu lifecycle. The infrastructure for the iommu subsystem to handle iopf requests is created during iopf enablement and released during iopf disablement. All invalid page requests from the device are automatically handled by the iommu subsystem if iopf is not enabled. Add iopf_refcount to track the iopf enablement. Convert the return type of intel_iommu_disable_iopf() to void, as there is no way to handle a failure when disabling this feature. Make intel_iommu_enable/disable_iopf() helpers global, as they will be used beyond the current file in the subsequent patch. The iopf_refcount is not protected by any lock. This is acceptable, as there is no concurrent access to it in the current code. The following patch will address this by moving it to the domain attach/detach paths, which are protected by the iommu group mutex. Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Tested-by: Zhangfei Gao <zhangfei.gao@linaro.org> Link: https://lore.kernel.org/r/20250228092631.3425464-6-baolu.lu@linux.intel.com Signed-off-by: Joerg Roedel <jroedel@suse.de>
2025-03-10iommu/vt-d: Move scalable mode ATS enablement to probe pathLu Baolu
Device ATS is currently enabled when a domain is attached to the device and disabled when the domain is detached. This creates a limitation: when the IOMMU is operating in scalable mode and IOPF is enabled, the device's domain cannot be changed. The previous code enables ATS when a domain is set to a device's RID and disables it during RID domain switch. So, if a PASID is set with a domain requiring PRI, ATS should remain enabled until the domain is removed. During the PASID domain's lifecycle, if the RID's domain changes, PRI will be disrupted because it depends on ATS, which is disabled when the blocking domain is set for the device's RID. Remove this limitation by moving ATS enablement to the device probe path. Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Tested-by: Zhangfei Gao <zhangfei.gao@linaro.org> Link: https://lore.kernel.org/r/20250228092631.3425464-5-baolu.lu@linux.intel.com Signed-off-by: Joerg Roedel <jroedel@suse.de>
2025-03-10iommu/vt-d: Check if SVA is supported when attaching the SVA domainJason Gunthorpe
Attach of a SVA domain should fail if SVA is not supported, move the check for SVA support out of IOMMU_DEV_FEAT_SVA and into attach. Also check when allocating a SVA domain to match other drivers. Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Yi Liu <yi.l.liu@intel.com> Tested-by: Zhangfei Gao <zhangfei.gao@linaro.org> Link: https://lore.kernel.org/r/20250228092631.3425464-3-baolu.lu@linux.intel.com Signed-off-by: Joerg Roedel <jroedel@suse.de>
2025-03-10iommu/vt-d: Use virt_to_phys()Jason Gunthorpe
If all the inlines are unwound virt_to_dma_pfn() is simply: return page_to_pfn(virt_to_page(p)) << (PAGE_SHIFT - VTD_PAGE_SHIFT); Which can be re-arranged to: (page_to_pfn(virt_to_page(p)) << PAGE_SHIFT) >> VTD_PAGE_SHIFT The only caller is: ((uint64_t)virt_to_dma_pfn(tmp_page) << VTD_PAGE_SHIFT) re-arranged to: ((page_to_pfn(virt_to_page(tmp_page)) << PAGE_SHIFT) >> VTD_PAGE_SHIFT) << VTD_PAGE_SHIFT Which simplifies to: page_to_pfn(virt_to_page(tmp_page)) << PAGE_SHIFT That is the same as virt_to_phys(tmp_page), so just remove all of this. Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Link: https://lore.kernel.org/r/8-v3-e797f4dc6918+93057-iommu_pages_jgg@nvidia.com Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com> Signed-off-by: Joerg Roedel <jroedel@suse.de>
2025-03-10iommu/vt-d: Fix system hang on reboot -fYunhui Cui
We found that executing the command ./a.out &;reboot -f (where a.out is a program that only executes a while(1) infinite loop) can probabilistically cause the system to hang in the intel_iommu_shutdown() function, rendering it unresponsive. Through analysis, we identified that the factors contributing to this issue are as follows: 1. The reboot -f command does not prompt the kernel to notify the application layer to perform cleanup actions, allowing the application to continue running. 2. When the kernel reaches the intel_iommu_shutdown() function, only the BSP (Bootstrap Processor) CPU is operational in the system. 3. During the execution of intel_iommu_shutdown(), the function down_write (&dmar_global_lock) causes the process to sleep and be scheduled out. 4. At this point, though the processor's interrupt flag is not cleared, allowing interrupts to be accepted. However, only legacy devices and NMI (Non-Maskable Interrupt) interrupts could come in, as other interrupts routing have already been disabled. If no legacy or NMI interrupts occur at this stage, the scheduler will not be able to run. 5. If the application got scheduled at this time is executing a while(1)- type loop, it will be unable to be preempted, leading to an infinite loop and causing the system to become unresponsive. To resolve this issue, the intel_iommu_shutdown() function should not execute down_write(), which can potentially cause the process to be scheduled out. Furthermore, since only the BSP is running during the later stages of the reboot, there is no need for protection against parallel access to the DMAR (DMA Remapping) unit. Therefore, the following lines could be removed: down_write(&dmar_global_lock); up_write(&dmar_global_lock); After testing, the issue has been resolved. Fixes: 6c3a44ed3c55 ("iommu/vt-d: Turn off translations at shutdown") Co-developed-by: Ethan Zhao <haifeng.zhao@linux.intel.com> Signed-off-by: Ethan Zhao <haifeng.zhao@linux.intel.com> Signed-off-by: Yunhui Cui <cuiyunhui@bytedance.com> Link: https://lore.kernel.org/r/20250303062421.17929-1-cuiyunhui@bytedance.com Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com> Signed-off-by: Joerg Roedel <jroedel@suse.de>
2025-03-10iommu: apple-dart: Allow mismatched bypass supportHector Martin
This is needed by ISP, which has DART0 with bypass and DART1/2 without. Signed-off-by: Hector Martin <marcan@marcan.st> Reviewed-by: Alyssa Rosenzweig <alyssa@rosenzweig.io> Signed-off-by: Sasha Finkelstein <fnkl.kernel@gmail.com> Reviewed-by: Sven Peter <sven@svenpeter.dev> Link: https://lore.kernel.org/r/20250307-isp-dart-v3-2-684fe4489591@gmail.com Signed-off-by: Joerg Roedel <jroedel@suse.de>
2025-03-10iommu: apple-dart: Increase MAX_DARTS_PER_DEVICE to 3Hector Martin
ISP needs this. Signed-off-by: Hector Martin <marcan@marcan.st> Reviewed-by: Alyssa Rosenzweig <alyssa@rosenzweig.io> Signed-off-by: Sasha Finkelstein <fnkl.kernel@gmail.com> Reviewed-by: Sven Peter <sven@svenpeter.dev> Link: https://lore.kernel.org/r/20250307-isp-dart-v3-1-684fe4489591@gmail.com Signed-off-by: Joerg Roedel <jroedel@suse.de>