summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2021-09-03ksmbd: fix translation in sid_to_id()Christian Brauner
The sid_to_id() functions is relevant when changing ownership of filesystem objects based on acl information. In this case we need to first translate the relevant s*ids into k*ids in ksmbd's user namespace and account for any idmapped mounts. Requesting a change in ownership requires the inverse translation to be applied when we would report ownership to userspace. So k*id_from_mnt() must be used here. Cc: Steve French <stfrench@microsoft.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Namjae Jeon <namjae.jeon@samsung.com> Cc: Hyunchul Lee <hyc.lee@gmail.com> Cc: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: linux-cifs@vger.kernel.org Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com> Signed-off-by: Namjae Jeon <linkinjeon@kernel.org> Signed-off-by: Steve French <stfrench@microsoft.com>
2021-09-03ksmbd: fix subauth 0 handling in sid_to_id()Christian Brauner
It's not obvious why subauth 0 would be excluded from translation. This would lead to wrong results whenever a non-identity idmapping is used. Cc: Steve French <stfrench@microsoft.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Namjae Jeon <namjae.jeon@samsung.com> Cc: Hyunchul Lee <hyc.lee@gmail.com> Cc: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: linux-cifs@vger.kernel.org Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com> Signed-off-by: Namjae Jeon <linkinjeon@kernel.org> Signed-off-by: Steve French <stfrench@microsoft.com>
2021-09-03ksmbd: fix translation in acl entriesChristian Brauner
The ksmbd server performs translation of posix acls to smb acls. Currently the translation is wrong since the idmapping of the mount is used to map the ids into raw userspace ids but what is relevant is the user namespace of ksmbd itself. The user namespace of ksmbd itself which is the initial user namespace. The operation is similar to asking "What *ids would a userspace process see given that k*id in the relevant user namespace?". Before the final translation we need to apply the idmapping of the mount in case any is used. Add two simple helpers for ksmbd. Cc: Steve French <stfrench@microsoft.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Namjae Jeon <namjae.jeon@samsung.com> Cc: Hyunchul Lee <hyc.lee@gmail.com> Cc: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: linux-cifs@vger.kernel.org Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com> Signed-off-by: Namjae Jeon <linkinjeon@kernel.org> Signed-off-by: Steve French <stfrench@microsoft.com>
2021-09-03ksmbd: fix translation in ksmbd_acls_fattr()Christian Brauner
When creating new filesystem objects ksmbd translates between k*ids and s*ids. For this it often uses struct smb_fattr and stashes the k*ids in cf_uid and cf_gid. Let cf_uid and cf_gid always contain the final information taking any potential idmapped mounts into account. When finally translation cf_*id into s*ids translate them into the user namespace of ksmbd since that is the relevant user namespace here. Cc: Steve French <stfrench@microsoft.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Namjae Jeon <namjae.jeon@samsung.com> Cc: Hyunchul Lee <hyc.lee@gmail.com> Cc: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: linux-cifs@vger.kernel.org Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com> Signed-off-by: Namjae Jeon <linkinjeon@kernel.org> Signed-off-by: Steve French <stfrench@microsoft.com>
2021-09-03ksmbd: fix translation in create_posix_rsp_buf()Christian Brauner
When transferring ownership information to the client the k*ids are translated into raw *ids before they are sent over the wire. The function currently erroneously translates the k*ids according to the mount's idmapping. Instead, reporting the owning *ids to userspace the underlying k*ids need to be mapped up in the caller's user namespace. This is how stat() works. The caller in this instance is ksmbd itself and ksmbd always runs in the initial user namespace. Translate according to that taking any potential idmapped mounts into account. Switch to from_k*id_munged() which ensures that the overflow*id is returned instead of the (*id_t)-1 when the k*id can't be translated. Cc: Steve French <stfrench@microsoft.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Namjae Jeon <namjae.jeon@samsung.com> Cc: Hyunchul Lee <hyc.lee@gmail.com> Cc: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: linux-cifs@vger.kernel.org Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com> Signed-off-by: Namjae Jeon <linkinjeon@kernel.org> Signed-off-by: Steve French <stfrench@microsoft.com>
2021-09-03ksmbd: fix translation in smb2_populate_readdir_entry()Christian Brauner
When transferring ownership information to the client the k*ids are translated into raw *ids before they are sent over the wire. The function currently erroneously translates the k*ids according to the mount's idmapping. Instead, reporting the owning *ids to userspace the underlying k*ids need to be mapped up in the caller's user namespace. This is how stat() works. The caller in this instance is ksmbd itself and ksmbd always runs in the initial user namespace. Translate according to that. The idmapping of the mount is already taken into account by the lower filesystem and so kstat->*id will contain the mapped k*ids. Switch to from_k*id_munged() which ensures that the overflow*id is returned instead of the (*id_t)-1 when the k*id can't be translated. Cc: Steve French <stfrench@microsoft.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Namjae Jeon <namjae.jeon@samsung.com> Cc: Hyunchul Lee <hyc.lee@gmail.com> Cc: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: linux-cifs@vger.kernel.org Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com> Signed-off-by: Namjae Jeon <linkinjeon@kernel.org> Signed-off-by: Steve French <stfrench@microsoft.com>
2021-09-03ksmbd: fix lookup on idmapped mountsChristian Brauner
It's great that the new in-kernel ksmbd server will support idmapped mounts out of the box! However, lookup is currently broken. Lookup helpers such as lookup_one_len() call inode_permission() internally to ensure that the caller is privileged over the inode of the base dentry they are trying to lookup under. So the permission checking here is currently wrong. Linux v5.15 will gain a new lookup helper lookup_one() that does take idmappings into account. I've added it as part of my patch series to make btrfs support idmapped mounts. The new helper is in linux-next as part of David's (Sterba) btrfs for-next branch as commit c972214c133b ("namei: add mapping aware lookup helper"). I've said it before during one of my first reviews: I would very much recommend adding fstests to [1]. It already seems to have very rudimentary cifs support. There is a completely generic idmapped mount testsuite that supports idmapped mounts. [1]: https://git.kernel.org/pub/scm/fs/xfs/xfsprogs-dev.git/ Cc: Colin Ian King <colin.king@canonical.com> Cc: Steve French <stfrench@microsoft.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Namjae Jeon <namjae.jeon@samsung.com> Cc: Hyunchul Lee <hyc.lee@gmail.com> Cc: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: David Sterba <dsterba@suse.com> Cc: linux-cifs@vger.kernel.org Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com> Signed-off-by: Namjae Jeon <linkinjeon@kernel.org> Signed-off-by: Steve French <stfrench@microsoft.com>
2021-09-03loop: reduce the loop_ctl_mutex scopeTetsuo Handa
syzbot is reporting circular locking problem at __loop_clr_fd() [1], for commit a160c6159d4a0cf8 ("block: add an optional probe callback to major_names") is calling the module's probe function with major_names_lock held. Fortunately, since commit 990e78116d38059c ("block: loop: fix deadlock between open and remove") stopped holding loop_ctl_mutex in lo_open(), current role of loop_ctl_mutex is to serialize access to loop_index_idr and loop_add()/loop_remove(); in other words, management of id for IDR. To avoid holding loop_ctl_mutex during whole add/remove operation, use a bool flag to indicate whether the loop device is ready for use. loop_unregister_transfer() which is called from cleanup_cryptoloop() currently has possibility of use-after-free problem due to lack of serialization between kfree() from loop_remove() from loop_control_remove() and mutex_lock() from unregister_transfer_cb(). But since lo->lo_encryption should be already NULL when this function is called due to module unload, and commit 222013f9ac30b9ce ("cryptoloop: add a deprecation warning") indicates that we will remove this function shortly, this patch updates this function to emit warning instead of checking lo->lo_encryption. Holding loop_ctl_mutex in loop_exit() is pointless, for all users must close /dev/loop-control and /dev/loop$num (in order to drop module's refcount to 0) before loop_exit() starts, and nobody can open /dev/loop-control or /dev/loop$num afterwards. Link: https://syzkaller.appspot.com/bug?id=7bb10e8b62f83e4d445cdf4c13d69e407e629558 [1] Reported-by: syzbot <syzbot+f61766d5763f9e7a118f@syzkaller.appspotmail.com> Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/adb1e792-fc0e-ee81-7ea0-0906fc36419d@i-love.sakura.ne.jp Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-09-03tracing: Add migrate-disabled counter to tracing output.Thomas Gleixner
migrate_disable() forbids task migration to another CPU. It is available since v5.11 and has already users such as highmem or BPF. It is useful to observe this task state in tracing which already has other states like the preemption counter. Instead of adding the migrate disable counter as a new entry to struct trace_entry, which would extend the whole struct by four bytes, it is squashed into the preempt-disable counter. The lower four bits represent the preemption counter, the upper four bits represent the migrate disable counter. Both counter shouldn't exceed 15 but if they do, there is a safety net which caps the value at 15. Add the migrate-disable counter to the trace entry so it shows up in the trace. Due to the users mentioned above, it is already possible to observe it: | bash-1108 [000] ...21 73.950578: rss_stat: mm_id=2213312838 curr=0 type=MM_ANONPAGES size=8192B | bash-1108 [000] d..31 73.951222: irq_disable: caller=flush_tlb_mm_range+0x115/0x130 parent=ptep_clear_flush+0x42/0x50 | bash-1108 [000] d..31 73.951222: tlb_flush: pages:1 reason:local mm shootdown (3) The last value is the migrate-disable counter. Things that popped up: - trace_print_lat_context() does not print the migrate counter. Not sure if it should. It is used in "verbose" mode and uses 8 digits and I'm not sure ther is something processing the value. - trace_define_common_fields() now defines a different variable. This probably breaks things. No ide what to do in order to preserve the old behaviour. Since this is used as a filter it should be split somehow to be able to match both nibbles here. Link: https://lkml.kernel.org/r/20210810132625.ylssabmsrkygokuv@linutronix.de Signed-off-by: Thomas Gleixner <tglx@linutronix.de> [bigeasy: patch description.] Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> [ SDR: Removed change to common_preempt_count field name ] Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2021-09-03io_uring: reexpand under-reexpanded itersPavel Begunkov
[ 74.211232] BUG: KASAN: stack-out-of-bounds in iov_iter_revert+0x809/0x900 [ 74.212778] Read of size 8 at addr ffff888025dc78b8 by task syz-executor.0/828 [ 74.214756] CPU: 0 PID: 828 Comm: syz-executor.0 Not tainted 5.14.0-rc3-next-20210730 #1 [ 74.216525] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014 [ 74.219033] Call Trace: [ 74.219683] dump_stack_lvl+0x8b/0xb3 [ 74.220706] print_address_description.constprop.0+0x1f/0x140 [ 74.224226] kasan_report.cold+0x7f/0x11b [ 74.226085] iov_iter_revert+0x809/0x900 [ 74.227960] io_write+0x57d/0xe40 [ 74.232647] io_issue_sqe+0x4da/0x6a80 [ 74.242578] __io_queue_sqe+0x1ac/0xe60 [ 74.245358] io_submit_sqes+0x3f6e/0x76a0 [ 74.248207] __do_sys_io_uring_enter+0x90c/0x1a20 [ 74.257167] do_syscall_64+0x3b/0x90 [ 74.257984] entry_SYSCALL_64_after_hwframe+0x44/0xae old_size = iov_iter_count(); ... iov_iter_revert(old_size - iov_iter_count()); If iov_iter_revert() is done base on the initial size as above, and the iter is truncated and not reexpanded in the middle, it miscalculates borders causing problems. This trace is due to no one reexpanding after generic_write_checks(). Now iters store how many bytes has been truncated, so reexpand them to the initial state right before reverting. Cc: stable@vger.kernel.org Reported-by: Palash Oswal <oswalpalash@gmail.com> Reported-by: Sudip Mukherjee <sudipm.mukherjee@gmail.com> Reported-and-tested-by: syzbot+9671693590ef5aad8953@syzkaller.appspotmail.com Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2021-09-03iov_iter: track truncated sizePavel Begunkov
Remember how many bytes were truncated and reverted back. Because not reexpanded iterators don't always work well with reverting, we may need to know that to reexpand ourselves when needed. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2021-09-03Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nfJakub Kicinski
Pablo Neira Ayuso says: ==================== Netfilter fixes for net 1) Protect nft_ct template with global mutex, from Pavel Skripkin. 2) Two recent commits switched inet rt and nexthop exception hashes from jhash to siphash. If those two spots are problematic then conntrack is affected as well, so switch voer to siphash too. While at it, add a hard upper limit on chain lengths and reject insertion if this is hit. Patches from Florian Westphal. 3) Fix use-after-scope in nf_socket_ipv6 reported by KASAN, from Benjamin Hesmans. * git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf: netfilter: socket: icmp6: fix use-after-scope netfilter: refuse insertion if chain has grown too large netfilter: conntrack: switch to siphash netfilter: conntrack: sanitize table size default settings netfilter: nft_ct: protect nft_ct_pcpu_template_refcnt with mutex ==================== Link: https://lore.kernel.org/r/20210903163020.13741-1-pablo@netfilter.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-09-03ionic: fix a sleeping in atomic bugDan Carpenter
This code is holding spin_lock_bh(&lif->rx_filters.lock); so the allocation needs to be atomic. Fixes: 969f84394604 ("ionic: sync the filters in the work task") Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Shannon Nelson <snelson@pensando.io> Link: https://lore.kernel.org/r/20210903131856.GA25934@kili Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-09-04mm: slub: move flush_cpu_slab() invocations __free_slab() invocations out of ↵Sebastian Andrzej Siewior
IRQ context flush_all() flushes a specific SLAB cache on each CPU (where the cache is present). The deactivate_slab()/__free_slab() invocation happens within IPI handler and is problematic for PREEMPT_RT. The flush operation is not a frequent operation or a hot path. The per-CPU flush operation can be moved to within a workqueue. Because a workqueue handler, unlike IPI handler, does not disable irqs, flush_slab() now has to disable them for working with the kmem_cache_cpu fields. deactivate_slab() is safe to call with irqs enabled. [vbabka@suse.cz: adapt to new SLUB changes] Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2021-09-04mm, slab: split out the cpu offline variant of flush_slab()Vlastimil Babka
flush_slab() is called either as part IPI handler on given live cpu, or as a cleanup on behalf of another cpu that went offline. The first case needs to protect updating the kmem_cache_cpu fields with disabled irqs. Currently the whole call happens with irqs disabled by the IPI handler, but the following patch will change from IPI to workqueue, and flush_slab() will have to disable irqs (to be replaced with a local lock later) in the critical part. To prepare for this change, replace the call to flush_slab() for the dead cpu handling with an opencoded variant that will not disable irqs nor take a local lock. Suggested-by: Mike Galbraith <efault@gmx.de> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2021-09-04mm, slub: don't disable irqs in slub_cpu_dead()Vlastimil Babka
slub_cpu_dead() cleans up for an offlined cpu from another cpu and calls only functions that are now irq safe, so we don't need to disable irqs anymore. Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2021-09-04mm, slub: only disable irq with spin_lock in __unfreeze_partials()Vlastimil Babka
__unfreeze_partials() no longer needs to have irqs disabled, except for making the spin_lock operations irq-safe, so convert the spin_locks operations and remove the separate irq handling. Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2021-09-04mm, slub: separate detaching of partial list in unfreeze_partials() from ↵Vlastimil Babka
unfreezing Unfreezing partial list can be split to two phases - detaching the list from struct kmem_cache_cpu, and processing the list. The whole operation does not need to be protected by disabled irqs. Restructure the code to separate the detaching (with disabled irqs) and unfreezing (with irq disabling to be reduced in the next patch). Also, unfreeze_partials() can be called from another cpu on behalf of a cpu that is being offlined, where disabling irqs on the local cpu has no sense, so restructure the code as follows: - __unfreeze_partials() is the bulk of unfreeze_partials() that processes the detached percpu partial list - unfreeze_partials() detaches list from current cpu with irqs disabled and calls __unfreeze_partials() - unfreeze_partials_cpu() is to be called for the offlined cpu so it needs no irq disabling, and is called from __flush_cpu_slab() - flush_cpu_slab() is for the local cpu thus it needs to call unfreeze_partials(). So it can't simply call __flush_cpu_slab(smp_processor_id()) anymore and we have to open-code the proper calls. Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2021-09-04mm, slub: detach whole partial list at once in unfreeze_partials()Vlastimil Babka
Instead of iterating through the live percpu partial list, detach it from the kmem_cache_cpu at once. This is simpler and will allow further optimization. Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2021-09-04mm, slub: discard slabs in unfreeze_partials() without irqs disabledVlastimil Babka
No need for disabled irqs when discarding slabs, so restore them before discarding. Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2021-09-04mm, slub: move irq control into unfreeze_partials()Vlastimil Babka
unfreeze_partials() can be optimized so that it doesn't need irqs disabled for the whole time. As the first step, move irq control into the function and remove it from the put_cpu_partial() caller. Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2021-09-04mm, slub: call deactivate_slab() without disabling irqsVlastimil Babka
The function is now safe to be called with irqs enabled, so move the calls outside of irq disabled sections. When called from ___slab_alloc() -> flush_slab() we have irqs disabled, so to reenable them before deactivate_slab() we need to open-code flush_slab() in ___slab_alloc() and reenable irqs after modifying the kmem_cache_cpu fields. But that means a IRQ handler meanwhile might have assigned a new page to kmem_cache_cpu.page so we have to retry the whole check. The remaining callers of flush_slab() are the IPI handler which has disabled irqs anyway, and slub_cpu_dead() which will be dealt with in the following patch. Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2021-09-04mm, slub: make locking in deactivate_slab() irq-safeVlastimil Babka
dectivate_slab() now no longer touches the kmem_cache_cpu structure, so it will be possible to call it with irqs enabled. Just convert the spin_lock calls to their irq saving/restoring variants to make it irq-safe. Note we now have to use cmpxchg_double_slab() for irq-safe slab_lock(), because in some situations we don't take the list_lock, which would disable irqs. Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2021-09-04mm, slub: move reset of c->page and freelist out of deactivate_slab()Vlastimil Babka
deactivate_slab() removes the cpu slab by merging the cpu freelist with slab's freelist and putting the slab on the proper node's list. It also sets the respective kmem_cache_cpu pointers to NULL. By extracting the kmem_cache_cpu operations from the function, we can make it not dependent on disabled irqs. Also if we return a single free pointer from ___slab_alloc, we no longer have to assign kmem_cache_cpu.page before deactivation or care if somebody preempted us and assigned a different page to our kmem_cache_cpu in the process. Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2021-09-04mm, slub: stop disabling irqs around get_partial()Vlastimil Babka
The function get_partial() does not need to have irqs disabled as a whole. It's sufficient to convert spin_lock operations to their irq saving/restoring versions. As a result, it's now possible to reach the page allocator from the slab allocator without disabling and re-enabling interrupts on the way. Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2021-09-04mm, slub: check new pages with restored irqsVlastimil Babka
Building on top of the previous patch, re-enable irqs before checking new pages. alloc_debug_processing() is now called with enabled irqs so we need to remove VM_BUG_ON(!irqs_disabled()); in check_slab() - there doesn't seem to be a need for it anyway. Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2021-09-04mm, slub: validate slab from partial list or page allocator before making it ↵Vlastimil Babka
cpu slab When we obtain a new slab page from node partial list or page allocator, we assign it to kmem_cache_cpu, perform some checks, and if they fail, we undo the assignment. In order to allow doing the checks without irq disabled, restructure the code so that the checks are done first, and kmem_cache_cpu.page assignment only after they pass. Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2021-09-04mm, slub: restore irqs around calling new_slab()Vlastimil Babka
allocate_slab() currently re-enables irqs before calling to the page allocator. It depends on gfpflags_allow_blocking() to determine if it's safe to do so. Now we can instead simply restore irq before calling it through new_slab(). The other caller early_kmem_cache_node_alloc() is unaffected by this. Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2021-09-04mm, slub: move disabling irqs closer to get_partial() in ___slab_alloc()Vlastimil Babka
Continue reducing the irq disabled scope. Check for per-cpu partial slabs with first with irqs enabled and then recheck with irqs disabled before grabbing the slab page. Mostly preparatory for the following patches. Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2021-09-04mm, slub: do initial checks in ___slab_alloc() with irqs enabledVlastimil Babka
As another step of shortening irq disabled sections in ___slab_alloc(), delay disabling irqs until we pass the initial checks if there is a cached percpu slab and it's suitable for our allocation. Now we have to recheck c->page after actually disabling irqs as an allocation in irq handler might have replaced it. Because we call pfmemalloc_match() as one of the checks, we might hit VM_BUG_ON_PAGE(!PageSlab(page)) in PageSlabPfmemalloc in case we get interrupted and the page is freed. Thus introduce a pfmemalloc_match_unsafe() variant that lacks the PageSlab check. Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Mel Gorman <mgorman@techsingularity.net>
2021-09-04mm, slub: move disabling/enabling irqs to ___slab_alloc()Vlastimil Babka
Currently __slab_alloc() disables irqs around the whole ___slab_alloc(). This includes cases where this is not needed, such as when the allocation ends up in the page allocator and has to awkwardly enable irqs back based on gfp flags. Also the whole kmem_cache_alloc_bulk() is executed with irqs disabled even when it hits the __slab_alloc() slow path, and long periods with disabled interrupts are undesirable. As a first step towards reducing irq disabled periods, move irq handling into ___slab_alloc(). Callers will instead prevent the s->cpu_slab percpu pointer from becoming invalid via get_cpu_ptr(), thus preempt_disable(). This does not protect against modification by an irq handler, which is still done by disabled irq for most of ___slab_alloc(). As a small immediate benefit, slab_out_of_memory() from ___slab_alloc() is now called with irqs enabled. kmem_cache_alloc_bulk() disables irqs for its fastpath and then re-enables them before calling ___slab_alloc(), which then disables them at its discretion. The whole kmem_cache_alloc_bulk() operation also disables preemption. When ___slab_alloc() calls new_slab() to allocate a new page, re-enable preemption, because new_slab() will re-enable interrupts in contexts that allow blocking (this will be improved by later patches). The patch itself will thus increase overhead a bit due to disabled preemption (on configs where it matters) and increased disabling/enabling irqs in kmem_cache_alloc_bulk(), but that will be gradually improved in the following patches. Note in __slab_alloc() we need to change the #ifdef CONFIG_PREEMPT guard to CONFIG_PREEMPT_COUNT to make sure preempt disable/enable is properly paired in all configurations. On configs without involuntary preemption and debugging the re-read of kmem_cache_cpu pointer is still compiled out as it was before. [ Mike Galbraith <efault@gmx.de>: Fix kmem_cache_alloc_bulk() error path ] Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2021-09-04mm, slub: simplify kmem_cache_cpu and tid setupVlastimil Babka
In slab_alloc_node() and do_slab_free() fastpaths we need to guarantee that our kmem_cache_cpu pointer is from the same cpu as the tid value. Currently that's done by reading the tid first using this_cpu_read(), then the kmem_cache_cpu pointer and verifying we read the same tid using the pointer and plain READ_ONCE(). This can be simplified to just fetching kmem_cache_cpu pointer and then reading tid using the pointer. That guarantees they are from the same cpu. We don't need to read the tid using this_cpu_read() because the value will be validated by this_cpu_cmpxchg_double(), making sure we are on the correct cpu and the freelist didn't change by anyone preempting us since reading the tid. Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Mel Gorman <mgorman@techsingularity.net>
2021-09-04mm, slub: restructure new page checks in ___slab_alloc()Vlastimil Babka
When we allocate slab object from a newly acquired page (from node's partial list or page allocator), we usually also retain the page as a new percpu slab. There are two exceptions - when pfmemalloc status of the page doesn't match our gfp flags, or when the cache has debugging enabled. The current code for these decisions is not easy to follow, so restructure it and add comments. The new structure will also help with the following changes. No functional change. Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Mel Gorman <mgorman@techsingularity.net>
2021-09-04mm, slub: return slab page from get_partial() and set c->page afterwardsVlastimil Babka
The function get_partial() finds a suitable page on a partial list, acquires and returns its freelist and assigns the page pointer to kmem_cache_cpu. In later patch we will need more control over the kmem_cache_cpu.page assignment, so instead of passing a kmem_cache_cpu pointer, pass a pointer to a pointer to a page that get_partial() can fill and the caller can assign the kmem_cache_cpu.page pointer. No functional change as all of this still happens with disabled IRQs. Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2021-09-04mm, slub: dissolve new_slab_objects() into ___slab_alloc()Vlastimil Babka
The later patches will need more fine grained control over individual actions in ___slab_alloc(), the only caller of new_slab_objects(), so dissolve it there. This is a preparatory step with no functional change. The only minor change is moving WARN_ON_ONCE() for using a constructor together with __GFP_ZERO to new_slab(), which makes it somewhat less frequent, but still able to catch a development change introducing a systematic misuse. Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Christoph Lameter <cl@linux.com> Acked-by: Mel Gorman <mgorman@techsingularity.net>
2021-09-04mm, slub: extract get_partial() from new_slab_objects()Vlastimil Babka
The later patches will need more fine grained control over individual actions in ___slab_alloc(), the only caller of new_slab_objects(), so this is a first preparatory step with no functional change. This adds a goto label that appears unnecessary at this point, but will be useful for later changes. Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Christoph Lameter <cl@linux.com>
2021-09-03io_uring: io_uring_complete() trace should take an integerJens Axboe
It currently takes a long, and while that's normally OK, the io_uring limit is an int. Internally in io_uring it's an int, but sometimes it's passed as a long. That can yield confusing results where a completions seems to generate a huge result: ou-sqp-1297-1298 [001] ...1 788.056371: io_uring_complete: ring 000000000e98e046, user_data 0x0, result 4294967171, cflags 0 which is due to -ECANCELED being stored in an unsigned, and then passed in as a long. Using the right int type, the trace looks correct: iou-sqp-338-339 [002] ...1 15.633098: io_uring_complete: ring 00000000e0ac60cf, user_data 0x0, result -125, cflags 0 Cc: stable@vger.kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-09-03Merge tag 'linux-kselftest-next-5.15-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest Pull Kselftest updates from Shuah Khan: "Fixes to build and test failures: - openat2 test failure for O_LARGEFILE flag on ARM64 - x86 test build failures related to glibc 2.34 adding support for variable sized MINSIGSTKSZ and SIGSTKSZ - removing obsolete configs in sync and cpufreq config files - minor spelling and duplicate header include cleanups" * tag 'linux-kselftest-next-5.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest: selftests/cpufreq: Rename DEBUG_PI_LIST to DEBUG_PLIST selftests/sync: Remove the deprecated config SYNC selftests: safesetid: Fix spelling mistake "cant" -> "can't" selftests/x86: Fix error: variably modified 'altstack_data' at file scope kselftest:sched: remove duplicate include in cs_prctl_test.c selftests: openat2: Fix testing failure for O_LARGEFILE flag
2021-09-03Merge tag 'kbuild-v5.15' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild Pull Kbuild updates from Masahiro Yamada: - Add -s option (strict mode) to merge_config.sh to make it fail when any symbol is redefined. - Show a warning if a different compiler is used for building external modules. - Infer --target from ARCH for CC=clang to let you cross-compile the kernel without CROSS_COMPILE. - Make the integrated assembler default (LLVM_IAS=1) for CC=clang. - Add <linux/stdarg.h> to the kernel source instead of borrowing <stdarg.h> from the compiler. - Add Nick Desaulniers as a Kbuild reviewer. - Drop stale cc-option tests. - Fix the combination of CONFIG_TRIM_UNUSED_KSYMS and CONFIG_LTO_CLANG to handle symbols in inline assembly. - Show a warning if 'FORCE' is missing for if_changed rules. - Various cleanups * tag 'kbuild-v5.15' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild: (39 commits) kbuild: redo fake deps at include/ksym/*.h kbuild: clean up objtool_args slightly modpost: get the *.mod file path more simply checkkconfigsymbols.py: Fix the '--ignore' option kbuild: merge vmlinux_link() between ARCH=um and other architectures kbuild: do not remove 'linux' link in scripts/link-vmlinux.sh kbuild: merge vmlinux_link() between the ordinary link and Clang LTO kbuild: remove stale *.symversions kbuild: remove unused quiet_cmd_update_lto_symversions gen_compile_commands: extract compiler command from a series of commands x86: remove cc-option-yn test for -mtune= arc: replace cc-option-yn uses with cc-option s390: replace cc-option-yn uses with cc-option ia64: move core-y in arch/ia64/Makefile to arch/ia64/Kbuild sparc: move the install rule to arch/sparc/Makefile security: remove unneeded subdir-$(CONFIG_...) kbuild: sh: remove unused install script kbuild: Fix 'no symbols' warning when CONFIG_TRIM_UNUSD_KSYMS=y kbuild: Switch to 'f' variants of integrated assembler flag kbuild: Shuffle blank line to improve comment meaning ...
2021-09-03Merge branch 'stable/for-linus-5.15-rc0' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/konrad/ibft Pull ibft fix from Konrad Rzeszutek Wilk: "An arm64 compile fix for the new code that fixed the iBFT KASLR handling. I missed the original 0-day build email report" * 'stable/for-linus-5.15-rc0' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/ibft: iscsi_ibft: Fix isa_bus_to_virt not working under ARM
2021-09-03mm, slub: remove redundant unfreeze_partials() from put_cpu_partial()Vlastimil Babka
Commit d6e0b7fa1186 ("slub: make dead caches discard free slabs immediately") introduced cpu partial flushing for kmemcg caches, based on setting the target cpu_partial to 0 and adding a flushing check in put_cpu_partial(). This code that sets cpu_partial to 0 was later moved by c9fc586403e7 ("slab: introduce __kmemcg_cache_deactivate()") and ultimately removed by 9855609bde03 ("mm: memcg/slab: use a single set of kmem_caches for all accounted allocations"). However the check and flush in put_cpu_partial() was never removed, although it's effectively a dead code. So this patch removes it. Note that d6e0b7fa1186 also added preempt_disable()/enable() to unfreeze_partials() which could be thus also considered unnecessary. But further patches will rely on it, so keep it. Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2021-09-03mm, slub: don't disable irq for debug_check_no_locks_freed()Vlastimil Babka
In slab_free_hook() we disable irqs around the debug_check_no_locks_freed() call, which is unnecessary, as irqs are already being disabled inside the call. This seems to be leftover from the past where there were more calls inside the irq disabled sections. Remove the irq disable/enable operations. Mel noted: > Looks like it was needed for kmemcheck which went away back in 4.15 Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Mel Gorman <mgorman@techsingularity.net>
2021-09-03mm, slub: allocate private object map for validate_slab_cache()Vlastimil Babka
validate_slab_cache() is called either to handle a sysfs write, or from a self-test context. In both situations it's straightforward to preallocate a private object bitmap instead of grabbing the shared static one meant for critical sections, so let's do that. Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Christoph Lameter <cl@linux.com> Acked-by: Mel Gorman <mgorman@techsingularity.net>
2021-09-03mm, slub: allocate private object map for debugfs listingsVlastimil Babka
Slub has a static spinlock protected bitmap for marking which objects are on freelist when it wants to list them, for situations where dynamically allocating such map can lead to recursion or locking issues, and on-stack bitmap would be too large. The handlers of debugfs files alloc_traces and free_traces also currently use this shared bitmap, but their syscall context makes it straightforward to allocate a private map before entering locked sections, so switch these processing paths to use a private bitmap. Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Christoph Lameter <cl@linux.com> Acked-by: Mel Gorman <mgorman@techsingularity.net>
2021-09-03mm, slub: don't call flush_all() from slab_debug_trace_open()Vlastimil Babka
slab_debug_trace_open() can only be called on caches with SLAB_STORE_USER flag and as with all slub debugging flags, such caches avoid cpu or percpu partial slabs altogether, so there's nothing to flush. Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Christoph Lameter <cl@linux.com>
2021-09-03docs: kernel-hacking: Remove inappropriate textAlyssa Rosenzweig
Remove inappropriate sexual (and ableist) text from the locking documentation, aligning it with the kernel code-of-conduct. As the text was unrelated to locking, this change streamlines the document and improves readability. Signed-off-by: Alyssa Rosenzweig <alyssa@rosenzweig.io> Link: https://lore.kernel.org/r/20210903151826.6300-1-alyssa@rosenzweig.io Signed-off-by: Jonathan Corbet <corbet@lwn.net>
2021-09-03futex: Remove unused variable 'vpid' in futex_proxy_trylock_atomic()Thomas Gleixner
The recent bug fix left the variable 'vpid' and an assignment to it around, but the variable is otherwise unused. clang dose not complain even with W=1, but gcc exposed this. Fixes: 4f07ec0d76f2 ("futex: Prevent inconsistent state and exit race") Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2021-09-03Merge tag 'powerpc-5.15-1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux Pull powerpc updates from Michael Ellerman: - Convert pseries & powernv to use MSI IRQ domains. - Rework the pseries CPU numbering so that CPUs that are removed, and later re-added, are given a CPU number on the same node as previously, when possible. - Add support for a new more flexible device-tree format for specifying NUMA distances. - Convert powerpc to GENERIC_PTDUMP. - Retire sbc8548 and sbc8641d board support. - Various other small features and fixes. Thanks to Alexey Kardashevskiy, Aneesh Kumar K.V, Anton Blanchard, Cédric Le Goater, Christophe Leroy, Emmanuel Gil Peyrot, Fabiano Rosas, Fangrui Song, Finn Thain, Gautham R. Shenoy, Hari Bathini, Joel Stanley, Jordan Niethe, Kajol Jain, Laurent Dufour, Leonardo Bras, Lukas Bulwahn, Marc Zyngier, Masahiro Yamada, Michal Suchanek, Nathan Chancellor, Nicholas Piggin, Parth Shah, Paul Gortmaker, Pratik R. Sampat, Randy Dunlap, Sebastian Andrzej Siewior, Srikar Dronamraju, Wan Jiabing, Xiongwei Song, and Zheng Yongjun. * tag 'powerpc-5.15-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: (154 commits) powerpc/bug: Cast to unsigned long before passing to inline asm powerpc/ptdump: Fix generic ptdump for 64-bit KVM: PPC: Fix clearing never mapped TCEs in realmode powerpc/pseries/iommu: Rename "direct window" to "dma window" powerpc/pseries/iommu: Make use of DDW for indirect mapping powerpc/pseries/iommu: Find existing DDW with given property name powerpc/pseries/iommu: Update remove_dma_window() to accept property name powerpc/pseries/iommu: Reorganize iommu_table_setparms*() with new helper powerpc/pseries/iommu: Add ddw_property_create() and refactor enable_ddw() powerpc/pseries/iommu: Allow DDW windows starting at 0x00 powerpc/pseries/iommu: Add ddw_list_new_entry() helper powerpc/pseries/iommu: Add iommu_pseries_alloc_table() helper powerpc/kernel/iommu: Add new iommu_table_in_use() helper powerpc/pseries/iommu: Replace hard-coded page shift powerpc/numa: Update cpu_cpu_map on CPU online/offline powerpc/numa: Print debug statements only when required powerpc/numa: convert printk to pr_xxx powerpc/numa: Drop dbg in favour of pr_debug powerpc/smp: Enable CACHE domain for shared processor powerpc/smp: Update cpu_core_map on all PowerPc systems ...
2021-09-03Merge tag 'for-5.15/parisc-2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux Pull parisc architecture fixes from Helge Deller: "Fix an unaligned-access crash in the bootloader and drop asm/swab.h" * tag 'for-5.15/parisc-2' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux: parisc: Fix unaligned-access crash in bootloader parisc: Drop __arch_swab16(), arch_swab24(), _arch_swab32() and __arch_swab64() functions
2021-09-03clk: qcom: gcc-sm6350: Remove unused variableKonrad Dybcio
In the commit "clk: qcom: Add SM6350 GCC driver" (no hash yet) an unused variable has been overlooked. Remove it. Signed-off-by: Konrad Dybcio <konrad.dybcio@somainline.org> Link: https://lore.kernel.org/r/5b7edab0-4756-94d0-d601-050120cbf4cb@somainline.org Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Stephen Boyd <sboyd@kernel.org>