summaryrefslogtreecommitdiff
path: root/include/linux
AgeCommit message (Collapse)Author
2023-03-14net: page_pool, skbuff: make skb_mark_for_recycle() always availableAlexander Lobakin
skb_mark_for_recycle() is guarded with CONFIG_PAGE_POOL, this creates unneeded complication when using it in the generic code. For now, it's only used in the drivers always selecting Page Pool, so this works. Move the guards so that preprocessor will cut out only the operation itself and the function will still be a noop on !PAGE_POOL systems, but available there as well. No functional changes. Reported-by: kernel test robot <lkp@intel.com> Link: https://lore.kernel.org/oe-kbuild-all/202303020342.Wi2PRFFH-lkp@intel.com Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com> Link: https://lore.kernel.org/r/20230313215553.1045175-3-aleksander.lobakin@intel.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-03-14tracing: Make tracepoint lockdep check actually test somethingSteven Rostedt (Google)
A while ago where the trace events had the following: rcu_read_lock_sched_notrace(); rcu_dereference_sched(...); rcu_read_unlock_sched_notrace(); If the tracepoint is enabled, it could trigger RCU issues if called in the wrong place. And this warning was only triggered if lockdep was enabled. If the tracepoint was never enabled with lockdep, the bug would not be caught. To handle this, the above sequence was done when lockdep was enabled regardless if the tracepoint was enabled or not (although the always enabled code really didn't do anything, it would still trigger a warning). But a lot has changed since that lockdep code was added. One is, that sequence no longer triggers any warning. Another is, the tracepoint when enabled doesn't even do that sequence anymore. The main check we care about today is whether RCU is "watching" or not. So if lockdep is enabled, always check if rcu_is_watching() which will trigger a warning if it is not (tracepoints require RCU to be watching). Note, that old sequence did add a bit of overhead when lockdep was enabled, and with the latest kernel updates, would cause the system to slow down enough to trigger kernel "stalled" warnings. Link: http://lore.kernel.org/lkml/20140806181801.GA4605@redhat.com Link: http://lore.kernel.org/lkml/20140807175204.C257CAC5@viggo.jf.intel.com Link: https://lore.kernel.org/lkml/20230307184645.521db5c9@gandalf.local.home/ Link: https://lore.kernel.org/linux-trace-kernel/20230310172856.77406446@gandalf.local.home Cc: stable@vger.kernel.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: "Paul E. McKenney" <paulmck@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Joel Fernandes <joel@joelfernandes.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Paul E. McKenney <paulmck@kernel.org> Fixes: e6753f23d961 ("tracepoint: Make rcuidle tracepoint callers use SRCU") Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-03-14lockd: set file_lock start and end when decoding nlm4 testargsJeff Layton
Commit 6930bcbfb6ce dropped the setting of the file_lock range when decoding a nlm_lock off the wire. This causes the client side grant callback to miss matching blocks and reject the lock, only to rerequest it 30s later. Add a helper function to set the file_lock range from the start and end values that the protocol uses, and have the nlm_lock decoder call that to set up the file_lock args properly. Fixes: 6930bcbfb6ce ("lockd: detect and reject lock arguments that overflow") Reported-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Jeff Layton <jlayton@kernel.org> Tested-by: Amir Goldstein <amir73il@gmail.com> Cc: stable@vger.kernel.org #6.0 Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2023-03-14block: do not reverse request order when flushing plug listJan Kara
Commit 26fed4ac4eab ("block: flush plug based on hardware and software queue order") changed flushing of plug list to submit requests one device at a time. However while doing that it also started using list_add_tail() instead of list_add() used previously thus effectively submitting requests in reverse order. Also when forming a rq_list with remaining requests (in case two or more devices are used), we effectively reverse the ordering of the plug list for each device we process. Submitting requests in reverse order has negative impact on performance for rotational disks (when BFQ is not in use). We observe 10-25% regression in random 4k write throughput, as well as ~20% regression in MariaDB OLTP benchmark on rotational storage on btrfs filesystem. Fix the problem by preserving ordering of the plug list when inserting requests into the queuelist as well as by appending to requeue_list instead of prepending to it. Fixes: 26fed4ac4eab ("block: flush plug based on hardware and software queue order") Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20230313093002.11756-1-jack@suse.cz Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-03-13bpf: Disable migration when freeing stashed local kptr using obj dropDave Marchevsky
When a local kptr is stashed in a map and freed when the map goes away, currently an error like the below appears: [ 39.195695] BUG: using smp_processor_id() in preemptible [00000000] code: kworker/u32:15/2875 [ 39.196549] caller is bpf_mem_free+0x56/0xc0 [ 39.196958] CPU: 15 PID: 2875 Comm: kworker/u32:15 Tainted: G O 6.2.0-13016-g22df776a9a86 #4477 [ 39.197897] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba5276e321-prebuilt.qemu.org 04/01/2014 [ 39.198949] Workqueue: events_unbound bpf_map_free_deferred [ 39.199470] Call Trace: [ 39.199703] <TASK> [ 39.199911] dump_stack_lvl+0x60/0x70 [ 39.200267] check_preemption_disabled+0xbf/0xe0 [ 39.200704] bpf_mem_free+0x56/0xc0 [ 39.201032] ? bpf_obj_new_impl+0xa0/0xa0 [ 39.201430] bpf_obj_free_fields+0x1cd/0x200 [ 39.201838] array_map_free+0xad/0x220 [ 39.202193] ? finish_task_switch+0xe5/0x3c0 [ 39.202614] bpf_map_free_deferred+0xea/0x210 [ 39.203006] ? lockdep_hardirqs_on_prepare+0xe/0x220 [ 39.203460] process_one_work+0x64f/0xbe0 [ 39.203822] ? pwq_dec_nr_in_flight+0x110/0x110 [ 39.204264] ? do_raw_spin_lock+0x107/0x1c0 [ 39.204662] ? lockdep_hardirqs_on_prepare+0xe/0x220 [ 39.205107] worker_thread+0x74/0x7a0 [ 39.205451] ? process_one_work+0xbe0/0xbe0 [ 39.205818] kthread+0x171/0x1a0 [ 39.206111] ? kthread_complete_and_exit+0x20/0x20 [ 39.206552] ret_from_fork+0x1f/0x30 [ 39.206886] </TASK> This happens because the call to __bpf_obj_drop_impl I added in the patch adding support for stashing local kptrs doesn't disable migration. Prior to that patch, __bpf_obj_drop_impl logic only ran when called by a BPF progarm, whereas now it can be called from map free path, so it's necessary to explicitly disable migration. Also, refactor a bit to just call __bpf_obj_drop_impl directly instead of bothering w/ dtor union and setting pointer-to-obj_drop. Fixes: c8e187540914 ("bpf: Support __kptr to local kptrs") Reported-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com> Link: https://lore.kernel.org/r/20230313214641.3731908-1-davemarchevsky@fb.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-03-13efi: earlycon: Reprobe after parsing config tablesArd Biesheuvel
Commit 732ea9db9d8a ("efi: libstub: Move screen_info handling to common code") reorganized the earlycon handling so that all architectures pass the screen_info data via a EFI config table instead of populating struct screen_info directly, as the latter is only possible when the EFI stub is baked into the kernel (and not into the decompressor). However, this means that struct screen_info may not have been populated yet by the time the earlycon probe takes place, and this results in a non-functional early console. So let's probe again right after parsing the config tables and populating struct screen_info. Note that this means that earlycon output starts a bit later than before, and so it may fail to capture issues that occur while doing the early EFI initialization. Fixes: 732ea9db9d8a ("efi: libstub: Move screen_info handling to common code") Reported-by: Shawn Guo <shawn.guo@linaro.org> Tested-by: Shawn Guo <shawn.guo@linaro.org> Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
2023-03-13tasks: Extract rcu_users out of unionDavid Vernet
In commit 3fbd7ee285b2b ("tasks: Add a count of task RCU users"), a count on the number of RCU users was added to struct task_struct. This was done so as to enable the removal of task_rcu_dereference(), and allow tasks to be protected by RCU even after exiting and being removed from the runqueue. In this commit, the 'refcount_t rcu_users' field that keeps track of this refcount was put into a union co-located with 'struct rcu_head rcu', so as to avoid taking up any extra space in task_struct. This was possible to do safely, because the field was only ever decremented by a static set of specific callers, and then never incremented again. While this restriction of there only being a small, static set of users of this field has worked fine, it prevents us from leveraging the field to use RCU to protect tasks in other contexts. During tracing, for example, it would be useful to be able to collect some tasks that performed a certain operation, put them in a map, and then periodically summarize who they are, which cgroup they're in, how much CPU time they've utilized, etc. While this can currently be done with 'usage', it becomes tricky when a task is already in a map, or if a reference should only be taken if a task is valid and will not soon be reaped. Ideally, we could do something like pass a reference to a map value, and then try to acquire a reference to the task in an RCU read region by using refcount_inc_not_zero(). Similarly, in sched_ext, schedulers are using integer pids to remember tasks, and then looking them up with find_task_by_pid_ns(). This is slow, error prone, and adds complexity. It would be more convenient and performant if BPF schedulers could instead store tasks directly in maps, and then leverage RCU to ensure they can be safely accessed with low overhead. Finally, overloading fields like this is error prone. Someone that wants to use 'rcu_users' could easily overlook the fact that once the rcu callback is scheduled, the refcount will go back to being nonzero, thus precluding the use of refcount_inc_not_zero(). Furthermore, as described below, it's possible to extract the fields of the union without changing the size of task_struct. There are several possible ways to enable this: 1. The lightest touch approach is likely the one proposed in this patch, which is to simply extract 'rcu_users' and 'rcu' from the union, so that scheduling the 'rcu' callback doesn't overwrite the 'rcu_users' refcount. If we have a trusted task pointer, this would allow us to use refcnt_inc_not_zero() inside of an RCU region to determine if we can safely acquire a reference to the task and store it in a map. As mentioned below, this can be done without changing the size of task_struct, by moving the location of the union to another location that has padding gaps we can fill in. 2. Removing 'refcount_t rcu_users', and instead having the entire task be freed in an rcu callback. This is likely the most sound overall design, though it changes the behavioral semantics exposed to callers, who currently expect that a task that's successfully looked up in e.g. the pid_list with find_task_by_pid_ns(), can always have a 'usage' reference acquired on them, as it's guaranteed to be > 0 until after the next gp. In order for this approach to work, we'd have to audit all callers. This approach also slightly changes behavior observed by user space by not invoking trace_sched_process_free() until the whole task_struct is actually being freed, rather than just after it's exited. It also may change timings, as memory will be freed in an RCU callback rather than immediately when the final 'usage' refcount drops to 0. This also is arguably a benefit, as it provides more predictable performance to callers who are refcounting tasks. 3. There may be other solutions as well that don't require changing the layout of task_struct. For example, we could possibly do something complex from the BPF side, such as listen for task exit and remove a task from a map when the task is exiting. This would likely require significant custom handling for task_struct in the verifier, so a more generalizable solution is likely warranted. As mentioned above, this patch proposes the lightest-touch approach which allows callers elsewhere in the kernel to use 'rcu_users' to ensure the lifetime of a task, by extracting 'rcu_users' and 'rcu' from the union. There is no size change in task_struct with this patch. Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Eric W. Biederman <ebiederm@xmission.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ingo Molnar <mingo@redhat.com> Signed-off-by: David Vernet <void@manifault.com> Acked-by: Oleg Nesterov <oleg@redhat.com> Link: https://lore.kernel.org/r/20230215233033.889644-1-void@manifault.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-03-13Merge drm/drm-fixes into drm-misc-fixesThomas Zimmermann
Backmerging to get latest upstream. Signed-off-by: Thomas Zimmermann <tzimmermann@suse.de>
2023-03-13PCI: s390: Fix use-after-free of PCI resources with per-function hotplugNiklas Schnelle
On s390 PCI functions may be hotplugged individually even when they belong to a multi-function device. In particular on an SR-IOV device VFs may be removed and later re-added. In commit a50297cf8235 ("s390/pci: separate zbus creation from scanning") it was missed however that struct pci_bus and struct zpci_bus's resource list retained a reference to the PCI functions MMIO resources even though those resources are released and freed on hot-unplug. These stale resources may subsequently be claimed when the PCI function re-appears resulting in use-after-free. One idea of fixing this use-after-free in s390 specific code that was investigated was to simply keep resources around from the moment a PCI function first appeared until the whole virtual PCI bus created for a multi-function device disappears. The problem with this however is that due to the requirement of artificial MMIO addreesses (address cookies) extra logic is then needed to keep the address cookies compatible on re-plug. At the same time the MMIO resources semantically belong to the PCI function so tying their lifecycle to the function seems more logical. Instead a simpler approach is to remove the resources of an individually hot-unplugged PCI function from the PCI bus's resource list while keeping the resources of other PCI functions on the PCI bus untouched. This is done by introducing pci_bus_remove_resource() to remove an individual resource. Similarly the resource also needs to be removed from the struct zpci_bus's resource list. It turns out however, that there is really no need to add the MMIO resources to the struct zpci_bus's resource list at all and instead we can simply use the zpci_bar_struct's resource pointer directly. Fixes: a50297cf8235 ("s390/pci: separate zbus creation from scanning") Signed-off-by: Niklas Schnelle <schnelle@linux.ibm.com> Reviewed-by: Matthew Rosato <mjrosato@linux.ibm.com> Acked-by: Bjorn Helgaas <bhelgaas@google.com> Link: https://lore.kernel.org/r/20230306151014.60913-2-schnelle@linux.ibm.com Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
2023-03-12Merge tag 'ext4_for_linus_stable' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 Pull ext4 fixes from Ted Ts'o: "Bug fixes and regressions for ext4, the most serious of which is a potential deadlock during directory renames that was introduced during the merge window discovered by a combination of syzbot and lockdep" * tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: ext4: zero i_disksize when initializing the bootloader inode ext4: make sure fs error flag setted before clear journal error ext4: commit super block if fs record error when journal record without error ext4, jbd2: add an optimized bmap for the journal inode ext4: fix WARNING in ext4_update_inline_data ext4: move where set the MAY_INLINE_DATA flag is set ext4: Fix deadlock during directory rename ext4: Fix comment about the 64BIT feature docs: ext4: modify the group desc size to 64 ext4: fix another off-by-one fsmap error on 1k block filesystems ext4: fix RENAME_WHITEOUT handling for inline directories ext4: make kobj_type structures constant ext4: fix cgroup writeback accounting with fs-layer encryption
2023-03-12cpumask: relax sanity checking constraintsLinus Torvalds
The cpumask_check() was unnecessarily tight, and causes problems for the users of cpumask_next(). We have a number of users that take the previous return value of one of the bit scanning functions and subtract one to keep it in "range". But since the scanning functions end up returning up to 'small_cpumask_bits' instead of the tighter 'nr_cpumask_bits', the range really needs to be using that widened form. [ This "previous-1" behavior is also the reason we have all those comments about /* -1 is a legal arg here. */ and separate checks for that being ok. So we could have just made "small_cpumask_bits-1" be a similar special "don't check this" value. Tetsuo Handa even suggested a patch that only does that for cpumask_next(), since that seems to be the only actual case that triggers, but that all makes it even _more_ magical and special. So just relax the check ] One example of this kind of pattern being the 'c_start()' function in arch/x86/kernel/cpu/proc.c, but also duplicated in various forms on other architectures. Reported-by: syzbot+96cae094d90877641f32@syzkaller.appspotmail.com Link: https://syzkaller.appspot.com/bug?extid=96cae094d90877641f32 Reported-by: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> Link: https://lore.kernel.org/lkml/c1f4cc16-feea-b83c-82cf-1a1f007b7eb9@I-love.SAKURA.ne.jp/ Fixes: 596ff4a09b89 ("cpumask: re-introduce constant-sized cpumask optimizations") Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2023-03-11Merge tag 'i2c-for-6.3-rc2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux Pull i2c updates from Wolfram Sang: "This marks the end of a transition to let I2C have the same probe semantics as other subsystems. Uwe took care that no drivers in the current tree nor in -next use the deprecated .probe call. So, it is a good time to switch to the new, standard semantics now. There is also a regression fix: - regression fix for the notifier handling of the I2C core - final coversions of drivers away from deprecated .probe - make .probe_new the standard probe and convert I2C core to use it * tag 'i2c-for-6.3-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux: i2c: dev: Fix bus callback return values i2c: Convert drivers to new .probe() callback i2c: mux: Convert all drivers to new .probe() callback i2c: Switch .probe() to not take an id parameter media: i2c: ov2685: convert to i2c's .probe_new() media: i2c: ov5695: convert to i2c's .probe_new() w1: ds2482: Convert to i2c's .probe_new() serial: sc16is7xx: Convert to i2c's .probe_new() mtd: maps: pismo: Convert to i2c's .probe_new() misc: ad525x_dpot-i2c: Convert to i2c's .probe_new()
2023-03-11ext4, jbd2: add an optimized bmap for the journal inodeTheodore Ts'o
The generic bmap() function exported by the VFS takes locks and does checks that are not necessary for the journal inode. So allow the file system to set a journal-optimized bmap function in journal->j_bmap. Reported-by: syzbot+9543479984ae9e576000@syzkaller.appspotmail.com Link: https://syzkaller.appspot.com/bug?id=e4aaa78795e490421c79f76ec3679006c8ff4cf0 Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-03-10Merge tag 'pull-highmem' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull put_and_unmap_page() helper from Al Viro: "kmap_local_page() conversions in local filesystems keep running into kunmap_local_page()+put_page() combinations. We can keep inventing names for identical inline helpers, but it's getting rather inconvenient. I've added a trivial helper to linux/highmem.h instead. I would've held that back until the merge window, if not for the mess it causes in tree topology - I've several branches merging from that one, and it's only going to get worse if e.g. ext2 stuff gets picked by Jan" * tag 'pull-highmem' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: new helper: put_and_unmap_page()
2023-03-10bpf: Support __kptr to local kptrsDave Marchevsky
If a PTR_TO_BTF_ID type comes from program BTF - not vmlinux or module BTF - it must have been allocated by bpf_obj_new and therefore must be free'd with bpf_obj_drop. Such a PTR_TO_BTF_ID is considered a "local kptr" and is tagged with MEM_ALLOC type tag by bpf_obj_new. This patch adds support for treating __kptr-tagged pointers to "local kptrs" as having an implicit bpf_obj_drop destructor for referenced kptr acquire / release semantics. Consider the following example: struct node_data { long key; long data; struct bpf_rb_node node; }; struct map_value { struct node_data __kptr *node; }; struct { __uint(type, BPF_MAP_TYPE_ARRAY); __type(key, int); __type(value, struct map_value); __uint(max_entries, 1); } some_nodes SEC(".maps"); If struct node_data had a matching definition in kernel BTF, the verifier would expect a destructor for the type to be registered. Since struct node_data does not match any type in kernel BTF, the verifier knows that there is no kfunc that provides a PTR_TO_BTF_ID to this type, and that such a PTR_TO_BTF_ID can only come from bpf_obj_new. So instead of searching for a registered dtor, a bpf_obj_drop dtor can be assumed. This allows the runtime to properly destruct such kptrs in bpf_obj_free_fields, which enables maps to clean up map_vals w/ such kptrs when going away. Implementation notes: * "kernel_btf" variable is renamed to "kptr_btf" in btf_parse_kptr. Before this patch, the variable would only ever point to vmlinux or module BTFs, but now it can point to some program BTF for local kptr type. It's later used to populate the (btf, btf_id) pair in kptr btf field. * It's necessary to btf_get the program BTF when populating btf_field for local kptr. btf_record_free later does a btf_put. * Behavior for non-local referenced kptrs is not modified, as bpf_find_btf_id helper only searches vmlinux and module BTFs for matching BTF type. If such a type is found, btf_field_kptr's btf will pass btf_is_kernel check, and the associated release function is some one-argument dtor. If btf_is_kernel check fails, associated release function is two-arg bpf_obj_drop_impl. Before this patch only btf_field_kptr's w/ kernel or module BTFs were created. Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com> Link: https://lore.kernel.org/r/20230310230743.2320707-2-davemarchevsky@fb.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-03-10bpf: Change btf_record_find enum parameter to field_maskDave Marchevsky
btf_record_find's 3rd parameter can be multiple enum btf_field_type's masked together. The function is called with BPF_KPTR in two places in verifier.c, so it works with masked values already. Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com> Link: https://lore.kernel.org/r/20230309180111.1618459-4-davemarchevsky@fb.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-03-10bpf: Add bpf_selem_free()Martin KaFai Lau
This patch refactors the selem freeing logic into bpf_selem_free(). It is a preparation work for a later patch using bpf_mem_cache_alloc/free. The other kfree(selem) cases are also changed to bpf_selem_free(..., reuse_now = true). Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Link: https://lore.kernel.org/r/20230308065936.1550103-10-martin.lau@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-03-10bpf: Repurpose use_trace_rcu to reuse_now in bpf_local_storageMartin KaFai Lau
This patch re-purpose the use_trace_rcu to mean if the freed memory can be reused immediately or not. The use_trace_rcu is renamed to reuse_now. Other than the boolean test is reversed, it should be a no-op. The following explains the reason for the rename and how it will be used in a later patch. In a later patch, bpf_mem_cache_alloc/free will be used in the bpf_local_storage. The bpf mem allocator will reuse the freed memory immediately. Some of the free paths in bpf_local_storage does not support memory to be reused immediately. These paths are the "delete" elem cases from the bpf_*_storage_delete() helper and the map_delete_elem() syscall. Note that "delete" elem before the owner's (sk/task/cgrp/inode) lifetime ended is not the common usage for the local storage. The common free path, bpf_local_storage_destroy(), can reuse the memory immediately. This common path means the storage stays with its owner until the owner is destroyed. The above mentioned "delete" elem paths that cannot reuse immediately always has the 'use_trace_rcu == true'. The cases that is safe for immediate reuse always have 'use_trace_rcu == false'. Instead of adding another arg in a later patch, this patch re-purpose this arg to reuse_now and have the test logic reversed. In a later patch, 'reuse_now == true' will free to the bpf_mem_cache_free() where the memory can be reused immediately. 'reuse_now == false' will go through the call_rcu_tasks_trace(). Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Link: https://lore.kernel.org/r/20230308065936.1550103-7-martin.lau@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-03-10bpf: Remember smap in bpf_local_storageMartin KaFai Lau
This patch remembers which smap triggers the allocation of a 'struct bpf_local_storage' object. The local_storage is allocated during the very first selem added to the owner. The smap pointer is needed when using the bpf_mem_cache_free in a later patch because it needs to free to the correct smap's bpf_mem_alloc object. When a selem is being removed, it needs to check if it is the selem that triggers the creation of the local_storage. If it is, the local_storage->smap pointer will be reset to NULL. This NULL reset is done under the local_storage->lock in bpf_selem_unlink_storage_nolock() when a selem is being removed. Also note that the local_storage may not go away even local_storage->smap is NULL because there may be other selem still stored in the local_storage. Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Link: https://lore.kernel.org/r/20230308065936.1550103-6-martin.lau@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-03-10bpf: Refactor codes into bpf_local_storage_destroyMartin KaFai Lau
This patch first renames bpf_local_storage_unlink_nolock to bpf_local_storage_destroy(). It better reflects that it is only used when the storage's owner (sk/task/cgrp/inode) is being kfree(). All bpf_local_storage_destroy's caller is taking the spin lock and then free the storage. This patch also moves these two steps into the bpf_local_storage_destroy. This is a preparation work for a later patch that uses bpf_mem_cache_alloc/free in the bpf_local_storage. Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Link: https://lore.kernel.org/r/20230308065936.1550103-3-martin.lau@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-03-10bpf: Move a few bpf_local_storage functions to static scopeMartin KaFai Lau
This patch moves the bpf_local_storage_free_rcu() and bpf_selem_unlink_map() to static because they are not used outside of bpf_local_storage.c. Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Link: https://lore.kernel.org/r/20230308065936.1550103-2-martin.lau@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-03-10bpf: ensure state checkpointing at iter_next() call sitesAndrii Nakryiko
State equivalence check and checkpointing performed in is_state_visited() employs certain heuristics to try to save memory by avoiding state checkpoints if not enough jumps and instructions happened since last checkpoint. This leads to unpredictability of whether a particular instruction will be checkpointed and how regularly. While normally this is not causing much problems (except inconveniences for predictable verifier tests, which we overcome with BPF_F_TEST_STATE_FREQ flag), turns out it's not the case for open-coded iterators. Checking and saving state checkpoints at iter_next() call is crucial for fast convergence of open-coded iterator loop logic, so we need to force it. If we don't do that, is_state_visited() might skip saving a checkpoint, causing unnecessarily long sequence of not checkpointed instructions and jumps, leading to exhaustion of jump history buffer, and potentially other undesired outcomes. It is expected that with correct open-coded iterators convergence will happen quickly, so we don't run a risk of exhausting memory. This patch adds, in addition to prune and jump instruction marks, also a "forced checkpoint" mark, and makes sure that any iter_next() call instruction is marked as such. Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20230310060149.625887-1-andrii@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-03-10fbdev: Fix incorrect page mapping clearance at fb_deferred_io_release()Takashi Iwai
The recent fix for the deferred I/O by the commit 3efc61d95259 ("fbdev: Fix invalid page access after closing deferred I/O devices") caused a regression when the same fb device is opened/closed while it's being used. It resulted in a frozen screen even if something is redrawn there after the close. The breakage is because the patch was made under a wrong assumption of a single open; in the current code, fb_deferred_io_release() cleans up the page mapping of the pageref list and it calls cancel_delayed_work_sync() unconditionally, where both are no correct behavior for multiple opens. This patch adds a refcount for the opens of the device, and applies the cleanup only when all files get closed. As both fb_deferred_io_open() and _close() are called always in the fb_info lock (mutex), it's safe to use the normal int for the refcounting. Also, a useless BUG_ON() is dropped. Fixes: 3efc61d95259 ("fbdev: Fix invalid page access after closing deferred I/O devices") Cc: <stable@vger.kernel.org> Signed-off-by: Takashi Iwai <tiwai@suse.de> Reviewed-by: Patrik Jakobsson <patrik.r.jakobsson@gmail.com> Signed-off-by: Thomas Zimmermann <tzimmermann@suse.de> Link: https://patchwork.freedesktop.org/patch/msgid/20230308105012.1845-1-tiwai@suse.de
2023-03-09Merge branch 'main' of ↵Jakub Kicinski
git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next Florian Westphal says: ==================== Netfilter updates for net-next 1. nf_tables 'brouting' support, from Sriram Yagnaraman. 2. Update bridge netfilter and ovs conntrack helpers to handle IPv6 Jumbo packets properly, i.e. fetch the packet length from hop-by-hop extension header, from Xin Long. This comes with a test BIG TCP test case, added to tools/testing/selftests/net/. 3. Fix spelling and indentation in conntrack, from Jeremy Sowden. * 'main' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next: netfilter: nat: fix indentation of function arguments netfilter: conntrack: fix typo selftests: add a selftest for big tcp netfilter: use nf_ip6_check_hbh_len in nf_ct_skb_network_trim netfilter: move br_nf_check_hbh_len to utils netfilter: bridge: move pskb_trim_rcsum out of br_nf_check_hbh_len netfilter: bridge: check len before accessing more nh data netfilter: bridge: call pskb_may_pull in br_nf_check_hbh_len netfilter: bridge: introduce broute meta statement ==================== Link: https://lore.kernel.org/r/20230308193033.13965-1-fw@strlen.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-03-09netlink: remove unused 'compare' functionFlorian Westphal
No users in the tree. Tested with allmodconfig build. Signed-off-by: Florian Westphal <fw@strlen.de> Link: https://lore.kernel.org/r/20230308142006.20879-1-fw@strlen.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-03-09Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski
Documentation/bpf/bpf_devel_QA.rst b7abcd9c656b ("bpf, doc: Link to submitting-patches.rst for general patch submission info") d56b0c461d19 ("bpf, docs: Fix link to netdev-FAQ target") https://lore.kernel.org/all/20230307095812.236eb1be@canb.auug.org.au/ Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-03-09clk: Avoid invalid function names in CLK_OF_DECLARE()Nathan Chancellor
After commit c28cd1f3433c ("clk: Mark a fwnode as initialized when using CLK_OF_DECLARE() macro"), drivers/clk/mvebu/kirkwood.c fails to build: drivers/clk/mvebu/kirkwood.c:358:1: error: expected identifier or '(' CLK_OF_DECLARE(98dx1135_clk, "marvell,mv98dx1135-core-clock", ^ include/linux/clk-provider.h:1367:21: note: expanded from macro 'CLK_OF_DECLARE' static void __init name##_of_clk_init_declare(struct device_node *np) \ ^ <scratch space>:124:1: note: expanded from here 98dx1135_clk_of_clk_init_declare ^ drivers/clk/mvebu/kirkwood.c:358:1: error: invalid digit 'd' in decimal constant include/linux/clk-provider.h:1372:34: note: expanded from macro 'CLK_OF_DECLARE' OF_DECLARE_1(clk, name, compat, name##_of_clk_init_declare) ^ <scratch space>:125:3: note: expanded from here 98dx1135_clk_of_clk_init_declare ^ drivers/clk/mvebu/kirkwood.c:358:1: error: invalid digit 'd' in decimal constant include/linux/clk-provider.h:1372:34: note: expanded from macro 'CLK_OF_DECLARE' OF_DECLARE_1(clk, name, compat, name##_of_clk_init_declare) ^ <scratch space>:125:3: note: expanded from here 98dx1135_clk_of_clk_init_declare ^ drivers/clk/mvebu/kirkwood.c:358:1: error: invalid digit 'd' in decimal constant include/linux/clk-provider.h:1372:34: note: expanded from macro 'CLK_OF_DECLARE' OF_DECLARE_1(clk, name, compat, name##_of_clk_init_declare) ^ <scratch space>:125:3: note: expanded from here 98dx1135_clk_of_clk_init_declare ^ C function names must start with either an alphabetic letter or an underscore. To avoid generating invalid function names from clock names, add two underscores to the beginning of the identifier. Fixes: c28cd1f3433c ("clk: Mark a fwnode as initialized when using CLK_OF_DECLARE() macro") Suggested-by: Saravana Kannan <saravanak@google.com> Signed-off-by: Nathan Chancellor <nathan@kernel.org> Link: https://lore.kernel.org/r/20230308-clk_of_declare-fix-v1-1-317b741e2532@kernel.org Reviewed-by: Saravana Kannan <saravanak@google.com> Reported-by: Naresh Kamboju <naresh.kamboju@linaro.org> Signed-off-by: Stephen Boyd <sboyd@kernel.org>
2023-03-09i2c: Switch .probe() to not take an id parameterUwe Kleine-König
Commit b8a1a4cd5a98 ("i2c: Provide a temporary .probe_new() call-back type") introduced a new probe callback to convert i2c init routines to not take an i2c_device_id parameter. Now that all in-tree drivers are converted to the temporary .probe_new() callback, .probe() can be modified to match the desired prototype. Now that .probe() and .probe_new() have the same semantic, they can be defined as members of an anonymous union to save some memory and simplify the core code a bit. Signed-off-by: Uwe Kleine-König <u.kleine-koenig@pengutronix.de> Signed-off-by: Wolfram Sang <wsa@kernel.org>
2023-03-09Merge tag 'for-linus-2023030901' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/hid/hid Pull HID fixes from Benjamin Tissoires: - fix potential out of bound write of zeroes in HID core with a specially crafted uhid device (Lee Jones) - fix potential use-after-free in work function in intel-ish-hid (Reka Norman) - selftests config fixes (Benjamin Tissoires) - few device small fixes and support * tag 'for-linus-2023030901' of git://git.kernel.org/pub/scm/linux/kernel/git/hid/hid: HID: intel-ish-hid: ipc: Fix potential use-after-free in work function HID: logitech-hidpp: Add support for Logitech MX Master 3S mouse HID: cp2112: Fix driver not registering GPIO IRQ chip as threaded selftest: hid: fix hid_bpf not set in config HID: uhid: Over-ride the default maximum data buffer value with our own HID: core: Provide new max_buffer_size attribute to over-ride the default
2023-03-08bpf: implement numbers iteratorAndrii Nakryiko
Implement the first open-coded iterator type over a range of integers. It's public API consists of: - bpf_iter_num_new() constructor, which accepts [start, end) range (that is, start is inclusive, end is exclusive). - bpf_iter_num_next() which will keep returning read-only pointer to int until the range is exhausted, at which point NULL will be returned. If bpf_iter_num_next() is kept calling after this, NULL will be persistently returned. - bpf_iter_num_destroy() destructor, which needs to be called at some point to clean up iterator state. BPF verifier enforces that iterator destructor is called at some point before BPF program exits. Note that `start = end = X` is a valid combination to setup an empty iterator. bpf_iter_num_new() will return 0 (success) for any such combination. If bpf_iter_num_new() detects invalid combination of input arguments, it returns error, resets iterator state to, effectively, empty iterator, so any subsequent call to bpf_iter_num_next() will keep returning NULL. BPF verifier has no knowledge that returned integers are in the [start, end) value range, as both `start` and `end` are not statically known and enforced: they are runtime values. While the implementation is pretty trivial, some care needs to be taken to avoid overflows and underflows. Subsequent selftests will validate correctness of [start, end) semantics, especially around extremes (INT_MIN and INT_MAX). Similarly to bpf_loop(), we enforce that no more than BPF_MAX_LOOPS can be specified. bpf_iter_num_{new,next,destroy}() is a logical evolution from bounded BPF loops and bpf_loop() helper and is the basis for implementing ergonomic BPF loops with no statically known or verified bounds. Subsequent patches implement bpf_for() macro, demonstrating how this can be wrapped into something that works and feels like a normal for() loop in C language. Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20230308184121.1165081-5-andrii@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-03-08bpf: add support for open-coded iterator loopsAndrii Nakryiko
Teach verifier about the concept of the open-coded (or inline) iterators. This patch adds generic iterator loop verification logic, new STACK_ITER stack slot type to contain iterator state, and necessary kfunc plumbing for iterator's constructor, destructor and next methods. Next patch implements first specific iterator (numbers iterator for implementing for() loop logic). Such split allows to have more focused commits for verifier logic and separate commit that we could point later to demonstrating what does it take to add a new kind of iterator. Each kind of iterator has its own associated struct bpf_iter_<type>, where <type> denotes a specific type of iterator. struct bpf_iter_<type> state is supposed to live on BPF program stack, so there will be no way to change its size later on without breaking backwards compatibility, so choose wisely! But given this struct is specific to a given <type> of iterator, this allows a lot of flexibility: simple iterators could be fine with just one stack slot (8 bytes), like numbers iterator in the next patch, while some other more complicated iterators might need way more to keep their iterator state. Either way, such design allows to avoid runtime memory allocations, which otherwise would be necessary if we fixed on-the-stack size and it turned out to be too small for a given iterator implementation. The way BPF verifier logic is implemented, there are no artificial restrictions on a number of active iterators, it should work correctly using multiple active iterators at the same time. This also means you can have multiple nested iteration loops. struct bpf_iter_<type> reference can be safely passed to subprograms as well. General flow is easiest to demonstrate with a simple example using number iterator implemented in next patch. Here's the simplest possible loop: struct bpf_iter_num it; int *v; bpf_iter_num_new(&it, 2, 5); while ((v = bpf_iter_num_next(&it))) { bpf_printk("X = %d", *v); } bpf_iter_num_destroy(&it); Above snippet should output "X = 2", "X = 3", "X = 4". Note that 5 is exclusive and is not returned. This matches similar APIs (e.g., slices in Go or Rust) that implement a range of elements, where end index is non-inclusive. In the above example, we see a trio of function: - constructor, bpf_iter_num_new(), which initializes iterator state (struct bpf_iter_num it) on the stack. If any of the input arguments are invalid, constructor should make sure to still initialize it such that subsequent bpf_iter_num_next() calls will return NULL. I.e., on error, return error and construct empty iterator. - next method, bpf_iter_num_next(), which accepts pointer to iterator state and produces an element. Next method should always return a pointer. The contract between BPF verifier is that next method will always eventually return NULL when elements are exhausted. Once NULL is returned, subsequent next calls should keep returning NULL. In the case of numbers iterator, bpf_iter_num_next() returns a pointer to an int (storage for this integer is inside the iterator state itself), which can be dereferenced after corresponding NULL check. - once done with the iterator, it's mandated that user cleans up its state with the call to destructor, bpf_iter_num_destroy() in this case. Destructor frees up any resources and marks stack space used by struct bpf_iter_num as usable for something else. Any other iterator implementation will have to implement at least these three methods. It is enforced that for any given type of iterator only applicable constructor/destructor/next are callable. I.e., verifier ensures you can't pass number iterator state into, say, cgroup iterator's next method. It is important to keep the naming pattern consistent to be able to create generic macros to help with BPF iter usability. E.g., one of the follow up patches adds generic bpf_for_each() macro to bpf_misc.h in selftests, which allows to utilize iterator "trio" nicely without having to code the above somewhat tedious loop explicitly every time. This is enforced at kfunc registration point by one of the previous patches in this series. At the implementation level, iterator state tracking for verification purposes is very similar to dynptr. We add STACK_ITER stack slot type, reserve necessary number of slots, depending on sizeof(struct bpf_iter_<type>), and keep track of necessary extra state in the "main" slot, which is marked with non-zero ref_obj_id. Other slots are also marked as STACK_ITER, but have zero ref_obj_id. This is simpler than having a separate "is_first_slot" flag. Another big distinction is that STACK_ITER is *always refcounted*, which simplifies implementation without sacrificing usability. So no need for extra "iter_id", no need to anticipate reuse of STACK_ITER slots for new constructors, etc. Keeping it simple here. As far as the verification logic goes, there are two extensive comments: in process_iter_next_call() and iter_active_depths_differ() explaining some important and sometimes subtle aspects. Please refer to them for details. But from 10,000-foot point of view, next methods are the points of forking a verification state, which are conceptually similar to what verifier is doing when validating conditional jump. We branch out at a `call bpf_iter_<type>_next` instruction and simulate two outcomes: NULL (iteration is done) and non-NULL (new element is returned). NULL is simulated first and is supposed to reach exit without looping. After that non-NULL case is validated and it either reaches exit (for trivial examples with no real loop), or reaches another `call bpf_iter_<type>_next` instruction with the state equivalent to already (partially) validated one. State equivalency at that point means we technically are going to be looping forever without "breaking out" out of established "state envelope" (i.e., subsequent iterations don't add any new knowledge or constraints to the verifier state, so running 1, 2, 10, or a million of them doesn't matter). But taking into account the contract stating that iterator next method *has to* return NULL eventually, we can conclude that loop body is safe and will eventually terminate. Given we validated logic outside of the loop (NULL case), and concluded that loop body is safe (though potentially looping many times), verifier can claim safety of the overall program logic. The rest of the patch is necessary plumbing for state tracking, marking, validation, and necessary further kfunc plumbing to allow implementing iterator constructor, destructor, and next methods. Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20230308184121.1165081-4-andrii@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-03-08bpf: add iterator kfuncs registration and validation logicAndrii Nakryiko
Add ability to register kfuncs that implement BPF open-coded iterator contract and enforce naming and function proto convention. Enforcement happens at the time of kfunc registration and significantly simplifies the rest of iterators logic in the verifier. More details follow in subsequent patches, but we enforce the following conditions. All kfuncs (constructor, next, destructor) have to be named consistenly as bpf_iter_<type>_{new,next,destroy}(), respectively. <type> represents iterator type, and iterator state should be represented as a matching `struct bpf_iter_<type>` state type. Also, all iter kfuncs should have a pointer to this `struct bpf_iter_<type>` as the very first argument. Additionally: - Constructor, i.e., bpf_iter_<type>_new(), can have arbitrary extra number of arguments. Return type is not enforced either. - Next method, i.e., bpf_iter_<type>_next(), has to return a pointer type and should have exactly one argument: `struct bpf_iter_<type> *` (const/volatile/restrict and typedefs are ignored). - Destructor, i.e., bpf_iter_<type>_destroy(), should return void and should have exactly one argument, similar to the next method. - struct bpf_iter_<type> size is enforced to be positive and a multiple of 8 bytes (to fit stack slots correctly). Such strictness and consistency allows to build generic helpers abstracting important, but boilerplate, details to be able to use open-coded iterators effectively and ergonomically (see bpf_for_each() in subsequent patches). It also simplifies the verifier logic in some places. At the same time, this doesn't hurt generality of possible iterator implementations. Win-win. Constructor kfunc is marked with a new KF_ITER_NEW flags, next method is marked with KF_ITER_NEXT (and should also have KF_RET_NULL, of course), while destructor kfunc is marked as KF_ITER_DESTROY. Additionally, we add a trivial kfunc name validation: it should be a valid non-NULL and non-empty string. Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20230308184121.1165081-3-andrii@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-03-08Merge https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-nextJakub Kicinski
Andrii Nakryiko says: ==================== pull-request: bpf-next 2023-03-08 We've added 23 non-merge commits during the last 2 day(s) which contain a total of 28 files changed, 414 insertions(+), 104 deletions(-). The main changes are: 1) Add more precise memory usage reporting for all BPF map types, from Yafang Shao. 2) Add ARM32 USDT support to libbpf, from Puranjay Mohan. 3) Fix BTF_ID_LIST size causing problems in !CONFIG_DEBUG_INFO_BTF, from Nathan Chancellor. 4) IMA selftests fix, from Roberto Sassu. 5) libbpf fix in APK support code, from Daniel Müller. * https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (23 commits) selftests/bpf: Fix IMA test libbpf: USDT arm arg parsing support libbpf: Refactor parse_usdt_arg() to re-use code libbpf: Fix theoretical u32 underflow in find_cd() function bpf: enforce all maps having memory usage callback bpf: offload map memory usage bpf, net: xskmap memory usage bpf, net: sock_map memory usage bpf, net: bpf_local_storage memory usage bpf: local_storage memory usage bpf: bpf_struct_ops memory usage bpf: queue_stack_maps memory usage bpf: devmap memory usage bpf: cpumap memory usage bpf: bloom_filter memory usage bpf: ringbuf memory usage bpf: reuseport_array memory usage bpf: stackmap memory usage bpf: arraymap memory usage bpf: hashtab memory usage ... ==================== Link: https://lore.kernel.org/r/20230308193533.1671597-1-andrii@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-03-08netfilter: move br_nf_check_hbh_len to utilsXin Long
Rename br_nf_check_hbh_len() to nf_ip6_check_hbh_len() and move it to netfilter utils, so that it can be used by other modules, like ovs and tc. Signed-off-by: Xin Long <lucien.xin@gmail.com> Reviewed-by: Simon Horman <simon.horman@corigine.com> Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org> Reviewed-by: Aaron Conole <aconole@redhat.com> Signed-off-by: Florian Westphal <fw@strlen.de>
2023-03-08net: reclaim skb->scm_io_uring bitEric Dumazet
Commit 0091bfc81741 ("io_uring/af_unix: defer registered files gc to io_uring release") added one bit to struct sk_buff. This structure is critical for networking, and we try very hard to not add bloat on it, unless absolutely required. For instance, we can use a specific destructor as a wrapper around unix_destruct_scm(), to identify skbs that unix_gc() has to special case. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Pavel Begunkov <asml.silence@gmail.com> Cc: Thadeu Lima de Souza Cascardo <cascardo@canonical.com> Cc: Jens Axboe <axboe@kernel.dk> Reviewed-by: Jens Axboe <axboe@kernel.dk> Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-03-07net: remove enum skb_free_reasonEric Dumazet
enum skb_drop_reason is more generic, we can adopt it instead. Provide dev_kfree_skb_irq_reason() and dev_kfree_skb_any_reason(). This means drivers can use more precise drop reasons if they want to. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Simon Horman <simon.horman@corigine.com> Reviewed-by: Yunsheng Lin <linyunsheng@huawei.com> Link: https://lore.kernel.org/r/20230306204313.10492-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-03-07net: phy: improve phy_read_poll_timeoutHeiner Kallweit
cond sometimes is (val & MASK) what may result in a false positive if val is a negative errno. We shouldn't evaluate cond if val < 0. This has no functional impact here, but it's not nice. Therefore switch order of the checks. Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com> Reviewed-by: Simon Horman <simon.horman@corigine.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Link: https://lore.kernel.org/r/6d8274ac-4344-23b4-d9a3-cad4c39517d4@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-03-07interconnect: fix provider registration APIJohan Hovold
The current interconnect provider interface is inherently racy as providers are expected to be added before being fully initialised. Specifically, nodes are currently not added and the provider data is not initialised until after registering the provider which can cause racing DT lookups to fail. Add a new provider API which will be used to fix up the interconnect drivers. The old API is reimplemented using the new interface and will be removed once all drivers have been fixed. Fixes: 11f1ceca7031 ("interconnect: Add generic on-chip interconnect API") Fixes: 87e3031b6fbd ("interconnect: Allow endpoints translation via DT") Cc: stable@vger.kernel.org # 5.1 Reviewed-by: Konrad Dybcio <konrad.dybcio@linaro.org> Signed-off-by: Johan Hovold <johan+linaro@kernel.org> Tested-by: Luca Ceresoli <luca.ceresoli@bootlin.com> # i.MX8MP MSC SM2-MB-EP1 Board Link: https://lore.kernel.org/r/20230306075651.2449-4-johan+linaro@kernel.org Signed-off-by: Georgi Djakov <djakov@kernel.org>
2023-03-07cpumask: be more careful with 'cpumask_setall()'Linus Torvalds
Commit 596ff4a09b89 ("cpumask: re-introduce constant-sized cpumask optimizations") changed cpumask_setall() to use "bitmap_set()" instead of "bitmap_fill()", because bitmap_fill() would explicitly set all the bits of a constant sized small bitmap, and that's exactly what we don't want: we want to only set bits up to 'nr_cpu_ids', which is what "bitmap_set()" does. However, Yury correctly points out that while "bitmap_set()" does indeed only set bits up to the required bitmap size, it doesn't _clear_ bits above that size, so the upper bits would still not have well-defined values. Now, none of this should really matter, since any bits set past 'nr_cpu_ids' should always be ignored in the first place. Yes, the bit scanning functions might return them as a result, but since users should always consider the ">= nr_cpu_ids" condition to mean "no more bits", that shouldn't have any actual effect (see previous commit 8ca09d5fa354 "cpumask: fix incorrect cpumask scanning result checks"). But let's just do it right, the way the code was _intended_ to work. We have had enough lazy code that works but bites us in the *rse later (again, see previous commit) that there's no reason to not just do this properly. It turns out that "bitmap_fill()" gets this all right for the complex case, and really only fails for the inlined optimized case that just fills the whole word. And while we could just fix bitmap_fill() to use the proper last word mask, there's two issues with that: - the cpumask case wants to do the _optimization_ based on "NR_CPUS is a small constant", but then wants to do the actual bit _fill_ based on "nr_cpu_ids" that isn't necessarily that same constant - we have lots of non-cpumask users of bitmap_fill(), and while they hopefully don't care, and probably would want the proper semantics anyway ("only set bits up to the limit"), I do not want the cpumask changes to impact other parts So this ends up just doing the single-word optimization by hand in the cpumask code. If our cpumask is fundamentally limited to a single word, just do the proper "fill in that word" exactly. And if it's the more complex multi-word case, then the generic bitmap_fill() will DTRT. This is all an example of how our bitmap function optimizations really are somewhat broken. They conflate the "this is size of the bitmap" optimizations with the actual bit(s) we want to set. In many cases we really want to have the two be separate things: sometimes we base our optimizations on the size of the whole bitmap ("I know this whole bitmap fits in a single word, so I'll just use single-word accesses"), and sometimes we base them on the bit we are looking at ("this is just acting on bits that are in the first word, so I'll use single-word accesses"). Notice how the end result of the two optimizations are the same, but the way we get to them are quite different. And all our cpumask optimization games are really about that fundamental distinction, and we'd often really want to pass in both the "this is the bit I'm working on" (which _can_ be a small constant but might be variable), and "I know it's in this range even if it's variable" (based on CONFIG_NR_CPUS). So this cpumask_setall() implementation just makes that explicit. It checks the "I statically know the size is small" using the known static size of the cpumask (which is what that 'small_cpumask_bits' is all about), but then sets the actual bits using the exact number of cpus we have (ie 'nr_cpumask_bits') Of course, in a perfect world, the compiler would have done all the range analysis (possibly with help from us just telling it that "this value is always in this range"), and would do all of this for us. But that is not the world we live in. While we dream of that perfect world, this does that manual logic to make it all work out. And this was a very long explanation for a small code change that shouldn't even matter. Reported-by: Yury Norov <yury.norov@gmail.com> Link: https://lore.kernel.org/lkml/ZAV9nGG9e1%2FrV+L%2F@yury-laptop/ Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2023-03-07bpf: offload map memory usageYafang Shao
A new helper is introduced to calculate offload map memory usage. But currently the memory dynamically allocated in netdev dev_ops, like nsim_map_update_elem, is not counted. Let's just put it aside now. Signed-off-by: Yafang Shao <laoar.shao@gmail.com> Link: https://lore.kernel.org/r/20230305124615.12358-18-laoar.shao@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-03-07bpf, net: bpf_local_storage memory usageYafang Shao
A new helper is introduced into bpf_local_storage map to calculate the memory usage. This helper is also used by other maps like bpf_cgrp_storage, bpf_inode_storage, bpf_task_storage and etc. Note that currently the dynamically allocated storage elements are not counted in the usage, since it will take extra runtime overhead in the elements update or delete path. So let's put it aside now, and implement it in the future when someone really need it. Signed-off-by: Yafang Shao <laoar.shao@gmail.com> Link: https://lore.kernel.org/r/20230305124615.12358-15-laoar.shao@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-03-07bpf: add new map ops ->map_mem_usageYafang Shao
Add a new map ops ->map_mem_usage to print the memory usage of a bpf map. This is a preparation for the followup change. Signed-off-by: Yafang Shao <laoar.shao@gmail.com> Link: https://lore.kernel.org/r/20230305124615.12358-2-laoar.shao@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-03-07bpf: Increase size of BTF_ID_LIST without CONFIG_DEBUG_INFO_BTF againNathan Chancellor
After commit 66e3a13e7c2c ("bpf: Add bpf_dynptr_slice and bpf_dynptr_slice_rdwr"), clang builds without CONFIG_DEBUG_INFO_BTF warn: kernel/bpf/verifier.c:10298:24: warning: array index 16 is past the end of the array (that has type 'u32[16]' (aka 'unsigned int[16]')) [-Warray-bounds] meta.func_id == special_kfunc_list[KF_bpf_dynptr_slice_rdwr]) { ^ ~~~~~~~~~~~~~~~~~~~~~~~~ kernel/bpf/verifier.c:9150:1: note: array 'special_kfunc_list' declared here BTF_ID_LIST(special_kfunc_list) ^ include/linux/btf_ids.h:207:27: note: expanded from macro 'BTF_ID_LIST' #define BTF_ID_LIST(name) static u32 __maybe_unused name[16]; ^ 1 warning generated. A warning of this nature was previously addressed by commit beb3d47d1d3d ("bpf: Fix a BTF_ID_LIST bug with CONFIG_DEBUG_INFO_BTF not set") but there have been new kfuncs added since then. Quadruple the size of the CONFIG_DEBUG_INFO_BTF=n definition so that this problem is unlikely to show up for some time. Link: https://github.com/ClangBuiltLinux/linux/issues/1810 Signed-off-by: Nathan Chancellor <nathan@kernel.org> Reviewed-by: Tom Rix <trix@redhat.com> Link: https://lore.kernel.org/r/20230307-bpf-kfuncs-warray-bounds-v1-1-00ad3191f3a6@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-03-07new helper: put_and_unmap_page()Al Viro
kunmap_local() + put_page(), as done by e.g. ext2 directory handling. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2023-03-06Merge tag 'for-netdev' of ↵Jakub Kicinski
https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next Daniel Borkmann says: ==================== pull-request: bpf-next 2023-03-06 We've added 85 non-merge commits during the last 13 day(s) which contain a total of 131 files changed, 7102 insertions(+), 1792 deletions(-). The main changes are: 1) Add skb and XDP typed dynptrs which allow BPF programs for more ergonomic and less brittle iteration through data and variable-sized accesses, from Joanne Koong. 2) Bigger batch of BPF verifier improvements to prepare for upcoming BPF open-coded iterators allowing for less restrictive looping capabilities, from Andrii Nakryiko. 3) Rework RCU enforcement in the verifier, add kptr_rcu and enforce BPF programs to NULL-check before passing such pointers into kfunc, from Alexei Starovoitov. 4) Add support for kptrs in percpu hashmaps, percpu LRU hashmaps and in local storage maps, from Kumar Kartikeya Dwivedi. 5) Add BPF verifier support for ST instructions in convert_ctx_access() which will help new -mcpu=v4 clang flag to start emitting them, from Eduard Zingerman. 6) Make uprobe attachment Android APK aware by supporting attachment to functions inside ELF objects contained in APKs via function names, from Daniel Müller. 7) Add a new flag BPF_F_TIMER_ABS flag for bpf_timer_start() helper to start the timer with absolute expiration value instead of relative one, from Tero Kristo. 8) Add a new kfunc bpf_cgroup_from_id() to look up cgroups via id, from Tejun Heo. 9) Extend libbpf to support users manually attaching kprobes/uprobes in the legacy/perf/link mode, from Menglong Dong. 10) Implement workarounds in the mips BPF JIT for DADDI/R4000, from Jiaxun Yang. 11) Enable mixing bpf2bpf and tailcalls for the loongarch BPF JIT, from Hengqi Chen. 12) Extend BPF instruction set doc with describing the encoding of BPF instructions in terms of how bytes are stored under big/little endian, from Jose E. Marchesi. 13) Follow-up to enable kfunc support for riscv BPF JIT, from Pu Lehui. 14) Fix bpf_xdp_query() backwards compatibility on old kernels, from Yonghong Song. 15) Fix BPF selftest cross compilation with CLANG_CROSS_FLAGS, from Florent Revest. 16) Improve bpf_cpumask_ma to only allocate one bpf_mem_cache, from Hou Tao. 17) Fix BPF verifier's check_subprogs to not unnecessarily mark a subprogram with has_tail_call, from Ilya Leoshkevich. 18) Fix arm syscall regs spec in libbpf's bpf_tracing.h, from Puranjay Mohan. * tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (85 commits) selftests/bpf: Add test for legacy/perf kprobe/uprobe attach mode selftests/bpf: Split test_attach_probe into multi subtests libbpf: Add support to set kprobe/uprobe attach mode tools/resolve_btfids: Add /libsubcmd to .gitignore bpf: add support for fixed-size memory pointer returns for kfuncs bpf: generalize dynptr_get_spi to be usable for iters bpf: mark PTR_TO_MEM as non-null register type bpf: move kfunc_call_arg_meta higher in the file bpf: ensure that r0 is marked scratched after any function call bpf: fix visit_insn()'s detection of BPF_FUNC_timer_set_callback helper bpf: clean up visit_insn()'s instruction processing selftests/bpf: adjust log_fixup's buffer size for proper truncation bpf: honor env->test_state_freq flag in is_state_visited() selftests/bpf: enhance align selftest's expected log matching bpf: improve regsafe() checks for PTR_TO_{MEM,BUF,TP_BUFFER} bpf: improve stack slot state printing selftests/bpf: Disassembler tests for verifier.c:convert_ctx_access() selftests/bpf: test if pointer type is tracked for BPF_ST_MEM bpf: allow ctx writes using BPF_ST_MEM instruction bpf: Use separate RCU callbacks for freeing selem ... ==================== Link: https://lore.kernel.org/r/20230307004346.27578-1-daniel@iogearbox.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-03-06clk: Mark a fwnode as initialized when using CLK_OF_DECLARE() macroSaravana Kannan
We already mark fwnodes as initialized when they are registered as clock providers. We do this so that fw_devlink can tell when a clock driver doesn't use the driver core framework to probe/initialize its device. This ensures fw_devlink doesn't block the consumers of such a clock provider indefinitely. However, some users of CLK_OF_DECLARE() macros don't use the same node that matches the macro as the node for the clock provider, but they initialize the entire node. To cover these cases, also mark the nodes that match the macros as initialized when the init callback function is called. An example of this is "stericsson,u8500-clks" that's handled using CLK_OF_DECLARE() and looks something like this: clocks { compatible = "stericsson,u8500-clks"; prcmu_clk: prcmu-clock { #clock-cells = <1>; }; prcc_pclk: prcc-periph-clock { #clock-cells = <2>; }; prcc_kclk: prcc-kernel-clock { #clock-cells = <2>; }; prcc_reset: prcc-reset-controller { #reset-cells = <2>; }; ... }; This patch makes sure that "clocks" is marked as initialized so that fw_devlink knows that all nodes under it have been initialized. If the driver creates struct devices for some of the subnodes, fw_devlink is smart enough to know to wait for those devices to probe, so no special handling is required for those cases. Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Reported-by: Linus Walleij <linus.walleij@linaro.org> Link: https://lore.kernel.org/lkml/CACRpkdamxDX6EBVjKX5=D3rkHp17f5pwGdBVhzFU90-0MHY6dQ@mail.gmail.com/ Fixes: 4a032827daa8 ("of: property: Simplify of_link_to_phandle()") Signed-off-by: Saravana Kannan <saravanak@google.com> Link: https://lore.kernel.org/r/20230302014639.297514-1-saravanak@google.com Reviewed-by: Linus Walleij <linus.walleij@linaro.org> Tested-by: Linus Walleij <linus.walleij@linaro.org> Signed-off-by: Stephen Boyd <sboyd@kernel.org>
2023-03-06cpumask: Fix typo nr_cpumask_size --> nr_cpumask_bitsAndy Shevchenko
The never used nr_cpumask_size is just a typo, hence use existing redefinition that's called nr_cpumask_bits. Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2023-03-05cpumask: re-introduce constant-sized cpumask optimizationsLinus Torvalds
Commit aa47a7c215e7 ("lib/cpumask: deprecate nr_cpumask_bits") resulted in the cpumask operations potentially becoming hugely less efficient, because suddenly the cpumask was always considered to be variable-sized. The optimization was then later added back in a limited form by commit 6f9c07be9d02 ("lib/cpumask: add FORCE_NR_CPUS config option"), but that FORCE_NR_CPUS option is not useful in a generic kernel and more of a special case for embedded situations with fixed hardware. Instead, just re-introduce the optimization, with some changes. Instead of depending on CPUMASK_OFFSTACK being false, and then always using the full constant cpumask width, this introduces three different cpumask "sizes": - the exact size (nr_cpumask_bits) remains identical to nr_cpu_ids. This is used for situations where we should use the exact size. - the "small" size (small_cpumask_bits) is the NR_CPUS constant if it fits in a single word and the bitmap operations thus end up able to trigger the "small_const_nbits()" optimizations. This is used for the operations that have optimized single-word cases that get inlined, notably the bit find and scanning functions. - the "large" size (large_cpumask_bits) is the NR_CPUS constant if it is an sufficiently small constant that makes simple "copy" and "clear" operations more efficient. This is arbitrarily set at four words or less. As a an example of this situation, without this fixed size optimization, cpumask_clear() will generate code like movl nr_cpu_ids(%rip), %edx addq $63, %rdx shrq $3, %rdx andl $-8, %edx callq memset@PLT on x86-64, because it would calculate the "exact" number of longwords that need to be cleared. In contrast, with this patch, using a MAX_CPU of 64 (which is quite a reasonable value to use), the above becomes a single movq $0,cpumask instruction instead, because instead of caring to figure out exactly how many CPU's the system has, it just knows that the cpumask will be a single word and can just clear it all. Note that this does end up tightening the rules a bit from the original version in another way: operations that set bits in the cpumask are now limited to the actual nr_cpu_ids limit, whereas we used to do the nr_cpumask_bits thing almost everywhere in the cpumask code. But if you just clear bits, or scan for bits, we can use the simpler compile-time constants. In the process, remove 'cpumask_complement()' and 'for_each_cpu_not()' which were not useful, and which fundamentally have to be limited to 'nr_cpu_ids'. Better remove them now than have somebody introduce use of them later. Of course, on x86-64 with MAXSMP there is no sane small compile-time constant for the cpumask sizes, and we end up using the actual CPU bits, and will generate the above kind of horrors regardless. Please don't use MAXSMP unless you really expect to have machines with thousands of cores. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2023-03-05Merge tag 'irq-urgent-2023-03-05' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull irq updates from Thomas Gleixner: "A set of updates for the interrupt susbsystem: - Prevent possible NULL pointer derefences in irq_data_get_affinity_mask() and irq_domain_create_hierarchy() - Take the per device MSI lock before invoking code which relies on it being hold - Make sure that MSI descriptors are unreferenced before freeing them. This was overlooked when the platform MSI code was converted to use core infrastructure and results in a fals positive warning - Remove dead code in the MSI subsystem - Clarify the documentation for pci_msix_free_irq() - More kobj_type constification" * tag 'irq-urgent-2023-03-05' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: genirq/msi, platform-msi: Ensure that MSI descriptors are unreferenced genirq/msi: Drop dead domain name assignment irqdomain: Add missing NULL pointer check in irq_domain_create_hierarchy() genirq/irqdesc: Make kobj_type structures constant PCI/MSI: Clarify usage of pci_msix_free_irq() genirq/msi: Take the per-device MSI lock before validating the control structure genirq/ipi: Fix NULL pointer deref in irq_data_get_affinity_mask()
2023-03-05Remove Intel compiler supportMasahiro Yamada
include/linux/compiler-intel.h had no update in the past 3 years. We often forget about the third C compiler to build the kernel. For example, commit a0a12c3ed057 ("asm goto: eradicate CC_HAS_ASM_GOTO") only mentioned GCC and Clang. init/Kconfig defines CC_IS_GCC and CC_IS_CLANG but not CC_IS_ICC, and nobody has reported any issue. I guess the Intel Compiler support is broken, and nobody is caring about it. Harald Arnesen pointed out ICC (classic Intel C/C++ compiler) is deprecated: $ icc -v icc: remark #10441: The Intel(R) C++ Compiler Classic (ICC) is deprecated and will be removed from product release in the second half of 2023. The Intel(R) oneAPI DPC++/C++ Compiler (ICX) is the recommended compiler moving forward. Please transition to use this compiler. Use '-diag-disable=10441' to disable this message. icc version 2021.7.0 (gcc version 12.1.0 compatibility) Arnd Bergmann provided a link to the article, "Intel C/C++ compilers complete adoption of LLVM". lib/zstd/common/compiler.h and lib/zstd/compress/zstd_fast.c were kept untouched for better sync with https://github.com/facebook/zstd Link: https://www.intel.com/content/www/us/en/developer/articles/technical/adoption-of-llvm-complete-icx.html Signed-off-by: Masahiro Yamada <masahiroy@kernel.org> Acked-by: Arnd Bergmann <arnd@arndb.de> Reviewed-by: Nick Desaulniers <ndesaulniers@google.com> Reviewed-by: Nathan Chancellor <nathan@kernel.org> Reviewed-by: Miguel Ojeda <ojeda@kernel.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>