summaryrefslogtreecommitdiff
path: root/include
AgeCommit message (Collapse)Author
2023-03-14lockd: set file_lock start and end when decoding nlm4 testargsJeff Layton
Commit 6930bcbfb6ce dropped the setting of the file_lock range when decoding a nlm_lock off the wire. This causes the client side grant callback to miss matching blocks and reject the lock, only to rerequest it 30s later. Add a helper function to set the file_lock range from the start and end values that the protocol uses, and have the nlm_lock decoder call that to set up the file_lock args properly. Fixes: 6930bcbfb6ce ("lockd: detect and reject lock arguments that overflow") Reported-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Jeff Layton <jlayton@kernel.org> Tested-by: Amir Goldstein <amir73il@gmail.com> Cc: stable@vger.kernel.org #6.0 Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2023-03-14block: do not reverse request order when flushing plug listJan Kara
Commit 26fed4ac4eab ("block: flush plug based on hardware and software queue order") changed flushing of plug list to submit requests one device at a time. However while doing that it also started using list_add_tail() instead of list_add() used previously thus effectively submitting requests in reverse order. Also when forming a rq_list with remaining requests (in case two or more devices are used), we effectively reverse the ordering of the plug list for each device we process. Submitting requests in reverse order has negative impact on performance for rotational disks (when BFQ is not in use). We observe 10-25% regression in random 4k write throughput, as well as ~20% regression in MariaDB OLTP benchmark on rotational storage on btrfs filesystem. Fix the problem by preserving ordering of the plug list when inserting requests into the queuelist as well as by appending to requeue_list instead of prepending to it. Fixes: 26fed4ac4eab ("block: flush plug based on hardware and software queue order") Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20230313093002.11756-1-jack@suse.cz Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-03-14x86/PVH: obtain VGA console info in Dom0Jan Beulich
A new platform-op was added to Xen to allow obtaining the same VGA console information PV Dom0 is handed. Invoke the new function and have the output data processed by xen_init_vga(). Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Juergen Gross <jgross@suse.com> Link: https://lore.kernel.org/r/8f315e92-7bda-c124-71cc-478ab9c5e610@suse.com Signed-off-by: Juergen Gross <jgross@suse.com>
2023-03-14Merge tag 'kvmarm-fixes-6.3-1' of ↵Paolo Bonzini
git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD KVM/arm64 fixes for 6.3, part #1 A single patch to address a rather annoying bug w.r.t. guest timer offsetting. Effectively the synchronization of timer offsets between vCPUs was broken, leading to inconsistent timer reads within the VM.
2023-03-13bpf: use canonical ftrace pathRoss Zwisler
The canonical location for the tracefs filesystem is at /sys/kernel/tracing. But, from Documentation/trace/ftrace.rst: Before 4.1, all ftrace tracing control files were within the debugfs file system, which is typically located at /sys/kernel/debug/tracing. For backward compatibility, when mounting the debugfs file system, the tracefs file system will be automatically mounted at: /sys/kernel/debug/tracing Many comments and samples in the bpf code still refer to this older debugfs path, so let's update them to avoid confusion. There are a few spots where the bpf code explicitly checks both tracefs and debugfs (tools/bpf/bpftool/tracelog.c and tools/lib/api/fs/fs.c) and I've left those alone so that the tools can continue to work with both paths. Signed-off-by: Ross Zwisler <zwisler@google.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org> Link: https://lore.kernel.org/r/20230313205628.1058720-2-zwisler@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-03-13bpf: Disable migration when freeing stashed local kptr using obj dropDave Marchevsky
When a local kptr is stashed in a map and freed when the map goes away, currently an error like the below appears: [ 39.195695] BUG: using smp_processor_id() in preemptible [00000000] code: kworker/u32:15/2875 [ 39.196549] caller is bpf_mem_free+0x56/0xc0 [ 39.196958] CPU: 15 PID: 2875 Comm: kworker/u32:15 Tainted: G O 6.2.0-13016-g22df776a9a86 #4477 [ 39.197897] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba5276e321-prebuilt.qemu.org 04/01/2014 [ 39.198949] Workqueue: events_unbound bpf_map_free_deferred [ 39.199470] Call Trace: [ 39.199703] <TASK> [ 39.199911] dump_stack_lvl+0x60/0x70 [ 39.200267] check_preemption_disabled+0xbf/0xe0 [ 39.200704] bpf_mem_free+0x56/0xc0 [ 39.201032] ? bpf_obj_new_impl+0xa0/0xa0 [ 39.201430] bpf_obj_free_fields+0x1cd/0x200 [ 39.201838] array_map_free+0xad/0x220 [ 39.202193] ? finish_task_switch+0xe5/0x3c0 [ 39.202614] bpf_map_free_deferred+0xea/0x210 [ 39.203006] ? lockdep_hardirqs_on_prepare+0xe/0x220 [ 39.203460] process_one_work+0x64f/0xbe0 [ 39.203822] ? pwq_dec_nr_in_flight+0x110/0x110 [ 39.204264] ? do_raw_spin_lock+0x107/0x1c0 [ 39.204662] ? lockdep_hardirqs_on_prepare+0xe/0x220 [ 39.205107] worker_thread+0x74/0x7a0 [ 39.205451] ? process_one_work+0xbe0/0xbe0 [ 39.205818] kthread+0x171/0x1a0 [ 39.206111] ? kthread_complete_and_exit+0x20/0x20 [ 39.206552] ret_from_fork+0x1f/0x30 [ 39.206886] </TASK> This happens because the call to __bpf_obj_drop_impl I added in the patch adding support for stashing local kptrs doesn't disable migration. Prior to that patch, __bpf_obj_drop_impl logic only ran when called by a BPF progarm, whereas now it can be called from map free path, so it's necessary to explicitly disable migration. Also, refactor a bit to just call __bpf_obj_drop_impl directly instead of bothering w/ dtor union and setting pointer-to-obj_drop. Fixes: c8e187540914 ("bpf: Support __kptr to local kptrs") Reported-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com> Link: https://lore.kernel.org/r/20230313214641.3731908-1-davemarchevsky@fb.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-03-13net: virtio_net: implement exact header length guest featureJiri Pirko
Virtio spec introduced a feature VIRTIO_NET_F_GUEST_HDRLEN which when set implicates that device benefits from knowing the exact size of the header. For compatibility, to signal to the device that the header is reliable driver also needs to set this feature. Without this feature set by driver, device has to figure out the header size itself. Quoting the original virtio spec: "hdr_len is a hint to the device as to how much of the header needs to be kept to copy into each packet" "a hint" might not be clear for the reader what does it mean, if it is "maybe like that" of "exactly like that". This feature just makes it crystal clear and let the device count on the hdr_len being filled up by the exact length of header. Also note the spec already has following note about hdr_len: "Due to various bugs in implementations, this field is not useful as a guarantee of the transport header size." Without this feature the device needs to parse the header in core data path handling. Accurate information helps the device to eliminate such header parsing and directly use the hardware accelerators for GSO operation. virtio_net_hdr_from_skb() fills up hdr_len to skb_headlen(skb). The driver already complies to fill the correct value. Introduce the feature and advertise it. Note that virtio spec also includes following note for device implementation: "Caution should be taken by the implementation so as to prevent a malicious driver from attacking the device by setting an incorrect hdr_len." There is a plan to support this feature in our emulated device. A device of SolidRun offers this feature bit. They claim this feature will save the device a few cycles for every GSO packet. Link: https://docs.oasis-open.org/virtio/virtio/v1.2/cs01/virtio-v1.2-cs01.html#x1-230006x3 Signed-off-by: Jiri Pirko <jiri@nvidia.com> Reviewed-by: Parav Pandit <parav@nvidia.com> Reviewed-by: Alvaro Karsz <alvaro.karsz@solid-run.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Acked-by: Willem de Bruijn <willemb@google.com> Link: https://lore.kernel.org/r/20230309094559.917857-1-jiri@resnulli.us Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-03-13efi: earlycon: Reprobe after parsing config tablesArd Biesheuvel
Commit 732ea9db9d8a ("efi: libstub: Move screen_info handling to common code") reorganized the earlycon handling so that all architectures pass the screen_info data via a EFI config table instead of populating struct screen_info directly, as the latter is only possible when the EFI stub is baked into the kernel (and not into the decompressor). However, this means that struct screen_info may not have been populated yet by the time the earlycon probe takes place, and this results in a non-functional early console. So let's probe again right after parsing the config tables and populating struct screen_info. Note that this means that earlycon output starts a bit later than before, and so it may fail to capture issues that occur while doing the early EFI initialization. Fixes: 732ea9db9d8a ("efi: libstub: Move screen_info handling to common code") Reported-by: Shawn Guo <shawn.guo@linaro.org> Tested-by: Shawn Guo <shawn.guo@linaro.org> Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
2023-03-13tasks: Extract rcu_users out of unionDavid Vernet
In commit 3fbd7ee285b2b ("tasks: Add a count of task RCU users"), a count on the number of RCU users was added to struct task_struct. This was done so as to enable the removal of task_rcu_dereference(), and allow tasks to be protected by RCU even after exiting and being removed from the runqueue. In this commit, the 'refcount_t rcu_users' field that keeps track of this refcount was put into a union co-located with 'struct rcu_head rcu', so as to avoid taking up any extra space in task_struct. This was possible to do safely, because the field was only ever decremented by a static set of specific callers, and then never incremented again. While this restriction of there only being a small, static set of users of this field has worked fine, it prevents us from leveraging the field to use RCU to protect tasks in other contexts. During tracing, for example, it would be useful to be able to collect some tasks that performed a certain operation, put them in a map, and then periodically summarize who they are, which cgroup they're in, how much CPU time they've utilized, etc. While this can currently be done with 'usage', it becomes tricky when a task is already in a map, or if a reference should only be taken if a task is valid and will not soon be reaped. Ideally, we could do something like pass a reference to a map value, and then try to acquire a reference to the task in an RCU read region by using refcount_inc_not_zero(). Similarly, in sched_ext, schedulers are using integer pids to remember tasks, and then looking them up with find_task_by_pid_ns(). This is slow, error prone, and adds complexity. It would be more convenient and performant if BPF schedulers could instead store tasks directly in maps, and then leverage RCU to ensure they can be safely accessed with low overhead. Finally, overloading fields like this is error prone. Someone that wants to use 'rcu_users' could easily overlook the fact that once the rcu callback is scheduled, the refcount will go back to being nonzero, thus precluding the use of refcount_inc_not_zero(). Furthermore, as described below, it's possible to extract the fields of the union without changing the size of task_struct. There are several possible ways to enable this: 1. The lightest touch approach is likely the one proposed in this patch, which is to simply extract 'rcu_users' and 'rcu' from the union, so that scheduling the 'rcu' callback doesn't overwrite the 'rcu_users' refcount. If we have a trusted task pointer, this would allow us to use refcnt_inc_not_zero() inside of an RCU region to determine if we can safely acquire a reference to the task and store it in a map. As mentioned below, this can be done without changing the size of task_struct, by moving the location of the union to another location that has padding gaps we can fill in. 2. Removing 'refcount_t rcu_users', and instead having the entire task be freed in an rcu callback. This is likely the most sound overall design, though it changes the behavioral semantics exposed to callers, who currently expect that a task that's successfully looked up in e.g. the pid_list with find_task_by_pid_ns(), can always have a 'usage' reference acquired on them, as it's guaranteed to be > 0 until after the next gp. In order for this approach to work, we'd have to audit all callers. This approach also slightly changes behavior observed by user space by not invoking trace_sched_process_free() until the whole task_struct is actually being freed, rather than just after it's exited. It also may change timings, as memory will be freed in an RCU callback rather than immediately when the final 'usage' refcount drops to 0. This also is arguably a benefit, as it provides more predictable performance to callers who are refcounting tasks. 3. There may be other solutions as well that don't require changing the layout of task_struct. For example, we could possibly do something complex from the BPF side, such as listen for task exit and remove a task from a map when the task is exiting. This would likely require significant custom handling for task_struct in the verifier, so a more generalizable solution is likely warranted. As mentioned above, this patch proposes the lightest-touch approach which allows callers elsewhere in the kernel to use 'rcu_users' to ensure the lifetime of a task, by extracting 'rcu_users' and 'rcu' from the union. There is no size change in task_struct with this patch. Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Eric W. Biederman <ebiederm@xmission.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ingo Molnar <mingo@redhat.com> Signed-off-by: David Vernet <void@manifault.com> Acked-by: Oleg Nesterov <oleg@redhat.com> Link: https://lore.kernel.org/r/20230215233033.889644-1-void@manifault.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-03-13Merge drm/drm-fixes into drm-misc-fixesThomas Zimmermann
Backmerging to get latest upstream. Signed-off-by: Thomas Zimmermann <tzimmermann@suse.de>
2023-03-13PCI: s390: Fix use-after-free of PCI resources with per-function hotplugNiklas Schnelle
On s390 PCI functions may be hotplugged individually even when they belong to a multi-function device. In particular on an SR-IOV device VFs may be removed and later re-added. In commit a50297cf8235 ("s390/pci: separate zbus creation from scanning") it was missed however that struct pci_bus and struct zpci_bus's resource list retained a reference to the PCI functions MMIO resources even though those resources are released and freed on hot-unplug. These stale resources may subsequently be claimed when the PCI function re-appears resulting in use-after-free. One idea of fixing this use-after-free in s390 specific code that was investigated was to simply keep resources around from the moment a PCI function first appeared until the whole virtual PCI bus created for a multi-function device disappears. The problem with this however is that due to the requirement of artificial MMIO addreesses (address cookies) extra logic is then needed to keep the address cookies compatible on re-plug. At the same time the MMIO resources semantically belong to the PCI function so tying their lifecycle to the function seems more logical. Instead a simpler approach is to remove the resources of an individually hot-unplugged PCI function from the PCI bus's resource list while keeping the resources of other PCI functions on the PCI bus untouched. This is done by introducing pci_bus_remove_resource() to remove an individual resource. Similarly the resource also needs to be removed from the struct zpci_bus's resource list. It turns out however, that there is really no need to add the MMIO resources to the struct zpci_bus's resource list at all and instead we can simply use the zpci_bar_struct's resource pointer directly. Fixes: a50297cf8235 ("s390/pci: separate zbus creation from scanning") Signed-off-by: Niklas Schnelle <schnelle@linux.ibm.com> Reviewed-by: Matthew Rosato <mjrosato@linux.ibm.com> Acked-by: Bjorn Helgaas <bhelgaas@google.com> Link: https://lore.kernel.org/r/20230306151014.60913-2-schnelle@linux.ibm.com Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
2023-03-12Merge tag 'ext4_for_linus_stable' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 Pull ext4 fixes from Ted Ts'o: "Bug fixes and regressions for ext4, the most serious of which is a potential deadlock during directory renames that was introduced during the merge window discovered by a combination of syzbot and lockdep" * tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: ext4: zero i_disksize when initializing the bootloader inode ext4: make sure fs error flag setted before clear journal error ext4: commit super block if fs record error when journal record without error ext4, jbd2: add an optimized bmap for the journal inode ext4: fix WARNING in ext4_update_inline_data ext4: move where set the MAY_INLINE_DATA flag is set ext4: Fix deadlock during directory rename ext4: Fix comment about the 64BIT feature docs: ext4: modify the group desc size to 64 ext4: fix another off-by-one fsmap error on 1k block filesystems ext4: fix RENAME_WHITEOUT handling for inline directories ext4: make kobj_type structures constant ext4: fix cgroup writeback accounting with fs-layer encryption
2023-03-12cpumask: relax sanity checking constraintsLinus Torvalds
The cpumask_check() was unnecessarily tight, and causes problems for the users of cpumask_next(). We have a number of users that take the previous return value of one of the bit scanning functions and subtract one to keep it in "range". But since the scanning functions end up returning up to 'small_cpumask_bits' instead of the tighter 'nr_cpumask_bits', the range really needs to be using that widened form. [ This "previous-1" behavior is also the reason we have all those comments about /* -1 is a legal arg here. */ and separate checks for that being ok. So we could have just made "small_cpumask_bits-1" be a similar special "don't check this" value. Tetsuo Handa even suggested a patch that only does that for cpumask_next(), since that seems to be the only actual case that triggers, but that all makes it even _more_ magical and special. So just relax the check ] One example of this kind of pattern being the 'c_start()' function in arch/x86/kernel/cpu/proc.c, but also duplicated in various forms on other architectures. Reported-by: syzbot+96cae094d90877641f32@syzkaller.appspotmail.com Link: https://syzkaller.appspot.com/bug?extid=96cae094d90877641f32 Reported-by: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> Link: https://lore.kernel.org/lkml/c1f4cc16-feea-b83c-82cf-1a1f007b7eb9@I-love.SAKURA.ne.jp/ Fixes: 596ff4a09b89 ("cpumask: re-introduce constant-sized cpumask optimizations") Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2023-03-11Merge tag 'i2c-for-6.3-rc2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux Pull i2c updates from Wolfram Sang: "This marks the end of a transition to let I2C have the same probe semantics as other subsystems. Uwe took care that no drivers in the current tree nor in -next use the deprecated .probe call. So, it is a good time to switch to the new, standard semantics now. There is also a regression fix: - regression fix for the notifier handling of the I2C core - final coversions of drivers away from deprecated .probe - make .probe_new the standard probe and convert I2C core to use it * tag 'i2c-for-6.3-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux: i2c: dev: Fix bus callback return values i2c: Convert drivers to new .probe() callback i2c: mux: Convert all drivers to new .probe() callback i2c: Switch .probe() to not take an id parameter media: i2c: ov2685: convert to i2c's .probe_new() media: i2c: ov5695: convert to i2c's .probe_new() w1: ds2482: Convert to i2c's .probe_new() serial: sc16is7xx: Convert to i2c's .probe_new() mtd: maps: pismo: Convert to i2c's .probe_new() misc: ad525x_dpot-i2c: Convert to i2c's .probe_new()
2023-03-11KVM: arm64: timers: Convert per-vcpu virtual offset to a global valueMarc Zyngier
Having a per-vcpu virtual offset is a pain. It needs to be synchronized on each update, and expands badly to a setup where different timers can have different offsets, or have composite offsets (as with NV). So let's start by replacing the use of the CNTVOFF_EL2 shadow register (which we want to reclaim for NV anyway), and make the virtual timer carry a pointer to a VM-wide offset. This simplifies the code significantly. It also addresses two terrible bugs: - The use of CNTVOFF_EL2 leads to some nice offset corruption when the sysreg gets reset, as reported by Joey. - The kvm mutex is taken from a vcpu ioctl, which goes against the locking rules... Reported-by: Joey Gouly <joey.gouly@arm.com> Reviewed-by: Reiji Watanabe <reijiw@google.com> Signed-off-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20230224173915.GA17407@e124191.cambridge.arm.com Tested-by: Joey Gouly <joey.gouly@arm.com> Link: https://lore.kernel.org/r/20230224191640.3396734-1-maz@kernel.org Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2023-03-11ext4, jbd2: add an optimized bmap for the journal inodeTheodore Ts'o
The generic bmap() function exported by the VFS takes locks and does checks that are not necessary for the journal inode. So allow the file system to set a journal-optimized bmap function in journal->j_bmap. Reported-by: syzbot+9543479984ae9e576000@syzkaller.appspotmail.com Link: https://syzkaller.appspot.com/bug?id=e4aaa78795e490421c79f76ec3679006c8ff4cf0 Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-03-10xdp: add xdp_set_features_flag utility routineLorenzo Bianconi
Introduce xdp_set_features_flag utility routine in order to update dynamically xdp_features according to the dynamic hw configuration via ethtool (e.g. changing number of hw rx/tx queues). Add xdp_clear_features_flag() in order to clear all xdp_feature flag. Reviewed-by: Shay Agroskin <shayagr@amazon.com> Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-03-10Merge tag 'pull-highmem' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull put_and_unmap_page() helper from Al Viro: "kmap_local_page() conversions in local filesystems keep running into kunmap_local_page()+put_page() combinations. We can keep inventing names for identical inline helpers, but it's getting rather inconvenient. I've added a trivial helper to linux/highmem.h instead. I would've held that back until the merge window, if not for the mess it causes in tree topology - I've several branches merging from that one, and it's only going to get worse if e.g. ext2 stuff gets picked by Jan" * tag 'pull-highmem' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: new helper: put_and_unmap_page()
2023-03-10Merge tag 'wireless-next-2023-03-10' of ↵Jakub Kicinski
git://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless-next Johannes Berg says: ==================== wireless-next patches for 6.4 Major changes: cfg80211 * 6 GHz improvements * HW timestamping support * support for randomized auth/deauth TA for PASN privacy (also for mac80211) mac80211 * radiotap TLV and EHT support for the iwlwifi sniffer * HW timestamping support * per-link debugfs for multi-link brcmfmac * support for Apple (M1 Pro/Max) devices iwlwifi * support for a few new devices * EHT sniffer support rtw88 * better support for some SDIO devices (e.g. MAC address from efuse) rtw89 * HW scan support for 8852b * better support for 6 GHz scanning * tag 'wireless-next-2023-03-10' of git://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless-next: (84 commits) wifi: iwlwifi: mvm: fix EOF bit reporting wifi: iwlwifi: Do not include radiotap EHT user info if not needed wifi: iwlwifi: mvm: add EHT RU allocation to radiotap wifi: iwlwifi: Update logs for yoyo reset sw changes wifi: iwlwifi: mvm: clean up duplicated defines wifi: iwlwifi: rs-fw: break out for unsupported bandwidth wifi: iwlwifi: Add support for B step of BnJ-Fm4 wifi: iwlwifi: mvm: make flush code a bit clearer wifi: iwlwifi: mvm: avoid UB shift of snif_queue wifi: iwlwifi: mvm: add primary 80 known for EHT radiotap wifi: iwlwifi: mvm: parse FW frame metadata for EHT sniffer mode wifi: iwlwifi: mvm: decode USIG_B1_B7 RU to nl80211 RU width wifi: iwlwifi: mvm: rename define to generic name wifi: iwlwifi: mvm: allow Microsoft to use TAS wifi: iwlwifi: mvm: add all EHT based on data0 info from HW wifi: iwlwifi: mvm: add EHT radiotap info based on rate_n_flags wifi: iwlwifi: mvm: add an helper function radiotap TLVs wifi: radiotap: separate vendor TLV into header/content wifi: iwlwifi: reduce verbosity of some logging events wifi: iwlwifi: Adding the code to get RF name for MsP device ... ==================== Link: https://lore.kernel.org/r/20230310120159.36518-1-johannes@sipsolutions.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-03-10bpf: Support __kptr to local kptrsDave Marchevsky
If a PTR_TO_BTF_ID type comes from program BTF - not vmlinux or module BTF - it must have been allocated by bpf_obj_new and therefore must be free'd with bpf_obj_drop. Such a PTR_TO_BTF_ID is considered a "local kptr" and is tagged with MEM_ALLOC type tag by bpf_obj_new. This patch adds support for treating __kptr-tagged pointers to "local kptrs" as having an implicit bpf_obj_drop destructor for referenced kptr acquire / release semantics. Consider the following example: struct node_data { long key; long data; struct bpf_rb_node node; }; struct map_value { struct node_data __kptr *node; }; struct { __uint(type, BPF_MAP_TYPE_ARRAY); __type(key, int); __type(value, struct map_value); __uint(max_entries, 1); } some_nodes SEC(".maps"); If struct node_data had a matching definition in kernel BTF, the verifier would expect a destructor for the type to be registered. Since struct node_data does not match any type in kernel BTF, the verifier knows that there is no kfunc that provides a PTR_TO_BTF_ID to this type, and that such a PTR_TO_BTF_ID can only come from bpf_obj_new. So instead of searching for a registered dtor, a bpf_obj_drop dtor can be assumed. This allows the runtime to properly destruct such kptrs in bpf_obj_free_fields, which enables maps to clean up map_vals w/ such kptrs when going away. Implementation notes: * "kernel_btf" variable is renamed to "kptr_btf" in btf_parse_kptr. Before this patch, the variable would only ever point to vmlinux or module BTFs, but now it can point to some program BTF for local kptr type. It's later used to populate the (btf, btf_id) pair in kptr btf field. * It's necessary to btf_get the program BTF when populating btf_field for local kptr. btf_record_free later does a btf_put. * Behavior for non-local referenced kptrs is not modified, as bpf_find_btf_id helper only searches vmlinux and module BTFs for matching BTF type. If such a type is found, btf_field_kptr's btf will pass btf_is_kernel check, and the associated release function is some one-argument dtor. If btf_is_kernel check fails, associated release function is two-arg bpf_obj_drop_impl. Before this patch only btf_field_kptr's w/ kernel or module BTFs were created. Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com> Link: https://lore.kernel.org/r/20230310230743.2320707-2-davemarchevsky@fb.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-03-10bpf: Change btf_record_find enum parameter to field_maskDave Marchevsky
btf_record_find's 3rd parameter can be multiple enum btf_field_type's masked together. The function is called with BPF_KPTR in two places in verifier.c, so it works with masked values already. Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com> Link: https://lore.kernel.org/r/20230309180111.1618459-4-davemarchevsky@fb.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-03-10bpf: Add bpf_selem_free()Martin KaFai Lau
This patch refactors the selem freeing logic into bpf_selem_free(). It is a preparation work for a later patch using bpf_mem_cache_alloc/free. The other kfree(selem) cases are also changed to bpf_selem_free(..., reuse_now = true). Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Link: https://lore.kernel.org/r/20230308065936.1550103-10-martin.lau@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-03-10bpf: Repurpose use_trace_rcu to reuse_now in bpf_local_storageMartin KaFai Lau
This patch re-purpose the use_trace_rcu to mean if the freed memory can be reused immediately or not. The use_trace_rcu is renamed to reuse_now. Other than the boolean test is reversed, it should be a no-op. The following explains the reason for the rename and how it will be used in a later patch. In a later patch, bpf_mem_cache_alloc/free will be used in the bpf_local_storage. The bpf mem allocator will reuse the freed memory immediately. Some of the free paths in bpf_local_storage does not support memory to be reused immediately. These paths are the "delete" elem cases from the bpf_*_storage_delete() helper and the map_delete_elem() syscall. Note that "delete" elem before the owner's (sk/task/cgrp/inode) lifetime ended is not the common usage for the local storage. The common free path, bpf_local_storage_destroy(), can reuse the memory immediately. This common path means the storage stays with its owner until the owner is destroyed. The above mentioned "delete" elem paths that cannot reuse immediately always has the 'use_trace_rcu == true'. The cases that is safe for immediate reuse always have 'use_trace_rcu == false'. Instead of adding another arg in a later patch, this patch re-purpose this arg to reuse_now and have the test logic reversed. In a later patch, 'reuse_now == true' will free to the bpf_mem_cache_free() where the memory can be reused immediately. 'reuse_now == false' will go through the call_rcu_tasks_trace(). Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Link: https://lore.kernel.org/r/20230308065936.1550103-7-martin.lau@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-03-10bpf: Remember smap in bpf_local_storageMartin KaFai Lau
This patch remembers which smap triggers the allocation of a 'struct bpf_local_storage' object. The local_storage is allocated during the very first selem added to the owner. The smap pointer is needed when using the bpf_mem_cache_free in a later patch because it needs to free to the correct smap's bpf_mem_alloc object. When a selem is being removed, it needs to check if it is the selem that triggers the creation of the local_storage. If it is, the local_storage->smap pointer will be reset to NULL. This NULL reset is done under the local_storage->lock in bpf_selem_unlink_storage_nolock() when a selem is being removed. Also note that the local_storage may not go away even local_storage->smap is NULL because there may be other selem still stored in the local_storage. Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Link: https://lore.kernel.org/r/20230308065936.1550103-6-martin.lau@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-03-10bpf: Refactor codes into bpf_local_storage_destroyMartin KaFai Lau
This patch first renames bpf_local_storage_unlink_nolock to bpf_local_storage_destroy(). It better reflects that it is only used when the storage's owner (sk/task/cgrp/inode) is being kfree(). All bpf_local_storage_destroy's caller is taking the spin lock and then free the storage. This patch also moves these two steps into the bpf_local_storage_destroy. This is a preparation work for a later patch that uses bpf_mem_cache_alloc/free in the bpf_local_storage. Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Link: https://lore.kernel.org/r/20230308065936.1550103-3-martin.lau@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-03-10bpf: Move a few bpf_local_storage functions to static scopeMartin KaFai Lau
This patch moves the bpf_local_storage_free_rcu() and bpf_selem_unlink_map() to static because they are not used outside of bpf_local_storage.c. Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Link: https://lore.kernel.org/r/20230308065936.1550103-2-martin.lau@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-03-10f2fs: Fix f2fs_truncate_partial_nodes ftrace eventDouglas Raillard
Fix the nid_t field so that its size is correctly reported in the text format embedded in trace.dat files. As it stands, it is reported as being of size 4: field:nid_t nid[3]; offset:24; size:4; signed:0; Instead of 12: field:nid_t nid[3]; offset:24; size:12; signed:0; This also fixes the reported offset of subsequent fields so that they match with the actual struct layout. Signed-off-by: Douglas Raillard <douglas.raillard@arm.com> Reviewed-by: Mukesh Ojha <quic_mojha@quicinc.com> Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2023-03-10Merge tag 'for-6.3-rc1-tag' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fixes from David Sterba: "First batch of fixes. Among them there are two updates to sysfs and ioctl which are not strictly fixes but are used for testing so there's no reason to delay them. - fix block group item corruption after inserting new block group - fix extent map logging bit not cleared for split maps after dropping range - fix calculation of unusable block group space reporting bogus values due to 32/64b division - fix unnecessary increment of read error stat on write error - improve error handling in inode update - export per-device fsid in DEV_INFO ioctl to distinguish seeding devices, needed for testing - allocator size classes: - fix potential dead lock in size class loading logic - print sysfs stats for the allocation classes" * tag 'for-6.3-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: fix block group item corruption after inserting new block group btrfs: fix extent map logging bit not cleared for split maps after dropping range btrfs: fix percent calculation for bg reclaim message btrfs: fix unnecessary increment of read error stat on write error btrfs: handle btrfs_del_item errors in __btrfs_update_delayed_inode btrfs: ioctl: return device fsid from DEV_INFO ioctl btrfs: fix potential dead lock in size class loading logic btrfs: sysfs: add size class stats
2023-03-10bpf: ensure state checkpointing at iter_next() call sitesAndrii Nakryiko
State equivalence check and checkpointing performed in is_state_visited() employs certain heuristics to try to save memory by avoiding state checkpoints if not enough jumps and instructions happened since last checkpoint. This leads to unpredictability of whether a particular instruction will be checkpointed and how regularly. While normally this is not causing much problems (except inconveniences for predictable verifier tests, which we overcome with BPF_F_TEST_STATE_FREQ flag), turns out it's not the case for open-coded iterators. Checking and saving state checkpoints at iter_next() call is crucial for fast convergence of open-coded iterator loop logic, so we need to force it. If we don't do that, is_state_visited() might skip saving a checkpoint, causing unnecessarily long sequence of not checkpointed instructions and jumps, leading to exhaustion of jump history buffer, and potentially other undesired outcomes. It is expected that with correct open-coded iterators convergence will happen quickly, so we don't run a risk of exhausting memory. This patch adds, in addition to prune and jump instruction marks, also a "forced checkpoint" mark, and makes sure that any iter_next() call instruction is marked as such. Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20230310060149.625887-1-andrii@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-03-10fbdev: Fix incorrect page mapping clearance at fb_deferred_io_release()Takashi Iwai
The recent fix for the deferred I/O by the commit 3efc61d95259 ("fbdev: Fix invalid page access after closing deferred I/O devices") caused a regression when the same fb device is opened/closed while it's being used. It resulted in a frozen screen even if something is redrawn there after the close. The breakage is because the patch was made under a wrong assumption of a single open; in the current code, fb_deferred_io_release() cleans up the page mapping of the pageref list and it calls cancel_delayed_work_sync() unconditionally, where both are no correct behavior for multiple opens. This patch adds a refcount for the opens of the device, and applies the cleanup only when all files get closed. As both fb_deferred_io_open() and _close() are called always in the fb_info lock (mutex), it's safe to use the normal int for the refcounting. Also, a useless BUG_ON() is dropped. Fixes: 3efc61d95259 ("fbdev: Fix invalid page access after closing deferred I/O devices") Cc: <stable@vger.kernel.org> Signed-off-by: Takashi Iwai <tiwai@suse.de> Reviewed-by: Patrik Jakobsson <patrik.r.jakobsson@gmail.com> Signed-off-by: Thomas Zimmermann <tzimmermann@suse.de> Link: https://patchwork.freedesktop.org/patch/msgid/20230308105012.1845-1-tiwai@suse.de
2023-03-09Merge branch 'main' of ↵Jakub Kicinski
git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next Florian Westphal says: ==================== Netfilter updates for net-next 1. nf_tables 'brouting' support, from Sriram Yagnaraman. 2. Update bridge netfilter and ovs conntrack helpers to handle IPv6 Jumbo packets properly, i.e. fetch the packet length from hop-by-hop extension header, from Xin Long. This comes with a test BIG TCP test case, added to tools/testing/selftests/net/. 3. Fix spelling and indentation in conntrack, from Jeremy Sowden. * 'main' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next: netfilter: nat: fix indentation of function arguments netfilter: conntrack: fix typo selftests: add a selftest for big tcp netfilter: use nf_ip6_check_hbh_len in nf_ct_skb_network_trim netfilter: move br_nf_check_hbh_len to utils netfilter: bridge: move pskb_trim_rcsum out of br_nf_check_hbh_len netfilter: bridge: check len before accessing more nh data netfilter: bridge: call pskb_may_pull in br_nf_check_hbh_len netfilter: bridge: introduce broute meta statement ==================== Link: https://lore.kernel.org/r/20230308193033.13965-1-fw@strlen.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-03-09neighbour: delete neigh_lookup_nodev as not usedLeon Romanovsky
neigh_lookup_nodev isn't used in the kernel after removal of DECnet. So let's remove it. Fixes: 1202cdd66531 ("Remove DECnet support from kernel") Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org> Link: https://lore.kernel.org/r/eb5656200d7964b2d177a36b77efa3c597d6d72d.1678267343.git.leonro@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-03-09net: sched: remove qdisc_watchdog->last_expiresEric Dumazet
This field mirrors hrtimer softexpires, we can instead use the existing helpers. Signed-off-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/r/20230308182648.1150762-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-03-09netlink: remove unused 'compare' functionFlorian Westphal
No users in the tree. Tested with allmodconfig build. Signed-off-by: Florian Westphal <fw@strlen.de> Link: https://lore.kernel.org/r/20230308142006.20879-1-fw@strlen.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-03-09Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski
Documentation/bpf/bpf_devel_QA.rst b7abcd9c656b ("bpf, doc: Link to submitting-patches.rst for general patch submission info") d56b0c461d19 ("bpf, docs: Fix link to netdev-FAQ target") https://lore.kernel.org/all/20230307095812.236eb1be@canb.auug.org.au/ Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-03-09scsi: core: Add BLIST_NO_VPD_SIZE for some VDASDLee Duncan
Some storage, such as AIX VDASD (virtual storage) and IBM 2076 (front end), fail as a result of commit c92a6b5d6335 ("scsi: core: Query VPD size before getting full page"). That commit changed getting SCSI VPD pages so that we now read just enough of the page to get the actual page size, then read the whole page in a second read. The problem is that the above mentioned hardware returns zero for the page size, because of a firmware error. In such cases, until the firmware is fixed, this new blacklist flag says to revert to the original method of reading the VPD pages, i.e. try to read a whole buffer's worth on the first try. [mkp: reworked somewhat] Fixes: c92a6b5d6335 ("scsi: core: Query VPD size before getting full page") Reported-by: Martin Wilck <mwilck@suse.com> Suggested-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Lee Duncan <lduncan@suse.com> Link: https://lore.kernel.org/r/20220928181350.9948-1-leeman.duncan@gmail.com Tested-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
2023-03-09clk: Avoid invalid function names in CLK_OF_DECLARE()Nathan Chancellor
After commit c28cd1f3433c ("clk: Mark a fwnode as initialized when using CLK_OF_DECLARE() macro"), drivers/clk/mvebu/kirkwood.c fails to build: drivers/clk/mvebu/kirkwood.c:358:1: error: expected identifier or '(' CLK_OF_DECLARE(98dx1135_clk, "marvell,mv98dx1135-core-clock", ^ include/linux/clk-provider.h:1367:21: note: expanded from macro 'CLK_OF_DECLARE' static void __init name##_of_clk_init_declare(struct device_node *np) \ ^ <scratch space>:124:1: note: expanded from here 98dx1135_clk_of_clk_init_declare ^ drivers/clk/mvebu/kirkwood.c:358:1: error: invalid digit 'd' in decimal constant include/linux/clk-provider.h:1372:34: note: expanded from macro 'CLK_OF_DECLARE' OF_DECLARE_1(clk, name, compat, name##_of_clk_init_declare) ^ <scratch space>:125:3: note: expanded from here 98dx1135_clk_of_clk_init_declare ^ drivers/clk/mvebu/kirkwood.c:358:1: error: invalid digit 'd' in decimal constant include/linux/clk-provider.h:1372:34: note: expanded from macro 'CLK_OF_DECLARE' OF_DECLARE_1(clk, name, compat, name##_of_clk_init_declare) ^ <scratch space>:125:3: note: expanded from here 98dx1135_clk_of_clk_init_declare ^ drivers/clk/mvebu/kirkwood.c:358:1: error: invalid digit 'd' in decimal constant include/linux/clk-provider.h:1372:34: note: expanded from macro 'CLK_OF_DECLARE' OF_DECLARE_1(clk, name, compat, name##_of_clk_init_declare) ^ <scratch space>:125:3: note: expanded from here 98dx1135_clk_of_clk_init_declare ^ C function names must start with either an alphabetic letter or an underscore. To avoid generating invalid function names from clock names, add two underscores to the beginning of the identifier. Fixes: c28cd1f3433c ("clk: Mark a fwnode as initialized when using CLK_OF_DECLARE() macro") Suggested-by: Saravana Kannan <saravanak@google.com> Signed-off-by: Nathan Chancellor <nathan@kernel.org> Link: https://lore.kernel.org/r/20230308-clk_of_declare-fix-v1-1-317b741e2532@kernel.org Reviewed-by: Saravana Kannan <saravanak@google.com> Reported-by: Naresh Kamboju <naresh.kamboju@linaro.org> Signed-off-by: Stephen Boyd <sboyd@kernel.org>
2023-03-09i2c: Switch .probe() to not take an id parameterUwe Kleine-König
Commit b8a1a4cd5a98 ("i2c: Provide a temporary .probe_new() call-back type") introduced a new probe callback to convert i2c init routines to not take an i2c_device_id parameter. Now that all in-tree drivers are converted to the temporary .probe_new() callback, .probe() can be modified to match the desired prototype. Now that .probe() and .probe_new() have the same semantic, they can be defined as members of an anonymous union to save some memory and simplify the core code a bit. Signed-off-by: Uwe Kleine-König <u.kleine-koenig@pengutronix.de> Signed-off-by: Wolfram Sang <wsa@kernel.org>
2023-03-09bpf: Fix a typo for BPF_F_ANY_ALIGNMENT in bpf.hMichael Weiß
Fix s/BPF_PROF_LOAD/BPF_PROG_LOAD/ typo in the documentation comment for BPF_F_ANY_ALIGNMENT in bpf.h. Signed-off-by: Michael Weiß <michael.weiss@aisec.fraunhofer.de> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20230309133823.944097-1-michael.weiss@aisec.fraunhofer.de
2023-03-09Merge tag 'net-6.3-rc2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Pull networking fixes from Paolo Abeni: "Including fixes from netfilter and bpf. Current release - regressions: - core: avoid skb end_offset change in __skb_unclone_keeptruesize() - sched: - act_connmark: handle errno on tcf_idr_check_alloc - flower: fix fl_change() error recovery path - ieee802154: prevent user from crashing the host Current release - new code bugs: - eth: bnxt_en: fix the double free during device removal - tools: ynl: - fix enum-as-flags in the generic CLI - fully inherit attrs in subsets - re-license uniformly under GPL-2.0 or BSD-3-clause Previous releases - regressions: - core: use indirect calls helpers for sk_exit_memory_pressure() - tls: - fix return value for async crypto - avoid hanging tasks on the tx_lock - eth: ice: copy last block omitted in ice_get_module_eeprom() Previous releases - always broken: - core: avoid double iput when sock_alloc_file fails - af_unix: fix struct pid leaks in OOB support - tls: - fix possible race condition - fix device-offloaded sendpage straddling records - bpf: - sockmap: fix an infinite loop error - test_run: fix &xdp_frame misplacement for LIVE_FRAMES - fix resolving BTF_KIND_VAR after ARRAY, STRUCT, UNION, PTR - netfilter: tproxy: fix deadlock due to missing BH disable - phylib: get rid of unnecessary locking - eth: bgmac: fix *initial* chip reset to support BCM5358 - eth: nfp: fix csum for ipsec offload - eth: mtk_eth_soc: fix RX data corruption issue Misc: - usb: qmi_wwan: add telit 0x1080 composition" * tag 'net-6.3-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (64 commits) tools: ynl: fix enum-as-flags in the generic CLI tools: ynl: move the enum classes to shared code net: avoid double iput when sock_alloc_file fails af_unix: fix struct pid leaks in OOB support eth: fealnx: bring back this old driver net: dsa: mt7530: permit port 5 to work without port 6 on MT7621 SoC net: microchip: sparx5: fix deletion of existing DSCP mappings octeontx2-af: Unlock contexts in the queue context cache in case of fault detection net/smc: fix fallback failed while sendmsg with fastopen ynl: re-license uniformly under GPL-2.0 OR BSD-3-Clause mailmap: update entries for Stephen Hemminger mailmap: add entry for Maxim Mikityanskiy nfc: change order inside nfc_se_io error path ethernet: ice: avoid gcc-9 integer overflow warning ice: don't ignore return codes in VSI related code ice: Fix DSCP PFC TLV creation net: usb: qmi_wwan: add Telit 0x1080 composition net: usb: cdc_mbim: avoid altsetting toggling for Telit FE990 netfilter: conntrack: adopt safer max chain length net: tls: fix device-offloaded sendpage straddling records ...
2023-03-09Merge tag 'for-linus-2023030901' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/hid/hid Pull HID fixes from Benjamin Tissoires: - fix potential out of bound write of zeroes in HID core with a specially crafted uhid device (Lee Jones) - fix potential use-after-free in work function in intel-ish-hid (Reka Norman) - selftests config fixes (Benjamin Tissoires) - few device small fixes and support * tag 'for-linus-2023030901' of git://git.kernel.org/pub/scm/linux/kernel/git/hid/hid: HID: intel-ish-hid: ipc: Fix potential use-after-free in work function HID: logitech-hidpp: Add support for Logitech MX Master 3S mouse HID: cp2112: Fix driver not registering GPIO IRQ chip as threaded selftest: hid: fix hid_bpf not set in config HID: uhid: Over-ride the default maximum data buffer value with our own HID: core: Provide new max_buffer_size attribute to over-ride the default
2023-03-09sctp: add weighted fair queueing stream schedulerXin Long
As it says in rfc8260#section-3.6 about the weighted fair queueing scheduler: A Weighted Fair Queueing scheduler between the streams is used. The weight is configurable per outgoing SCTP stream. This scheduler considers the lengths of the messages of each stream and schedules them in a specific way to use the capacity according to the given weights. If the weight of stream S1 is n times the weight of stream S2, the scheduler should assign to stream S1 n times the capacity it assigns to stream S2. The details are implementation dependent. Interleaving user messages allows for a better realization of the capacity usage according to the given weights. This patch adds Weighted Fair Queueing Scheduler actually based on the code of Fair Capacity Scheduler by adding fc_weight into struct sctp_stream_out_ext and taking it into account when sorting stream-> fc_list in sctp_sched_fc_sched() and sctp_sched_fc_dequeue_done(). Signed-off-by: Xin Long <lucien.xin@gmail.com> Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-03-09sctp: add fair capacity stream schedulerXin Long
As it says in rfc8260#section-3.5 about the fair capacity scheduler: A fair capacity distribution between the streams is used. This scheduler considers the lengths of the messages of each stream and schedules them in a specific way to maintain an equal capacity for all streams. The details are implementation dependent. interleaving user messages allows for a better realization of the fair capacity usage. This patch adds Fair Capacity Scheduler based on the foundations added by commit 5bbbbe32a431 ("sctp: introduce stream scheduler foundations"): A fc_list and a fc_length are added into struct sctp_stream_out_ext and a fc_list is added into struct sctp_stream. In .enqueue, when there are chunks enqueued into a stream, this stream will be linked into stream-> fc_list by its fc_list ordered by its fc_length. In .dequeue, it always picks up the 1st skb from stream->fc_list. In .dequeue_done, fc_length is increased by chunk's len and update its location in stream->fc_list according to the its new fc_length. Note that when the new fc_length overflows in .dequeue_done, instead of resetting all fc_lengths to 0, we only reduced them by U32_MAX / 4 to avoid a moment of imbalance in the scheduling, as Marcelo suggested. Signed-off-by: Xin Long <lucien.xin@gmail.com> Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-03-08bpf: implement numbers iteratorAndrii Nakryiko
Implement the first open-coded iterator type over a range of integers. It's public API consists of: - bpf_iter_num_new() constructor, which accepts [start, end) range (that is, start is inclusive, end is exclusive). - bpf_iter_num_next() which will keep returning read-only pointer to int until the range is exhausted, at which point NULL will be returned. If bpf_iter_num_next() is kept calling after this, NULL will be persistently returned. - bpf_iter_num_destroy() destructor, which needs to be called at some point to clean up iterator state. BPF verifier enforces that iterator destructor is called at some point before BPF program exits. Note that `start = end = X` is a valid combination to setup an empty iterator. bpf_iter_num_new() will return 0 (success) for any such combination. If bpf_iter_num_new() detects invalid combination of input arguments, it returns error, resets iterator state to, effectively, empty iterator, so any subsequent call to bpf_iter_num_next() will keep returning NULL. BPF verifier has no knowledge that returned integers are in the [start, end) value range, as both `start` and `end` are not statically known and enforced: they are runtime values. While the implementation is pretty trivial, some care needs to be taken to avoid overflows and underflows. Subsequent selftests will validate correctness of [start, end) semantics, especially around extremes (INT_MIN and INT_MAX). Similarly to bpf_loop(), we enforce that no more than BPF_MAX_LOOPS can be specified. bpf_iter_num_{new,next,destroy}() is a logical evolution from bounded BPF loops and bpf_loop() helper and is the basis for implementing ergonomic BPF loops with no statically known or verified bounds. Subsequent patches implement bpf_for() macro, demonstrating how this can be wrapped into something that works and feels like a normal for() loop in C language. Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20230308184121.1165081-5-andrii@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-03-08bpf: add support for open-coded iterator loopsAndrii Nakryiko
Teach verifier about the concept of the open-coded (or inline) iterators. This patch adds generic iterator loop verification logic, new STACK_ITER stack slot type to contain iterator state, and necessary kfunc plumbing for iterator's constructor, destructor and next methods. Next patch implements first specific iterator (numbers iterator for implementing for() loop logic). Such split allows to have more focused commits for verifier logic and separate commit that we could point later to demonstrating what does it take to add a new kind of iterator. Each kind of iterator has its own associated struct bpf_iter_<type>, where <type> denotes a specific type of iterator. struct bpf_iter_<type> state is supposed to live on BPF program stack, so there will be no way to change its size later on without breaking backwards compatibility, so choose wisely! But given this struct is specific to a given <type> of iterator, this allows a lot of flexibility: simple iterators could be fine with just one stack slot (8 bytes), like numbers iterator in the next patch, while some other more complicated iterators might need way more to keep their iterator state. Either way, such design allows to avoid runtime memory allocations, which otherwise would be necessary if we fixed on-the-stack size and it turned out to be too small for a given iterator implementation. The way BPF verifier logic is implemented, there are no artificial restrictions on a number of active iterators, it should work correctly using multiple active iterators at the same time. This also means you can have multiple nested iteration loops. struct bpf_iter_<type> reference can be safely passed to subprograms as well. General flow is easiest to demonstrate with a simple example using number iterator implemented in next patch. Here's the simplest possible loop: struct bpf_iter_num it; int *v; bpf_iter_num_new(&it, 2, 5); while ((v = bpf_iter_num_next(&it))) { bpf_printk("X = %d", *v); } bpf_iter_num_destroy(&it); Above snippet should output "X = 2", "X = 3", "X = 4". Note that 5 is exclusive and is not returned. This matches similar APIs (e.g., slices in Go or Rust) that implement a range of elements, where end index is non-inclusive. In the above example, we see a trio of function: - constructor, bpf_iter_num_new(), which initializes iterator state (struct bpf_iter_num it) on the stack. If any of the input arguments are invalid, constructor should make sure to still initialize it such that subsequent bpf_iter_num_next() calls will return NULL. I.e., on error, return error and construct empty iterator. - next method, bpf_iter_num_next(), which accepts pointer to iterator state and produces an element. Next method should always return a pointer. The contract between BPF verifier is that next method will always eventually return NULL when elements are exhausted. Once NULL is returned, subsequent next calls should keep returning NULL. In the case of numbers iterator, bpf_iter_num_next() returns a pointer to an int (storage for this integer is inside the iterator state itself), which can be dereferenced after corresponding NULL check. - once done with the iterator, it's mandated that user cleans up its state with the call to destructor, bpf_iter_num_destroy() in this case. Destructor frees up any resources and marks stack space used by struct bpf_iter_num as usable for something else. Any other iterator implementation will have to implement at least these three methods. It is enforced that for any given type of iterator only applicable constructor/destructor/next are callable. I.e., verifier ensures you can't pass number iterator state into, say, cgroup iterator's next method. It is important to keep the naming pattern consistent to be able to create generic macros to help with BPF iter usability. E.g., one of the follow up patches adds generic bpf_for_each() macro to bpf_misc.h in selftests, which allows to utilize iterator "trio" nicely without having to code the above somewhat tedious loop explicitly every time. This is enforced at kfunc registration point by one of the previous patches in this series. At the implementation level, iterator state tracking for verification purposes is very similar to dynptr. We add STACK_ITER stack slot type, reserve necessary number of slots, depending on sizeof(struct bpf_iter_<type>), and keep track of necessary extra state in the "main" slot, which is marked with non-zero ref_obj_id. Other slots are also marked as STACK_ITER, but have zero ref_obj_id. This is simpler than having a separate "is_first_slot" flag. Another big distinction is that STACK_ITER is *always refcounted*, which simplifies implementation without sacrificing usability. So no need for extra "iter_id", no need to anticipate reuse of STACK_ITER slots for new constructors, etc. Keeping it simple here. As far as the verification logic goes, there are two extensive comments: in process_iter_next_call() and iter_active_depths_differ() explaining some important and sometimes subtle aspects. Please refer to them for details. But from 10,000-foot point of view, next methods are the points of forking a verification state, which are conceptually similar to what verifier is doing when validating conditional jump. We branch out at a `call bpf_iter_<type>_next` instruction and simulate two outcomes: NULL (iteration is done) and non-NULL (new element is returned). NULL is simulated first and is supposed to reach exit without looping. After that non-NULL case is validated and it either reaches exit (for trivial examples with no real loop), or reaches another `call bpf_iter_<type>_next` instruction with the state equivalent to already (partially) validated one. State equivalency at that point means we technically are going to be looping forever without "breaking out" out of established "state envelope" (i.e., subsequent iterations don't add any new knowledge or constraints to the verifier state, so running 1, 2, 10, or a million of them doesn't matter). But taking into account the contract stating that iterator next method *has to* return NULL eventually, we can conclude that loop body is safe and will eventually terminate. Given we validated logic outside of the loop (NULL case), and concluded that loop body is safe (though potentially looping many times), verifier can claim safety of the overall program logic. The rest of the patch is necessary plumbing for state tracking, marking, validation, and necessary further kfunc plumbing to allow implementing iterator constructor, destructor, and next methods. Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20230308184121.1165081-4-andrii@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-03-08bpf: add iterator kfuncs registration and validation logicAndrii Nakryiko
Add ability to register kfuncs that implement BPF open-coded iterator contract and enforce naming and function proto convention. Enforcement happens at the time of kfunc registration and significantly simplifies the rest of iterators logic in the verifier. More details follow in subsequent patches, but we enforce the following conditions. All kfuncs (constructor, next, destructor) have to be named consistenly as bpf_iter_<type>_{new,next,destroy}(), respectively. <type> represents iterator type, and iterator state should be represented as a matching `struct bpf_iter_<type>` state type. Also, all iter kfuncs should have a pointer to this `struct bpf_iter_<type>` as the very first argument. Additionally: - Constructor, i.e., bpf_iter_<type>_new(), can have arbitrary extra number of arguments. Return type is not enforced either. - Next method, i.e., bpf_iter_<type>_next(), has to return a pointer type and should have exactly one argument: `struct bpf_iter_<type> *` (const/volatile/restrict and typedefs are ignored). - Destructor, i.e., bpf_iter_<type>_destroy(), should return void and should have exactly one argument, similar to the next method. - struct bpf_iter_<type> size is enforced to be positive and a multiple of 8 bytes (to fit stack slots correctly). Such strictness and consistency allows to build generic helpers abstracting important, but boilerplate, details to be able to use open-coded iterators effectively and ergonomically (see bpf_for_each() in subsequent patches). It also simplifies the verifier logic in some places. At the same time, this doesn't hurt generality of possible iterator implementations. Win-win. Constructor kfunc is marked with a new KF_ITER_NEW flags, next method is marked with KF_ITER_NEXT (and should also have KF_RET_NULL, of course), while destructor kfunc is marked as KF_ITER_DESTROY. Additionally, we add a trivial kfunc name validation: it should be a valid non-NULL and non-empty string. Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20230308184121.1165081-3-andrii@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-03-08Merge https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-nextJakub Kicinski
Andrii Nakryiko says: ==================== pull-request: bpf-next 2023-03-08 We've added 23 non-merge commits during the last 2 day(s) which contain a total of 28 files changed, 414 insertions(+), 104 deletions(-). The main changes are: 1) Add more precise memory usage reporting for all BPF map types, from Yafang Shao. 2) Add ARM32 USDT support to libbpf, from Puranjay Mohan. 3) Fix BTF_ID_LIST size causing problems in !CONFIG_DEBUG_INFO_BTF, from Nathan Chancellor. 4) IMA selftests fix, from Roberto Sassu. 5) libbpf fix in APK support code, from Daniel Müller. * https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (23 commits) selftests/bpf: Fix IMA test libbpf: USDT arm arg parsing support libbpf: Refactor parse_usdt_arg() to re-use code libbpf: Fix theoretical u32 underflow in find_cd() function bpf: enforce all maps having memory usage callback bpf: offload map memory usage bpf, net: xskmap memory usage bpf, net: sock_map memory usage bpf, net: bpf_local_storage memory usage bpf: local_storage memory usage bpf: bpf_struct_ops memory usage bpf: queue_stack_maps memory usage bpf: devmap memory usage bpf: cpumap memory usage bpf: bloom_filter memory usage bpf: ringbuf memory usage bpf: reuseport_array memory usage bpf: stackmap memory usage bpf: arraymap memory usage bpf: hashtab memory usage ... ==================== Link: https://lore.kernel.org/r/20230308193533.1671597-1-andrii@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-03-08netfilter: move br_nf_check_hbh_len to utilsXin Long
Rename br_nf_check_hbh_len() to nf_ip6_check_hbh_len() and move it to netfilter utils, so that it can be used by other modules, like ovs and tc. Signed-off-by: Xin Long <lucien.xin@gmail.com> Reviewed-by: Simon Horman <simon.horman@corigine.com> Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org> Reviewed-by: Aaron Conole <aconole@redhat.com> Signed-off-by: Florian Westphal <fw@strlen.de>
2023-03-08net: reclaim skb->scm_io_uring bitEric Dumazet
Commit 0091bfc81741 ("io_uring/af_unix: defer registered files gc to io_uring release") added one bit to struct sk_buff. This structure is critical for networking, and we try very hard to not add bloat on it, unless absolutely required. For instance, we can use a specific destructor as a wrapper around unix_destruct_scm(), to identify skbs that unix_gc() has to special case. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Pavel Begunkov <asml.silence@gmail.com> Cc: Thadeu Lima de Souza Cascardo <cascardo@canonical.com> Cc: Jens Axboe <axboe@kernel.dk> Reviewed-by: Jens Axboe <axboe@kernel.dk> Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-03-08netfilter: bridge: introduce broute meta statementSriram Yagnaraman
nftables equivalent for ebtables -t broute. Implement broute meta statement to set br_netfilter_broute flag in skb to force a packet to be routed instead of being bridged. Signed-off-by: Sriram Yagnaraman <sriram.yagnaraman@est.tech> Signed-off-by: Florian Westphal <fw@strlen.de>