summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2022-08-23netfilter: nft_tproxy: restrict to prerouting hookFlorian Westphal
TPROXY is only allowed from prerouting, but nft_tproxy doesn't check this. This fixes a crash (null dereference) when using tproxy from e.g. output. Fixes: 4ed8eb6570a4 ("netfilter: nf_tables: Add native tproxy support") Reported-by: Shell Chen <xierch@gmail.com> Signed-off-by: Florian Westphal <fw@strlen.de>
2022-08-23cgroup: Fix race condition at rebind_subsystems()Jing-Ting Wu
Root cause: The rebind_subsystems() is no lock held when move css object from A list to B list,then let B's head be treated as css node at list_for_each_entry_rcu(). Solution: Add grace period before invalidating the removed rstat_css_node. Reported-by: Jing-Ting Wu <jing-ting.wu@mediatek.com> Suggested-by: Michal Koutný <mkoutny@suse.com> Signed-off-by: Jing-Ting Wu <jing-ting.wu@mediatek.com> Tested-by: Jing-Ting Wu <jing-ting.wu@mediatek.com> Link: https://lore.kernel.org/linux-arm-kernel/d8f0bc5e2fb6ed259f9334c83279b4c011283c41.camel@mediatek.com/T/ Acked-by: Mukesh Ojha <quic_mojha@quicinc.com> Fixes: a7df69b81aac ("cgroup: rstat: support cgroup1") Cc: stable@vger.kernel.org # v5.13+ Signed-off-by: Tejun Heo <tj@kernel.org>
2022-08-23cpufreq: check only freq_table in __resolve_freq()Lukasz Luba
There is no need to check if the cpufreq driver implements callback cpufreq_driver::target_index. The logic in the __resolve_freq uses the frequency table available in the policy. It doesn't matter if the driver provides 'target_index' or 'target' callback. It just has to populate the 'policy->freq_table'. Thus, check only frequency table during the frequency resolving call. Acked-by: Viresh Kumar <viresh.kumar@linaro.org> Signed-off-by: Lukasz Luba <lukasz.luba@arm.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2022-08-23x86/cpu: Add new Raptor Lake CPU model numberTony Luck
Note1: Model 0xB7 already claimed the "no suffix" #define for a regular client part, so add (yet another) suffix "S" to distinguish this new part from the earlier one. Note2: the RAPTORLAKE* and ALDERLAKE* processors are very similar from a software enabling point of view. There are no known features that have model-specific enabling and also differ between the two. In other words, every single place that list *one* or more RAPTORLAKE* or ALDERLAKE* processors should list all of them. Note3: This is being merged before there is an in-tree user. Merging this provides an "anchor" so that the different folks can update their subsystems (like perf) in parallel to use this define and test it. [ dhansen: add a note about why this has no in-tree users yet ] Signed-off-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Link: https://lkml.kernel.org/r/20220823174819.223941-1-tony.luck@intel.com
2022-08-23thermal/int340x_thermal: handle data_vault when the value is ZERO_SIZE_PTRLee, Chun-Yi
In some case, the GDDV returns a package with a buffer which has zero length. It causes that kmemdup() returns ZERO_SIZE_PTR (0x10). Then the data_vault_read() got NULL point dereference problem when accessing the 0x10 value in data_vault. [ 71.024560] BUG: kernel NULL pointer dereference, address: 0000000000000010 This patch uses ZERO_OR_NULL_PTR() for checking ZERO_SIZE_PTR or NULL value in data_vault. Signed-off-by: "Lee, Chun-Yi" <jlee@suse.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2022-08-23netfilter: conntrack: work around exceeded receive windowFlorian Westphal
When a TCP sends more bytes than allowed by the receive window, all future packets can be marked as invalid. This can clog up the conntrack table because of 5-day default timeout. Sequence of packets: 01 initiator > responder: [S], seq 171, win 5840, options [mss 1330,sackOK,TS val 63 ecr 0,nop,wscale 1] 02 responder > initiator: [S.], seq 33211, ack 172, win 65535, options [mss 1460,sackOK,TS val 010 ecr 63,nop,wscale 8] 03 initiator > responder: [.], ack 33212, win 2920, options [nop,nop,TS val 068 ecr 010], length 0 04 initiator > responder: [P.], seq 172:240, ack 33212, win 2920, options [nop,nop,TS val 279 ecr 010], length 68 Window is 5840 starting from 33212 -> 39052. 05 responder > initiator: [.], ack 240, win 256, options [nop,nop,TS val 872 ecr 279], length 0 06 responder > initiator: [.], seq 33212:34530, ack 240, win 256, options [nop,nop,TS val 892 ecr 279], length 1318 This is fine, conntrack will flag the connection as having outstanding data (UNACKED), which lowers the conntrack timeout to 300s. 07 responder > initiator: [.], seq 34530:35848, ack 240, win 256, options [nop,nop,TS val 892 ecr 279], length 1318 08 responder > initiator: [.], seq 35848:37166, ack 240, win 256, options [nop,nop,TS val 892 ecr 279], length 1318 09 responder > initiator: [.], seq 37166:38484, ack 240, win 256, options [nop,nop,TS val 892 ecr 279], length 1318 10 responder > initiator: [.], seq 38484:39802, ack 240, win 256, options [nop,nop,TS val 892 ecr 279], length 1318 Packet 10 is already sending more than permitted, but conntrack doesn't validate this (only seq is tested vs. maxend, not 'seq+len'). 38484 is acceptable, but only up to 39052, so this packet should not have been sent (or only 568 bytes, not 1318). At this point, connection is still in '300s' mode. Next packet however will get flagged: 11 responder > initiator: [P.], seq 39802:40128, ack 240, win 256, options [nop,nop,TS val 892 ecr 279], length 326 nf_ct_proto_6: SEQ is over the upper bound (over the window of the receiver) .. LEN=378 .. SEQ=39802 ACK=240 ACK PSH .. Now, a couple of replies/acks comes in: 12 initiator > responder: [.], ack 34530, win 4368, [.. irrelevant acks removed ] 16 initiator > responder: [.], ack 39802, win 8712, options [nop,nop,TS val 296201291 ecr 2982371892], length 0 This ack is significant -- this acks the last packet send by the responder that conntrack considered valid. This means that ack == td_end. This will withdraw the 'unacked data' flag, the connection moves back to the 5-day timeout of established conntracks. 17 initiator > responder: ack 40128, win 10030, ... This packet is also flagged as invalid. Because conntrack only updates state based on packets that are considered valid, packet 11 'did not exist' and that gets us: nf_ct_proto_6: ACK is over upper bound 39803 (ACKed data not seen yet) .. SEQ=240 ACK=40128 WINDOW=10030 RES=0x00 ACK URG Because this received and processed by the endpoints, the conntrack entry remains in a bad state, no packets will ever be considered valid again: 30 responder > initiator: [F.], seq 40432, ack 2045, win 391, .. 31 initiator > responder: [.], ack 40433, win 11348, .. 32 initiator > responder: [F.], seq 2045, ack 40433, win 11348 .. ... all trigger 'ACK is over bound' test and we end up with non-early-evictable 5-day default timeout. NB: This patch triggers a bunch of checkpatch warnings because of silly indent. I will resend the cleanup series linked below to reduce the indent level once this change has propagated to net-next. I could route the cleanup via nf but that causes extra backport work for stable maintainers. Link: https://lore.kernel.org/netfilter-devel/20220720175228.17880-1-fw@strlen.de/T/#mb1d7147d36294573cc4f81d00f9f8dadfdd06cd8 Signed-off-by: Florian Westphal <fw@strlen.de>
2022-08-23netfilter: ebtables: reject blobs that don't provide all entry pointsFlorian Westphal
Harshit Mogalapalli says: In ebt_do_table() function dereferencing 'private->hook_entry[hook]' can lead to NULL pointer dereference. [..] Kernel panic: general protection fault, probably for non-canonical address 0xdffffc0000000005: 0000 [#1] PREEMPT SMP KASAN KASAN: null-ptr-deref in range [0x0000000000000028-0x000000000000002f] [..] RIP: 0010:ebt_do_table+0x1dc/0x1ce0 Code: 89 fa 48 c1 ea 03 80 3c 02 00 0f 85 5c 16 00 00 48 b8 00 00 00 00 00 fc ff df 49 8b 6c df 08 48 8d 7d 2c 48 89 fa 48 c1 ea 03 <0f> b6 14 02 48 89 f8 83 e0 07 83 c0 03 38 d0 7c 08 84 d2 0f 85 88 [..] Call Trace: nf_hook_slow+0xb1/0x170 __br_forward+0x289/0x730 maybe_deliver+0x24b/0x380 br_flood+0xc6/0x390 br_dev_xmit+0xa2e/0x12c0 For some reason ebtables rejects blobs that provide entry points that are not supported by the table, but what it should instead reject is the opposite: blobs that DO NOT provide an entry point supported by the table. t->valid_hooks is the bitmask of hooks (input, forward ...) that will see packets. Providing an entry point that is not support is harmless (never called/used), but the inverse isn't: it results in a crash because the ebtables traverser doesn't expect a NULL blob for a location its receiving packets for. Instead of fixing all the individual checks, do what iptables is doing and reject all blobs that differ from the expected hooks. Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Reported-by: Harshit Mogalapalli <harshit.m.mogalapalli@oracle.com> Reported-by: syzkaller <syzkaller@googlegroups.com> Signed-off-by: Florian Westphal <fw@strlen.de>
2022-08-23ACPI: processor: Remove freq Qos request for all CPUsRiwen Lu
The freq Qos request would be removed repeatedly if the cpufreq policy relates to more than one CPU. Then, it would cause the "called for unknown object" warning. Remove the freq Qos request for each CPU relates to the cpufreq policy, instead of removing repeatedly for the last CPU of it. Fixes: a1bb46c36ce3 ("ACPI: processor: Add QoS requests for all CPUs") Reported-by: Jeremy Linton <Jeremy.Linton@arm.com> Tested-by: Jeremy Linton <jeremy.linton@arm.com> Signed-off-by: Riwen Lu <luriwen@kylinos.cn> Cc: 5.4+ <stable@vger.kernel.org> # 5.4+ Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2022-08-23nouveau: explicitly wait on the fence in nouveau_bo_move_m2mfKarol Herbst
It is a bit unlcear to us why that's helping, but it does and unbreaks suspend/resume on a lot of GPUs without any known drawbacks. Cc: stable@vger.kernel.org # v5.15+ Closes: https://gitlab.freedesktop.org/drm/nouveau/-/issues/156 Signed-off-by: Karol Herbst <kherbst@redhat.com> Reviewed-by: Lyude Paul <lyude@redhat.com> Link: https://patchwork.freedesktop.org/patch/msgid/20220819200928.401416-1-kherbst@redhat.com
2022-08-23io_uring: fix submission-failure handling for uring-cmdKanchan Joshi
If ->uring_cmd returned an error value different from -EAGAIN or -EIOCBQUEUED, it gets overridden with IOU_OK. This invites trouble as caller (io_uring core code) handles IOU_OK differently than other error codes. Fix this by returning the actual error code. Signed-off-by: Kanchan Joshi <joshi.k@samsung.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-08-23ALSA: seq: oss: Fix data-race for max_midi_devs accessTakashi Iwai
ALSA OSS sequencer refers to a global variable max_midi_devs at creating a new port, storing it to its own field. Meanwhile this variable may be changed by other sequencer events at snd_seq_oss_midi_check_exit_port() in parallel, which may cause a data race. OTOH, this data race itself is almost harmless, as the access to the MIDI device is done via get_mdev() and it's protected with a refcount, hence its presence is guaranteed. Though, it's sill better to address the data-race from the code sanity POV, and this patch adds the proper spinlock for the protection. Reported-by: Abhishek Shah <abhishek.shah@columbia.edu> Cc: <stable@vger.kernel.org> Link: https://lore.kernel.org/r/CAEHB2493pZRXs863w58QWnUTtv3HHfg85aYhLn5HJHCwxqtHQg@mail.gmail.com Link: https://lore.kernel.org/r/20220823072717.1706-1-tiwai@suse.de Signed-off-by: Takashi Iwai <tiwai@suse.de>
2022-08-23net: dsa: don't dereference NULL extack in dsa_slave_changeupper()Vladimir Oltean
When a driver returns -EOPNOTSUPP in dsa_port_bridge_join() but failed to provide a reason for it, DSA attempts to set the extack to say that software fallback will kick in. The problem is, when we use brctl and the legacy bridge ioctls, the extack will be NULL, and DSA dereferences it in the process of setting it. Sergei Antonov proves this using the following stack trace: Unable to handle kernel NULL pointer dereference at virtual address 00000000 PC is at dsa_slave_changeupper+0x5c/0x158 dsa_slave_changeupper from raw_notifier_call_chain+0x38/0x6c raw_notifier_call_chain from __netdev_upper_dev_link+0x198/0x3b4 __netdev_upper_dev_link from netdev_master_upper_dev_link+0x50/0x78 netdev_master_upper_dev_link from br_add_if+0x430/0x7f4 br_add_if from br_ioctl_stub+0x170/0x530 br_ioctl_stub from br_ioctl_call+0x54/0x7c br_ioctl_call from dev_ifsioc+0x4e0/0x6bc dev_ifsioc from dev_ioctl+0x2f8/0x758 dev_ioctl from sock_ioctl+0x5f0/0x674 sock_ioctl from sys_ioctl+0x518/0xe40 sys_ioctl from ret_fast_syscall+0x0/0x1c Fix the problem by only overriding the extack if non-NULL. Fixes: 1c6e8088d9a7 ("net: dsa: allow port_bridge_join() to override extack message") Link: https://lore.kernel.org/netdev/CABikg9wx7vB5eRDAYtvAm7fprJ09Ta27a4ZazC=NX5K4wn6pWA@mail.gmail.com/ Reported-by: Sergei Antonov <saproj@gmail.com> Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Tested-by: Sergei Antonov <saproj@gmail.com> Link: https://lore.kernel.org/r/20220819173925.3581871-1-vladimir.oltean@nxp.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-08-23net: ipvtap - add __init/__exit annotations to module init/exit funcsMaciej Żenczykowski
Looks to have been left out in an oversight. Cc: Mahesh Bandewar <maheshb@google.com> Cc: Sainath Grandhi <sainath.grandhi@intel.com> Fixes: 235a9d89da97 ('ipvtap: IP-VLAN based tap driver') Signed-off-by: Maciej Żenczykowski <maze@google.com> Link: https://lore.kernel.org/r/20220821130808.12143-1-zenczykowski@gmail.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2022-08-23io_uring: fix off-by-one in sync cancelation file checkJens Axboe
The passed in index should be validated against the number of registered files we have, it needs to be smaller than the index value to avoid going one beyond the end. Fixes: 78a861b94959 ("io_uring: add sync cancelation API through io_uring_register()") Reported-by: Luo Likang <luolikang@nsfocus.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-08-23io_uring: uapi: Add `extern "C"` in io_uring.h for liburingAmmar Faizi
Make it easy for liburing to integrate uapi header with the kernel. Previously, when this header changes, the liburing side can't directly copy this header file due to some small differences. Sync them. Link: https://lore.kernel.org/io-uring/f1feef16-6ea2-0653-238f-4aaee35060b6@kernel.dk Cc: Bart Van Assche <bvanassche@acm.org> Cc: Dylan Yudaken <dylany@fb.com> Cc: Facebook Kernel Team <kernel-team@fb.com> Signed-off-by: Ammar Faizi <ammarfaizi2@gnuweeb.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-08-23MAINTAINERS: Add `include/linux/io_uring_types.h`Ammar Faizi
File include/linux/io_uring_types.h doesn't have a maintainer, add it to the io_uring section. Signed-off-by: Ammar Faizi <ammarfaizi2@gnuweeb.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-08-23arm64/sme: Don't flush SVE register state when handling SME trapsMark Brown
Currently as part of handling a SME access trap we flush the SVE register state. This is not needed and would corrupt register state if the task has access to the SVE registers already. For non-streaming mode accesses the required flushing will be done in the SVE access trap. For streaming mode SVE register accesses the architecture guarantees that the register state will be flushed when streaming mode is entered or exited so there is no need for us to do so. Simply remove the register initialisation. Fixes: 8bd7f91c03d8 ("arm64/sme: Implement traps and syscall handling for SME") Signed-off-by: Mark Brown <broonie@kernel.org> Reviewed-by: Catalin Marinas <catalin.marinas@arm.com> Link: https://lore.kernel.org/r/20220817182324.638214-5-broonie@kernel.org Signed-off-by: Will Deacon <will@kernel.org>
2022-08-23arm64/sme: Don't flush SVE register state when allocating SME storageMark Brown
Currently when taking a SME access trap we allocate storage for the SVE register state in order to be able to handle storage of streaming mode SVE. Due to the original usage in a purely SVE context the SVE register state allocation this also flushes the register state for SVE if storage was already allocated but in the SME context this is not desirable. For a SME access trap to be taken the task must not be in streaming mode so either there already is SVE register state present for regular SVE mode which would be corrupted or the task does not have TIF_SVE and the flush is redundant. Fix this by adding a flag to sve_alloc() indicating if we are in a SVE context and need to flush the state. Freshly allocated storage is always zeroed either way. Fixes: 8bd7f91c03d8 ("arm64/sme: Implement traps and syscall handling for SME") Signed-off-by: Mark Brown <broonie@kernel.org> Reviewed-by: Catalin Marinas <catalin.marinas@arm.com> Link: https://lore.kernel.org/r/20220817182324.638214-4-broonie@kernel.org Signed-off-by: Will Deacon <will@kernel.org>
2022-08-23arm64/signal: Flush FPSIMD register state when disabling streaming modeMark Brown
When handling a signal delivered to a context with streaming mode enabled we will disable streaming mode for the signal handler, when doing so we should also flush the saved FPSIMD register state like exiting streaming mode in the hardware would do so that if that state is reloaded we get the same behaviour. Without this we will reload whatever the last FPSIMD state that was saved for the task was. Fixes: 40a8e87bb328 ("arm64/sme: Disable ZA and streaming mode when handling signals") Signed-off-by: Mark Brown <broonie@kernel.org> Reviewed-by: Catalin Marinas <catalin.marinas@arm.com> Link: https://lore.kernel.org/r/20220817182324.638214-3-broonie@kernel.org Signed-off-by: Will Deacon <will@kernel.org>
2022-08-23arm64/signal: Raise limit on stack framesMark Brown
The signal code has a limit of 64K on the size of a stack frame that it will generate, if this limit is exceeded then a process will be killed if it receives a signal. Unfortunately with the advent of SME this limit is too small - the maximum possible size of the ZA register alone is 64K. This is not an issue for practical systems at present but is easily seen using virtual platforms. Raise the limit to 256K, this is substantially more than could be used by any current architecture extension. Signed-off-by: Mark Brown <broonie@kernel.org> Acked-by: Catalin Marinas <catalin.marinas@arm.com> Link: https://lore.kernel.org/r/20220817182324.638214-2-broonie@kernel.org Signed-off-by: Will Deacon <will@kernel.org>
2022-08-23arm64/cache: Fix cache_type_cwg() for register generationMark Brown
Ard noticed that since we converted CTR_EL0 to automatic generation we have been seeing errors on some systems handling the value of cache_type_cwg() such as CPU features: No Cache Writeback Granule information, assuming 128 This is because the manual definition of CTR_EL0_CWG_MASK was done without a shift while our convention is to define the mask after shifting. This means that the user in cache_type_cwg() was broken as it was written for the manually written shift then mask. Fix this by converting to use SYS_FIELD_GET(). The only other field where the _MASK for this register is used is IminLine which is at offset 0 so unaffected. Fixes: 9a3634d02301 ("arm64/sysreg: Convert CTR_EL0 to automatic generation") Reported-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Mark Brown <broonie@kernel.org> Link: https://lore.kernel.org/r/20220818213613.733091-4-broonie@kernel.org Signed-off-by: Will Deacon <will@kernel.org>
2022-08-23arm64/sysreg: Guard SYS_FIELD_ macros for asmMark Brown
The SYS_FIELD_ macros are not safe for assembly contexts, move them inside the guarded section. Signed-off-by: Mark Brown <broonie@kernel.org> Link: https://lore.kernel.org/r/20220818213613.733091-3-broonie@kernel.org Signed-off-by: Will Deacon <will@kernel.org>
2022-08-23arm64/sysreg: Directly include bitfield.hMark Brown
The SYS_FIELD_ macros in sysreg.h use definitions from bitfield.h but there is no direct inclusion of it, add one to ensure that sysreg.h is directly usable. Signed-off-by: Mark Brown <broonie@kernel.org> Link: https://lore.kernel.org/r/20220818213613.733091-2-broonie@kernel.org Signed-off-by: Will Deacon <will@kernel.org>
2022-08-23arm64: cacheinfo: Fix incorrect assignment of signed error value to unsigned ↵Sudeep Holla
fw_level Though acpi_find_last_cache_level() always returned signed value and the document states it will return any errors caused by lack of a PPTT table, it never returned negative values before. Commit 0c80f9e165f8 ("ACPI: PPTT: Leave the table mapped for the runtime usage") however changed it by returning -ENOENT if no PPTT was found. The value returned from acpi_find_last_cache_level() is then assigned to unsigned fw_level. It will result in the number of cache leaves calculated incorrectly as a huge value which will then cause the following warning from __alloc_pages as the order would be great than MAX_ORDER because of incorrect and huge cache leaves value. | WARNING: CPU: 0 PID: 1 at mm/page_alloc.c:5407 __alloc_pages+0x74/0x314 | Modules linked in: | CPU: 0 PID: 1 Comm: swapper/0 Not tainted 5.19.0-10393-g7c2a8d3ac4c0 #73 | pstate: 20000005 (nzCv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--) | pc : __alloc_pages+0x74/0x314 | lr : alloc_pages+0xe8/0x318 | Call trace: | __alloc_pages+0x74/0x314 | alloc_pages+0xe8/0x318 | kmalloc_order_trace+0x68/0x1dc | __kmalloc+0x240/0x338 | detect_cache_attributes+0xe0/0x56c | update_siblings_masks+0x38/0x284 | store_cpu_topology+0x78/0x84 | smp_prepare_cpus+0x48/0x134 | kernel_init_freeable+0xc4/0x14c | kernel_init+0x2c/0x1b4 | ret_from_fork+0x10/0x20 Fix the same by changing fw_level to be signed integer and return the error from init_cache_level() early in case of error. Reported-and-Tested-by: Bruno Goncalves <bgoncalv@redhat.com> Signed-off-by: Sudeep Holla <sudeep.holla@arm.com> Link: https://lore.kernel.org/r/20220808084640.3165368-1-sudeep.holla@arm.com Signed-off-by: Will Deacon <will@kernel.org>
2022-08-23arm64: errata: add detection for AMEVCNTR01 incrementing incorrectlyIonela Voinescu
The AMU counter AMEVCNTR01 (constant counter) should increment at the same rate as the system counter. On affected Cortex-A510 cores, AMEVCNTR01 increments incorrectly giving a significantly higher output value. This results in inaccurate task scheduler utilization tracking and incorrect feedback on CPU frequency. Work around this problem by returning 0 when reading the affected counter in key locations that results in disabling all users of this counter from using it either for frequency invariance or as FFH reference counter. This effect is the same to firmware disabling affected counters. Details on how the two features are affected by this erratum: - AMU counters will not be used for frequency invariance for affected CPUs and CPUs in the same cpufreq policy. AMUs can still be used for frequency invariance for unaffected CPUs in the system. Although unlikely, if no alternative method can be found to support frequency invariance for affected CPUs (cpufreq based or solution based on platform counters) frequency invariance will be disabled. Please check the chapter on frequency invariance at Documentation/scheduler/sched-capacity.rst for details of its effect. - Given that FFH can be used to fetch either the core or constant counter values, restrictions are lifted regarding any of these counters returning a valid (!0) value. Therefore FFH is considered supported if there is a least one CPU that support AMUs, independent of any counters being disabled or affected by this erratum. Clarifying comments are now added to the cpc_ffh_supported(), cpu_read_constcnt() and cpu_read_corecnt() functions. The above is achieved through adding a new erratum: ARM64_ERRATUM_2457168. Signed-off-by: Ionela Voinescu <ionela.voinescu@arm.com> Reviewed-by: Catalin Marinas <catalin.marinas@arm.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Will Deacon <will@kernel.org> Cc: James Morse <james.morse@arm.com> Link: https://lore.kernel.org/r/20220819103050.24211-1-ionela.voinescu@arm.com Signed-off-by: Will Deacon <will@kernel.org>
2022-08-23arm64: fix rodata=fullMark Rutland
On arm64, "rodata=full" has been suppored (but not documented) since commit: c55191e96caa9d78 ("arm64: mm: apply r/o permissions of VM areas to its linear alias as well") As it's necessary to determine the rodata configuration early during boot, arm64 has an early_param() handler for this, whereas init/main.c has a __setup() handler which is run later. Unfortunately, this split meant that since commit: f9a40b0890658330 ("init/main.c: return 1 from handled __setup() functions") ... passing "rodata=full" would result in a spurious warning from the __setup() handler (though RO permissions would be configured appropriately). Further, "rodata=full" has been broken since commit: 0d6ea3ac94ca77c5 ("lib/kstrtox.c: add "false"/"true" support to kstrtobool()") ... which caused strtobool() to parse "full" as false (in addition to many other values not documented for the "rodata=" kernel parameter. This patch fixes this breakage by: * Moving the core parameter parser to an __early_param(), such that it is available early. * Adding an (optional) arch hook which arm64 can use to parse "full". * Updating the documentation to mention that "full" is valid for arm64. * Having the core parameter parser handle "on" and "off" explicitly, such that any undocumented values (e.g. typos such as "ful") are reported as errors rather than being silently accepted. Note that __setup() and early_param() have opposite conventions for their return values, where __setup() uses 1 to indicate a parameter was handled and early_param() uses 0 to indicate a parameter was handled. Fixes: f9a40b089065 ("init/main.c: return 1 from handled __setup() functions") Fixes: 0d6ea3ac94ca ("lib/kstrtox.c: add "false"/"true" support to kstrtobool()") Signed-off-by: Mark Rutland <mark.rutland@arm.com> Cc: Andy Shevchenko <andy.shevchenko@gmail.com> Cc: Ard Biesheuvel <ardb@kernel.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Jagdish Gediya <jvgediya@linux.ibm.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Will Deacon <will@kernel.org> Reviewed-by: Ard Biesheuvel <ardb@kernel.org> Link: https://lore.kernel.org/r/20220817154022.3974645-1-mark.rutland@arm.com Signed-off-by: Will Deacon <will@kernel.org>
2022-08-23arm64: Fix comment typoKuan-Ying Lee
Replace wrong 'FIQ EL1h' comment with 'FIQ EL1t'. Signed-off-by: Kuan-Ying Lee <Kuan-Ying.Lee@mediatek.com> Link: https://lore.kernel.org/r/20220721030531.21234-1-Kuan-Ying.Lee@mediatek.com Signed-off-by: Will Deacon <will@kernel.org>
2022-08-23Merge branch 'dsa-changes-for-multiple-cpu-ports-part-3'Paolo Abeni
Vladimir Oltean says: ==================== DSA changes for multiple CPU ports (part 3) Those who have been following part 1: https://patchwork.kernel.org/project/netdevbpf/cover/20220511095020.562461-1-vladimir.oltean@nxp.com/ and part 2: https://patchwork.kernel.org/project/netdevbpf/cover/20220521213743.2735445-1-vladimir.oltean@nxp.com/ will know that I am trying to enable the second internal port pair from the NXP LS1028A Felix switch for DSA-tagged traffic via "ocelot-8021q". This series represents part 3 of that effort. Covered here are some preparations in DSA for handling multiple DSA masters: - when changing the tagging protocol via sysfs - when the masters go down as well as preparation for monitoring the upper devices of a DSA master (to support DSA masters under a LAG). There are also 2 small preparations for the ocelot driver, for the case where multiple tag_8021q CPU ports are used in a LAG. Both those changes have to do with PGID forwarding domains. Compared to v1, the patches were trimmed down to just another preparation stage, and the UAPI changes were pushed further out to part 4. https://patchwork.kernel.org/project/netdevbpf/cover/20220523104256.3556016-1-olteanv@gmail.com/ Compared to v2, I had to export a symbol I forgot to (ocelot_port_teardown_dsa_8021q_cpu), to avoid a build breakage when the felix and seville drivers are built as modules. ==================== Link: https://lore.kernel.org/r/20220819174820.3585002-1-vladimir.oltean@nxp.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2022-08-23net: mscc: ocelot: adjust forwarding domain for CPU ports in a LAGVladimir Oltean
Currently when we have 2 CPU ports configured for DSA tag_8021q mode and we put them in a LAG, a PGID dump looks like this: PGID_SRC[0] = ports 4, PGID_SRC[1] = ports 4, PGID_SRC[2] = ports 4, PGID_SRC[3] = ports 4, PGID_SRC[4] = ports 0, 1, 2, 3, 4, 5, PGID_SRC[5] = no ports (ports 0-3 are user ports, ports 4 and 5 are CPU ports) There are 2 problems with the configuration above: - user ports should enable forwarding towards both CPU ports, not just 4, and the aggregation PGIDs should prune one CPU port or the other from the destination port mask, based on a hash computed from packet headers. - CPU ports should not be allowed to forward towards themselves and also not towards other ports in the same LAG as themselves The first problem requires fixing up the PGID_SRC of user ports, when ocelot_port_assigned_dsa_8021q_cpu_mask() is called. We need to say that when a user port is assigned to a tag_8021q CPU port and that port is in a LAG, it should forward towards all ports in that LAG. The second problem requires fixing up the PGID_SRC of port 4, to remove ports 4 and 5 (in a LAG) from the allowed destinations. After this change, the PGID source masks look as follows: PGID_SRC[0] = ports 4, 5, PGID_SRC[1] = ports 4, 5, PGID_SRC[2] = ports 4, 5, PGID_SRC[3] = ports 4, 5, PGID_SRC[4] = ports 0, 1, 2, 3, PGID_SRC[5] = no ports Note that PGID_SRC[5] still looks weird (it should say "0, 1, 2, 3" just like PGID_SRC[4] does), but I've tested forwarding through this CPU port and it doesn't seem like anything is affected (it appears that PGID_SRC[4] is being looked up on forwarding from the CPU, since both ports 4 and 5 have logical port ID 4). The reason why it looks weird is because we've never called ocelot_port_assign_dsa_8021q_cpu() for any user port towards port 5 (all user ports are assigned to port 4 which is in a LAG with 5). Since things aren't broken, I'm willing to leave it like that for now and just document the oddity. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2022-08-23net: mscc: ocelot: set up tag_8021q CPU ports independent of user port affinityVladimir Oltean
This is a partial revert of commit c295f9831f1d ("net: mscc: ocelot: switch from {,un}set to {,un}assign for tag_8021q CPU ports"), because as it turns out, this isn't how tag_8021q CPU ports under a LAG are supposed to work. Under that scenario, all user ports are "assigned" to the single tag_8021q CPU port represented by the logical port corresponding to the bonding interface. So one CPU port in a LAG would have is_dsa_8021q_cpu set to true (the one whose physical port ID is equal to the logical port ID), and the other one to false. In turn, this makes 2 undesirable things happen: (1) PGID_CPU contains only the first physical CPU port, rather than both (2) only the first CPU port will be added to the private VLANs used by ocelot for VLAN-unaware bridging To make the driver behave in the same way for both bonded CPU ports, we need to bring back the old concept of setting up a port as a tag_8021q CPU port, and this is what deals with VLAN membership and PGID_CPU updating. But we also need the CPU port "assignment" (the user to CPU port affinity), and this is what updates the PGID_SRC forwarding rules. All DSA CPU ports are statically configured for tag_8021q mode when the tagging protocol is changed to ocelot-8021q. User ports are "assigned" to one CPU port or the other dynamically (this will be handled by a future change). Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2022-08-23net: dsa: use dsa_tree_for_each_cpu_port in dsa_tree_{setup,teardown}_masterVladimir Oltean
More logic will be added to dsa_tree_setup_master() and dsa_tree_teardown_master() in upcoming changes. Reduce the indentation by one level in these functions by introducing and using a dedicated iterator for CPU ports of a tree. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2022-08-23net: dsa: all DSA masters must be down when changing the tagging protocolVladimir Oltean
The fact that the tagging protocol is set and queried from the /sys/class/net/<dsa-master>/dsa/tagging file is a bit of a quirk from the single CPU port days which isn't aging very well now that DSA can have more than a single CPU port. This is because the tagging protocol is a switch property, yet in the presence of multiple CPU ports it can be queried and set from multiple sysfs files, all of which are handled by the same implementation. The current logic ensures that the net device whose sysfs file we're changing the tagging protocol through must be down. That net device is the DSA master, and this is fine for single DSA master / CPU port setups. But exactly because the tagging protocol is per switch [ tree, in fact ] and not per DSA master, this isn't fine any longer with multiple CPU ports, and we must iterate through the tree and find all DSA masters, and make sure that all of them are down. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2022-08-23net: dsa: only bring down user ports assigned to a given DSA masterVladimir Oltean
This is an adaptation of commit c0a8a9c27493 ("net: dsa: automatically bring user ports down when master goes down") for multiple DSA masters. When a DSA master goes down, only the user ports under its control should go down too, the others can still send/receive traffic. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2022-08-23net: dsa: existing DSA masters cannot join upper interfacesVladimir Oltean
All the traffic to/from a DSA master is supposed to be distributed among its DSA switch upper interfaces, so we should not allow other upper device kinds. An exception to this is DSA_TAG_PROTO_NONE (switches with no DSA tags), and in that case it is actually expected to create e.g. VLAN interfaces on the master. But for those, netdev_uses_dsa(master) returns false, so the restriction doesn't apply. The motivation for this change is to allow LAG interfaces of DSA masters to be DSA masters themselves. We want to restrict the user's degrees of freedom by 1: the LAG should already have all DSA masters as lowers, and while lower ports of the LAG can be removed, none can be added after the fact. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2022-08-23net: bridge: move DSA master bridging restriction to DSAVladimir Oltean
When DSA gains support for multiple CPU ports in a LAG, it will become mandatory to monitor the changeupper events for the DSA master. In fact, there are already some restrictions to be imposed in that area, namely that a DSA master cannot be a bridge port except in some special circumstances. Centralize the restrictions at the level of the DSA layer as a preliminary step. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Acked-by: Nikolay Aleksandrov <razor@blackwall.org> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2022-08-23net: dsa: don't stop at NOTIFY_OK when calling ds->ops->port_prechangeupperVladimir Oltean
dsa_slave_prechangeupper_sanity_check() is supposed to enforce some adjacency restrictions, and calls ds->ops->port_prechangeupper if the driver implements it. We convert the error code from the port_prechangeupper() call to a notifier code, and 0 is converted to NOTIFY_OK, but the caller of dsa_slave_prechangeupper_sanity_check() stops at any notifier code different from NOTIFY_DONE. Avoid this by converting back the notifier code to an error code, so that both NOTIFY_OK and NOTIFY_DONE will be seen as 0. This allows more parallel sanity check functions to be added. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2022-08-23net: dsa: walk through all changeupper notifier functionsVladimir Oltean
Traditionally, DSA has had a single netdev notifier handling function for each device type. For the sake of code cleanliness, we would like to introduce more handling functions which do one thing, but the conditions for entering these functions start to overlap. Example: a handling function which tracks whether any bridges contain both DSA and non-DSA interfaces. Either this is placed before dsa_slave_changeupper(), case in which it will prevent that function from executing, or we place it after dsa_slave_changeupper(), case in which we will prevent it from executing. The other alternative is to ignore errors from the new handling function (not ideal). To support this usage, we need to change the pattern. In the new model, we enter all notifier handling sub-functions, and exit with NOTIFY_DONE if there is nothing to do. This allows the sub-functions to be relatively free-form and independent from each other. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2022-08-23docs/arm64: elf_hwcaps: unify newlines in HWCAP listsMartin Liška
Unify horizontal spacing (remove extra newlines) which are sensitive to visual presentation by Sphinx. Fixes: 5e64b862c482 ("arm64/sme: Basic enumeration support") Signed-off-by: Martin Liska <mliska@suse.cz> Reviewed-by: Mark Brown <broonie@kernel.org> Link: https://lore.kernel.org/r/84e3d6cc-75cf-d6f3-9bb8-be02075aaf6d@suse.cz Signed-off-by: Will Deacon <will@kernel.org>
2022-08-23Merge branch 'vsock-updates-for-so_rcvlowat-handling'Paolo Abeni
Arseniy Krasnov says: ==================== vsock: updates for SO_RCVLOWAT handling This patchset includes some updates for SO_RCVLOWAT: 1) af_vsock: During my experiments with zerocopy receive, i found, that in some cases, poll() implementation violates POSIX: when socket has non- default SO_RCVLOWAT(e.g. not 1), poll() will always set POLLIN and POLLRDNORM bits in 'revents' even number of bytes available to read on socket is smaller than SO_RCVLOWAT value. In this case,user sees POLLIN flag and then tries to read data(for example using 'read()' call), but read call will be blocked, because SO_RCVLOWAT logic is supported in dequeue loop in af_vsock.c. But the same time, POSIX requires that: "POLLIN Data other than high-priority data may be read without blocking. POLLRDNORM Normal data may be read without blocking." See https://www.open-std.org/jtc1/sc22/open/n4217.pdf, page 293. So, we have, that poll() syscall returns POLLIN, but read call will be blocked. Also in man page socket(7) i found that: "Since Linux 2.6.28, select(2), poll(2), and epoll(7) indicate a socket as readable only if at least SO_RCVLOWAT bytes are available." I checked TCP callback for poll()(net/ipv4/tcp.c, tcp_poll()), it uses SO_RCVLOWAT value to set POLLIN bit, also i've tested TCP with this case for TCP socket, it works as POSIX required. I've added some fixes to af_vsock.c and virtio_transport_common.c, test is also implemented. 2) virtio/vsock: It adds some optimization to wake ups, when new data arrived. Now, SO_RCVLOWAT is considered before wake up sleepers who wait new data. There is no sense, to kick waiter, when number of available bytes in socket's queue < SO_RCVLOWAT, because if we wake up reader in this case, it will wait for SO_RCVLOWAT data anyway during dequeue, or in poll() case, POLLIN/POLLRDNORM bits won't be set, so such exit from poll() will be "spurious". This logic is also used in TCP sockets. 3) vmci/vsock: Same as 2), but i'm not sure about this changes. Will be very good, to get comments from someone who knows this code. 4) Hyper-V: As Dexuan Cui mentioned, for Hyper-V transport it is difficult to support SO_RCVLOWAT, so he suggested to disable this feature for Hyper-V. ==================== Link: https://lore.kernel.org/r/de41de4c-0345-34d7-7c36-4345258b7ba8@sberdevices.ru Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2022-08-23vsock_test: POLLIN + SO_RCVLOWAT testArseniy Krasnov
This adds test to check, that when poll() returns POLLIN, POLLRDNORM bits, next read call won't block. Signed-off-by: Arseniy Krasnov <AVKrasnov@sberdevices.ru> Reviewed-by: Stefano Garzarella <sgarzare@redhat.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2022-08-23vmci/vsock: check SO_RCVLOWAT before wake up readerArseniy Krasnov
This adds extra condition to wake up data reader: do it only when number of readable bytes >= SO_RCVLOWAT. Otherwise, there is no sense to kick user, because it will wait until SO_RCVLOWAT bytes will be dequeued. This check is performed in vsock_data_ready(). Signed-off-by: Arseniy Krasnov <AVKrasnov@sberdevices.ru> Reviewed-by: Vishnu Dasa <vdasa@vmware.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2022-08-23virtio/vsock: check SO_RCVLOWAT before wake up readerArseniy Krasnov
This adds extra condition to wake up data reader: do it only when number of readable bytes >= SO_RCVLOWAT. Otherwise, there is no sense to kick user,because it will wait until SO_RCVLOWAT bytes will be dequeued. This check is performed in vsock_data_ready(). Signed-off-by: Arseniy Krasnov <AVKrasnov@sberdevices.ru> Reviewed-by: Stefano Garzarella <sgarzare@redhat.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2022-08-23vsock: add API call for data readyArseniy Krasnov
This adds 'vsock_data_ready()' which must be called by transport to kick sleeping data readers. It checks for SO_RCVLOWAT value before waking user, thus preventing spurious wake ups. Based on 'tcp_data_ready()' logic. Signed-off-by: Arseniy Krasnov <AVKrasnov@sberdevices.ru> Reviewed-by: Stefano Garzarella <sgarzare@redhat.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2022-08-23vsock: pass sock_rcvlowat to notify_poll_in as targetArseniy Krasnov
Passing 1 as the target to notify_poll_in(), we don't honor what the user has set via SO_RCVLOWAT, going to set POLLIN and POLLRDNORM, even if we don't have the amount of bytes expected by the user. Let's use sock_rcvlowat() to get the right target to pass to notify_poll_in(); Signed-off-by: Arseniy Krasnov <AVKrasnov@sberdevices.ru> Reviewed-by: Stefano Garzarella <sgarzare@redhat.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2022-08-23vmci/vsock: use 'target' in notify_poll_in callbackArseniy Krasnov
This callback controls setting of POLLIN, POLLRDNORM output bits of poll() syscall, but in some cases, it is incorrectly to set it, when socket has at least 1 bytes of available data. Use 'target' which is already exists. Signed-off-by: Arseniy Krasnov <AVKrasnov@sberdevices.ru> Reviewed-by: Stefano Garzarella <sgarzare@redhat.com> Reviewed-by: Vishnu Dasa <vdasa@vmware.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2022-08-23virtio/vsock: use 'target' in notify_poll_in callbackArseniy Krasnov
This callback controls setting of POLLIN, POLLRDNORM output bits of poll() syscall, but in some cases, it is incorrectly to set it, when socket has at least 1 bytes of available data. Use 'target' which is already exists. Signed-off-by: Arseniy Krasnov <AVKrasnov@sberdevices.ru> Reviewed-by: Stefano Garzarella <sgarzare@redhat.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2022-08-23hv_sock: disable SO_RCVLOWAT supportArseniy Krasnov
For Hyper-V it is quiet difficult to support this socket option,due to transport internals, so disable it. Signed-off-by: Arseniy Krasnov <AVKrasnov@sberdevices.ru> Reviewed-by: Dexuan Cui <decui@microsoft.com> Reviewed-by: Stefano Garzarella <sgarzare@redhat.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2022-08-23vsock: SO_RCVLOWAT transport set callbackArseniy Krasnov
This adds transport specific callback for SO_RCVLOWAT, because in some transports it may be difficult to know current available number of bytes ready to read. Thus, when SO_RCVLOWAT is set, transport may reject it. Signed-off-by: Arseniy Krasnov <AVKrasnov@sberdevices.ru> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2022-08-23net: sched: remove duplicate check of user rights in qdiscZhengchao Shao
In rtnetlink_rcv_msg function, the permission for all user operations is checked except the GET operation, which is the same as the checking in qdisc. Therefore, remove the user rights check in qdisc. Signed-off-by: Zhengchao Shao <shaozhengchao@huawei.com> Message-Id: <20220819041854.83372-1-shaozhengchao@huawei.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2022-08-23smb3: missing inode locks in punch holeDavid Howells
smb3 fallocate punch hole was not grabbing the inode or filemap_invalidate locks so could have race with pagemap reinstantiating the page. Cc: stable@vger.kernel.org Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Steve French <stfrench@microsoft.com>