Age | Commit message (Collapse) | Author |
|
Adjust EMLSR activation to account for traffic levels. By
tracking the number of RX/TX MPDUs, EMLSR will be activated only when
traffic volume meets the required threshold.
Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com>
Link: https://msgid.link/20240505091420.9480f99ac8fc.If9eb946e929a39e10fe5f4638bc8bc3f8976edf1@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
|
|
When an event occurs to unblock EMLSR, the code attempts to re-enable
EMLSR. However, the current implementation always tries to activate
EMLSR, regardless of whether the blocker was set before the unblocking
event or not. If EMLSR was already unblocked, there is no need to
re-activate it.
Fixes: 6cf7df9f013f ("wifi: iwlwifi: mvm: Add helper functions to update EMLSR status")
Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com>
Link: https://msgid.link/20240505091420.eb861402dac9.I6a1d9f774f5551cfab60ea37b71a62640496af9b@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
|
|
EMLSR can't be activated from mac80211. Except for the debugfs, which is
intended for testing purposes. Currently we don't allow entering EMLSR
from debugfs if EMLSR is blocked, i.e. if mvmvif::esr_disable_reason is
not 0. But we need a way to activate EMLSR regardless of the vif being
blocked, for testing. Remove the check of esr_disable_reason
Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com>
Link: https://msgid.link/20240505091420.bc3c24d9e0e6.Iad60e22a0d7e2b2b989051e1140b6dc98bef7bcc@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
|
|
This is needed for testing purposes.
Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com>
Link: https://msgid.link/20240505091420.eba2b6f0664c.I5f058e02abda11bf2eccfd2bcb59ca26bae87a3a@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
|
|
If the reason for exiting EMLSR was a blocking reason, wait for the
corresponding unblocking event:
- if there is an ongoing scan - do nothing. Link selection will be
triggered at the end of it.
- If more than 30 seconds passed since the exit, trigger MLO scan, which
will trigger link selection
- If less then 30 seconds passed since exit, reuse the latest link
selection result
If the reason for exiting EMLSR was an exit reason (IWL_MVM_EXIT_*),
schedule MLO scan in 30 seconds.
Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com>
Reviewed-by: Ilan Peer <ilan.peer@intel.com>
Link: https://msgid.link/20240505091420.6a808c4ae8f5.Ia79605838eb6deee9358bec633ef537f2653db92@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
|
|
BT Coex disables EMLSR only for a 2.4 GHz link, but doesn't block the
vif from using EMLSR with a different link pair. In addition, storing it
in mvmvif:disable_esr_reason requires extracting the BT Coex bit before
checking if EMLSR is blocked or not for a specific vif.
Therefore, change the BT Coex bit to be an exit reason and not a
blocker. On link selection, EMLSR mode will be re-calculated for the 2.4
GHz link instead of checking that bit.
While at it, move the relevant function declarations to the EMLSR
functions area in mvm.h
Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com>
Link: https://msgid.link/20240505091420.a2e93b67c895.I183a0039ef076613144648cc46fbe9ab3d47c574@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
|
|
Given how late we are in the cycle, merge the two fixes from
wireless into wireless-next as they don't see that urgent.
This way, the wireless tree won't need rebasing later.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
|
|
Its the only remaining call site so there is no need for this to
be separated anymore.
Signed-off-by: Florian Westphal <fw@strlen.de>
Reviewed-by: Stefano Brivio <sbrivio@redhat.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
|
The existing code uses iter->type to figure out what data is needed, the
live copy (READ) or clone (UPDATE).
Without pending updates, priv->clone and priv->match will point to
different memory locations, but they have identical content.
Future patch will make priv->clone == NULL if there are no pending changes,
in this case we must copy the live data for the UPDATE case.
Currently this would require GFP_ATOMIC allocation. Split the walk
function in two parts: one that does the walk and one that decides which
data is needed.
In the UPDATE case, callers hold the transaction mutex so we do not need
the rcu read lock. This allows to use GFP_KERNEL allocation while
cloning.
Signed-off-by: Florian Westphal <fw@strlen.de>
Reviewed-by: Stefano Brivio <sbrivio@redhat.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
|
Once priv->clone can be NULL in case no insertions/removals occurred
in the last transaction we need to drop set elements from priv->match
if priv->clone is NULL.
While at it, condense this function by reusing the pipapo_free_match
helper instead of open-coded version.
The rcu_barrier() is removed, its not needed: old call_rcu instances
for pipapo_reclaim_match do not access struct nft_set.
Signed-off-by: Florian Westphal <fw@strlen.de>
Reviewed-by: Stefano Brivio <sbrivio@redhat.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
|
Currently it returns an error pointer, but the only possible failure
is ENOMEM.
After a followup patch, we'd need to discard the errno code, i.e.
x = pipapo_clone()
if (IS_ERR(x))
return NULL
or make more changes to fix up callers to expect IS_ERR() code
from set->ops->deactivate().
So simplify this and make it return ptr-or-null.
Signed-off-by: Florian Westphal <fw@strlen.de>
Reviewed-by: Stefano Brivio <sbrivio@redhat.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
|
Preparation patch, the helper will soon get called from insert
function too.
Signed-off-by: Florian Westphal <fw@strlen.de>
Reviewed-by: Stefano Brivio <sbrivio@redhat.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
|
Not sure why this special case exists. Early drop logic
(which kicks in when conntrack table is full) should be independent
of flowtable offload and only consider assured bit (i.e., two-way
traffic was seen).
flowtable entries hold a reference to the conntrack entry (struct
nf_conn) that has been offloaded. The conntrack use count is not
decremented until after the entry is free'd.
This change therefore will not result in exceeding the conntrack table
limit. It does allow early-drop of tcp flows even when they've been
offloaded, but only if they have been offloaded before syn-ack was
received or after at least one peer has sent a fin.
Currently 'fin' packet reception already stops offloading, so this
should not impact offloading either.
Cc: Vlad Buslov <vladbu@nvidia.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
|
The referenced sysctl doesn't exist anymore.
Fixes: 4592ee7f525c ("netfilter: conntrack: remove offload_pickup sysctl again")
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
|
At the beginning in 2009 one patch [1] introduced collecting drop
counter in nf_conntrack_in() by returning -NF_DROP. Later, another
patch [2] changed the return value of tcp_packet() which now is
renamed to nf_conntrack_tcp_packet() from -NF_DROP to NF_DROP. As
we can see, that -NF_DROP should be corrected.
Similarly, there are other two points where the -NF_DROP is used.
Well, as NF_DROP is equal to 0, inverting NF_DROP makes no sense
as patch [2] said many years ago.
[1]
commit 7d1e04598e5e ("netfilter: nf_conntrack: account packets drop by tcp_packet()")
[2]
commit ec8d540969da ("netfilter: conntrack: fix dropping packet after l4proto->packet()")
Signed-off-by: Jason Xing <kernelxing@tencent.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
|
We're using mutex_lock() inside a wait_event() conditional -
prepare_to_wait() has already flipped task state, so potentially
blocking ops need annotation.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Arnd Bergmann sent a patch to fsdevel, he says:
"orangefs_statfs() copies two consecutive fields of the superblock into
the statfs structure, which triggers a warning from the string fortification
helpers"
Jan Kara suggested an alternate way to do the patch to make it more readable.
I ran both ideas through xfstests and both seem fine. This patch
is based on Jan Kara's suggestion.
Signed-off-by: Mike Marshall <hubcap@omnibond.com>
|
|
Add mmio trace events support, currently generic mmio events
KVM_TRACE_MMIO_WRITE/xxx_READ/xx_READ_UNSATISFIED are added here.
Also vcpu id field is added for all kvm trace events, since perf
KVM tool parses vcpu id information for kvm entry event.
Signed-off-by: Bibo Mao <maobibo@loongson.cn>
Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
|
|
When VM runs in kvm mode, system will not exit to host mode when
executing a general software breakpoint instruction such as INSN_BREAK,
trap exception happens in guest mode rather than host mode. In order to
debug guest kernel on host side, one mechanism should be used to let VM
exit to host mode.
Here a hypercall instruction with a special code is used for software
breakpoint usage. VM exits to host mode and kvm hypervisor identifies
the special hypercall code and sets exit_reason with KVM_EXIT_DEBUG. And
then let qemu handle it.
Idea comes from ppc kvm, one api KVM_REG_LOONGARCH_DEBUG_INST is added
to get the hypercall code. VMM needs get sw breakpoint instruction with
this api and set the corresponding sw break point for guest kernel.
Signed-off-by: Bibo Mao <maobibo@loongson.cn>
Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
|
|
PARAVIRT config option and PV IPI is added for the guest side, function
pv_ipi_init() is used to add IPI sending and IPI receiving hooks. This
function firstly checks whether system runs in VM mode, and if kernel
runs in VM mode, it will call function kvm_para_available() to detect
the current hypervirsor type (now only KVM type detection is supported).
The paravirt functions can work only if current hypervisor type is KVM,
since there is only KVM supported on LoongArch now.
PV IPI uses virtual IPI sender and virtual IPI receiver functions. With
virtual IPI sender, IPI message is stored in memory rather than emulated
HW. IPI multicast is also supported, and 128 vcpus can received IPIs
at the same time like X86 KVM method. Hypercall method is used for IPI
sending.
With virtual IPI receiver, HW SWI0 is used rather than real IPI HW.
Since VCPU has separate HW SWI0 like HW timer, there is no trap in IPI
interrupt acknowledge. Since IPI message is stored in memory, there is
no trap in getting IPI message.
Signed-off-by: Bibo Mao <maobibo@loongson.cn>
Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
|
|
On LoongArch system, IPI hw uses iocsr registers. There are one iocsr
register access on IPI sending, and two iocsr access on IPI receiving
for the IPI interrupt handler. In VM mode all iocsr accessing will cause
VM to trap into hypervisor. So with one IPI hw notification there will
be three times of trap.
In this patch PV IPI is added for VM, hypercall instruction is used for
IPI sender, and hypervisor will inject an SWI to the destination vcpu.
During the SWI interrupt handler, only CSR.ESTAT register is written to
clear irq. CSR.ESTAT register access will not trap into hypervisor, so
with PV IPI supported, there is one trap with IPI sender, and no trap
with IPI receiver, there is only one trap with IPI notification.
Also this patch adds IPI multicast support, the method is similar with
x86. With IPI multicast support, IPI notification can be sent to at
most 128 vcpus at one time. It greatly reduces the times of trapping
into hypervisor.
Signed-off-by: Bibo Mao <maobibo@loongson.cn>
Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
|
|
Physical CPUID is used for interrupt routing for irqchips such as ipi,
msgint and eiointc interrupt controllers. Physical CPUID is stored at
the CSR register LOONGARCH_CSR_CPUID, it can not be changed once vcpu
is created and the physical CPUIDs of two vcpus cannot be the same.
Different irqchips have different size declaration about physical CPUID,
the max CPUID value for CSR LOONGARCH_CSR_CPUID on Loongson-3A5000 is
512, the max CPUID supported by IPI hardware is 1024, while for eiointc
irqchip is 256, and for msgint irqchip is 65536.
The smallest value from all interrupt controllers is selected now, and
the max cpuid size is defines as 256 by KVM which comes from the eiointc
irqchip.
Signed-off-by: Bibo Mao <maobibo@loongson.cn>
Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
|
|
Instruction cpucfg can be used to get processor features. And there
is a trap exception when it is executed in VM mode, and also it can be
used to provide cpu features to VM. On real hardware cpucfg area 0 - 20
is used by now. Here one specified area 0x40000000 -- 0x400000ff is used
for KVM hypervisor to provide PV features, and the area can be extended
for other hypervisors in future. This area will never be used for real
HW, it is only used by software.
Signed-off-by: Bibo Mao <maobibo@loongson.cn>
Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
|
|
On LoongArch system, there is a hypercall instruction special for
virtualization. When system executes this instruction on host side,
there is an illegal instruction exception reported, however it will
trap into host when it is executed in VM mode.
When hypercall is emulated, A0 register is set with value
KVM_HCALL_INVALID_CODE, rather than inject EXCCODE_INE invalid
instruction exception. So VM can continue to executing the next code.
Signed-off-by: Bibo Mao <maobibo@loongson.cn>
Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
|
|
Refine the ipi handling on LoongArch platform, there are three
modifications:
1. Add generic function get_percpu_irq(), replacing some percpu irq
functions such as get_ipi_irq()/get_pmc_irq()/get_timer_irq() with
get_percpu_irq().
2. Change definition about parameter action called by function
loongson_send_ipi_single() and loongson_send_ipi_mask(), and it is
defined as decimal encoding format at ipi sender side. Normal decimal
encoding is used rather than binary bitmap encoding for ipi action, ipi
hw sender uses decimal encoding code, and ipi receiver will get binary
bitmap encoding, the ipi hw will convert it into bitmap in ipi message
buffer.
3. Add a structure smp_ops on LoongArch platform so that pv ipi can be
used later.
Signed-off-by: Bibo Mao <maobibo@loongson.cn>
Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
|
|
No functional changes intended.
Fixes: f2298c0403b0 ("null_blk: multi queue aware block test driver")
Signed-off-by: Zhu Yanjun <yanjun.zhu@linux.dev>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Link: https://lore.kernel.org/r/20240506075538.6064-1-yanjun.zhu@linux.dev
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
It is unused.
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20240506122848.20326-1-bp@kernel.org
|
|
The race condition around the ECCCLR register access happens in the IRQ
disable method called in the device remove() procedure and in the ECC IRQ
handler:
1. Enable IRQ:
a. ECCCLR = EN_CE | EN_UE
2. Disable IRQ:
a. ECCCLR = 0
3. IRQ handler:
a. ECCCLR = CLR_CE | CLR_CE_CNT | CLR_CE | CLR_CE_CNT
b. ECCCLR = 0
c. ECCCLR = EN_CE | EN_UE
So if the IRQ disabling procedure is called concurrently with the IRQ
handler method the IRQ might be actually left enabled due to the
statement 3c.
The root cause of the problem is that ECCCLR register (which since
v3.10a has been called as ECCCTL) has intermixed ECC status data clear
flags and the IRQ enable/disable flags. Thus the IRQ disabling (clear EN
flags) and handling (write 1 to clear ECC status data) procedures must
be serialised around the ECCCTL register modification to prevent the
race.
So fix the problem described above by adding the spin-lock around the
ECCCLR modifications and preventing the IRQ-handler from modifying the
IRQs enable flags (there is no point in disabling the IRQ and then
re-enabling it again within a single IRQ handler call, see the
statements 3a/3b and 3c above).
Fixes: f7824ded4149 ("EDAC/synopsys: Add support for version 3 of the Synopsys EDAC DDR")
Signed-off-by: Serge Semin <fancer.lancer@gmail.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20240222181324.28242-2-fancer.lancer@gmail.com
|
|
There is code that builds with calls to IO accessors even when
CONFIG_PCI=n, but the actual calls are guarded by runtime checks.
If not those calls would be faulting, because the page at virtual
address zero is (usually) not mapped into the kernel. As Arnd pointed
out, it is possible a large port value could cause the address to be
above mmap_min_addr which would then access userspace, which would be
a bug.
To avoid any such issues, set _IO_BASE to POISON_POINTER_DELTA. That
is a value chosen to point into unmapped space between the kernel and
userspace, so any access will always fault.
Note that on 32-bit POISON_POINTER_DELTA is 0, so the patch only has an
effect on 64-bit.
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://msgid.link/20240503075619.394467-2-mpe@ellerman.id.au
|
|
With -Wextra clang warns about pointer arithmetic using a null pointer.
When building with CONFIG_PCI=n, that triggers a warning in the IO
accessors, eg:
In file included from linux/arch/powerpc/include/asm/io.h:672:
linux/arch/powerpc/include/asm/io-defs.h:23:1: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
23 | DEF_PCI_AC_RET(inb, u8, (unsigned long port), (port), pio, port)
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
...
linux/arch/powerpc/include/asm/io.h:591:53: note: expanded from macro '__do_inb'
591 | #define __do_inb(port) readb((PCI_IO_ADDR)_IO_BASE + port);
| ~~~~~~~~~~~~~~~~~~~~~ ^
That is because when CONFIG_PCI=n, _IO_BASE is defined as 0.
Although _IO_BASE is defined as plain 0, the cast (PCI_IO_ADDR) converts
it to void * before the addition with port happens.
Instead the addition can be done first, and then the cast. The resulting
value will be the same, but avoids the warning, and also avoids void
pointer arithmetic which is apparently non-standard.
Reported-by: Naresh Kamboju <naresh.kamboju@linaro.org>
Closes: https://lore.kernel.org/all/CA+G9fYtEh8zmq8k8wE-8RZwW-Qr927RLTn+KqGnq1F=ptaaNsA@mail.gmail.com
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://msgid.link/20240503075619.394467-1-mpe@ellerman.id.au
|
|
Currently, bpf jit code on powerpc assumes all the bpf functions and
helpers to be part of core kernel text. This is false for kfunc case,
as function addresses may not be part of core kernel text area. So,
add support for addresses that are not within core kernel text area
too, to enable kfunc support. Emit instructions based on whether the
function address is within core kernel text address or not, to retain
optimized instruction sequence where possible.
In case of PCREL, as a bpf function that is not within core kernel
text area is likely to go out of range with relative addressing on
kernel base, use PC relative addressing. If that goes out of range,
load the full address with PPC_LI64().
With addresses that are not within core kernel text area supported,
override bpf_jit_supports_kfunc_call() to enable kfunc support. Also,
override bpf_jit_supports_far_kfunc_call() to enable 64-bit pointers,
as an address offset can be more than 32-bit long on PPC64.
Signed-off-by: Hari Bathini <hbathini@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://msgid.link/20240502173205.142794-2-hbathini@linux.ibm.com
|
|
With PCREL addressing, there is no kernel TOC. So, it is not setup in
prologue when PCREL addressing is used. But the number of instructions
to skip on a tail call was not adjusted accordingly. That resulted in
not so obvious failures while using tailcalls. 'tailcalls' selftest
crashed the system with the below call trace:
bpf_test_run+0xe8/0x3cc (unreliable)
bpf_prog_test_run_skb+0x348/0x778
__sys_bpf+0xb04/0x2b00
sys_bpf+0x28/0x38
system_call_exception+0x168/0x340
system_call_vectored_common+0x15c/0x2ec
Also, as bpf programs are always module addresses and a bpf helper in
general is a core kernel text address, using PC relative addressing
often fails with "out of range of pcrel address" error. Switch to
using kernel base for relative addressing to handle this better.
Fixes: 7e3a68be42e1 ("powerpc/64: vmlinux support building with PCREL addresing")
Cc: stable@vger.kernel.org # v6.4+
Signed-off-by: Hari Bathini <hbathini@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://msgid.link/20240502173205.142794-1-hbathini@linux.ibm.com
|
|
This list was moved many years ago.
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://msgid.link/20240503121012.3ba5000b@canb.auug.org.au
|
|
Documents how to use the PR_PPC_GET_DEXCR and PR_PPC_SET_DEXCR prctl()'s
for changing a process's DEXCR or its process tree default value.
Signed-off-by: Benjamin Gray <bgray@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://msgid.link/20240417112325.728010-10-bgray@linux.ibm.com
|
|
Adds a utility to exercise the prctl DEXCR inheritance in the shell.
Supports setting and clearing each aspect.
Signed-off-by: Benjamin Gray <bgray@linux.ibm.com>
[mpe: Use correct SPDX license, use execvp() for usability, print errors]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://msgid.link/20240417112325.728010-9-bgray@linux.ibm.com
|
|
Now that the DEXCR can be configured with prctl, add a section in
lsdexcr that explains why each aspect is set the way it is.
Signed-off-by: Benjamin Gray <bgray@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://msgid.link/20240417112325.728010-8-bgray@linux.ibm.com
|
|
Now that a process can control its DEXCR to some extent, make the
hashchk tests more reliable by explicitly setting the local and onexec
NPHIE aspect.
Signed-off-by: Benjamin Gray <bgray@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://msgid.link/20240417112325.728010-7-bgray@linux.ibm.com
|
|
Some basic tests of the prctl interface of the DEXCR.
Signed-off-by: Benjamin Gray <bgray@linux.ibm.com>
[mpe: Add missing SPDX tag]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://msgid.link/20240417112325.728010-6-bgray@linux.ibm.com
|
|
Now that we track a DEXCR on a per-task basis, individual tasks are free
to configure it as they like.
The interface is a pair of getter/setter prctl's that work on a single
aspect at a time (multiple aspects at once is more difficult if there
are different rules applied for each aspect, now or in future). The
getter shows the current state of the process config, and the setter
allows setting/clearing the aspect.
Signed-off-by: Benjamin Gray <bgray@linux.ibm.com>
[mpe: Account for PR_RISCV_SET_ICACHE_FLUSH_CTX, shrink some longs lines]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://msgid.link/20240417112325.728010-5-bgray@linux.ibm.com
|
|
net_alloc_generic is called by net_alloc, which is called without any
locking. It reads max_gen_ptrs, which is changed under pernet_ops_rwsem. It
is read twice, first to allocate an array, then to set s.len, which is
later used to limit the bounds of the array access.
It is possible that the array is allocated and another thread is
registering a new pernet ops, increments max_gen_ptrs, which is then used
to set s.len with a larger than allocated length for the variable array.
Fix it by reading max_gen_ptrs only once in net_alloc_generic. If
max_gen_ptrs is later incremented, it will be caught in net_assign_generic.
Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@igalia.com>
Fixes: 073862ba5d24 ("netns: fix net_alloc_generic()")
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20240502132006.3430840-1-cascardo@igalia.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
This includes a major rework of thermal governors and part of the
thermal core interacting with them as well as some fixes and cleanups
of the thermal debug code:
- Redesign the thermal governor interface to allow the governors to
work in a more straightforward way.
- Make thermal governors take the current trip point thresholds into
account in their computations which allows trip hysteresis to be
observed more accurately.
- Clean up thermal governors.
- Make the thermal core manage passive polling for thermal zones and
remove passive polling management from thermal governors.
- Improve the handling of cooling device states and thermal mitigation
episodes in progress in the thermal debug code.
- Avoid excessive updates of trip point statistics and clean up the
printing of thermal mitigation episode information.
* thermal-core: (27 commits)
thermal: core: Move passive polling management to the core
thermal: core: Do not call handle_thermal_trip() if zone temperature is invalid
thermal: trip: Add missing empty code line
thermal/debugfs: Avoid printing zero duration for mitigation events in progress
thermal/debugfs: Pass cooling device state to thermal_debug_cdev_add()
thermal/debugfs: Create records for cdev states as they get used
thermal: core: Introduce thermal_governor_trip_crossed()
thermal/debugfs: Make tze_seq_show() skip invalid trips and trips with no stats
thermal/debugfs: Rename thermal_debug_update_temp() to thermal_debug_update_trip_stats()
thermal/debugfs: Clean up thermal_debug_update_temp()
thermal/debugfs: Avoid excessive updates of trip point statistics
thermal: core: Relocate critical and hot trip handling
thermal: core: Drop the .throttle() governor callback
thermal: gov_user_space: Use .trip_crossed() instead of .throttle()
thermal: gov_fair_share: Eliminate unnecessary integer divisions
thermal: gov_fair_share: Use trip thresholds instead of trip temperatures
thermal: gov_fair_share: Use .manage() callback instead of .throttle()
thermal: gov_step_wise: Clean up thermal_zone_trip_update()
thermal: gov_step_wise: Use trip thresholds instead of trip temperatures
thermal: gov_step_wise: Use .manage() callback instead of .throttle()
...
|
|
|
|
All EV4 machines are already gone, and the remaining EV5 based machines
all support the slightly more modern EV56 generation as well.
Debian only supports EV56 and later.
Drop both of these and build kernels optimized for EV56 and higher
when the "generic" options is selected, tuning for an out-of-order
EV6 pipeline, same as Debian userspace.
Since this was the only supported architecture without 8-bit and
16-bit stores, common kernel code no longer has to worry about
aligning struct members, and existing workarounds from the block
and tty layers can be removed.
The alpha memory management code no longer needs an abstraction
for the differences between EV4 and EV5+.
Link: https://lists.debian.org/debian-alpha/2023/05/msg00009.html
Acked-by: Paul E. McKenney <paulmck@kernel.org>
Acked-by: Matt Turner <mattst88@gmail.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
|
|
Felix Fietkau says:
====================
Add TCP fraglist GRO support
When forwarding TCP after GRO, software segmentation is very expensive,
especially when the checksum needs to be recalculated.
One case where that's currently unavoidable is when routing packets over
PPPoE. Performance improves significantly when using fraglist GRO
implemented in the same way as for UDP.
When NETIF_F_GRO_FRAGLIST is enabled, perform a lookup for an established
socket in the same netns as the receiving device. While this may not
cover all relevant use cases in multi-netns configurations, it should be
good enough for most configurations that need this.
Here's a measurement of running 2 TCP streams through a MediaTek MT7622
device (2-core Cortex-A53), which runs NAT with flow offload enabled from
one ethernet port to PPPoE on another ethernet port + cake qdisc set to
1Gbps.
rx-gro-list off: 630 Mbit/s, CPU 35% idle
rx-gro-list on: 770 Mbit/s, CPU 40% idle
Changes since v4:
- add likely() to prefer the non-fraglist path in check
Changes since v3:
- optimize __tcpv4_gso_segment_csum
- add unlikely()
- reorder dev_net/skb_gro_network_header calls after NETIF_F_GRO_FRAGLIST
check
- add support for ipv6 nat
- drop redundant pskb_may_pull check
Changes since v2:
- create tcp_gro_header_pull helper function to pull tcp header only once
- optimize __tcpv4_gso_segment_list_csum, drop obsolete flags check
Changes since v1:
- revert bogus tcp flags overwrite on segmentation
- fix kbuild issue with !CONFIG_IPV6
- only perform socket lookup for the first skb in the GRO train
Changes since RFC:
- split up patches
- handle TCP flags mutations
====================
Link: https://lore.kernel.org/r/20240502084450.44009-1-nbd@nbd.name
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
When forwarding TCP after GRO, software segmentation is very expensive,
especially when the checksum needs to be recalculated.
One case where that's currently unavoidable is when routing packets over
PPPoE. Performance improves significantly when using fraglist GRO
implemented in the same way as for UDP.
When NETIF_F_GRO_FRAGLIST is enabled, perform a lookup for an established
socket in the same netns as the receiving device. While this may not
cover all relevant use cases in multi-netns configurations, it should be
good enough for most configurations that need this.
Here's a measurement of running 2 TCP streams through a MediaTek MT7622
device (2-core Cortex-A53), which runs NAT with flow offload enabled from
one ethernet port to PPPoE on another ethernet port + cake qdisc set to
1Gbps.
rx-gro-list off: 630 Mbit/s, CPU 35% idle
rx-gro-list on: 770 Mbit/s, CPU 40% idle
Acked-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Felix Fietkau <nbd@nbd.name>
Reviewed-by: David Ahern <dsahern@kernel.org>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
Pull the code out of tcp_gro_receive in order to access the tcp header
from tcp4/6_gro_receive.
Acked-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Felix Fietkau <nbd@nbd.name>
Reviewed-by: David Ahern <dsahern@kernel.org>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
This pulls the flow port matching out of tcp_gro_receive, so that it can be
reused for the next change, which adds the TCP fraglist GRO heuristic.
Acked-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Felix Fietkau <nbd@nbd.name>
Reviewed-by: David Ahern <dsahern@kernel.org>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
This implements fraglist GRO similar to how it's handled in UDP, however
no functional changes are added yet. The next change adds a heuristic for
using fraglist GRO instead of regular GRO.
Acked-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Felix Fietkau <nbd@nbd.name>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
Preparation for adding TCP fraglist GRO support. It expects packets to be
combined in a similar way as UDP fraglist GSO packets.
For IPv4 packets, NAT is handled in the same way as UDP fraglist GSO.
Acked-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Felix Fietkau <nbd@nbd.name>
Reviewed-by: David Ahern <dsahern@kernel.org>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
This helper function will be used for TCP fraglist GRO support
Acked-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Felix Fietkau <nbd@nbd.name>
Reviewed-by: David Ahern <dsahern@kernel.org>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|