Age | Commit message (Collapse) | Author |
|
copy_fd_bitmaps(new, old, count) is expected to copy the first
count/BITS_PER_LONG bits from old->full_fds_bits[] and fill
the rest with zeroes. What it does is copying enough words
(BITS_TO_LONGS(count/BITS_PER_LONG)), then memsets the rest.
That works fine, *if* all bits past the cutoff point are
clear. Otherwise we are risking garbage from the last word
we'd copied.
For most of the callers that is true - expand_fdtable() has
count equal to old->max_fds, so there's no open descriptors
past count, let alone fully occupied words in ->open_fds[],
which is what bits in ->full_fds_bits[] correspond to.
The other caller (dup_fd()) passes sane_fdtable_size(old_fdt, max_fds),
which is the smallest multiple of BITS_PER_LONG that covers all
opened descriptors below max_fds. In the common case (copying on
fork()) max_fds is ~0U, so all opened descriptors will be below
it and we are fine, by the same reasons why the call in expand_fdtable()
is safe.
Unfortunately, there is a case where max_fds is less than that
and where we might, indeed, end up with junk in ->full_fds_bits[] -
close_range(from, to, CLOSE_RANGE_UNSHARE) with
* descriptor table being currently shared
* 'to' being above the current capacity of descriptor table
* 'from' being just under some chunk of opened descriptors.
In that case we end up with observably wrong behaviour - e.g. spawn
a child with CLONE_FILES, get all descriptors in range 0..127 open,
then close_range(64, ~0U, CLOSE_RANGE_UNSHARE) and watch dup(0) ending
up with descriptor #128, despite #64 being observably not open.
The minimally invasive fix would be to deal with that in dup_fd().
If this proves to add measurable overhead, we can go that way, but
let's try to fix copy_fd_bitmaps() first.
* new helper: bitmap_copy_and_expand(to, from, bits_to_copy, size).
* make copy_fd_bitmaps() take the bitmap size in words, rather than
bits; it's 'count' argument is always a multiple of BITS_PER_LONG,
so we are not losing any information, and that way we can use the
same helper for all three bitmaps - compiler will see that count
is a multiple of BITS_PER_LONG for the large ones, so it'll generate
plain memcpy()+memset().
Reproducer added to tools/testing/selftests/core/close_range_test.c
Cc: stable@vger.kernel.org
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|
Following helpers do not touch their struct net argument:
- bpf_sk_lookup_run_v4()
- inet_lookup_reuseport()
- inet_lhash2_lookup()
- inet_lookup_run_sk_lookup()
- __inet_lookup_listener()
- __inet_lookup_established()
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20240802134029.3748005-3-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
When a reference counter is at zero and refcount_sub_and_test() is invoked
to subtract zero, the function accepts this request without any warning and
returns true. This behavior does not seem ideal because the counter being
already at zero indicates a use-after-free. Furthermore, returning true by
refcount_sub_and_test() in this case potentially results in a double-free
done by its caller.
Modify the underlying function __refcount_sub_and_test() to warn about this
case as a use-after-free and have it return false to avoid the potential
double-free.
Signed-off-by: Petr Pavlu <petr.pavlu@suse.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20240717130023.5675-1-petr.pavlu@suse.com
Signed-off-by: Kees Cook <kees@kernel.org>
|
|
On almost all ASUS ROG series laptops the MCU used for the USB keyboard
also has a HID packet used for setting the brightness. This is usually
the same as the WMI method. But in some laptops the WMI method either
is missing or doesn't work, so we should default to the HID control.
Signed-off-by: Luke D. Jones <luke@ljones.dev>
Acked-by: Benjamin Tissoires <bentiss@kernel.org>
Link: https://lore.kernel.org/r/20240713074733.77334-2-luke@ljones.dev
Reviewed-by: Hans de Goede <hdegoede@redhat.com>
Signed-off-by: Hans de Goede <hdegoede@redhat.com>
|
|
To allow a genpd provider for a CPU PM domain to enter a domain-idle-state
during s2idle on a PREEMPT_RT based configuration, we can't use the regular
spinlock, as they are turned into sleepable locks on PREEMPT_RT.
To address this problem, let's convert into using the raw spinlock, but
only for genpd providers that have the GENPD_FLAG_CPU_DOMAIN bit set. In
this way, the lock can still be acquired/released in atomic context, which
is needed in the idle-path for PREEMPT_RT.
Do note that the genpd power-on/off notifiers may also be fired during
s2idle, but these are already prepared for PREEMPT_RT as they are based on
the raw notifiers. However, consumers of them may need to adopt accordingly
to work properly on PREEMPT_RT.
Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org>
Tested-by: Raghavendra Kakarla <quic_rkakarla@quicinc.com> # qcm6490 with PREEMPT_RT set
Acked-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Link: https://lore.kernel.org/r/20240527142557.321610-2-ulf.hansson@linaro.org
|
|
After the commit 66665ad2f102 ("tracing/kprobe: bpf: Compare instruction
pointer with original one"), "bpf_kprobe_override" is not used anywhere
anymore, and we can remove it now.
Link: https://lore.kernel.org/all/20240710085939.11520-1-dongml2@chinatelecom.cn/
Fixes: 66665ad2f102 ("tracing/kprobe: bpf: Compare instruction pointer with original one")
Signed-off-by: Menglong Dong <dongml2@chinatelecom.cn>
Acked-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
|
|
The kernel sleep profile is no longer working due to a recursive locking
bug introduced by commit 42a20f86dc19 ("sched: Add wrapper for get_wchan()
to keep task blocked")
Booting with the 'profile=sleep' kernel command line option added or
executing
# echo -n sleep > /sys/kernel/profiling
after boot causes the system to lock up.
Lockdep reports
kthreadd/3 is trying to acquire lock:
ffff93ac82e08d58 (&p->pi_lock){....}-{2:2}, at: get_wchan+0x32/0x70
but task is already holding lock:
ffff93ac82e08d58 (&p->pi_lock){....}-{2:2}, at: try_to_wake_up+0x53/0x370
with the call trace being
lock_acquire+0xc8/0x2f0
get_wchan+0x32/0x70
__update_stats_enqueue_sleeper+0x151/0x430
enqueue_entity+0x4b0/0x520
enqueue_task_fair+0x92/0x6b0
ttwu_do_activate+0x73/0x140
try_to_wake_up+0x213/0x370
swake_up_locked+0x20/0x50
complete+0x2f/0x40
kthread+0xfb/0x180
However, since nobody noticed this regression for more than two years,
let's remove 'profile=sleep' support based on the assumption that nobody
needs this functionality.
Fixes: 42a20f86dc19 ("sched: Add wrapper for get_wchan() to keep task blocked")
Cc: stable@vger.kernel.org # v5.16+
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Add IS_ENABLED(CONFIG_I2C) to the conditional around a bunch of ACPI
functions.
The conditional around these functions depended only on CONFIG_ACPI.
But the functions are implemented in I2C core, so are only present if
CONFIG_I2C is enabled.
Signed-off-by: Richard Fitzgerald <rf@opensource.cirrus.com>
Signed-off-by: Wolfram Sang <wsa+renesas@sang-engineering.com>
|
|
linkmode_andnot() simply returns the result of bitmap_andnot().
And the return type of bitmap_andnot() is bool.
So it makes sense for the return type of linkmode_andnot()
to also be bool.
I checked all call-sites and they either ignore the return
value or treat it as a bool.
Compile tested only.
Link: https://lore.kernel.org/netdev/68088998-4486-4930-90a4-96a32f08c490@lunn.ch/
Signed-off-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20240801-linkfield-bowl-v1-1-d58f68967802@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
We re-define values of enum netdev_priv_flags as preprocessor
macros with the same name. I guess this was done to avoid breaking
out of tree modules which may use #ifdef X for kernel compatibility?
Commit 7aa98047df95 ("net: move net_device priv_flags out from UAPI")
which added the enum doesn't say. In any case, the flags with defines
are quite old now, and defines for new flags don't get added.
OOT drivers have to resort to code greps for compat detection, anyway.
Let's delete these defines, save LoC, help LXR link to the right place.
Reviewed-by: Simon Horman <horms@kernel.org>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Link: https://patch.msgid.link/20240801163401.378723-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Merge series from Richard Fitzgerald <rf@opensource.cirrus.com>:
Provide empty versions of acpi_spi_count_resources(),
acpi_spi_device_alloc() and acpi_spi_find_controller_by_adev()
if the real functions are not being built.
This commit fixes two problems with the original definitions:
1) There wasn't an empty version of these functions
2) The #if only depended on CONFIG_ACPI. But the functions are implemented
in the core spi.c so CONFIG_SPI_MASTER must also be enabled for the real
functions to exist.
|
|
Provide empty versions of acpi_spi_count_resources(),
acpi_spi_device_alloc() and acpi_spi_find_controller_by_adev()
if the real functions are not being built.
This commit fixes two problems with the original definitions:
1) There wasn't an empty version of these functions
2) The #if only depended on CONFIG_ACPI. But the functions are implemented
in the core spi.c so CONFIG_SPI_MASTER must also be enabled for the real
functions to exist.
Signed-off-by: Richard Fitzgerald <rf@opensource.cirrus.com>
Link: https://patch.msgid.link/20240802152215.20831-2-rf@opensource.cirrus.com
Signed-off-by: Mark Brown <broonie@kernel.org>
|
|
Pull kvm updates from Paolo Bonzini:
"The bulk of the changes here is a largish change to guest_memfd,
delaying the clearing and encryption of guest-private pages until they
are actually added to guest page tables. This started as "let's make
it impossible to misuse the API" for SEV-SNP; but then it ballooned a
bit.
The new logic is generally simpler and more ready for hugepage support
in guest_memfd.
Summary:
- fix latent bug in how usage of large pages is determined for
confidential VMs
- fix "underline too short" in docs
- eliminate log spam from limited APIC timer periods
- disallow pre-faulting of memory before SEV-SNP VMs are initialized
- delay clearing and encrypting private memory until it is added to
guest page tables
- this change also enables another small cleanup: the checks in
SNP_LAUNCH_UPDATE that limit it to non-populated, private pages can
now be moved in the common kvm_gmem_populate() function
- fix compilation error that the RISC-V merge introduced in selftests"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
KVM: x86/mmu: fix determination of max NPT mapping level for private pages
KVM: riscv: selftests: Fix compile error
KVM: guest_memfd: abstract how prepared folios are recorded
KVM: guest_memfd: let kvm_gmem_populate() operate only on private gfns
KVM: extend kvm_range_has_memory_attributes() to check subset of attributes
KVM: cleanup and add shortcuts to kvm_range_has_memory_attributes()
KVM: guest_memfd: move check for already-populated page to common code
KVM: remove kvm_arch_gmem_prepare_needed()
KVM: guest_memfd: make kvm_gmem_prepare_folio() operate on a single struct kvm
KVM: guest_memfd: delay kvm_gmem_prepare_folio() until the memory is passed to the guest
KVM: guest_memfd: return locked folio from __kvm_gmem_get_pfn
KVM: rename CONFIG_HAVE_KVM_GMEM_* to CONFIG_HAVE_KVM_ARCH_GMEM_*
KVM: guest_memfd: do not go through struct page
KVM: guest_memfd: delay folio_mark_uptodate() until after successful preparation
KVM: guest_memfd: return folio from __kvm_gmem_get_pfn()
KVM: x86: disallow pre-fault for SNP VMs before initialization
KVM: Documentation: Fix title underline too short warning
KVM: x86: Eliminate log spam from limited APIC timer periods
|
|
* fix latent bug in how usage of large pages is determined for
confidential VMs
* fix "underline too short" in docs
* eliminate log spam from limited APIC timer periods
* disallow pre-faulting of memory before SEV-SNP VMs are initialized
* delay clearing and encrypting private memory until it is added to
guest page tables
* this change also enables another small cleanup: the checks in
SNP_LAUNCH_UPDATE that limit it to non-populated, private pages
can now be moved in the common kvm_gmem_populate() function
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux
Pull RISC-V fixes from Palmer Dabbelt:
- A fix to avoid dropping some of the internal pseudo-extensions, which
breaks *envcfg dependency parsing
- The kernel entry address is now aligned in purgatory, which avoids a
misaligned load that can lead to crash on systems that don't support
misaligned accesses early in boot
- The FW_SFENCE_VMA_RECEIVED perf event was duplicated in a handful of
perf JSON configurations, one of them been updated to
FW_SFENCE_VMA_ASID_SENT
- The starfive cache driver is now restricted to 64-bit systems, as it
isn't 32-bit clean
- A fix for to avoid aliasing legacy-mode perf counters with software
perf counters
- VM_FAULT_SIGSEGV is now handled in the page fault code
- A fix for stalls during CPU hotplug due to IPIs being disabled
- A fix for memblock bounds checking. This manifests as a crash on
systems with discontinuous memory maps that have regions that don't
fit in the linear map
* tag 'riscv-for-linus-6.11-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux:
riscv: Fix linear mapping checks for non-contiguous memory regions
RISC-V: Enable the IPI before workqueue_online_cpu()
riscv/mm: Add handling for VM_FAULT_SIGSEGV in mm_fault_error()
perf: riscv: Fix selecting counters in legacy mode
cache: StarFive: Require a 64-bit system
perf arch events: Fix duplicate RISC-V SBI firmware event name
riscv/purgatory: align riscv_kernel_entry
riscv: cpufeature: Do not drop Linux-internal extensions
|
|
According to the data sheet, writing the MODE register should stop the
counter (and thus the interrupts). This appears to work on real hardware,
at least modern Intel and AMD systems. It should also work on Hyper-V.
However, on some buggy virtual machines the mode change doesn't have any
effect until the counter is subsequently loaded (or perhaps when the IRQ
next fires).
So, set MODE 0 and then load the counter, to ensure that those buggy VMs
do the right thing and the interrupts stop. And then write MODE 0 *again*
to stop the counter on compliant implementations too.
Apparently, Hyper-V keeps firing the IRQ *repeatedly* even in mode zero
when it should only happen once, but the second MODE write stops that too.
Userspace test program (mostly written by tglx):
=====
#include <stdio.h>
#include <unistd.h>
#include <stdlib.h>
#include <stdint.h>
#include <sys/io.h>
static __always_inline void __out##bwl(type value, uint16_t port) \
{ \
asm volatile("out" #bwl " %" #bw "0, %w1" \
: : "a"(value), "Nd"(port)); \
} \
\
static __always_inline type __in##bwl(uint16_t port) \
{ \
type value; \
asm volatile("in" #bwl " %w1, %" #bw "0" \
: "=a"(value) : "Nd"(port)); \
return value; \
}
BUILDIO(b, b, uint8_t)
#define inb __inb
#define outb __outb
#define PIT_MODE 0x43
#define PIT_CH0 0x40
#define PIT_CH2 0x42
static int is8254;
static void dump_pit(void)
{
if (is8254) {
// Latch and output counter and status
outb(0xC2, PIT_MODE);
printf("%02x %02x %02x\n", inb(PIT_CH0), inb(PIT_CH0), inb(PIT_CH0));
} else {
// Latch and output counter
outb(0x0, PIT_MODE);
printf("%02x %02x\n", inb(PIT_CH0), inb(PIT_CH0));
}
}
int main(int argc, char* argv[])
{
int nr_counts = 2;
if (argc > 1)
nr_counts = atoi(argv[1]);
if (argc > 2)
is8254 = 1;
if (ioperm(0x40, 4, 1) != 0)
return 1;
dump_pit();
printf("Set oneshot\n");
outb(0x38, PIT_MODE);
outb(0x00, PIT_CH0);
outb(0x0F, PIT_CH0);
dump_pit();
usleep(1000);
dump_pit();
printf("Set periodic\n");
outb(0x34, PIT_MODE);
outb(0x00, PIT_CH0);
outb(0x0F, PIT_CH0);
dump_pit();
usleep(1000);
dump_pit();
dump_pit();
usleep(100000);
dump_pit();
usleep(100000);
dump_pit();
printf("Set stop (%d counter writes)\n", nr_counts);
outb(0x30, PIT_MODE);
while (nr_counts--)
outb(0xFF, PIT_CH0);
dump_pit();
usleep(100000);
dump_pit();
usleep(100000);
dump_pit();
printf("Set MODE 0\n");
outb(0x30, PIT_MODE);
dump_pit();
usleep(100000);
dump_pit();
usleep(100000);
dump_pit();
return 0;
}
=====
Suggested-by: Sean Christopherson <seanjc@google.com>
Co-developed-by: Li RongQing <lirongqing@baidu.com>
Signed-off-by: Li RongQing <lirongqing@baidu.com>
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Michael Kelley <mhkelley@outlook.com>
Link: https://lore.kernel.org/all/20240802135555.564941-2-dwmw2@infradead.org
|
|
Leaving the PIT interrupt running can cause noticeable steal time for
virtual guests. The VMM generally has a timer which toggles the IRQ input
to the PIC and I/O APIC, which takes CPU time away from the guest. Even
on real hardware, running the counter may use power needlessly (albeit
not much).
Make sure it's turned off if it isn't going to be used.
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Michael Kelley <mhkelley@outlook.com>
Link: https://lore.kernel.org/all/20240802135555.564941-1-dwmw2@infradead.org
|
|
The current LATENCY_MULTIPLIER which has been around for nearly 20 years
causes rate_limit_us to be always in ms range.
On M1 mac mini I get 50 and 56us transition latency, but due to the 1000
multiplier we end up setting rate_limit_us to 50 and 56ms, which gets
capped into 2ms and was 10ms before e13aa799c2a6 ("cpufreq: Change
default transition delay to 2ms")
On Intel I5 system transition latency is 20us but due to the multiplier
we end up with 20ms that again is capped to 2ms.
Given how good modern hardware and how modern workloads require systems
to be more responsive to cater for sudden changes in workload (tasks
sleeping/wakeup/migrating, uclamp causing a sudden boost or cap) and
that 2ms is quarter of the time of 120Hz refresh rate system, drop the
old logic in favour of providing 50% headroom.
rate_limit_us = 1.5 * latency.
I considered not adding any headroom which could mean that we can end up
with infinite back-to-back requests.
I also considered providing a constant headroom (e.g: 100us) assuming
that any h/w or f/w dealing with the request shouldn't require a large
headroom when transition_latency is actually high.
But for both cases I wasn't sure if h/w or f/w can end up being
overwhelmed dealing with the freq requests in a potentially busy system.
So I opted for providing 50% breathing room.
This is expected to impact schedutil only as the other user,
dbs_governor, takes the max(2*tick, transition_delay_us) and the former
was at least 2ms on 1ms TICK, which is equivalent to the max_delay_us
before applying this patch. For systems with TICK of 4ms, this value
would have almost always ended up with 8ms sampling rate.
For systems that report 0 transition latency, we still default to
returning 1ms as transition delay.
This helps in eliminating a source of latency for applying requests as
mentioned in [1]. For example if we have a 1ms tick, most systems will
miss sending an update at tick when updating the util_avg for a task/CPU
(rate_limit_us will be 2ms for most systems).
Link: https://lore.kernel.org/lkml/20240724212255.mfr2ybiv2j2uqek7@airbuntu/ # [1]
Link: https://lore.kernel.org/lkml/20240205022500.2232124-1-qyousef@layalina.io/
Signed-off-by: Qais Yousef <qyousef@layalina.io>
Link: https://patch.msgid.link/20240728192659.58115-1-qyousef@layalina.io
[ rjw: Subject edits ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
|
|
There are no more callers of thermal_zone_get_trip() in the tree, so
drop it.
No functional impact.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Reviewed-by: Lukasz Luba <lukasz.luba@arm.com>
Link: https://patch.msgid.link/2220301.Mh6RI2rZIc@rjwysocki.net
|
|
The only existing caller of thermal_zone_get_num_trips(), which is
rcar_gen3_thermal_probe(), uses this function to put the number of
trip points into a kernel log message, but this information is also
available from the thermal sysfs interface.
For this reason, remove the thermal_zone_get_num_trips() call from
rcar_gen3_thermal_probe() and drop the former altogether.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Reviewed-by: Niklas Söderlund <niklas.soderlund+renesas@ragnatech.se>
Reviewed-by: Lukasz Luba <lukasz.luba@arm.com>
Link: https://patch.msgid.link/2636988.Lt9SDvczpP@rjwysocki.net
|
|
This helper will allow HID drivers to easily determine if they should
bind to a hid_device by checking for the prescence of a certain field
when its ID is not enough, which can be the case on USB devices with
multiple interfaces and/or configurations.
Convert google-hammer driver to use it, and remove now superfluous
hammer_has_usage().
[jkosina@suse.com: expand changelog with the information about
google-hammer being added as user of this API ]
Signed-off-by: Kerem Karabay <kekrby@gmail.com>
Signed-off-by: Aditya Garg <gargaditya08@live.com>
Signed-off-by: Jiri Kosina <jkosina@suse.com>
|
|
This way uprobe_unregister() and uprobe_apply() can use "struct uprobe *"
rather than inode + offset. This simplifies the code and allows to avoid
the unnecessary find_uprobe() + put_uprobe() in these functions.
TODO: uprobe_unregister() still needs get_uprobe/put_uprobe to ensure that
this uprobe can't be freed before up_write(&uprobe->register_rwsem).
Co-developed-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Jiri Olsa <jolsa@kernel.org>
Link: https://lore.kernel.org/r/20240801132734.GA8803@redhat.com
|
|
It doesn't make any sense to have 2 versions of _register(). Note that
trace_uprobe_enable(), the only user of uprobe_register(), doesn't need
to check tu->ref_ctr_offset to decide which one should be used, it could
safely pass ref_ctr_offset == 0 to uprobe_register_refctr().
Add this argument to uprobe_register(), update the callers, and kill
uprobe_register_refctr().
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Jiri Olsa <jolsa@kernel.org>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20240801132728.GA8800@redhat.com
|
|
When tracing user functions with uprobe functionality, it's common to
install the probe (e.g., a BPF program) at the first instruction of the
function. This is often going to be `push %rbp` instruction in function
preamble, which means that within that function frame pointer hasn't
been established yet. This leads to consistently missing an actual
caller of the traced function, because perf_callchain_user() only
records current IP (capturing traced function) and then following frame
pointer chain (which would be caller's frame, containing the address of
caller's caller).
So when we have target_1 -> target_2 -> target_3 call chain and we are
tracing an entry to target_3, captured stack trace will report
target_1 -> target_3 call chain, which is wrong and confusing.
This patch proposes a x86-64-specific heuristic to detect `push %rbp`
(`push %ebp` on 32-bit architecture) instruction being traced. Given
entire kernel implementation of user space stack trace capturing works
under assumption that user space code was compiled with frame pointer
register (%rbp/%ebp) preservation, it seems pretty reasonable to use
this instruction as a strong indicator that this is the entry to the
function. In that case, return address is still pointed to by %rsp/%esp,
so we fetch it and add to stack trace before proceeding to unwind the
rest using frame pointer-based logic.
We also check for `endbr64` (for 64-bit modes) as another common pattern
for function entry, as suggested by Josh Poimboeuf. Even if we get this
wrong sometimes for uprobes attached not at the function entry, it's OK
because stack trace will still be overall meaningful, just with one
extra bogus entry. If we don't detect this, we end up with guaranteed to
be missing caller function entry in the stack trace, which is worse
overall.
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20240729175223.23914-1-andrii@kernel.org
|
|
This change allows events to use PERF_SAMPLE_READ with inherit
so long as PERF_SAMPLE_TID is also set. This enables sample based
profiling of a group of counters over a hierarchy of processes or
threads. This is useful, for example, for collecting per-thread
counters/metrics, event based sampling of multiple counters as a unit,
access to the enabled and running time when using multiplexing and so
on.
Prior to this, users were restricted to either collecting aggregate
statistics for a multi-threaded/-process application (e.g. with
"perf stat"), or to sample individual threads, or to profile the entire
system (which requires root or CAP_PERFMON, and may produce much more
data than is required). Theoretically a tool could poll for or otherwise
monitor thread/process creation and construct whatever events the user
is interested in using perf_event_open, for each new thread or process,
but this is racy, can lead to file-descriptor exhaustion, and ultimately
just replicates the behaviour of inherit, but in userspace.
This configuration differs from inherit without PERF_SAMPLE_READ in that
the accumulated event count, and consequently any sample (such as if
triggered by overflow of sample_period) will be on a per-thread rather
than on an aggregate basis.
The meaning of read_format::value field of both PERF_RECORD_READ and
PERF_RECORD_SAMPLE is changed such that if the sampled event uses this
new configuration then the values reported will be per-thread rather
than the global aggregate value. This is a change from the existing
semantics of read_format (where PERF_SAMPLE_READ is used without
inherit), but it is necessary to expose the per-thread counter values,
and it avoids reinventing a separate "read_format_thread" field that
otherwise replicates the same behaviour. This change should not break
existing tools, since this configuration was not previously valid and
was rejected by the kernel. Tools that opt into this new mode will need
to account for this when calculating the counter delta for a given
sample. Tools that wish to have both the per-thread and aggregate value
can perform the global aggregation themselves from the per-thread
values.
The change to read_format::value does not affect existing valid
perf_event_attr configurations, nor does it change the behaviour of
calls to "read" on an event descriptor. Both continue to report the
aggregate value for the entire thread/process hierarchy. The difference
between the results reported by "read" and PERF_RECORD_SAMPLE in this
new configuration is justified on the basis that it is not (easily)
possible for "read" to target a specific thread (the caller only has
the fd for the original parent event).
Signed-off-by: Ben Gainey <ben.gainey@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20240730084417.7693-3-ben.gainey@arm.com
|
|
nr_pending counts the number of events in the context that
either pending_sigtrap or pending_work, but it is used
to prevent taking the fast path in perf_event_context_sched_out.
Renamed to reflect what it is used for, rather than what it
counts. This change allows using the field to track other
event properties that also require skipping the fast path
without possible confusion over the name.
Signed-off-by: Ben Gainey <ben.gainey@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20240730084417.7693-2-ben.gainey@arm.com
|
|
Introduce support for virtio_transport_unsent_bytes
ioctl for virtio_transport, vhost_vsock and vsock_loopback.
For all transports the unsent bytes counter is incremented
in virtio_transport_get_credit.
In virtio_transport (G2H) and in vhost-vsock (H2G) the counter
is decremented when the skbuff is consumed. In vsock_loopback the
same skbuff is passed from the transmitter to the receiver, so
the counter is decremented before queuing the skbuff to the
receiver.
Signed-off-by: Luigi Leonardi <luigi.leonardi@outlook.com>
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
In ata_dev_print_quirks(), return early if ata_dev_print_info() returns
false or if we already printed quirk information. This is to avoid
printing a device quirks multiple times (that is, each time
ata_dev_revalidate() is called).
To remember if ata_dev_print_quirks() was already executed, define the
EH context flag ATA_EHI_DID_PRINT_QUIRKS and set this flag in
ata_dev_print_quirks().
Reported-by: Geert Uytterhoeven <geert@linux-m68k.org>
Fixes: 58157d607aec ("ata: libata: Print quirks applied to devices")
Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Tested-by: Geert Uytterhoeven <geert+renesas@glider.be>
|
|
The function ata_noop_qc_prep(), as its name implies, does nothing and
simply returns AC_ERR_OK. For drivers that do not need any special
preparations of queued commands, we can avoid having to define struct
ata_port qc_prep operation by simply testing if that operation is
defined or not in ata_qc_issue(). Make this change and remove
ata_noop_qc_prep().
Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: John Garry <john.g.garry@oracle.com>
Reviewed-by: Sergey Shtylyov <s.shtylyov@omp.ru>
|
|
Cross-merge networking fixes after downstream PR.
No conflicts or adjacent changes.
Link: https://patch.msgid.link/20240801131917.34494-1-pabeni@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from Paolo Abeni:
"Including fixes from wireless, bleutooth, BPF and netfilter.
Current release - regressions:
- core: drop bad gso csum_start and offset in virtio_net_hdr
- wifi: mt76: fix null pointer access in mt792x_mac_link_bss_remove
- eth: tun: add missing bpf_net_ctx_clear() in do_xdp_generic()
- phy: aquantia: only poll GLOBAL_CFG regs on aqr113, aqr113c and
aqr115c
Current release - new code bugs:
- smc: prevent UAF in inet_create()
- bluetooth: btmtk: fix kernel crash when entering btmtk_usb_suspend
- eth: bnxt: reject unsupported hash functions
Previous releases - regressions:
- sched: act_ct: take care of padding in struct zones_ht_key
- netfilter: fix null-ptr-deref in iptable_nat_table_init().
- tcp: adjust clamping window for applications specifying SO_RCVBUF
Previous releases - always broken:
- ethtool: rss: small fixes to spec and GET
- mptcp:
- fix signal endpoint re-add
- pm: fix backup support in signal endpoints
- wifi: ath12k: fix soft lockup on suspend
- eth: bnxt_en: fix RSS logic in __bnxt_reserve_rings()
- eth: ice: fix AF_XDP ZC timeout and concurrency issues
- eth: mlx5:
- fix missing lock on sync reset reload
- fix error handling in irq_pool_request_irq"
* tag 'net-6.11-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (76 commits)
mptcp: fix duplicate data handling
mptcp: fix bad RCVPRUNED mib accounting
ipv6: fix ndisc_is_useropt() handling for PIO
igc: Fix double reset adapter triggered from a single taprio cmd
net: MAINTAINERS: Demote Qualcomm IPA to "maintained"
net: wan: fsl_qmc_hdlc: Discard received CRC
net: wan: fsl_qmc_hdlc: Convert carrier_lock spinlock to a mutex
net/mlx5e: Add a check for the return value from mlx5_port_set_eth_ptys
net/mlx5e: Fix CT entry update leaks of modify header context
net/mlx5e: Require mlx5 tc classifier action support for IPsec prio capability
net/mlx5: Fix missing lock on sync reset reload
net/mlx5: Lag, don't use the hardcoded value of the first port
net/mlx5: DR, Fix 'stack guard page was hit' error in dr_rule
net/mlx5: Fix error handling in irq_pool_request_irq
net/mlx5: Always drain health in shutdown callback
net: Add skbuff.h to MAINTAINERS
r8169: don't increment tx_dropped in case of NETDEV_TX_BUSY
netfilter: iptables: Fix potential null-ptr-deref in ip6table_nat_table_init().
netfilter: iptables: Fix null-ptr-deref in iptable_nat_table_init().
net: drop bad gso csum_start and offset in virtio_net_hdr
...
|
|
Sometimes the hotplug cpu stalls at the arch_cpu_idle() for a while after
workqueue_online_cpu(). When cpu stalls at the idle loop, the reschedule
IPI is pending. However the enable bit is not enabled yet so the cpu stalls
at WFI until watchdog timeout. Therefore enable the IPI before the
workqueue_online_cpu() to fix the issue.
Fixes: 63c5484e7495 ("workqueue: Add multiple affinity scopes and interface to select them")
Signed-off-by: Nick Hu <nick.hu@sifive.com>
Reviewed-by: Anup Patel <anup@brainfault.org>
Link: https://lore.kernel.org/r/20240717031714.1946036-1-nick.hu@sifive.com
Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
|
|
Add support for vivobook fan profiles wmi call on the ASUS VIVOBOOK
to adjust power limits.
These fan profiles have a different device id than the ROG series
and different order. This reorders the existing modes.
As part of keeping the patch clean the throttle_thermal_policy_available
boolean stored in the driver struct is removed and
throttle_thermal_policy_dev is used in place (as on init it is zeroed).
Co-developed-by: Luke D. Jones <luke@ljones.dev>
Signed-off-by: Luke D. Jones <luke@ljones.dev>
Signed-off-by: Mohamed Ghanmi <mohamed.ghanmi@supcom.tn>
Reviewed-by: Luke D. Jones <luke@ljones.dev>
Link: https://lore.kernel.org/r/20240609144849.2532-2-mohamed.ghanmi@supcom.tn
Reviewed-by: Hans de Goede <hdegoede@redhat.com>
Signed-off-by: Hans de Goede <hdegoede@redhat.com>
|
|
The ADP5585 is a 10/11 input/output port expander with a built in keypad
matrix decoder, programmable logic, reset generator, and PWM generator.
This driver supports the chip by modelling it as an MFD device, with two
child devices for the GPIO and PWM functions.
The driver is derived from an initial implementation from NXP, available
in commit 8059835bee19 ("MLK-25917-1 mfd: adp5585: add ADI adp5585 core
support") in their BSP kernel tree. It has been extensively rewritten.
Signed-off-by: Haibo Chen <haibo.chen@nxp.com>
Signed-off-by: Laurent Pinchart <laurent.pinchart@ideasonboard.com>
Reviewed-by: Frank Li <Frank.Li@nxp.com>
Link: https://lore.kernel.org/r/20240722121100.2855-3-laurent.pinchart@ideasonboard.com
Signed-off-by: Lee Jones <lee@kernel.org>
|
|
Platform Runtime Mechanism (PRM) handlers can be invoked from either the AML
interpreter or directly by an OS driver. Implement the latter.
[ bp: Massage commit message. ]
Signed-off-by: John Allen <john.allen@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Yazen Ghannam <yazen.ghannam@amd.com>
Reviewed-by: Ard Biesheuvel <ardb@kernel.org>
Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Link: https://lore.kernel.org/r/20240730151731.15363-2-john.allen@amd.com
|
|
This patch modifies the skb_unref function to skip the early return
optimization when CONFIG_DEBUG_NET is enabled. The change ensures that
the reference count decrement always occurs in debug builds, allowing
for more thorough checking of SKB reference counting.
Previously, when the SKB's reference count was 1 and CONFIG_DEBUG_NET
was not set, the function would return early after a memory barrier
(smp_rmb()) without decrementing the reference count. This optimization
assumes it's safe to proceed with freeing the SKB without the overhead
of an atomic decrement from 1 to 0.
With this change:
- In non-debug builds (CONFIG_DEBUG_NET not set), behavior remains
unchanged, preserving the performance optimization.
- In debug builds (CONFIG_DEBUG_NET set), the reference count is always
decremented, even when it's 1, allowing for consistent behavior and
potentially catching subtle SKB management bugs.
This modification enhances debugging capabilities for networking code
without impacting performance in production kernels. It helps kernel
developers identify and diagnose issues related to SKB management and
reference counting in the network stack.
Cc: Chris Mason <clm@fb.com>
Suggested-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Breno Leitao <leitao@debian.org>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20240729104741.370327-1-leitao@debian.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
To be consistent with most LSM hooks, convert the return value of
hook vm_enough_memory to 0 or a negative error code.
Before:
- Hook vm_enough_memory returns 1 if permission is granted, 0 if not.
- LSM_RET_DEFAULT(vm_enough_memory_mm) is 1.
After:
- Hook vm_enough_memory reutrns 0 if permission is granted, negative
error code if not.
- LSM_RET_DEFAULT(vm_enough_memory_mm) is 0.
Signed-off-by: Xu Kuohai <xukuohai@huawei.com>
Reviewed-by: Casey Schaufler <casey@schaufler-ca.com>
Signed-off-by: Paul Moore <paul@paul-moore.com>
|
|
Platform data is not used anywhere in the upstram kernel.
Drop support for it to simplify code maintenance.
Reviewed-by: Tzung-Bi Shih <tzungbi@kernel.org>
Signed-off-by: Guenter Roeck <linux@roeck-us.net>
|
|
Cgroup subsystem state (CSS) is an abstraction in the cgroup layer to
help manage different structures in various cgroup subsystems by being
an embedded element inside a larger structure like cpuset or mem_cgroup.
The /proc/cgroups file shows the number of cgroups for each of the
subsystems. With cgroup v1, the number of CSSes is the same as the
number of cgroups. That is not the case anymore with cgroup v2. The
/proc/cgroups file cannot show the actual number of CSSes for the
subsystems that are bound to cgroup v2.
So if a v2 cgroup subsystem is leaking cgroups (usually memory cgroup),
we can't tell by looking at /proc/cgroups which cgroup subsystems may
be responsible.
As cgroup v2 had deprecated the use of /proc/cgroups, the hierarchical
cgroup.stat file is now being extended to show the number of live and
dying CSSes associated with all the non-inhibited cgroup subsystems that
have been bound to cgroup v2. The number includes CSSes in the current
cgroup as well as in all the descendants underneath it. This will help
us pinpoint which subsystems are responsible for the increasing number
of dying (nr_dying_descendants) cgroups.
The CSSes dying counts are stored in the cgroup structure itself
instead of inside the CSS as suggested by Johannes. This will allow
us to accurately track dying counts of cgroup subsystems that have
recently been disabled in a cgroup. It is now possible that a zero
subsystem number is coupled with a non-zero dying subsystem number.
The cgroup-v2.rst file is updated to discuss this new behavior.
With this patch applied, a sample output from root cgroup.stat file
was shown below.
nr_descendants 56
nr_subsys_cpuset 1
nr_subsys_cpu 43
nr_subsys_io 43
nr_subsys_memory 56
nr_subsys_perf_event 57
nr_subsys_hugetlb 1
nr_subsys_pids 56
nr_subsys_rdma 1
nr_subsys_misc 1
nr_dying_descendants 30
nr_dying_subsys_cpuset 0
nr_dying_subsys_cpu 0
nr_dying_subsys_io 0
nr_dying_subsys_memory 30
nr_dying_subsys_perf_event 0
nr_dying_subsys_hugetlb 0
nr_dying_subsys_pids 0
nr_dying_subsys_rdma 0
nr_dying_subsys_misc 0
Another sample output from system.slice/cgroup.stat was:
nr_descendants 34
nr_subsys_cpuset 0
nr_subsys_cpu 32
nr_subsys_io 32
nr_subsys_memory 34
nr_subsys_perf_event 35
nr_subsys_hugetlb 0
nr_subsys_pids 34
nr_subsys_rdma 0
nr_subsys_misc 0
nr_dying_descendants 30
nr_dying_subsys_cpuset 0
nr_dying_subsys_cpu 0
nr_dying_subsys_io 0
nr_dying_subsys_memory 30
nr_dying_subsys_perf_event 0
nr_dying_subsys_hugetlb 0
nr_dying_subsys_pids 0
nr_dying_subsys_rdma 0
nr_dying_subsys_misc 0
Note that 'debug' controller wasn't used to provide this information because
the controller is not recommended in productions kernels, also many of them
won't enable CONFIG_CGROUP_DEBUG by default.
Similar information could be retrieved with debuggers like drgn but that's
also not always available (e.g. lockdown) and the additional cost of runtime
tracking here is deemed marginal.
tj: Added Michal's paragraphs on why this is not added the debug controller
to the commit message.
Signed-off-by: Waiman Long <longman@redhat.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Roman Gushchin <roman.gushchin@linux.dev>
Reviewed-by: Kamalesh Babulal <kamalesh.babulal@oracle.com>
Cc: Michal Koutný <mkoutny@suse.com>
Link: http://lkml.kernel.org/r/20240715150034.2583772-1-longman@redhat.com
Signed-off-by: Tejun Heo <tj@kernel.org>
|
|
David Laight pointed out that we should deal with the min3() and max3()
mess too, which still does excessive expansion.
And our current macros are actually rather broken.
In particular, the macros did this:
#define min3(x, y, z) min((typeof(x))min(x, y), z)
#define max3(x, y, z) max((typeof(x))max(x, y), z)
and that not only is a nested expansion of possibly very complex
arguments with all that involves, the typing with that "typeof()" cast
is completely wrong.
For example, imagine what happens in max3() if 'x' happens to be a
'unsigned char', but 'y' and 'z' are 'unsigned long'. The types are
compatible, and there's no warning - but the result is just random
garbage.
No, I don't think we've ever hit that issue in practice, but since we
now have sane infrastructure for doing this right, let's just use it.
It fixes any excessive expansion, and also avoids these kinds of broken
type issues.
Requested-by: David Laight <David.Laight@aculab.com>
Acked-by: Arnd Bergmann <arnd@kernel.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
The kernel-doc headers for list_replace_rcu() and hlist_replace_rcu()
claim that the replacement is atomic, which it is, but only for readers.
Avoid confusion by making it clear that the atomic nature of these
functions applies only to readers, not to concurrent updaters.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Neeraj Upadhyay <neeraj.upadhyay@kernel.org>
|
|
draft-ietf-6man-pio-pflag is adding a new flag to the Prefix Information
Option to signal that the network can allocate a unique IPv6 prefix per
client via DHCPv6-PD (see draft-ietf-v6ops-dhcp-pd-per-device).
When ra_honor_pio_pflag is enabled, the presence of a P-flag causes
SLAAC autoconfiguration to be disabled for that particular PIO.
An automated test has been added in Android (r.android.com/3195335) to
go along with this change.
Cc: Maciej Żenczykowski <maze@google.com>
Cc: Lorenzo Colitti <lorenzo@google.com>
Cc: David Lamparter <equinox@opensourcerouting.org>
Cc: Simon Horman <horms@kernel.org>
Signed-off-by: Patrick Rohr <prohr@google.com>
Reviewed-by: Maciej Żenczykowski <maze@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Tighten csum_start and csum_offset checks in virtio_net_hdr_to_skb
for GSO packets.
The function already checks that a checksum requested with
VIRTIO_NET_HDR_F_NEEDS_CSUM is in skb linear. But for GSO packets
this might not hold for segs after segmentation.
Syzkaller demonstrated to reach this warning in skb_checksum_help
offset = skb_checksum_start_offset(skb);
ret = -EINVAL;
if (WARN_ON_ONCE(offset >= skb_headlen(skb)))
By injecting a TSO packet:
WARNING: CPU: 1 PID: 3539 at net/core/dev.c:3284 skb_checksum_help+0x3d0/0x5b0
ip_do_fragment+0x209/0x1b20 net/ipv4/ip_output.c:774
ip_finish_output_gso net/ipv4/ip_output.c:279 [inline]
__ip_finish_output+0x2bd/0x4b0 net/ipv4/ip_output.c:301
iptunnel_xmit+0x50c/0x930 net/ipv4/ip_tunnel_core.c:82
ip_tunnel_xmit+0x2296/0x2c70 net/ipv4/ip_tunnel.c:813
__gre_xmit net/ipv4/ip_gre.c:469 [inline]
ipgre_xmit+0x759/0xa60 net/ipv4/ip_gre.c:661
__netdev_start_xmit include/linux/netdevice.h:4850 [inline]
netdev_start_xmit include/linux/netdevice.h:4864 [inline]
xmit_one net/core/dev.c:3595 [inline]
dev_hard_start_xmit+0x261/0x8c0 net/core/dev.c:3611
__dev_queue_xmit+0x1b97/0x3c90 net/core/dev.c:4261
packet_snd net/packet/af_packet.c:3073 [inline]
The geometry of the bad input packet at tcp_gso_segment:
[ 52.003050][ T8403] skb len=12202 headroom=244 headlen=12093 tailroom=0
[ 52.003050][ T8403] mac=(168,24) mac_len=24 net=(192,52) trans=244
[ 52.003050][ T8403] shinfo(txflags=0 nr_frags=1 gso(size=1552 type=3 segs=0))
[ 52.003050][ T8403] csum(0x60000c7 start=199 offset=1536
ip_summed=3 complete_sw=0 valid=0 level=0)
Mitigate with stricter input validation.
csum_offset: for GSO packets, deduce the correct value from gso_type.
This is already done for USO. Extend it to TSO. Let UFO be:
udp[46]_ufo_fragment ignores these fields and always computes the
checksum in software.
csum_start: finding the real offset requires parsing to the transport
header. Do not add a parser, use existing segmentation parsing. Thanks
to SKB_GSO_DODGY, that also catches bad packets that are hw offloaded.
Again test both TSO and USO. Do not test UFO for the above reason, and
do not test UDP tunnel offload.
GSO packet are almost always CHECKSUM_PARTIAL. USO packets may be
CHECKSUM_NONE since commit 10154dbded6d6 ("udp: Allow GSO transmit
from devices with no checksum offload"), but then still these fields
are initialized correctly in udp4_hwcsum/udp6_hwcsum_outgoing. So no
need to test for ip_summed == CHECKSUM_PARTIAL first.
This revises an existing fix mentioned in the Fixes tag, which broke
small packets with GSO offload, as detected by kselftests.
Link: https://syzkaller.appspot.com/bug?extid=e1db31216c789f552871
Link: https://lore.kernel.org/netdev/20240723223109.2196886-1-kuba@kernel.org
Fixes: e269d79c7d35 ("net: missing check virtio")
Cc: stable@vger.kernel.org
Signed-off-by: Willem de Bruijn <willemb@google.com>
Link: https://patch.msgid.link/20240729201108.1615114-1-willemdebruijn.kernel@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
This patch implements a union-find data structure in the kernel library,
which includes operations for allocating nodes, freeing nodes,
finding the root of a node, and merging two nodes.
Signed-off-by: Xavier <xavier_qy@163.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
|
|
Since the SLAB implementation was removed in v6.8, so the
cpuset_slab_spread_rotor is no longer used and can be removed.
Signed-off-by: Xiu Jianfeng <xiujianfeng@huawei.com>
Reviewed-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
|
|
This clarifies the rules for min()/max()/clamp() type checking and makes
them a much more efficient macro expansion.
In particular, we now look at the type and range of the inputs to see
whether they work together, generating a mask of acceptable comparisons,
and then just verifying that the inputs have a shared case:
- an expression with a signed type can be used for
(1) signed comparisons
(2) unsigned comparisons if it is statically known to have a
non-negative value
- an expression with an unsigned type can be used for
(3) unsigned comparison
(4) signed comparisons if the type is smaller than 'int' and thus
the C integer promotion rules will make it signed anyway
Here rule (1) and (3) are obvious, and rule (2) is important in order to
allow obvious trivial constants to be used together with unsigned
values.
Rule (4) is not necessarily a good idea, but matches what we used to do,
and we have extant cases of this situation in the kernel. Notably with
bcachefs having an expression like
min(bch2_bucket_sectors_dirty(a), ca->mi.bucket_size)
where bch2_bucket_sectors_dirty() returns an 's64', and
'ca->mi.bucket_size' is of type 'u16'.
Technically that bcachefs comparison is clearly sensible on a C type
level, because the 'u16' will go through the normal C integer promotion,
and become 'int', and then we're comparing two signed values and
everything looks sane.
However, it's not entirely clear that a 'min(s64,u16)' operation makes a
lot of conceptual sense, and it's possible that we will remove rule (4).
After all, the _reason_ we have these complicated type checks is exactly
that the C type promotion rules are not very intuitive.
But at least for now the rule is in place for backwards compatibility.
Also note that rule (2) existed before, but is hugely relaxed by this
commit. It used to be true only for the simplest compile-time
non-negative integer constants. The new macro model will allow cases
where the compiler can trivially see that an expression is non-negative
even if it isn't necessarily a constant.
For example, the amdgpu driver does
min_t(size_t, sizeof(fru_info->serial), pia[addr] & 0x3F));
because our old 'min()' macro would see that 'pia[addr] & 0x3F' is of
type 'int' and clearly not a C constant expression, so doing a 'min()'
with a 'size_t' is a signedness violation.
Our new 'min()' macro still sees that 'pia[addr] & 0x3F' is of type
'int', but is smart enough to also see that it is clearly non-negative,
and thus would allow that case without any complaints.
Cc: Arnd Bergmann <arnd@kernel.org>
Cc: David Laight <David.Laight@aculab.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Introduce the function ata_dev_print_quirks() to print the quirk flags
that will be applied to a scanned device. This new function is called
from ata_dev_quirks() when a match on a device model or device model
and revision is found for a device in the __ata_dev_quirks array.
To implement this function, the ATA_QUIRK_ flags are redefined using
the new enum ata_quirk which defines the bit shift for each quirk
flag. The array of strings ata_quirk_names is used to define the name
of each flag, which are printed by ata_dev_print_quirks().
Example output for a device listed in the __ata_dev_quirks array and
which has the ATA_QUIRK_DISABLE flag applied:
[10193.461270] ata1: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
[10193.469190] ata1.00: Model 'ASMT109x- Config', rev '2143 5', applying quirks: disable
[10193.469195] ata1.00: unsupported device, disabling
[10193.481564] ata1.00: disable device
enum ata_quirk also defines the __ATA_QUIRK_MAX value as one plus the
last quirk flag defined. This value is used in ata_dev_quirks() to add a
build time check that all quirk flags fit within the unsigned int
(32-bits) quirks field of struct ata_device.
Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Igor Pylypiv <ipylypiv@google.com>
Reviewed-by: Niklas Cassel <cassel@kernel.org>
|
|
According to Wiktionary, the verb "hork" is computing slang defined as
"To foul up; to be occupied with difficulty, tangle, or unpleasantness;
to be broken" (https://en.wiktionary.org/wiki/hork#Verb). libata uses
this with the term "horkage" to refer to broken device features. Given
that this term is not widely used and its meaning unknown to many,
rename it to the more commonly used term "quirk", similar to many other
places in the kernel.
The renaming done is:
1) Rename all ATA_HORKAGE_XXX flags to ATA_QUIRK_XXX
2) Rename struct ata_device horkage field to quirks
3) Rename struct ata_blacklist_entry to struct ata_dev_quirks_entry. The
array of these structures defining quirks for known devices is
renamed __ata_dev_quirks.
4) The functions ata_dev_blacklisted() and ata_force_horkage() are
renamed to ata_dev_quirks() and ata_force_quirks() respectively.
5) All the force_horkage_xxx() macros are renamed to force_quirk_xxx()
And while at it, make sure that the type "unsigned int" is used
consistantly for quirk flags variables and data structure fields.
Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Niklas Cassel <cassel@kernel.org>
Reviewed-by: Igor Pylypiv <ipylypiv@google.com>
|
|
Move management of the perf_event->security blob out of the individual
security modules and into the security infrastructure. Instead of
allocating the blobs from within the modules the modules tell the
infrastructure how much space is required, and the space is allocated
there. There are no longer any modules that require the perf_event_free()
hook. The hook definition has been removed.
Signed-off-by: Casey Schaufler <casey@schaufler-ca.com>
Reviewed-by: John Johansen <john.johansen@canonical.com>
[PM: subject tweak]
Signed-off-by: Paul Moore <paul@paul-moore.com>
|
|
Move management of the infiniband security blob out of the individual
security modules and into the LSM infrastructure. The security modules
tell the infrastructure how much space they require at initialization.
There are no longer any modules that require the ib_free() hook.
The hook definition has been removed.
Signed-off-by: Casey Schaufler <casey@schaufler-ca.com>
Reviewed-by: John Johansen <john.johansen@canonical.com>
[PM: subject tweak, selinux style fixes]
Signed-off-by: Paul Moore <paul@paul-moore.com>
|