summaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)Author
2022-12-06bpf: decouple prune and jump pointsAndrii Nakryiko
BPF verifier marks some instructions as prune points. Currently these prune points serve two purposes. It's a point where verifier tries to find previously verified state and check current state's equivalence to short circuit verification for current code path. But also currently it's a point where jump history, used for precision backtracking, is updated. This is done so that non-linear flow of execution could be properly backtracked. Such coupling is coincidental and unnecessary. Some prune points are not part of some non-linear jump path, so don't need update of jump history. On the other hand, not all instructions which have to be recorded in jump history necessarily are good prune points. This patch splits prune and jump points into independent flags. Currently all prune points are marked as jump points to minimize amount of changes in this patch, but next patch will perform some optimization of prune vs jmp point placement. No functional changes are intended. Acked-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20221206233345.438540-2-andrii@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-12-06bpf: Loosen alloc obj test in verifier's reg_btf_recordDave Marchevsky
btf->struct_meta_tab is populated by btf_parse_struct_metas in btf.c. There, a BTF record is created for any type containing a spin_lock or any next-gen datastructure node/head. Currently, for non-MAP_VALUE types, reg_btf_record will only search for a record using struct_meta_tab if the reg->type exactly matches (PTR_TO_BTF_ID | MEM_ALLOC). This exact match is too strict: an "allocated obj" type - returned from bpf_obj_new - might pick up other flags while working its way through the program. Loosen the check to be exact for base_type and just use MEM_ALLOC mask for type_flag. This patch is marked Fixes as the original intent of reg_btf_record was unlikely to have been to fail finding btf_record for valid alloc obj types with additional flags, some of which (e.g. PTR_UNTRUSTED) are valid register type states for alloc obj independent of this series. However, I didn't find a specific broken repro case outside of this series' added functionality, so it's possible that nothing was triggering this logic error before. Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com> cc: Kumar Kartikeya Dwivedi <memxor@gmail.com> Fixes: 4e814da0d599 ("bpf: Allow locking bpf_spin_lock in allocated objects") Link: https://lore.kernel.org/r/20221206231000.3180914-2-davemarchevsky@fb.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-12-06bpf: Don't use rcu_users to refcount in task kfuncsDavid Vernet
A series of prior patches added some kfuncs that allow struct task_struct * objects to be used as kptrs. These kfuncs leveraged the 'refcount_t rcu_users' field of the task for performing refcounting. This field was used instead of 'refcount_t usage', as we wanted to leverage the safety provided by RCU for ensuring a task's lifetime. A struct task_struct is refcounted by two different refcount_t fields: 1. p->usage: The "true" refcount field which task lifetime. The task is freed as soon as this refcount drops to 0. 2. p->rcu_users: An "RCU users" refcount field which is statically initialized to 2, and is co-located in a union with a struct rcu_head field (p->rcu). p->rcu_users essentially encapsulates a single p->usage refcount, and when p->rcu_users goes to 0, an RCU callback is scheduled on the struct rcu_head which decrements the p->usage refcount. Our logic was that by using p->rcu_users, we would be able to use RCU to safely issue refcount_inc_not_zero() a task's rcu_users field to determine if a task could still be acquired, or was exiting. Unfortunately, this does not work due to p->rcu_users and p->rcu sharing a union. When p->rcu_users goes to 0, an RCU callback is scheduled to drop a single p->usage refcount, and because the fields share a union, the refcount immediately becomes nonzero again after the callback is scheduled. If we were to split the fields out of the union, this wouldn't be a problem. Doing so should also be rather non-controversial, as there are a number of places in struct task_struct that have padding which we could use to avoid growing the structure by splitting up the fields. For now, so as to fix the kfuncs to be correct, this patch instead updates bpf_task_acquire() and bpf_task_release() to use the p->usage field for refcounting via the get_task_struct() and put_task_struct() functions. Because we can no longer rely on RCU, the change also guts the bpf_task_acquire_not_zero() and bpf_task_kptr_get() functions pending a resolution on the above problem. In addition, the task fixes the kfunc and rcu_read_lock selftests to expect this new behavior. Fixes: 90660309b0c7 ("bpf: Add kfuncs for storing struct task_struct * as a kptr") Fixes: fca1aa75518c ("bpf: Handle MEM_RCU type properly") Reported-by: Matus Jokay <matus.jokay@stuba.sk> Signed-off-by: David Vernet <void@manifault.com> Link: https://lore.kernel.org/r/20221206210538.597606-1-void@manifault.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-12-06Merge branch 'for-next/ftrace' into for-next/coreWill Deacon
* for-next/ftrace: ftrace: arm64: remove static ftrace ftrace: arm64: move from REGS to ARGS ftrace: abstract DYNAMIC_FTRACE_WITH_ARGS accesses ftrace: rename ftrace_instruction_pointer_set() -> ftrace_regs_set_instruction_pointer() ftrace: pass fregs to arch_ftrace_set_direct_caller()
2022-12-06PM: sleep: Refine error message in try_to_freeze_tasks()Rafael J. Wysocki
A previous change amended try_to_freeze_tasks() with the "what" variable pointing to a string describing the group of tasks subject to the freezing which may be used in the error message in there too, so make that happen. Accordingly, update sleepgraph.py to catch the modified error message as appropriate. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Reviewed-by: Petr Mladek <pmladek@suse.com>
2022-12-06PM: sleep: Avoid using pr_cont() in the tasks freezing codeRafael J. Wysocki
Using pr_cont() in the tasks freezing code related to system-wide suspend and hibernation is problematic, because the continuation messages printed there are susceptible to interspersing with other unrelated messages which results in output that is hard to understand. Address this issue by modifying try_to_freeze_tasks() to print messages that don't require continuations and adjusting its callers accordingly. Reported-by: Thomas Weißschuh <linux@weissschuh.net> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Reviewed-by: Petr Mladek <pmladek@suse.com>
2022-12-05genirq/msi: Provide msi_domain_alloc_irq_at()Thomas Gleixner
For supporting post MSI-X enable allocations and for the upcoming PCI/IMS support a separate interface is required which allows not only the allocation of a specific index, but also the allocation of any, i.e. the next free index. The latter is especially required for IMS because IMS completely does away with index to functionality mappings which are often found in MSI/MSI-X implementation. But even with MSI-X there are devices where only the first few indices have a fixed functionality and the rest is freely assignable by software, e.g. to queues. msi_domain_alloc_irq_at() is also different from the range based interfaces as it always enforces that the MSI descriptor is allocated by the core code and not preallocated by the caller like the PCI/MSI[-X] enable code path does. msi_domain_alloc_irq_at() can be invoked with the index argument set to MSI_ANY_INDEX which makes the core code pick the next free index. The irq domain can provide a prepare_desc() operation callback in it's msi_domain_ops to do domain specific post allocation initialization before the actual Linux interrupt and the associated interrupt descriptor and hierarchy alloccations are conducted. The function also takes an optional @icookie argument which is of type union msi_instance_cookie. This cookie is not used by the core code and is stored in the allocated msi_desc::data::icookie. The meaning of the cookie is completely implementation defined. In case of IMS this might be a PASID or a pointer to a device queue, but for the MSI core it's opaque and not used in any way. The function returns a struct msi_map which on success contains the allocated index number and the Linux interrupt number so the caller can spare the index to Linux interrupt number lookup. On failure map::index contains the error code and map::virq is 0. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Acked-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20221124232326.501359457@linutronix.de
2022-12-05genirq/msi: Provide msi_domain_ops:: Prepare_desc()Thomas Gleixner
The existing MSI domain ops msi_prepare() and set_desc() turned out to be unsuitable for implementing IMS support. msi_prepare() does not operate on the MSI descriptors. set_desc() lacks an irq_domain pointer and has a completely different purpose. Introduce a prepare_desc() op which allows IMS implementations to amend an MSI descriptor which was allocated by the core code, e.g. by adjusting the iomem base or adding some data based on the allocated index. This is way better than requiring that all IMS domain implementations preallocate the MSI descriptor and then allocate the interrupt. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Acked-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20221124232326.444560717@linutronix.de
2022-12-05genirq/msi: Provide BUS_DEVICE_PCI_MSI[X]Thomas Gleixner
Provide new bus tokens for the upcoming per device PCI/MSI and PCI/MSIX interrupt domains. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Acked-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20221124232325.917219885@linutronix.de
2022-12-05genirq/msi: Add range checking to msi_insert_desc()Thomas Gleixner
Per device domains provide the real domain size to the core code. This allows range checking on insertion of MSI descriptors and also paves the way for dynamic index allocations which are required e.g. for IMS. This avoids external mechanisms like bitmaps on the device side and just utilizes the core internal MSI descriptor storxe for it. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Acked-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20221124232325.798556374@linutronix.de
2022-12-05proc: proc_skip_spaces() shouldn't think it is working on C stringsLinus Torvalds
proc_skip_spaces() seems to think it is working on C strings, and ends up being just a wrapper around skip_spaces() with a really odd calling convention. Instead of basing it on skip_spaces(), it should have looked more like proc_skip_char(), which really is the exact same function (except it skips a particular character, rather than whitespace). So use that as inspiration, odd coding and all. Now the calling convention actually makes sense and works for the intended purpose. Reported-and-tested-by: Kyle Zeng <zengyhkyle@gmail.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-12-05proc: avoid integer type confusion in get_proc_longLinus Torvalds
proc_get_long() is passed a size_t, but then assigns it to an 'int' variable for the length. Let's not do that, even if our IO paths are limited to MAX_RW_COUNT (exactly because of these kinds of type errors). So do the proper test in the rigth type. Reported-by: Kyle Zeng <zengyhkyle@gmail.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-12-05genirq/msi: Provide msi_match_device_domain()Thomas Gleixner
Provide an interface to match a per device domain bus token. This allows to query which type of domain is installed for a particular domain id. Will be used for PCI to avoid frequent create/remove cycles for the MSI resp. MSI-X domains. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Acked-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20221124232325.738047902@linutronix.de
2022-12-05genirq/msi: Provide msi_create/free_device_irq_domain()Thomas Gleixner
Now that all prerequsites are in place, provide the actual interfaces for creating and removing per device interrupt domains. MSI device interrupt domains are created from the provided msi_domain_template which is duplicated so that it can be modified for the particular device. The name of the domain and the name of the interrupt chip are composed by "$(PREFIX)$(CHIPNAME)-$(DEVNAME)" $PREFIX: The optional prefix provided by the underlying MSI parent domain via msi_parent_ops::prefix. $CHIPNAME: The name of the irq_chip in the template $DEVNAME: The name of the device The domain is further initialized through a MSI parent domain callback which fills in the required functionality for the parent domain or domains further down the hierarchy. This initialization can fail, e.g. when the requested feature or MSI domain type cannot be supported. The domain pointer is stored in the pointer array inside of msi_device_data which is attached to the domain. The domain can be removed via the API or left for disposal via devres when the device is torn down. The API removal is useful e.g. for PCI to have seperate domains for MSI and MSI-X, which are mutually exclusive and always occupy the default domain id slot. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Acked-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20221124232325.678838546@linutronix.de
2022-12-05genirq/msi: Split msi_create_irq_domain()Thomas Gleixner
Split the functionality of msi_create_irq_domain() so it can be reused for creating per device irq domains. No functional change. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Acked-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20221124232325.559086358@linutronix.de
2022-12-05genirq/msi: Add size info to struct msi_domain_infoThomas Gleixner
To allow proper range checking especially for dynamic allocations add a size field to struct msi_domain_info. If the field is 0 then the size is unknown or unlimited (up to MSI_MAX_INDEX) to provide backwards compability. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Acked-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20221124232325.501144862@linutronix.de
2022-12-05genirq/msi: Provide struct msi_parent_opsThomas Gleixner
MSI parent domains must have some control over the MSI domains which are built on top. On domain creation they need to fill in e.g. architecture specific chip callbacks or msi domain ops to make the outermost domain parent agnostic which is obviously required for architecture independence etc. The structure contains: 1) A bitfield which exposes the supported functional features. This allows to check for features and is also used in the initialization callback to mask out unsupported features when the actual domain implementation requests a broader range, e.g. on x86 PCI multi-MSI is only supported by remapping domains but not by the underlying vector domain. The PCI/MSI code can then always request multi-MSI support, but the resulting feature set after creation might not have it set. 2) An optional string prefix which is put in front of domain and chip names during creation of the MSI domain. That allows to keep the naming schemes e.g. on x86 where PCI-MSI domains have a IR- prefix when interrupt remapping is enabled. 3) An initialization callback to sanity check the domain info of the to be created MSI domain, to restrict features and to apply changes in MSI ops and interrupt chip callbacks to accomodate to the particular MSI parent implementation and/or the underlying hierarchy. Add a conveniance function to delegate the initialization from the MSI parent domain to an underlying domain in the hierarchy. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Acked-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20221124232325.382485843@linutronix.de
2022-12-05genirq/msi: Remove unused alloc/free interfacesThomas Gleixner
Now that all users are converted remove the old interfaces. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Acked-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20221124230314.694291814@linutronix.de
2022-12-05genirq/msi: Provide new domain id allocation functionsThomas Gleixner
Provide two sorts of interfaces to handle the different use cases: - msi_domain_alloc_irqs_range(): Handles a caller defined precise range - msi_domain_alloc_irqs_all(): Allocates all interrupts associated to a domain by scanning the allocated MSI descriptors The latter is useful for the existing PCI/MSI support which does not have range information available. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Acked-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20221124230314.396497163@linutronix.de
2022-12-05genirq/msi: Provide new domain id based interfaces for freeing interruptsThomas Gleixner
Provide two sorts of interfaces to handle the different use cases: - msi_domain_free_irqs_range(): Handles a caller defined precise range - msi_domain_free_irqs_all(): Frees all interrupts associated to a domain The latter is useful for device teardown and to handle the legacy MSI support which does not have any range information available. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Acked-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20221124230314.337844751@linutronix.de
2022-12-05genirq/msi: Make msi_add_simple_msi_descs() device domain awareThomas Gleixner
Allocating simple interrupt descriptors in the core code has to be multi device irqdomain aware for the upcoming PCI/IMS support. Change the interfaces to take a domain id into account. Use the internal control struct for transport of arguments. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Acked-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20221124230314.279112474@linutronix.de
2022-12-05genirq/msi: Make descriptor freeing domain awareThomas Gleixner
Change the descriptor free functions to take a domain id to prepare for the upcoming multi MSI domain per device support. To avoid changing and extending the interfaces over and over use an core internal control struct and hand the pointer through the various functions. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Acked-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20221124230314.220788011@linutronix.de
2022-12-05genirq/msi: Make descriptor allocation device domain awareThomas Gleixner
Change the descriptor allocation and insertion functions to take a domain id to prepare for the upcoming multi MSI domain per device support. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Acked-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20221124230314.163043028@linutronix.de
2022-12-05genirq/msi: Rename msi_add_msi_desc() to msi_insert_msi_desc()Thomas Gleixner
This reflects the functionality better. No functional change. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Acked-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20221124230314.103554618@linutronix.de
2022-12-05genirq/msi: Make msi_get_virq() device domain awareAhmed S. Darwish
In preparation of the upcoming per device multi MSI domain support, change the interface to support lookups based on domain id and zero based index within the domain. Signed-off-by: Ahmed S. Darwish <darwi@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Acked-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20221124230314.044613697@linutronix.de
2022-12-05genirq/msi: Make MSI descriptor iterators device domain awareThomas Gleixner
To support multiple MSI interrupt domains per device it is necessary to segment the xarray MSI descriptor storage. Each domain gets up to MSI_MAX_INDEX entries. Change the iterators so they operate with domain ids and take the domain offsets into account. The publicly available iterators which are mostly used in legacy implementations and the PCI/MSI core default to MSI_DEFAULT_DOMAIN (0) which is the id for the existing "global" domains. No functional change. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Acked-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20221124230313.985498981@linutronix.de
2022-12-05genirq/msi: Add pointers for per device irq domainsThomas Gleixner
With the upcoming per device MSI interrupt domain support it is necessary to store the domain pointers per device. Instead of delegating that storage to device drivers or subsystems add a domain pointer to the msi_dev_domain array in struct msi_device_data. This pointer is also used to take care of tearing down the irq domains when msi_device_data is cleaned up via devres. The interfaces into the MSI core will be changed from irqdomain pointer based interfaces to domain id based interfaces to support multiple MSI domains on a single device (e.g. PCI/MSI[-X] and PCI/IMS. Once the per device domain support is complete the irq domain pointer in struct device::msi.domain will not longer contain a pointer to the "global" MSI domain. It will contain a pointer to the MSI parent domain instead. It would be a horrible maze of conditionals to evaluate all over the place which domain pointer should be used, i.e. the "global" one in device::msi::domain or one from the internal pointer array. To avoid this evaluate in msi_setup_device_data() whether the irq domain which is associated to a device is a "global" or a parent MSI domain. If it is global then copy the pointer into the first entry of the msi_dev_domain array. This allows to convert interfaces and implementation to domain ids while keeping everything existing working. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Acked-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20221124230313.923860399@linutronix.de
2022-12-05genirq/msi: Move xarray into a separate struct and create an arrayThomas Gleixner
The upcoming support for multiple MSI domains per device requires storage for the MSI descriptors and in a second step storage for the irqdomain pointers. Move the xarray into a separate data structure msi_dev_domain and create an array with size 1 in msi_device_data, which can be expanded later when the support for per device domains is implemented. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Acked-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20221124230313.864887773@linutronix.de
2022-12-05genirq/msi: Check for invalid MSI parent domain usageThomas Gleixner
In the upcoming per device MSI domain concept the MSI parent domains are not allowed to be used as regular MSI domains where the MSI allocation/free operations are applicable. Add appropriate checks. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Acked-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20221124230313.806128070@linutronix.de
2022-12-05genirq/irqdomain: Rename irq_domain::dev to irq_domain:: Pm_devThomas Gleixner
irq_domain::dev is a misnomer as it's usually the rule that a device pointer points to something which is directly related to the instance. irq_domain::dev can point to some other device for power management to ensure that this underlying device is not powered down when an interrupt is allocated. The upcoming per device MSI domains really require a pointer to the device which instantiated the irq domain and not to some random other device which is required for power management down the chain. Rename irq_domain::dev to irq_domain::pm_dev and fixup the few sites which use that pointer. Conversion was done with the help of coccinelle. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Acked-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20221124230313.574541683@linutronix.de
2022-12-05genirq/msi: Move IRQ_DOMAIN_MSI_NOMASK_QUIRK to MSI flagsThomas Gleixner
It's truly a MSI only flag and for the upcoming per device MSI domains this must be in the MSI flags so it can be set during domain setup without exposing this quirk outside of x86. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Acked-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20221124230313.454246167@linutronix.de
2022-12-04bpf: Enable sleeptable support for cgrp local storageYonghong Song
Similar to sk/inode/task local storage, enable sleepable support for cgrp local storage. Signed-off-by: Yonghong Song <yhs@fb.com> Link: https://lore.kernel.org/r/20221201050444.2785007-1-yhs@fb.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-12-04bpf: Do not mark certain LSM hook arguments as trustedYonghong Song
Martin mentioned that the verifier cannot assume arguments from LSM hook sk_alloc_security being trusted since after the hook is called, the sk ref_count is set to 1. This will overwrite the ref_count changed by the bpf program and may cause ref_count underflow later on. I then further checked some other hooks. For example, for bpf_lsm_file_alloc() hook in fs/file_table.c, f->f_cred = get_cred(cred); error = security_file_alloc(f); if (unlikely(error)) { file_free_rcu(&f->f_rcuhead); return ERR_PTR(error); } atomic_long_set(&f->f_count, 1); The input parameter 'f' to security_file_alloc() cannot be trusted as well. Specifically, I investiaged bpf_map/bpf_prog/file/sk/task alloc/free lsm hooks. Except bpf_map_alloc and task_alloc, arguments for all other hooks should not be considered as trusted. This may not be a complete list, but it covers common usage for sk and task. Fixes: 3f00c5239344 ("bpf: Allow trusted pointers to be passed to KF_TRUSTED_ARGS kfuncs") Signed-off-by: Yonghong Song <yhs@fb.com> Link: https://lore.kernel.org/r/20221203204954.2043348-1-yhs@fb.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-12-04bpf: Handle MEM_RCU type properlyYonghong Song
Commit 9bb00b2895cb ("bpf: Add kfunc bpf_rcu_read_lock/unlock()") introduced MEM_RCU and bpf_rcu_read_lock/unlock() support. In that commit, a rcu pointer is tagged with both MEM_RCU and PTR_TRUSTED so that it can be passed into kfuncs or helpers as an argument. Martin raised a good question in [1] such that the rcu pointer, although being able to accessing the object, might have reference count of 0. This might cause a problem if the rcu pointer is passed to a kfunc which expects trusted arguments where ref count should be greater than 0. This patch makes the following changes related to MEM_RCU pointer: - MEM_RCU pointer might be NULL (PTR_MAYBE_NULL). - Introduce KF_RCU so MEM_RCU ptr can be acquired with a KF_RCU tagged kfunc which assumes ref count of rcu ptr could be zero. - For mem access 'b = ptr->a', say 'ptr' is a MEM_RCU ptr, and 'a' is tagged with __rcu as well. Let us mark 'b' as MEM_RCU | PTR_MAYBE_NULL. [1] https://lore.kernel.org/bpf/ac70f574-4023-664e-b711-e0d3b18117fd@linux.dev/ Fixes: 9bb00b2895cb ("bpf: Add kfunc bpf_rcu_read_lock/unlock()") Signed-off-by: Yonghong Song <yhs@fb.com> Link: https://lore.kernel.org/r/20221203184602.477272-1-yhs@fb.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-12-04Merge tag 'perf_urgent_for_v6.1_rc8' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull perf fix from Borislav Petkov: - Fix a use-after-free case where the perf pending task callback would see an already freed event * tag 'perf_urgent_for_v6.1_rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: perf: Fix perf_pending_task() UaF
2022-12-02signal: Initialize the info in ksignalhaifeng.xu
When handing the SIGNAL_GROUP_EXIT flag, the info in ksignal isn't cleared. However, the info acquired by dequeue_synchronous_signal/dequeue_signal is initialized and can be safely used. Fortunately, the fatal signal process just uses the si_signo and doesn't use any other member. Even so, the initialization before use is more safer. Signed-off-by: haifeng.xu <haifeng.xu@shopee.com> Signed-off-by: Kees Cook <keescook@chromium.org> Link: https://lore.kernel.org/r/20221128065606.19570-1-haifeng.xu@shopee.com
2022-12-02panic: Expose "warn_count" to sysfsKees Cook
Since Warn count is now tracked and is a fairly interesting signal, add the entry /sys/kernel/warn_count to expose it to userspace. Cc: Petr Mladek <pmladek@suse.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: tangmeng <tangmeng@uniontech.com> Cc: "Guilherme G. Piccoli" <gpiccoli@igalia.com> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Tiezhu Yang <yangtiezhu@loongson.cn> Reviewed-by: Luis Chamberlain <mcgrof@kernel.org> Signed-off-by: Kees Cook <keescook@chromium.org> Link: https://lore.kernel.org/r/20221117234328.594699-6-keescook@chromium.org
2022-12-02panic: Introduce warn_limitKees Cook
Like oops_limit, add warn_limit for limiting the number of warnings when panic_on_warn is not set. Cc: Jonathan Corbet <corbet@lwn.net> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: "Jason A. Donenfeld" <Jason@zx2c4.com> Cc: Eric Biggers <ebiggers@google.com> Cc: Huang Ying <ying.huang@intel.com> Cc: Petr Mladek <pmladek@suse.com> Cc: tangmeng <tangmeng@uniontech.com> Cc: "Guilherme G. Piccoli" <gpiccoli@igalia.com> Cc: Tiezhu Yang <yangtiezhu@loongson.cn> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: linux-doc@vger.kernel.org Reviewed-by: Luis Chamberlain <mcgrof@kernel.org> Signed-off-by: Kees Cook <keescook@chromium.org> Link: https://lore.kernel.org/r/20221117234328.594699-5-keescook@chromium.org
2022-12-02panic: Consolidate open-coded panic_on_warn checksKees Cook
Several run-time checkers (KASAN, UBSAN, KFENCE, KCSAN, sched) roll their own warnings, and each check "panic_on_warn". Consolidate this into a single function so that future instrumentation can be added in a single location. Cc: Marco Elver <elver@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Vincent Guittot <vincent.guittot@linaro.org> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Ben Segall <bsegall@google.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Daniel Bristot de Oliveira <bristot@redhat.com> Cc: Valentin Schneider <vschneid@redhat.com> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Alexander Potapenko <glider@google.com> Cc: Andrey Konovalov <andreyknvl@gmail.com> Cc: Vincenzo Frascino <vincenzo.frascino@arm.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: David Gow <davidgow@google.com> Cc: tangmeng <tangmeng@uniontech.com> Cc: Jann Horn <jannh@google.com> Cc: Shuah Khan <skhan@linuxfoundation.org> Cc: Petr Mladek <pmladek@suse.com> Cc: "Paul E. McKenney" <paulmck@kernel.org> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: "Guilherme G. Piccoli" <gpiccoli@igalia.com> Cc: Tiezhu Yang <yangtiezhu@loongson.cn> Cc: kasan-dev@googlegroups.com Cc: linux-mm@kvack.org Reviewed-by: Luis Chamberlain <mcgrof@kernel.org> Signed-off-by: Kees Cook <keescook@chromium.org> Reviewed-by: Marco Elver <elver@google.com> Reviewed-by: Andrey Konovalov <andreyknvl@gmail.com> Link: https://lore.kernel.org/r/20221117234328.594699-4-keescook@chromium.org
2022-12-02exit: Allow oops_limit to be disabledKees Cook
In preparation for keeping oops_limit logic in sync with warn_limit, have oops_limit == 0 disable checking the Oops counter. Cc: Jann Horn <jannh@google.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: "Jason A. Donenfeld" <Jason@zx2c4.com> Cc: Eric Biggers <ebiggers@google.com> Cc: Huang Ying <ying.huang@intel.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: linux-doc@vger.kernel.org Signed-off-by: Kees Cook <keescook@chromium.org>
2022-12-02seccomp: Move copy_seccomp() to no failure path.Kuniyuki Iwashima
Our syzbot instance reported memory leaks in do_seccomp() [0], similar to the report [1]. It shows that we miss freeing struct seccomp_filter and some objects included in it. We can reproduce the issue with the program below [2] which calls one seccomp() and two clone() syscalls. The first clone()d child exits earlier than its parent and sends a signal to kill it during the second clone(), more precisely before the fatal_signal_pending() test in copy_process(). When the parent receives the signal, it has to destroy the embryonic process and return -EINTR to user space. In the failure path, we have to call seccomp_filter_release() to decrement the filter's refcount. Initially, we called it in free_task() called from the failure path, but the commit 3a15fb6ed92c ("seccomp: release filter after task is fully dead") moved it to release_task() to notify user space as early as possible that the filter is no longer used. To keep the change and current seccomp refcount semantics, let's move copy_seccomp() just after the signal check and add a WARN_ON_ONCE() in free_task() for future debugging. [0]: unreferenced object 0xffff8880063add00 (size 256): comm "repro_seccomp", pid 230, jiffies 4294687090 (age 9.914s) hex dump (first 32 bytes): 01 00 00 00 01 00 00 00 00 00 00 00 00 00 00 00 ................ ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ................ backtrace: do_seccomp (./include/linux/slab.h:600 ./include/linux/slab.h:733 kernel/seccomp.c:666 kernel/seccomp.c:708 kernel/seccomp.c:1871 kernel/seccomp.c:1991) do_syscall_64 (arch/x86/entry/common.c:50 arch/x86/entry/common.c:80) entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:120) unreferenced object 0xffffc90000035000 (size 4096): comm "repro_seccomp", pid 230, jiffies 4294687090 (age 9.915s) hex dump (first 32 bytes): 01 00 00 00 00 00 00 00 00 00 00 00 05 00 00 00 ................ 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ backtrace: __vmalloc_node_range (mm/vmalloc.c:3226) __vmalloc_node (mm/vmalloc.c:3261 (discriminator 4)) bpf_prog_alloc_no_stats (kernel/bpf/core.c:91) bpf_prog_alloc (kernel/bpf/core.c:129) bpf_prog_create_from_user (net/core/filter.c:1414) do_seccomp (kernel/seccomp.c:671 kernel/seccomp.c:708 kernel/seccomp.c:1871 kernel/seccomp.c:1991) do_syscall_64 (arch/x86/entry/common.c:50 arch/x86/entry/common.c:80) entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:120) unreferenced object 0xffff888003fa1000 (size 1024): comm "repro_seccomp", pid 230, jiffies 4294687090 (age 9.915s) hex dump (first 32 bytes): 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ backtrace: bpf_prog_alloc_no_stats (./include/linux/slab.h:600 ./include/linux/slab.h:733 kernel/bpf/core.c:95) bpf_prog_alloc (kernel/bpf/core.c:129) bpf_prog_create_from_user (net/core/filter.c:1414) do_seccomp (kernel/seccomp.c:671 kernel/seccomp.c:708 kernel/seccomp.c:1871 kernel/seccomp.c:1991) do_syscall_64 (arch/x86/entry/common.c:50 arch/x86/entry/common.c:80) entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:120) unreferenced object 0xffff888006360240 (size 16): comm "repro_seccomp", pid 230, jiffies 4294687090 (age 9.915s) hex dump (first 16 bytes): 01 00 37 00 76 65 72 6c e0 83 01 06 80 88 ff ff ..7.verl........ backtrace: bpf_prog_store_orig_filter (net/core/filter.c:1137) bpf_prog_create_from_user (net/core/filter.c:1428) do_seccomp (kernel/seccomp.c:671 kernel/seccomp.c:708 kernel/seccomp.c:1871 kernel/seccomp.c:1991) do_syscall_64 (arch/x86/entry/common.c:50 arch/x86/entry/common.c:80) entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:120) unreferenced object 0xffff8880060183e0 (size 8): comm "repro_seccomp", pid 230, jiffies 4294687090 (age 9.915s) hex dump (first 8 bytes): 06 00 00 00 00 00 ff 7f ........ backtrace: kmemdup (mm/util.c:129) bpf_prog_store_orig_filter (net/core/filter.c:1144) bpf_prog_create_from_user (net/core/filter.c:1428) do_seccomp (kernel/seccomp.c:671 kernel/seccomp.c:708 kernel/seccomp.c:1871 kernel/seccomp.c:1991) do_syscall_64 (arch/x86/entry/common.c:50 arch/x86/entry/common.c:80) entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:120) [1]: https://syzkaller.appspot.com/bug?id=2809bb0ac77ad9aa3f4afe42d6a610aba594a987 [2]: #define _GNU_SOURCE #include <sched.h> #include <signal.h> #include <unistd.h> #include <sys/syscall.h> #include <linux/filter.h> #include <linux/seccomp.h> void main(void) { struct sock_filter filter[] = { BPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_ALLOW), }; struct sock_fprog fprog = { .len = sizeof(filter) / sizeof(filter[0]), .filter = filter, }; long i, pid; syscall(__NR_seccomp, SECCOMP_SET_MODE_FILTER, 0, &fprog); for (i = 0; i < 2; i++) { pid = syscall(__NR_clone, CLONE_NEWNET | SIGKILL, NULL, NULL, 0); if (pid == 0) return; } } Fixes: 3a15fb6ed92c ("seccomp: release filter after task is fully dead") Reported-by: syzbot+ab17848fe269b573eb71@syzkaller.appspotmail.com Reported-by: Ayushman Dutta <ayudutta@amazon.com> Suggested-by: Kees Cook <keescook@chromium.org> Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Christian Brauner (Microsoft) <brauner@kernel.org> Signed-off-by: Kees Cook <keescook@chromium.org> Link: https://lore.kernel.org/r/20220823154532.82913-1-kuniyu@amazon.com
2022-12-02Merge tag 'v6.1-rc7' into iommufd.git for-nextJason Gunthorpe
Resolve conflicts in drivers/vfio/vfio_main.c by using the iommfd version. The rc fix was done a different way when iommufd patches reworked this code. Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2022-12-02cpu/hotplug: Do not bail-out in DYING/STARTING sectionsVincent Donnefort
The DYING/STARTING callbacks are not expected to fail. However, as reported by Derek, buggy drivers such as tboot are still free to return errors within those sections, which halts the hot(un)plug and leaves the CPU in an unrecoverable state. As there is no rollback possible, only log the failures and proceed with the following steps. This restores the hotplug behaviour prior to commit 453e41085183 ("cpu/hotplug: Add cpuhp_invoke_callback_range()") Fixes: 453e41085183 ("cpu/hotplug: Add cpuhp_invoke_callback_range()") Reported-by: Derek Dolney <z23@posteo.net> Signed-off-by: Vincent Donnefort <vdonnefort@google.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Derek Dolney <z23@posteo.net> Reviewed-by: Valentin Schneider <vschneid@redhat.com> Link: https://bugzilla.kernel.org/show_bug.cgi?id=215867 Link: https://lore.kernel.org/r/20220927101259.1149636-1-vdonnefort@google.com
2022-12-02cpu/hotplug: Set cpuhp target for boot cpuPhil Auld
Since the boot cpu does not go through the hotplug process it ends up with state == CPUHP_ONLINE but target == CPUHP_OFFLINE. So set the target to match in boot_cpu_hotplug_init(). Signed-off-by: Phil Auld <pauld@redhat.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Valentin Schneider <vschneid@redhat.com> Link: https://lore.kernel.org/r/20221117162329.3164999-3-pauld@redhat.com
2022-12-02cpu/hotplug: Make target_store() a nop when target == statePhil Auld
Writing the current state back in hotplug/target calls cpu_down() which will set cpu dying even when it isn't and then nothing will ever clear it. A stress test that reads values and writes them back for all cpu device files in sysfs will trigger the BUG() in select_fallback_rq once all cpus are marked as dying. kernel/cpu.c::target_store() ... if (st->state < target) ret = cpu_up(dev->id, target); else ret = cpu_down(dev->id, target); cpu_down() -> cpu_set_state() bool bringup = st->state < target; ... if (cpu_dying(cpu) != !bringup) set_cpu_dying(cpu, !bringup); Fix this by letting state==target fall through in the target_store() conditional. Also make sure st->target == target in that case. Fixes: 757c989b9994 ("cpu/hotplug: Make target state writeable") Signed-off-by: Phil Auld <pauld@redhat.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Valentin Schneider <vschneid@redhat.com> Link: https://lore.kernel.org/r/20221117162329.3164999-2-pauld@redhat.com
2022-12-02futex: Resend potentially swallowed owner death notificationAlexey Izbyshev
Commit ca16d5bee598 ("futex: Prevent robust futex exit race") addressed two cases when tasks waiting on a robust non-PI futex remained blocked despite the futex not being owned anymore: * if the owner died after writing zero to the futex word, but before waking up a waiter * if a task waiting on the futex was woken up, but died before updating the futex word (effectively swallowing the notification without acting on it) In the second case, the task could be woken up either by the previous owner (after the futex word was reset to zero) or by the kernel (after the OWNER_DIED bit was set and the TID part of the futex word was reset to zero) if the previous owner died without the resetting the futex. Because the referenced commit wakes up a potential waiter only if the whole futex word is zero, the latter subcase remains unaddressed. Fix this by looking only at the TID part of the futex when deciding whether a wake up is needed. Fixes: ca16d5bee598 ("futex: Prevent robust futex exit race") Signed-off-by: Alexey Izbyshev <izbyshev@ispras.ru> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20221111215439.248185-1-izbyshev@ispras.ru
2022-12-02printk: htmldocs: add missing descriptionJohn Ogness
Variable and return descriptions were missing from the SRCU read lock functions. Add them. Signed-off-by: John Ogness <john.ogness@linutronix.de> Signed-off-by: Petr Mladek <pmladek@suse.com> Link: https://lore.kernel.org/r/87zgcjpdvo.fsf@jogness.linutronix.de
2022-12-02printk: relieve console_lock of list synchronization dutiesJohn Ogness
The console_list_lock provides synchronization for console list and console->flags updates. All call sites that were using the console_lock for this synchronization have either switched to use the console_list_lock or the SRCU list iterator. Remove console_lock usage for console list updates and console->flags updates. Signed-off-by: John Ogness <john.ogness@linutronix.de> Reviewed-by: Petr Mladek <pmladek@suse.com> Signed-off-by: Petr Mladek <pmladek@suse.com> Link: https://lore.kernel.org/r/20221116162152.193147-40-john.ogness@linutronix.de
2022-12-02printk, xen: fbfront: create/use safe function for forcing preferredJohn Ogness
With commit 9e124fe16ff2("xen: Enable console tty by default in domU if it's not a dummy") a hack was implemented to make sure that the tty console remains the console behind the /dev/console device. The main problem with the hack is that, after getting the console pointer to the tty console, it is assumed the pointer is still valid after releasing the console_sem. This assumption is incorrect and unsafe. Make the hack safe by introducing a new function console_force_preferred_locked() and perform the full operation under the console_list_lock. Signed-off-by: John Ogness <john.ogness@linutronix.de> Reviewed-by: Petr Mladek <pmladek@suse.com> Signed-off-by: Petr Mladek <pmladek@suse.com> Link: https://lore.kernel.org/r/20221116162152.193147-33-john.ogness@linutronix.de
2022-12-02console: introduce console_is_registered()John Ogness
Currently it is not possible for drivers to detect if they have already successfully registered their console. Several drivers have multiple paths that lead to console registration. To avoid attempting a 2nd registration (which leads to a WARN), drivers are implementing their own solution. Introduce console_is_registered() so drivers can easily identify if their console is currently registered. A _locked() variant is also provided if the caller is already holding the console_list_lock. Signed-off-by: John Ogness <john.ogness@linutronix.de> Reviewed-by: Petr Mladek <pmladek@suse.com> Signed-off-by: Petr Mladek <pmladek@suse.com> Link: https://lore.kernel.org/r/20221116162152.193147-22-john.ogness@linutronix.de