summaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)Author
2021-11-08Merge tag 'modules-5.16-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux Pull module updates from Luis Chamberlain: "As requested by Jessica I'm stepping in to help with modules maintenance. This is my first pull request to you. I've collected only two patches for modules for the 5.16-rc1 merge window. These patches are from Shuah Khan as she debugged some corner case error with modules. The error messages are improved for elf_validity_check(). While doing this work a corner case fix was spotted on validate_section_offset() due to a possible overflow bug on 64-bit. The impact of this fix is low given this just limits module section headers placed within the 32-bit boundary, and we obviously don't have insane module sizes. Even if a specially crafted module is constructed later checks would invalidate the module right away. I've let this sit through 0-day testing since October 15th with no issues found" * tag 'modules-5.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux: module: change to print useful messages from elf_validity_check() module: fix validate_section_offset() overflow bug on 64-bit
2021-11-06Merge tag 'fsnotify_for_v5.16-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs Pull fsnotify updates from Jan Kara: "Support for reporting filesystem errors through fanotify so that system health monitoring daemons can watch for these and act instead of scraping system logs" * tag 'fsnotify_for_v5.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs: (34 commits) samples: remove duplicate include in fs-monitor.c samples: Fix warning in fsnotify sample docs: Fix formatting of literal sections in fanotify docs samples: Make fs-monitor depend on libc and headers docs: Document the FAN_FS_ERROR event samples: Add fs error monitoring example ext4: Send notifications on error fanotify: Allow users to request FAN_FS_ERROR events fanotify: Emit generic error info for error event fanotify: Report fid info for file related file system errors fanotify: WARN_ON against too large file handles fanotify: Add helpers to decide whether to report FID/DFID fanotify: Wrap object_fh inline space in a creator macro fanotify: Support merging of error events fanotify: Support enqueueing of error events fanotify: Pre-allocate pool of error events fanotify: Reserve UAPI bits for FAN_FS_ERROR fsnotify: Support FS_ERROR event type fanotify: Require fid_mode for any non-fd event fanotify: Encode empty file handle when no inode is provided ...
2021-11-06Merge tag 'pci-v5.16-changes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci Pull pci updates from Bjorn Helgaas: "Enumeration: - Conserve IRQs by setting up portdrv IRQs only when there are users (Jan Kiszka) - Rework and simplify _OSC negotiation for control of PCIe features (Joerg Roedel) - Remove struct pci_dev.driver pointer since it's redundant with the struct device.driver pointer (Uwe Kleine-König) Resource management: - Coalesce contiguous host bridge apertures from _CRS to accommodate BARs that cover more than one aperture (Kai-Heng Feng) Sysfs: - Check CAP_SYS_ADMIN before parsing user input (Krzysztof Wilczyński) - Return -EINVAL consistently from "store" functions (Krzysztof Wilczyński) - Use sysfs_emit() in endpoint "show" functions to avoid buffer overruns (Kunihiko Hayashi) PCIe native device hotplug: - Ignore Link Down/Up caused by resets during error recovery so endpoint drivers can remain bound to the device (Lukas Wunner) Virtualization: - Avoid bus resets on Atheros QCA6174, where they hang the device (Ingmar Klein) - Work around Pericom PI7C9X2G switch packet drop erratum by using store and forward mode instead of cut-through (Nathan Rossi) - Avoid trying to enable AtomicOps on VFs; the PF setting applies to all VFs (Selvin Xavier) MSI: - Document that /sys/bus/pci/devices/.../irq contains the legacy INTx interrupt or the IRQ of the first MSI (not MSI-X) vector (Barry Song) VPD: - Add pci_read_vpd_any() and pci_write_vpd_any() to access anywhere in the possible VPD space; use these to simplify the cxgb3 driver (Heiner Kallweit) Peer-to-peer DMA: - Add (not subtract) the bus offset when calculating DMA address (Wang Lu) ASPM: - Re-enable LTR at Downstream Ports so they don't report Unsupported Requests when reset or hot-added devices send LTR messages (Mingchuang Qiao) Apple PCIe controller driver: - Add driver for Apple M1 PCIe controller (Alyssa Rosenzweig, Marc Zyngier) Cadence PCIe controller driver: - Return success when probe succeeds instead of falling into error path (Li Chen) HiSilicon Kirin PCIe controller driver: - Reorganize PHY logic and add support for external PHY drivers (Mauro Carvalho Chehab) - Support PERST# GPIOs for HiKey970 external PEX 8606 bridge (Mauro Carvalho Chehab) - Add Kirin 970 support (Mauro Carvalho Chehab) - Make driver removable (Mauro Carvalho Chehab) Intel VMD host bridge driver: - If IOMMU supports interrupt remapping, leave VMD MSI-X remapping enabled (Adrian Huang) - Number each controller so we can tell them apart in /proc/interrupts (Chunguang Xu) - Avoid building on UML because VMD depends on x86 bare metal APIs (Johannes Berg) Marvell Aardvark PCIe controller driver: - Define macros for PCI_EXP_DEVCTL_PAYLOAD_* (Pali Rohár) - Set Max Payload Size to 512 bytes per Marvell spec (Pali Rohár) - Downgrade PIO Response Status messages to debug level (Marek Behún) - Preserve CRS SV (Config Request Retry Software Visibility) bit in emulated Root Control register (Pali Rohár) - Fix issue in configuring reference clock (Pali Rohár) - Don't clear status bits for masked interrupts (Pali Rohár) - Don't mask unused interrupts (Pali Rohár) - Avoid code repetition in advk_pcie_rd_conf() (Marek Behún) - Retry config accesses on CRS response (Pali Rohár) - Simplify emulated Root Capabilities initialization (Pali Rohár) - Fix several link training issues (Pali Rohár) - Fix link-up checking via LTSSM (Pali Rohár) - Fix reporting of Data Link Layer Link Active (Pali Rohár) - Fix emulation of W1C bits (Marek Behún) - Fix MSI domain .alloc() method to return zero on success (Marek Behún) - Read entire 16-bit MSI vector in MSI handler, not just low 8 bits (Marek Behún) - Clear Root Port I/O Space, Memory Space, and Bus Master Enable bits at startup; PCI core will set those as necessary (Pali Rohár) - When operating as a Root Port, set class code to "PCI Bridge" instead of the default "Mass Storage Controller" (Pali Rohár) - Add emulation for PCI_BRIDGE_CTL_BUS_RESET since aardvark doesn't implement this per spec (Pali Rohár) - Add emulation of option ROM BAR since aardvark doesn't implement this per spec (Pali Rohár) MediaTek MT7621 PCIe controller driver: - Add MediaTek MT7621 PCIe host controller driver and DT binding (Sergio Paracuellos) Qualcomm PCIe controller driver: - Add SC8180x compatible string (Bjorn Andersson) - Add endpoint controller driver and DT binding (Manivannan Sadhasivam) - Restructure to use of_device_get_match_data() (Prasad Malisetty) - Add SC7280-specific pcie_1_pipe_clk_src handling (Prasad Malisetty) Renesas R-Car PCIe controller driver: - Remove unnecessary includes (Geert Uytterhoeven) Rockchip DesignWare PCIe controller driver: - Add DT binding (Simon Xue) Socionext UniPhier Pro5 controller driver: - Serialize INTx masking/unmasking (Kunihiko Hayashi) Synopsys DesignWare PCIe controller driver: - Run dwc .host_init() method before registering MSI interrupt handler so we can deal with pending interrupts left by bootloader (Bjorn Andersson) - Clean up Kconfig dependencies (Andy Shevchenko) - Export symbols to allow more modular drivers (Luca Ceresoli) TI DRA7xx PCIe controller driver: - Allow host and endpoint drivers to be modules (Luca Ceresoli) - Enable external clock if present (Luca Ceresoli) TI J721E PCIe driver: - Disable PHY when probe fails after initializing it (Christophe JAILLET) MicroSemi Switchtec management driver: - Return error to application when command execution fails because an out-of-band reset has cleared the device BARs, Memory Space Enable, etc (Kelvin Cao) - Fix MRPC error status handling issue (Kelvin Cao) - Mask out other bits when reading of management VEP instance ID (Kelvin Cao) - Return EOPNOTSUPP instead of ENOTSUPP from sysfs show functions (Kelvin Cao) - Add check of event support (Logan Gunthorpe) Miscellaneous: - Remove unused pci_pool wrappers, which have been replaced by dma_pool (Cai Huoqing) - Use 'unsigned int' instead of bare 'unsigned' (Krzysztof Wilczyński) - Use kstrtobool() directly, sans strtobool() wrapper (Krzysztof Wilczyński) - Fix some sscanf(), sprintf() format mismatches (Krzysztof Wilczyński) - Update PCI subsystem information in MAINTAINERS (Krzysztof Wilczyński) - Correct some misspellings (Krzysztof Wilczyński)" * tag 'pci-v5.16-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci: (137 commits) PCI: Add ACS quirk for Pericom PI7C9X2G switches PCI: apple: Configure RID to SID mapper on device addition iommu/dart: Exclude MSI doorbell from PCIe device IOVA range PCI: apple: Implement MSI support PCI: apple: Add INTx and per-port interrupt support PCI: kirin: Allow removing the driver PCI: kirin: De-init the dwc driver PCI: kirin: Disable clkreq during poweroff sequence PCI: kirin: Move the power-off code to a common routine PCI: kirin: Add power_off support for Kirin 960 PHY PCI: kirin: Allow building it as a module PCI: kirin: Add MODULE_* macros PCI: kirin: Add Kirin 970 compatible PCI: kirin: Support PERST# GPIOs for HiKey970 external PEX 8606 bridge PCI: apple: Set up reference clocks when probing PCI: apple: Add initial hardware bring-up PCI: of: Allow matching of an interrupt-map local to a PCI device of/irq: Allow matching of an interrupt-map local to an interrupt controller irqdomain: Make of_phandle_args_to_fwspec() generally available PCI: Do not enable AtomicOps on VFs ...
2021-11-06Merge branch 'akpm' (patches from Andrew)Linus Torvalds
Merge misc updates from Andrew Morton: "257 patches. Subsystems affected by this patch series: scripts, ocfs2, vfs, and mm (slab-generic, slab, slub, kconfig, dax, kasan, debug, pagecache, gup, swap, memcg, pagemap, mprotect, mremap, iomap, tracing, vmalloc, pagealloc, memory-failure, hugetlb, userfaultfd, vmscan, tools, memblock, oom-kill, hugetlbfs, migration, thp, readahead, nommu, ksm, vmstat, madvise, memory-hotplug, rmap, zsmalloc, highmem, zram, cleanups, kfence, and damon)" * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (257 commits) mm/damon: remove return value from before_terminate callback mm/damon: fix a few spelling mistakes in comments and a pr_debug message mm/damon: simplify stop mechanism Docs/admin-guide/mm/pagemap: wordsmith page flags descriptions Docs/admin-guide/mm/damon/start: simplify the content Docs/admin-guide/mm/damon/start: fix a wrong link Docs/admin-guide/mm/damon/start: fix wrong example commands mm/damon/dbgfs: add adaptive_targets list check before enable monitor_on mm/damon: remove unnecessary variable initialization Documentation/admin-guide/mm/damon: add a document for DAMON_RECLAIM mm/damon: introduce DAMON-based Reclamation (DAMON_RECLAIM) selftests/damon: support watermarks mm/damon/dbgfs: support watermarks mm/damon/schemes: activate schemes based on a watermarks mechanism tools/selftests/damon: update for regions prioritization of schemes mm/damon/dbgfs: support prioritization weights mm/damon/vaddr,paddr: support pageout prioritization mm/damon/schemes: prioritize regions within the quotas mm/damon/selftests: support schemes quotas mm/damon/dbgfs: support quotas of schemes ...
2021-11-06stacktrace: move filter_irq_stacks() to kernel/stacktrace.cMarco Elver
filter_irq_stacks() has little to do with the stackdepot implementation, except that it is usually used by users (such as KASAN) of stackdepot to reduce the stack trace. However, filter_irq_stacks() itself is not useful without a stack trace as obtained by stack_trace_save() and friends. Therefore, move filter_irq_stacks() to kernel/stacktrace.c, so that new users of filter_irq_stacks() do not have to start depending on STACKDEPOT only for filter_irq_stacks(). Link: https://lkml.kernel.org/r/20210923104803.2620285-1-elver@google.com Signed-off-by: Marco Elver <elver@google.com> Acked-by: Dmitry Vyukov <dvyukov@google.com> Cc: Alexander Potapenko <glider@google.com> Cc: Jann Horn <jannh@google.com> Cc: Aleksandr Nogikh <nogikh@google.com> Cc: Taras Madan <tarasmadan@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-11-06memblock: add MEMBLOCK_DRIVER_MANAGED to mimic IORESOURCE_SYSRAM_DRIVER_MANAGEDDavid Hildenbrand
Let's add a flag that corresponds to IORESOURCE_SYSRAM_DRIVER_MANAGED, indicating that we're dealing with a memory region that is never indicated in the firmware-provided memory map, but always detected and added by a driver. Similar to MEMBLOCK_HOTPLUG, most infrastructure has to treat such memory regions like ordinary MEMBLOCK_NONE memory regions -- for example, when selecting memory regions to add to the vmcore for dumping in the crashkernel via for_each_mem_range(). However, especially kexec_file is not supposed to select such memblocks via for_each_free_mem_range() / for_each_free_mem_range_reverse() to place kexec images, similar to how we handle IORESOURCE_SYSRAM_DRIVER_MANAGED without CONFIG_ARCH_KEEP_MEMBLOCK. We'll make sure that memory hotplug code sets the flag where applicable (IORESOURCE_SYSRAM_DRIVER_MANAGED) next. This prepares architectures that need CONFIG_ARCH_KEEP_MEMBLOCK, such as arm64, for virtio-mem support. Note that kexec *must not* indicate this memory to the second kernel and *must not* place kexec-images on this memory. Let's add a comment to kexec_walk_memblock(), documenting how we handle MEMBLOCK_DRIVER_MANAGED now just like using IORESOURCE_SYSRAM_DRIVER_MANAGED in locate_mem_hole_callback() for kexec_walk_resources(). Also note that MEMBLOCK_HOTPLUG cannot be reused due to different semantics: MEMBLOCK_HOTPLUG: memory is indicated as "System RAM" in the firmware-provided memory map and added to the system early during boot; kexec *has to* indicate this memory to the second kernel and can place kexec-images on this memory. After memory hotunplug, kexec has to be re-armed. We mostly ignore this flag when "movable_node" is not set on the kernel command line, because then we're told to not care about hotunpluggability of such memory regions. MEMBLOCK_DRIVER_MANAGED: memory is not indicated as "System RAM" in the firmware-provided memory map; this memory is always detected and added to the system by a driver; memory might not actually be physically hotunpluggable. kexec *must not* indicate this memory to the second kernel and *must not* place kexec-images on this memory. Link: https://lkml.kernel.org/r/20211004093605.5830-5-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Reviewed-by: Mike Rapoport <rppt@linux.ibm.com> Cc: "Aneesh Kumar K . V" <aneesh.kumar@linux.ibm.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Christian Borntraeger <borntraeger@de.ibm.com> Cc: Eric Biederman <ebiederm@xmission.com> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Huacai Chen <chenhuacai@kernel.org> Cc: Jianyong Wu <Jianyong.Wu@arm.com> Cc: Jiaxun Yang <jiaxun.yang@flygoat.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Shahab Vahedi <shahab@synopsys.com> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Vineet Gupta <vgupta@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-11-06memblock: use memblock_free for freeing virtual pointersMike Rapoport
Rename memblock_free_ptr() to memblock_free() and use memblock_free() when freeing a virtual pointer so that memblock_free() will be a counterpart of memblock_alloc() The callers are updated with the below semantic patch and manual addition of (void *) casting to pointers that are represented by unsigned long variables. @@ identifier vaddr; expression size; @@ ( - memblock_phys_free(__pa(vaddr), size); + memblock_free(vaddr, size); | - memblock_free_ptr(vaddr, size); + memblock_free(vaddr, size); ) [sfr@canb.auug.org.au: fixup] Link: https://lkml.kernel.org/r/20211018192940.3d1d532f@canb.auug.org.au Link: https://lkml.kernel.org/r/20210930185031.18648-7-rppt@kernel.org Signed-off-by: Mike Rapoport <rppt@linux.ibm.com> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Juergen Gross <jgross@suse.com> Cc: Shahab Vahedi <Shahab.Vahedi@synopsys.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-11-06memblock: rename memblock_free to memblock_phys_freeMike Rapoport
Since memblock_free() operates on a physical range, make its name reflect it and rename it to memblock_phys_free(), so it will be a logical counterpart to memblock_phys_alloc(). The callers are updated with the below semantic patch: @@ expression addr; expression size; @@ - memblock_free(addr, size); + memblock_phys_free(addr, size); Link: https://lkml.kernel.org/r/20210930185031.18648-6-rppt@kernel.org Signed-off-by: Mike Rapoport <rppt@linux.ibm.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Juergen Gross <jgross@suse.com> Cc: Shahab Vahedi <Shahab.Vahedi@synopsys.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-11-06memblock: drop memblock_free_early_nid() and memblock_free_early()Mike Rapoport
memblock_free_early_nid() is unused and memblock_free_early() is an alias for memblock_free(). Replace calls to memblock_free_early() with calls to memblock_free() and remove memblock_free_early() and memblock_free_early_nid(). Link: https://lkml.kernel.org/r/20210930185031.18648-4-rppt@kernel.org Signed-off-by: Mike Rapoport <rppt@linux.ibm.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Juergen Gross <jgross@suse.com> Cc: Shahab Vahedi <Shahab.Vahedi@synopsys.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-11-06mm: make generic arch_is_kernel_initmem_freed() do what it saysChristophe Leroy
Commit 7a5da02de8d6 ("locking/lockdep: check for freed initmem in static_obj()") added arch_is_kernel_initmem_freed() which is supposed to report whether an object is part of already freed init memory. For the time being, the generic version of arch_is_kernel_initmem_freed() always reports 'false', allthough free_initmem() is generically called on all architectures. Therefore, change the generic version of arch_is_kernel_initmem_freed() to check whether free_initmem() has been called. If so, then check if a given address falls into init memory. To ease the use of system_state, move it out of line into its only caller which is lockdep.c Link: https://lkml.kernel.org/r/1d40783e676e07858be97d881f449ee7ea8adfb1.1633001016.git.christophe.leroy@csgroup.eu Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Paul Mackerras <paulus@ozlabs.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-11-06mm: create a new system state and fix core_kernel_text()Christophe Leroy
core_kernel_text() considers that until system_state in at least SYSTEM_RUNNING, init memory is valid. But init memory is freed a few lines before setting SYSTEM_RUNNING, so we have a small period of time when core_kernel_text() is wrong. Create an intermediate system state called SYSTEM_FREEING_INIT that is set before starting freeing init memory, and use it in core_kernel_text() to report init memory invalid earlier. Link: https://lkml.kernel.org/r/9ecfdee7dd4d741d172cb93ff1d87f1c58127c9a.1633001016.git.christophe.leroy@csgroup.eu Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@ozlabs.org> Cc: Heiko Carstens <hca@linux.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-11-06mm/page_alloc: detect allocation forbidden by cpuset and bail out earlyFeng Tang
There was a report that starting an Ubuntu in docker while using cpuset to bind it to movable nodes (a node only has movable zone, like a node for hotplug or a Persistent Memory node in normal usage) will fail due to memory allocation failure, and then OOM is involved and many other innocent processes got killed. It can be reproduced with command: $ docker run -it --rm --cpuset-mems 4 ubuntu:latest bash -c "grep Mems_allowed /proc/self/status" (where node 4 is a movable node) runc:[2:INIT] invoked oom-killer: gfp_mask=0x500cc2(GFP_HIGHUSER|__GFP_ACCOUNT), order=0, oom_score_adj=0 CPU: 8 PID: 8291 Comm: runc:[2:INIT] Tainted: G W I E 5.8.2-0.g71b519a-default #1 openSUSE Tumbleweed (unreleased) Hardware name: Dell Inc. PowerEdge R640/0PHYDR, BIOS 2.6.4 04/09/2020 Call Trace: dump_stack+0x6b/0x88 dump_header+0x4a/0x1e2 oom_kill_process.cold+0xb/0x10 out_of_memory.part.0+0xaf/0x230 out_of_memory+0x3d/0x80 __alloc_pages_slowpath.constprop.0+0x954/0xa20 __alloc_pages_nodemask+0x2d3/0x300 pipe_write+0x322/0x590 new_sync_write+0x196/0x1b0 vfs_write+0x1c3/0x1f0 ksys_write+0xa7/0xe0 do_syscall_64+0x52/0xd0 entry_SYSCALL_64_after_hwframe+0x44/0xa9 Mem-Info: active_anon:392832 inactive_anon:182 isolated_anon:0 active_file:68130 inactive_file:151527 isolated_file:0 unevictable:2701 dirty:0 writeback:7 slab_reclaimable:51418 slab_unreclaimable:116300 mapped:45825 shmem:735 pagetables:2540 bounce:0 free:159849484 free_pcp:73 free_cma:0 Node 4 active_anon:1448kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB mapped:0kB dirty:0kB writeback:0kB shmem:0kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 0kB writeback_tmp:0kB all_unreclaimable? no Node 4 Movable free:130021408kB min:9140kB low:139160kB high:269180kB reserved_highatomic:0KB active_anon:1448kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:130023424kB managed:130023424kB mlocked:0kB kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:292kB local_pcp:84kB free_cma:0kB lowmem_reserve[]: 0 0 0 0 0 Node 4 Movable: 1*4kB (M) 0*8kB 0*16kB 1*32kB (M) 0*64kB 0*128kB 1*256kB (M) 1*512kB (M) 1*1024kB (M) 0*2048kB 31743*4096kB (M) = 130021156kB oom-kill:constraint=CONSTRAINT_CPUSET,nodemask=(null),cpuset=docker-9976a269caec812c134fa317f27487ee36e1129beba7278a463dd53e5fb9997b.scope,mems_allowed=4,global_oom,task_memcg=/system.slice/containerd.service,task=containerd,pid=4100,uid=0 Out of memory: Killed process 4100 (containerd) total-vm:4077036kB, anon-rss:51184kB, file-rss:26016kB, shmem-rss:0kB, UID:0 pgtables:676kB oom_score_adj:0 oom_reaper: reaped process 8248 (docker), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB oom_reaper: reaped process 2054 (node_exporter), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB oom_reaper: reaped process 1452 (systemd-journal), now anon-rss:0kB, file-rss:8564kB, shmem-rss:4kB oom_reaper: reaped process 2146 (munin-node), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB oom_reaper: reaped process 8291 (runc:[2:INIT]), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB The reason is that in this case, the target cpuset nodes only have movable zone, while the creation of an OS in docker sometimes needs to allocate memory in non-movable zones (dma/dma32/normal) like GFP_HIGHUSER, and the cpuset limit forbids the allocation, then out-of-memory killing is involved even when normal nodes and movable nodes both have many free memory. The OOM killer cannot help to resolve the situation as there is no usable memory for the request in the cpuset scope. The only reasonable measure to take is to fail the allocation right away and have the caller to deal with it. So add a check for cases like this in the slowpath of allocation, and bail out early returning NULL for the allocation. As page allocation is one of the hottest path in kernel, this check will hurt all users with sane cpuset configuration, add a static branch check and detect the abnormal config in cpuset memory binding setup so that the extra check cost in page allocation is not paid by everyone. [thanks to Micho Hocko and David Rientjes for suggesting not handling it inside OOM code, adding cpuset check, refining comments] Link: https://lkml.kernel.org/r/1632481657-68112-1-git-send-email-feng.tang@intel.com Signed-off-by: Feng Tang <feng.tang@intel.com> Suggested-by: Michal Hocko <mhocko@suse.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: David Rientjes <rientjes@google.com> Cc: Tejun Heo <tj@kernel.org> Cc: Zefan Li <lizefan.x@bytedance.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-11-06mm: move node_reclaim_distance to fix NUMA without SMPGeert Uytterhoeven
Patch series "Fix NUMA without SMP". SuperH is the only architecture which still supports NUMA without SMP, for good reasons (various memories scattered around the address space, each with varying latencies). This series fixes two build errors due to variables and functions used by the NUMA code being provided by SMP-only source files or sections. This patch (of 2): If CONFIG_NUMA=y, but CONFIG_SMP=n (e.g. sh/migor_defconfig): sh4-linux-gnu-ld: mm/page_alloc.o: in function `get_page_from_freelist': page_alloc.c:(.text+0x2c24): undefined reference to `node_reclaim_distance' Fix this by moving the declaration of node_reclaim_distance from an SMP-only to a generic file. Link: https://lkml.kernel.org/r/cover.1631781495.git.geert+renesas@glider.be Link: https://lkml.kernel.org/r/6432666a648dde85635341e6c918cee97c97d264.1631781495.git.geert+renesas@glider.be Fixes: a55c7454a8c887b2 ("sched/topology: Improve load balancing on AMD EPYC systems") Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be> Suggested-by: Matt Fleming <matt@codeblueprint.co.uk> Acked-by: Mel Gorman <mgorman@suse.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Vincent Guittot <vincent.guittot@linaro.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Yoshinori Sato <ysato@users.osdn.me> Cc: Rich Felker <dalias@libc.org> Cc: Gon Solo <gonsolo@gmail.com> Cc: Geert Uytterhoeven <geert+renesas@glider.be> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-11-06mm/mmap.c: fix a data race of mm->total_vmPeng Liu
The variable mm->total_vm could be accessed concurrently during mmaping and system accounting as noticed by KCSAN, BUG: KCSAN: data-race in __acct_update_integrals / mmap_region read-write to 0xffffa40267bd14c8 of 8 bytes by task 15609 on cpu 3: mmap_region+0x6dc/0x1400 do_mmap+0x794/0xca0 vm_mmap_pgoff+0xdf/0x150 ksys_mmap_pgoff+0xe1/0x380 do_syscall_64+0x37/0x50 entry_SYSCALL_64_after_hwframe+0x44/0xa9 read to 0xffffa40267bd14c8 of 8 bytes by interrupt on cpu 2: __acct_update_integrals+0x187/0x1d0 acct_account_cputime+0x3c/0x40 update_process_times+0x5c/0x150 tick_sched_timer+0x184/0x210 __run_hrtimer+0x119/0x3b0 hrtimer_interrupt+0x350/0xaa0 __sysvec_apic_timer_interrupt+0x7b/0x220 asm_call_irq_on_stack+0x12/0x20 sysvec_apic_timer_interrupt+0x4d/0x80 asm_sysvec_apic_timer_interrupt+0x12/0x20 smp_call_function_single+0x192/0x2b0 perf_install_in_context+0x29b/0x4a0 __se_sys_perf_event_open+0x1a98/0x2550 __x64_sys_perf_event_open+0x63/0x70 do_syscall_64+0x37/0x50 entry_SYSCALL_64_after_hwframe+0x44/0xa9 Reported by Kernel Concurrency Sanitizer on: CPU: 2 PID: 15610 Comm: syz-executor.3 Not tainted 5.10.0+ #2 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014 In vm_stat_account which called by mmap_region, increase total_vm, and __acct_update_integrals may read total_vm at the same time. This will cause a data race which lead to undefined behaviour. To avoid potential bad read/write, volatile property and barrier are both used to avoid undefined behaviour. Link: https://lkml.kernel.org/r/20210913105550.1569419-1-liupeng256@huawei.com Signed-off-by: Peng Liu <liupeng256@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-11-06workqueue, kasan: avoid alloc_pages() when recording stackMarco Elver
Shuah Khan reported: | When CONFIG_PROVE_RAW_LOCK_NESTING=y and CONFIG_KASAN are enabled, | kasan_record_aux_stack() runs into "BUG: Invalid wait context" when | it tries to allocate memory attempting to acquire spinlock in page | allocation code while holding workqueue pool raw_spinlock. | | There are several instances of this problem when block layer tries | to __queue_work(). Call trace from one of these instances is below: | | kblockd_mod_delayed_work_on() | mod_delayed_work_on() | __queue_delayed_work() | __queue_work() (rcu_read_lock, raw_spin_lock pool->lock held) | insert_work() | kasan_record_aux_stack() | kasan_save_stack() | stack_depot_save() | alloc_pages() | __alloc_pages() | get_page_from_freelist() | rm_queue() | rm_queue_pcplist() | local_lock_irqsave(&pagesets.lock, flags); | [ BUG: Invalid wait context triggered ] The default kasan_record_aux_stack() calls stack_depot_save() with GFP_NOWAIT, which in turn can then call alloc_pages(GFP_NOWAIT, ...). In general, however, it is not even possible to use either GFP_ATOMIC nor GFP_NOWAIT in certain non-preemptive contexts, including raw_spin_locks (see gfp.h and commmit ab00db216c9c7). Fix it by instructing stackdepot to not expand stack storage via alloc_pages() in case it runs out by using kasan_record_aux_stack_noalloc(). While there is an increased risk of failing to insert the stack trace, this is typically unlikely, especially if the same insertion had already succeeded previously (stack depot hit). For frequent calls from the same location, it therefore becomes extremely unlikely that kasan_record_aux_stack_noalloc() fails. Link: https://lkml.kernel.org/r/20210902200134.25603-1-skhan@linuxfoundation.org Link: https://lkml.kernel.org/r/20210913112609.2651084-7-elver@google.com Signed-off-by: Marco Elver <elver@google.com> Reported-by: Shuah Khan <skhan@linuxfoundation.org> Tested-by: Shuah Khan <skhan@linuxfoundation.org> Acked-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Acked-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrey Konovalov <andreyknvl@gmail.com> Cc: Alexander Potapenko <glider@google.com> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: "Gustavo A. R. Silva" <gustavoars@kernel.org> Cc: Lai Jiangshan <jiangshanlai@gmail.com> Cc: Taras Madan <tarasmadan@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vijayanand Jitta <vjitta@codeaurora.org> Cc: Vinayak Menon <vinmenon@codeaurora.org> Cc: Walter Wu <walter-zh.wu@mediatek.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-11-05module: change to print useful messages from elf_validity_check()Shuah Khan
elf_validity_check() checks ELF headers for errors and ELF Spec. compliance and if any of them fail it returns -ENOEXEC from all of these error paths. Almost all of them don't print any messages. When elf_validity_check() returns an error, load_module() prints an error message without error code. It is hard to determine why the module ELF structure is invalid, even if load_module() prints the error code which is -ENOEXEC in all of these cases. Change to print useful error messages from elf_validity_check() to clearly say what went wrong and why the ELF validity checks failed. Remove the load_module() error message which is no longer needed. This patch includes changes to fix build warns on 32-bit platforms: warning: format '%llu' expects argument of type 'long long unsigned int', but argument 3 has type 'Elf32_Off' {aka 'unsigned int'} Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Shuah Khan <skhan@linuxfoundation.org> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
2021-11-05module: fix validate_section_offset() overflow bug on 64-bitShuah Khan
validate_section_offset() uses unsigned long local variable to add/store shdr->sh_offset and shdr->sh_size on all platforms. unsigned long is too short when sh_offset is Elf64_Off which would be the case on 64bit ELF headers. Without this fix applied we were shorting the design of modules to have section headers placed within the 32-bit boundary (4 GiB) instead of 64-bits when on 64-bit architectures (which allows for up to 16,777,216 TiB). In practice this just meant we were limiting modules sections to below 4 GiB even on 64-bit systems. This then should not really affect any real-world use case as modules these days obviously should likely never exceed 1 GiB in size overall. A specially crafted invalid module might succeed to skip validation in validate_section_offset() due to this mistake, but in such case no impact is observed through code inspection given the correct data types are used for the copy of the module when needed on move_module() when the section type is not SHT_NOBITS (which indicates no the section occupies no space on the file). Fix the overflow problem using the right size local variable when CONFIG_64BIT is defined. Signed-off-by: Shuah Khan <skhan@linuxfoundation.org> [mcgrof: expand commit log with possible impact if not applied] Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
2021-11-04PCI: apple: Add INTx and per-port interrupt supportMarc Zyngier
Add support for the per-port interrupt controller that deals with both INTx signalling and management interrupts. This allows the Link-up/Link-down interrupts to be wired, allowing the bring-up to be synchronised (and provide debug information). The framework can further be used to handle the rest of the per port events if and when necessary. Likewise, INTx signalling is implemented so that end-points can actually be used. Link: https://lore.kernel.org/r/20210929163847.2807812-7-maz@kernel.org Link: https://lore.kernel.org/r/20211004150552.3844830-1-maz@kernel.org Tested-by: Alyssa Rosenzweig <alyssa@rosenzweig.io> Signed-off-by: Marc Zyngier <maz@kernel.org> Signed-off-by: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
2021-11-04irqdomain: Make of_phandle_args_to_fwspec() generally availableMarc Zyngier
of_phandle_args_to_fwspec() can be generally useful to code extracting a DT of_phandle and using an irq_fwspec to use the hierarchical irqdomain API. Make it visible to the rest of the kernel, including modules. Link: https://lore.kernel.org/r/20210929163847.2807812-2-maz@kernel.org Tested-by: Alyssa Rosenzweig <alyssa@rosenzweig.io> Signed-off-by: Marc Zyngier <maz@kernel.org> Signed-off-by: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
2021-11-04Merge tag 'iommu-updates-v5.16' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu Pull iommu updates from Joerg Roedel: - Intel IOMMU Updates fro Lu Baolu: - Dump DMAR translation structure when DMA fault occurs - An optimization in the page table manipulation code - Use second level for GPA->HPA translation - Various cleanups - Arm SMMU Updates from Will - Minor optimisations to SMMUv3 command creation and submission - Numerous new compatible string for Qualcomm SMMUv2 implementations - Fixes for the SWIOTLB based implemenation of dma-iommu code for untrusted devices - Add support for r8a779a0 to the Renesas IOMMU driver and DT matching code for r8a77980 - A couple of cleanups and fixes for the Apple DART IOMMU driver - Make use of generic report_iommu_fault() interface in the AMD IOMMU driver - Various smaller fixes and cleanups * tag 'iommu-updates-v5.16' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu: (35 commits) iommu/dma: Fix incorrect error return on iommu deferred attach iommu/dart: Initialize DART_STREAMS_ENABLE iommu/dma: Use kvcalloc() instead of kvzalloc() iommu/tegra-smmu: Use devm_bitmap_zalloc when applicable iommu/dart: Use kmemdup instead of kzalloc and memcpy iommu/vt-d: Avoid duplicate removing in __domain_mapping() iommu/vt-d: Convert the return type of first_pte_in_page to bool iommu/vt-d: Clean up unused PASID updating functions iommu/vt-d: Delete dev_has_feat callback iommu/vt-d: Use second level for GPA->HPA translation iommu/vt-d: Check FL and SL capability sanity in scalable mode iommu/vt-d: Remove duplicate identity domain flag iommu/vt-d: Dump DMAR translation structure when DMA fault occurs iommu/vt-d: Do not falsely log intel_iommu is unsupported kernel option iommu/arm-smmu-qcom: Request direct mapping for modem device iommu: arm-smmu-qcom: Add compatible for QCM2290 dt-bindings: arm-smmu: Add compatible for QCM2290 SoC iommu/arm-smmu-qcom: Add SM6350 SMMU compatible dt-bindings: arm-smmu: Add compatible for SM6350 SoC iommu/arm-smmu-v3: Properly handle the return value of arm_smmu_cmdq_build_cmd() ...
2021-11-03Merge branch 'per_signal_struct_coredumps-for-v5.16' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace Pull per signal_struct coredumps from Eric Biederman: "Current coredumps are mixed up with the exit code, the signal handling code, and the ptrace code making coredumps much more complicated than necessary and difficult to follow. This series of changes starts with ptrace_stop and cleans it up, making it easier to follow what is happening in ptrace_stop. Then cleans up the exec interactions with coredumps. Then cleans up the coredump interactions with exit. Finally the coredump interactions with the signal handling code is cleaned up. The first and last changes are bug fixes for minor bugs. I believe the fact that vfork followed by execve can kill the process the called vfork if exec fails is sufficient justification to change the userspace visible behavior. In previous discussions some of these changes were organized differently and individually appeared to make the code base worse. As currently written I believe they all stand on their own as cleanups and bug fixes. Which means that even if the worst should happen and the last change needs to be reverted for some unimaginable reason, the code base will still be improved. If the worst does not happen there are a more cleanups that can be made. Signals that generate coredumps can easily become eligible for short circuit delivery in complete_signal. The entire rendezvous for generating a coredump can move into get_signal. The function force_sig_info_to_task be written in a way that does not modify the signal handling state of the target task (because coredumps are eligible for short circuit delivery). Many of these future cleanups can be done another way but nothing so cleanly as if coredumps become per signal_struct" * 'per_signal_struct_coredumps-for-v5.16' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: coredump: Limit coredumps to a single thread group coredump: Don't perform any cleanups before dumping core exit: Factor coredump_exit_mm out of exit_mm exec: Check for a pending fatal signal instead of core_state ptrace: Remove the unnecessary arguments from arch_ptrace_stop signal: Remove the bogus sigkill_pending in ptrace_stop
2021-11-03Merge tag 'trace-v5.16-2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace Pull more tracing updates from Steven Rostedt: - osnoise and timerlat updates that will work with the RTLA tool (Real-Time Linux Analysis). Specifically it disconnects the work load (threads that look for latency) from the tracing instances attached to them, allowing for more than one instance to retrieve data from the work load. - Optimization on division in the trace histogram trigger code to use shift and multiply when possible. Also added documentation. - Fix prototype to my_direct_func in direct ftrace trampoline sample code. * tag 'trace-v5.16-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: ftrace/samples: Add missing prototype for my_direct_func tracing/selftests: Add tests for hist trigger expression parsing tracing/histogram: Document hist trigger variables tracing/histogram: Update division by 0 documentation tracing/histogram: Optimize division by constants tracing/osnoise: Remove PREEMPT_RT ifdefs from inside functions tracing/osnoise: Remove STACKTRACE ifdefs from inside functions tracing/osnoise: Allow multiple instances of the same tracer tracing/osnoise: Remove TIMERLAT ifdefs from inside functions tracing/osnoise: Support a list of trace_array *tr tracing/osnoise: Use start/stop_per_cpu_kthreads() on osnoise_cpus_write() tracing/osnoise: Split workload start from the tracer start tracing/osnoise: Improve comments about barrier need for NMI callbacks tracing/osnoise: Do not follow tracing_cpumask
2021-11-02Merge tag 'pm-5.16-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm Pull power management updates from Rafael Wysocki: "These make the power management of PCI devices with ACPI companions more straightforwad, add support for inefficient operating performance points to the Energy model and make cpufreq handle them as appropriate, rearrange the handling of cpuidle during system PM transitions, update a few cpufreq drivers and intel_idle, fix assorded issues and clean up code in multiple places. Specifics: - Add support for inefficient operating performance points to the Energy Model and modify cpufreq to use them properly (Vincent Donnefort). - Rearrange the DTPM framework code to simplify it and make it easier to follow (Daniel Lezcano). - Fix power intialization in DTPM (Daniel Lezcano). - Add CPU load consideration when estimating the instaneous power consumption in DTPM (Daniel Lezcano). - Fix cpu->pstate.turbo_freq initialization in intel_pstate (Zhang Rui). - Make intel_pstate process HWP Guaranteed change notifications from the processor (Srinivas Pandruvada). - Fix typo in cpufreq.h (Rafael Wysocki). - Fix tegra driver to handle BPMP errors properly (Mikko Perttunen). - Fix the parameter usage of the newly added perf-domain API (Hector Yuan). - Minor cleanups to cppc, vexpress and s3c244x drivers (Han Wang, Guenter Roeck, and Arnd Bergmann). - Fix kobject memory leaks in cpuidle error paths (Anel Orazgaliyeva). - Make intel_idle enable interrupts before entering C1 on some Xeon processor models (Artem Bityutskiy). - Clean up hib_wait_io() (Falla Coulibaly). - Fix sparse warnings in hibernation-related code (Anders Roxell). - Use vzalloc() and kzalloc() instead of their open-coded equivalents in hibernation-related code (Cai Huoqing). - Prevent user space from crashing the kernel by attempting to restore the system state from a swap partition in use (Ye Bin). - Do not let "syscore" devices runtime-suspend during system PM transitions (Rafael Wysocki). - Do not pause cpuidle in the suspend-to-idle path (Rafael Wysocki). - Pause cpuidle later and resume it earlier during system PM transitions (Rafael Wysocki). - Make system suspend code use valid_state() consistently (Rafael Wysocki). - Add support for enabling wakeup IRQs after invoking the ->runtime_suspend() callback and make two drivers use it (Chunfeng Yun). - Make the association of ACPI device objects with PCI devices more straightforward and simplify the code doing that for all devices in general (Rafael Wysocki). - Eliminate struct pci_platform_pm_ops and handle the both of its users (PCI and Intel MID) directly in the PCI bus code (Rafael Wysocki). - Simplify and clarify ACPI PCI device PM helpers (Rafael Wysocki). - Fix ordering of operations in pci_back_from_sleep() (Rafael Wysocki). - Make exynos-ppmu use hyphens in DT properties (Krzysztof Kozlowski). - Simplify parsing event-type from DT in exynos-ppmu (Krzysztof Kozlowski). - Strengthen check for freq_table in devfreq (Samuel Holland)" * tag 'pm-5.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (49 commits) cpufreq: Fix parameter in parse_perf_domain() usb: mtu3: enable wake-up interrupt after runtime_suspend called usb: xhci-mtk: enable wake-up interrupt after runtime_suspend called PM / wakeirq: support enabling wake-up irq after runtime_suspend called PM / devfreq: Strengthen check for freq_table devfreq: exynos-ppmu: simplify parsing event-type from DT devfreq: exynos-ppmu: use node names with hyphens cpufreq: intel_pstate: Fix cpu->pstate.turbo_freq initialization PM: suspend: Use valid_state() consistently PM: sleep: Pause cpuidle later and resume it earlier during system transitions PM: suspend: Do not pause cpuidle in the suspend-to-idle path PM: sleep: Do not let "syscore" devices runtime-suspend during system transitions PM: hibernate: Get block device exclusively in swsusp_check() powercap/drivers/dtpm: Fix power limit initialization powercap/drivers/dtpm: Scale the power with the load powercap/drivers/dtpm: Use container_of instead of a private data field powercap/drivers/dtpm: Simplify the dtpm table powercap/drivers/dtpm: Encapsulate even more the code PM: hibernate: swap: Use vzalloc() and kzalloc() PM: hibernate: fix sparse warnings ...
2021-11-02Merge branch 'ucount-fixes-for-v5.16' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace Pull ucount cleanups from Eric Biederman: "While working on the ucount fixes a for v5.15 a number of cleanups suggested themselves. Little things like not testing for NULL when a pointer can not be NULL and wrapping atomic_add_negative with a more descriptive name, so that people reading the code can more quickly understand what is going on" * 'ucount-fixes-for-v5.16' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: ucounts: Use atomic_long_sub_return for clarity ucounts: Add get_ucounts_or_wrap for clarity ucounts: Remove unnecessary test for NULL ucount in get_ucounts ucounts: In set_cred_ucounts assume new->ucounts is non-NULL
2021-11-02Merge branch 'for-5.16' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup Pull cgroup updates from Tejun Heo: - The misc controller now reports allocation rejections through misc.events instead of printking - cgroup_mutex usage is reduced to improve scalability of some operations - vhost helper threads are now assigned to the right cgroup on cgroup2 - Bug fixes * 'for-5.16' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: cgroup: bpf: Move wrapper for __cgroup_bpf_*() to kernel/bpf/cgroup.c cgroup: Fix rootcg cpu.stat guest double counting cgroup: no need for cgroup_mutex for /proc/cgroups cgroup: remove cgroup_mutex from cgroupstats_build cgroup: reduce dependency on cgroup_mutex cgroup: cgroup-v1: do not exclude cgrp_dfl_root cgroup: Make rebind_subsystems() disable v2 controllers all at once docs/cgroup: add entry for misc.events misc_cgroup: remove error log to avoid log flood misc_cgroup: introduce misc.events to count failures
2021-11-02Merge branch 'for-5.16' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wqLinus Torvalds
Pull workqueue updates from Tejun Heo: "Nothing too interesting. An optimization to short-circuit noop cpumask updates, debug dump code reorg, and doc update" * 'for-5.16' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq: workqueue: doc: Call out the non-reentrance conditions workqueue: Introduce show_one_worker_pool and show_one_workqueue. workqueue: make sysfs of unbound kworker cpumask more clever
2021-11-02Merge branches 'pm-em' and 'powercap'Rafael J. Wysocki
Merge Energy Model and power capping updates for 5.16-rc1: - Add support for inefficient operating performance points to the Energy Model and modify cpufreq to use them properly (Vincent Donnefort). - Rearrange the DTPM framework code to simplify it and make it easier to follow (Daniel Lezcano). - Fix power intialization in DTPM (Daniel Lezcano). - Add CPU load consideration when estimating the instaneous power consumption in DTPM (Daniel Lezcano). * pm-em: cpufreq: mediatek-hw: Fix cpufreq_table_find_index_dl() call PM: EM: Mark inefficiencies in CPUFreq cpufreq: Use CPUFREQ_RELATION_E in DVFS governors cpufreq: Introducing CPUFREQ_RELATION_E cpufreq: Add an interface to mark inefficient frequencies cpufreq: Make policy min/max hard requirements PM: EM: Allow skipping inefficient states PM: EM: Extend em_perf_domain with a flag field PM: EM: Mark inefficient states PM: EM: Fix inefficient states detection * powercap: powercap/drivers/dtpm: Fix power limit initialization powercap/drivers/dtpm: Scale the power with the load powercap/drivers/dtpm: Use container_of instead of a private data field powercap/drivers/dtpm: Simplify the dtpm table powercap/drivers/dtpm: Encapsulate even more the code
2021-11-02Merge branch 'pm-sleep'Rafael J. Wysocki
Merge updates related to system sleep for 5.16-rc1: - Clean up hib_wait_io() (Falla Coulibaly). - Fix sparse warnings in hibernation-related code (Anders Roxell). - Use vzalloc() and kzalloc() instead of their open-coded equivalents in hibernation-related code (Cai Huoqing). - Prevent user space from crashing the kernel by attempting to restore the system state from a swap partition in use (Ye Bin). - Do not let "syscore" devices runtime-suspend during system PM transitions (Rafael Wysocki). - Do not pause cpuidle in the suspend-to-idle path (Rafael Wysocki). - Pause cpuidle later and resume it earlier during system PM transitions (Rafael Wysocki). - Make system suspend code use valid_state() consistently (Rafael Wysocki). - Add support for enabling wakeup IRQs after invoking the ->runtime_suspend() callback and make two drivers use it (Chunfeng Yun). * pm-sleep: usb: mtu3: enable wake-up interrupt after runtime_suspend called usb: xhci-mtk: enable wake-up interrupt after runtime_suspend called PM / wakeirq: support enabling wake-up irq after runtime_suspend called PM: suspend: Use valid_state() consistently PM: sleep: Pause cpuidle later and resume it earlier during system transitions PM: suspend: Do not pause cpuidle in the suspend-to-idle path PM: sleep: Do not let "syscore" devices runtime-suspend during system transitions PM: hibernate: Get block device exclusively in swsusp_check() PM: hibernate: swap: Use vzalloc() and kzalloc() PM: hibernate: fix sparse warnings Revert "PM: sleep: Do not assume that "mem" is always present" PM: hibernate: Remove blk_status_to_errno in hib_wait_io PM: sleep: Do not assume that "mem" is always present
2021-11-02Merge tag 'printk-for-5.16' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/printk/linux Pull printk updates from Petr Mladek: - Extend %pGp print format to print hex value of the page flags - Use kvmalloc instead of kmalloc to allocate devkmsg buffers - Misc cleanup and warning fixes * tag 'printk-for-5.16' of git://git.kernel.org/pub/scm/linux/kernel/git/printk/linux: vsprintf: Update %pGp documentation about that it prints hex value lib/vsprintf.c: Amend static asserts for format specifier flags vsprintf: Make %pGp print the hex value test_printf: Append strings more efficiently test_printf: Remove custom appending of '|' test_printf: Remove separate page_flags variable test_printf: Make pft array const ia64: don't do IA64_CMPXCHG_DEBUG without CONFIG_PRINTK printk: use gnu_printf format attribute for printk_sprint() printk: avoid -Wsometimes-uninitialized warning printk: use kvmalloc instead of kmalloc for devkmsg_user
2021-11-02Merge tag 'net-next-for-5.16' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next Pull networking updates from Jakub Kicinski: "Core: - Remove socket skb caches - Add a SO_RESERVE_MEM socket op to forward allocate buffer space and avoid memory accounting overhead on each message sent - Introduce managed neighbor entries - added by control plane and resolved by the kernel for use in acceleration paths (BPF / XDP right now, HW offload users will benefit as well) - Make neighbor eviction on link down controllable by userspace to work around WiFi networks with bad roaming implementations - vrf: Rework interaction with netfilter/conntrack - fq_codel: implement L4S style ce_threshold_ect1 marking - sch: Eliminate unnecessary RCU waits in mini_qdisc_pair_swap() BPF: - Add support for new btf kind BTF_KIND_TAG, arbitrary type tagging as implemented in LLVM14 - Introduce bpf_get_branch_snapshot() to capture Last Branch Records - Implement variadic trace_printk helper - Add a new Bloomfilter map type - Track <8-byte scalar spill and refill - Access hw timestamp through BPF's __sk_buff - Disallow unprivileged BPF by default - Document BPF licensing Netfilter: - Introduce egress hook for looking at raw outgoing packets - Allow matching on and modifying inner headers / payload data - Add NFT_META_IFTYPE to match on the interface type either from ingress or egress Protocols: - Multi-Path TCP: - increase default max additional subflows to 2 - rework forward memory allocation - add getsockopts: MPTCP_INFO, MPTCP_TCPINFO, MPTCP_SUBFLOW_ADDRS - MCTP flow support allowing lower layer drivers to configure msg muxing as needed - Automatic Multicast Tunneling (AMT) driver based on RFC7450 - HSR support the redbox supervision frames (IEC-62439-3:2018) - Support for the ip6ip6 encapsulation of IOAM - Netlink interface for CAN-FD's Transmitter Delay Compensation - Support SMC-Rv2 eliminating the current same-subnet restriction, by exploiting the UDP encapsulation feature of RoCE adapters - TLS: add SM4 GCM/CCM crypto support - Bluetooth: initial support for link quality and audio/codec offload Driver APIs: - Add a batched interface for RX buffer allocation in AF_XDP buffer pool - ethtool: Add ability to control transceiver modules' power mode - phy: Introduce supported interfaces bitmap to express MAC capabilities and simplify PHY code - Drop rtnl_lock from DSA .port_fdb_{add,del} callbacks New drivers: - WiFi driver for Realtek 8852AE 802.11ax devices (rtw89) - Ethernet driver for ASIX AX88796C SPI device (x88796c) Drivers: - Broadcom PHYs - support 72165, 7712 16nm PHYs - support IDDQ-SR for additional power savings - PHY support for QCA8081, QCA9561 PHYs - NXP DPAA2: support for IRQ coalescing - NXP Ethernet (enetc): support for software TCP segmentation - Renesas Ethernet (ravb) - support DMAC and EMAC blocks of Gigabit-capable IP found on RZ/G2L SoC - Intel 100G Ethernet - support for eswitch offload of TC/OvS flow API, including offload of GRE, VxLAN, Geneve tunneling - support application device queues - ability to assign Rx and Tx queues to application threads - PTP and PPS (pulse-per-second) extensions - Broadcom Ethernet (bnxt) - devlink health reporting and device reload extensions - Mellanox Ethernet (mlx5) - offload macvlan interfaces - support HW offload of TC rules involving OVS internal ports - support HW-GRO and header/data split - support application device queues - Marvell OcteonTx2: - add XDP support for PF - add PTP support for VF - Qualcomm Ethernet switch (qca8k): support for QCA8328 - Realtek Ethernet DSA switch (rtl8366rb) - support bridge offload - support STP, fast aging, disabling address learning - support for Realtek RTL8365MB-VC, a 4+1 port 10M/100M/1GE switch - Mellanox Ethernet/IB switch (mlxsw) - multi-level qdisc hierarchy offload (e.g. RED, prio and shaping) - offload root TBF qdisc as port shaper - support multiple routing interface MAC address prefixes - support for IP-in-IP with IPv6 underlay - MediaTek WiFi (mt76) - mt7921 - ASPM, 6GHz, SDIO and testmode support - mt7915 - LED and TWT support - Qualcomm WiFi (ath11k) - include channel rx and tx time in survey dump statistics - support for 80P80 and 160 MHz bandwidths - support channel 2 in 6 GHz band - spectral scan support for QCN9074 - support for rx decapsulation offload (data frames in 802.3 format) - Qualcomm phone SoC WiFi (wcn36xx) - enable Idle Mode Power Save (IMPS) to reduce power consumption during idle - Bluetooth driver support for MediaTek MT7922 and MT7921 - Enable support for AOSP Bluetooth extension in Qualcomm WCN399x and Realtek 8822C/8852A - Microsoft vNIC driver (mana) - support hibernation and kexec - Google vNIC driver (gve) - support for jumbo frames - implement Rx page reuse Refactor: - Make all writes to netdev->dev_addr go thru helpers, so that we can add this address to the address rbtree and handle the updates - Various TCP cleanups and optimizations including improvements to CPU cache use - Simplify the gnet_stats, Qdisc stats' handling and remove qdisc->running sequence counter - Driver changes and API updates to address devlink locking deficiencies" * tag 'net-next-for-5.16' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (2122 commits) Revert "net: avoid double accounting for pure zerocopy skbs" selftests: net: add arp_ndisc_evict_nocarrier net: ndisc: introduce ndisc_evict_nocarrier sysctl parameter net: arp: introduce arp_evict_nocarrier sysctl parameter libbpf: Deprecate AF_XDP support kbuild: Unify options for BTF generation for vmlinux and modules selftests/bpf: Add a testcase for 64-bit bounds propagation issue. bpf: Fix propagation of signed bounds from 64-bit min/max into 32-bit. bpf: Fix propagation of bounds from 64-bit min/max into 32-bit and var_off. net: vmxnet3: remove multiple false checks in vmxnet3_ethtool.c net: avoid double accounting for pure zerocopy skbs tcp: rename sk_wmem_free_skb netdevsim: fix uninit value in nsim_drv_configure_vfs() selftests/bpf: Fix also no-alu32 strobemeta selftest bpf: Add missing map_delete_elem method to bloom filter map selftests/bpf: Add bloom map success test for userspace calls bpf: Add alignment padding for "map_extra" + consolidate holes bpf: Bloom filter map naming fixups selftests/bpf: Add test cases for struct_ops prog bpf: Add dummy BPF STRUCT_OPS for test purpose ...
2021-11-02Merge branch 'for-5.16-vsprintf-pgp' into for-linusPetr Mladek
2021-11-01Merge tag 'audit-pr-20211101' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/audit Pull audit updates from Paul Moore: "Add some additional audit logging to capture the openat2() syscall open_how struct info. Previous variations of the open()/openat() syscalls allowed audit admins to inspect the syscall args to get the information contained in the new open_how struct used in openat2()" * tag 'audit-pr-20211101' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/audit: audit: return early if the filter rule has a lower priority audit: add OPENAT2 record to list "how" info audit: add support for the openat2 syscall audit: replace magic audit syscall class numbers with macros lsm_audit: avoid overloading the "key" audit field audit: Convert to SPDX identifier audit: rename struct node to struct audit_node to prevent future name collisions
2021-11-01Merge tag 'selinux-pr-20211101' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/selinux Pull selinux updates from Paul Moore: - Add LSM/SELinux/Smack controls and auditing for io-uring. As usual, the individual commit descriptions have more detail, but we were basically missing two things which we're adding here: + establishment of a proper audit context so that auditing of io-uring ops works similarly to how it does for syscalls (with some io-uring additions because io-uring ops are *not* syscalls) + additional LSM hooks to enable access control points for some of the more unusual io-uring features, e.g. credential overrides. The additional audit callouts and LSM hooks were done in conjunction with the io-uring folks, based on conversations and RFC patches earlier in the year. - Fixup the binder credential handling so that the proper credentials are used in the LSM hooks; the commit description and the code comment which is removed in these patches are helpful to understand the background and why this is the proper fix. - Enable SELinux genfscon policy support for securityfs, allowing improved SELinux filesystem labeling for other subsystems which make use of securityfs, e.g. IMA. * tag 'selinux-pr-20211101' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/selinux: security: Return xattr name from security_dentry_init_security() selinux: fix a sock regression in selinux_ip_postroute_compat() binder: use cred instead of task for getsecid binder: use cred instead of task for selinux checks binder: use euid from cred instead of using task LSM: Avoid warnings about potentially unused hook variables selinux: fix all of the W=1 build warnings selinux: make better use of the nf_hook_state passed to the NF hooks selinux: fix race condition when computing ocontext SIDs selinux: remove unneeded ipv6 hook wrappers selinux: remove the SELinux lockdown implementation selinux: enable genfscon labeling for securityfs Smack: Brutalist io_uring support selinux: add support for the io_uring access controls lsm,io_uring: add LSM hooks to io_uring io_uring: convert io_uring to the secure anon inode interface fs: add anon_inode_getfile_secure() similar to anon_inode_getfd_secure() audit: add filtering for io_uring records audit,io_uring,io-wq: add some basic audit support to io_uring audit: prepare audit_context for use in calling contexts beyond syscalls
2021-11-01Merge tag 'rcu.2021.11.01a' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu Pull RCU updates from Paul McKenney: - Miscellaneous fixes - Torture-test updates for smp_call_function(), most notably improved checking of module parameters. - Tasks-trace RCU updates that fix a number of rare but important race-condition bugs. - Other torture-test updates, most notably better checking of module parameters. In addition, rcutorture may once again be run on CONFIG_PREEMPT_RT kernels. - Torture-test scripting updates, most notably specifying the new CONFIG_KCSAN_STRICT kconfig option rather than maintaining an ever-changing list of individual KCSAN kconfig options. * tag 'rcu.2021.11.01a' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu: (46 commits) rcu: Fix rcu_dynticks_curr_cpu_in_eqs() vs noinstr rcu: Always inline rcu_dynticks_task*_{enter,exit}() torture: Make kvm-remote.sh print size of downloaded tarball torture: Allot 1G of memory for scftorture runs tools/rcu: Add an extract-stall script scftorture: Warn on individual scf_torture_init() error conditions scftorture: Count reschedule IPIs scftorture: Account for weight_resched when checking for all zeroes scftorture: Shut down if nonsensical arguments given scftorture: Allow zero weight to exclude an smp_call_function*() category rcu: Avoid unneeded function call in rcu_read_unlock() rcu-tasks: Update comments to cond_resched_tasks_rcu_qs() rcu-tasks: Fix IPI failure handling in trc_wait_for_one_reader rcu-tasks: Fix read-side primitives comment for call_rcu_tasks_trace rcu-tasks: Clarify read side section info for rcu_tasks_rude GP primitives rcu-tasks: Correct comparisons for CPU numbers in show_stalled_task_trace rcu-tasks: Correct firstreport usage in check_all_holdout_tasks_trace rcu-tasks: Fix s/rcu_add_holdout/trc_add_holdout/ typo in comment rcu-tasks: Move RTGS_WAIT_CBS to beginning of rcu_tasks_kthread() loop rcu-tasks: Fix s/instruction/instructions/ typo in comment ...
2021-11-01Merge tag 'trace-v5.16' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace Pull tracing updates from Steven Rostedt: - kprobes: Restructured stack unwinder to show properly on x86 when a stack dump happens from a kretprobe callback. - Fix to bootconfig parsing - Have tracefs allow owner and group permissions by default (only denying others). There's been pressure to allow non root to tracefs in a controlled fashion, and using groups is probably the safest. - Bootconfig memory managament updates. - Bootconfig clean up to have the tools directory be less dependent on changes in the kernel tree. - Allow perf to be traced by function tracer. - Rewrite of function graph tracer to be a callback from the function tracer instead of having its own trampoline (this change will happen on an arch by arch basis, and currently only x86_64 implements it). - Allow multiple direct trampolines (bpf hooks to functions) be batched together in one synchronization. - Allow histogram triggers to add variables that can perform calculations against the event's fields. - Use the linker to determine architecture callbacks from the ftrace trampoline to allow for proper parameter prototypes and prevent warnings from the compiler. - Extend histogram triggers to key off of variables. - Have trace recursion use bit magic to determine preempt context over if branches. - Have trace recursion disable preemption as all use cases do anyway. - Added testing for verification of tracing utilities. - Various small clean ups and fixes. * tag 'trace-v5.16' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: (101 commits) tracing/histogram: Fix semicolon.cocci warnings tracing/histogram: Fix documentation inline emphasis warning tracing: Increase PERF_MAX_TRACE_SIZE to handle Sentinel1 and docker together tracing: Show size of requested perf buffer bootconfig: Initialize ret in xbc_parse_tree() ftrace: do CPU checking after preemption disabled ftrace: disable preemption when recursion locked tracing/histogram: Document expression arithmetic and constants tracing/histogram: Optimize division by a power of 2 tracing/histogram: Covert expr to const if both operands are constants tracing/histogram: Simplify handling of .sym-offset in expressions tracing: Fix operator precedence for hist triggers expression tracing: Add division and multiplication support for hist triggers tracing: Add support for creating hist trigger variables from literal selftests/ftrace: Stop tracing while reading the trace file by default MAINTAINERS: Update KPROBES and TRACING entries test_kprobes: Move it from kernel/ to lib/ docs, kprobes: Remove invalid URL and add new reference samples/kretprobes: Fix return value if register_kretprobe() failed lib/bootconfig: Fix the xbc_get_info kerneldoc ...
2021-11-01Merge https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-nextJakub Kicinski
Alexei Starovoitov says: ==================== pull-request: bpf-next 2021-11-01 We've added 181 non-merge commits during the last 28 day(s) which contain a total of 280 files changed, 11791 insertions(+), 5879 deletions(-). The main changes are: 1) Fix bpf verifier propagation of 64-bit bounds, from Alexei. 2) Parallelize bpf test_progs, from Yucong and Andrii. 3) Deprecate various libbpf apis including af_xdp, from Andrii, Hengqi, Magnus. 4) Improve bpf selftests on s390, from Ilya. 5) bloomfilter bpf map type, from Joanne. 6) Big improvements to JIT tests especially on Mips, from Johan. 7) Support kernel module function calls from bpf, from Kumar. 8) Support typeless and weak ksym in light skeleton, from Kumar. 9) Disallow unprivileged bpf by default, from Pawan. 10) BTF_KIND_DECL_TAG support, from Yonghong. 11) Various bpftool cleanups, from Quentin. * https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (181 commits) libbpf: Deprecate AF_XDP support kbuild: Unify options for BTF generation for vmlinux and modules selftests/bpf: Add a testcase for 64-bit bounds propagation issue. bpf: Fix propagation of signed bounds from 64-bit min/max into 32-bit. bpf: Fix propagation of bounds from 64-bit min/max into 32-bit and var_off. selftests/bpf: Fix also no-alu32 strobemeta selftest bpf: Add missing map_delete_elem method to bloom filter map selftests/bpf: Add bloom map success test for userspace calls bpf: Add alignment padding for "map_extra" + consolidate holes bpf: Bloom filter map naming fixups selftests/bpf: Add test cases for struct_ops prog bpf: Add dummy BPF STRUCT_OPS for test purpose bpf: Factor out helpers for ctx access checking bpf: Factor out a helper to prepare trampoline for struct_ops prog selftests, bpf: Fix broken riscv build riscv, libbpf: Add RISC-V (RV64) support to bpf_tracing.h tools, build: Add RISC-V to HOSTARCH parsing riscv, bpf: Increase the maximum number of iterations selftests, bpf: Add one test for sockmap with strparser selftests, bpf: Fix test_txmsg_ingress_parser error ... ==================== Link: https://lore.kernel.org/r/20211102013123.9005-1-alexei.starovoitov@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-11-01bpf: Fix propagation of signed bounds from 64-bit min/max into 32-bit.Alexei Starovoitov
Similar to unsigned bounds propagation fix signed bounds. The 'Fixes' tag is a hint. There is no security bug here. The verifier was too conservative. Fixes: 3f50f132d840 ("bpf: Verifier, do explicit ALU32 bounds tracking") Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Yonghong Song <yhs@fb.com> Link: https://lore.kernel.org/bpf/20211101222153.78759-2-alexei.starovoitov@gmail.com
2021-11-01bpf: Fix propagation of bounds from 64-bit min/max into 32-bit and var_off.Alexei Starovoitov
Before this fix: 166: (b5) if r2 <= 0x1 goto pc+22 from 166 to 189: R2=invP(id=1,umax_value=1,var_off=(0x0; 0xffffffff)) After this fix: 166: (b5) if r2 <= 0x1 goto pc+22 from 166 to 189: R2=invP(id=1,umax_value=1,var_off=(0x0; 0x1)) While processing BPF_JLE the reg_set_min_max() would set true_reg->umax_value = 1 and call __reg_combine_64_into_32(true_reg). Without the fix it would not pass the condition: if (__reg64_bound_u32(reg->umin_value) && __reg64_bound_u32(reg->umax_value)) since umin_value == 0 at this point. Before commit 10bf4e83167c the umin was incorrectly ingored. The commit 10bf4e83167c fixed the correctness issue, but pessimized propagation of 64-bit min max into 32-bit min max and corresponding var_off. Fixes: 10bf4e83167c ("bpf: Fix propagation of 32 bit unsigned bounds from 64 bit bounds") Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Yonghong Song <yhs@fb.com> Link: https://lore.kernel.org/bpf/20211101222153.78759-1-alexei.starovoitov@gmail.com
2021-11-01tracing/histogram: Document hist trigger variablesKalesh Singh
Update the tracefs README to describe how hist trigger variables can be created. Link: https://lkml.kernel.org/r/20211029183339.3216491-4-kaleshsingh@google.com Cc: Jonathan Corbet <corbet@lwn.net> Cc: Ingo Molnar <mingo@redhat.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Tom Zanussi <zanussi@kernel.org> Signed-off-by: Kalesh Singh <kaleshsingh@google.com> Acked-by: Masami Hiramatsu <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2021-11-01tracing/histogram: Optimize division by constantsKalesh Singh
If the divisor is a constant use specific division functions to avoid extra branches when the trigger is hit. If the divisor constant but not a power of 2, the division can be replaced with a multiplication and shift in the following case: Let X = dividend and Y = divisor. Choose Z = some power of 2. If Y <= Z, then: X / Y = (X * (Z / Y)) / Z (Z / Y) is a constant (mult) which is calculated at parse time, so: X / Y = (X * mult) / Z The division by Z can be replaced by a shift since Z is a power of 2: X / Y = (X * mult) >> shift As long, as X < Z the results will not be off by more than 1. Link: https://lkml.kernel.org/r/20211029232410.3494196-1-kaleshsingh@google.com Suggested-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Kalesh Singh <kaleshsingh@google.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2021-11-01Merge tag 'hardening-v5.16-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux Pull compiler hardening updates from Kees Cook: "These are various compiler-related hardening feature updates. Notable is the addition of an explicit limited rationale for, and deprecation schedule of, gcc-plugins. gcc-plugins: - remove support for GCC 4.9 and older (Ard Biesheuvel) - remove duplicate include in gcc-common.h (Ye Guojin) - Explicitly document purpose and deprecation schedule (Kees Cook) - Remove cyc_complexity (Kees Cook) instrumentation: - Avoid harmless Clang option under CONFIG_INIT_STACK_ALL_ZERO (Kees Cook) Clang LTO: - kallsyms: strip LTO suffixes from static functions (Nick Desaulniers)" * tag 'hardening-v5.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux: gcc-plugins: remove duplicate include in gcc-common.h gcc-plugins: Remove cyc_complexity gcc-plugins: Explicitly document purpose and deprecation schedule kallsyms: strip LTO suffixes from static functions gcc-plugins: remove support for GCC 4.9 and older hardening: Avoid harmless Clang option under CONFIG_INIT_STACK_ALL_ZERO
2021-11-01Merge tag 'cpu-to-thread_info-v5.16-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux Pull thread_info update to move 'cpu' back from task_struct from Kees Cook: "Cross-architecture update to move task_struct::cpu back into thread_info on arm64, x86, s390, powerpc, and riscv. All Acked by arch maintainers. Quoting Ard Biesheuvel: 'Move task_struct::cpu back into thread_info Keeping CPU in task_struct is problematic for architectures that define raw_smp_processor_id() in terms of this field, as it requires linux/sched.h to be included, which causes a lot of pain in terms of circular dependencies (aka 'header soup') This series moves it back into thread_info (where it came from) for all architectures that enable THREAD_INFO_IN_TASK, addressing the header soup issue as well as some pointless differences in the implementations of task_cpu() and set_task_cpu()'" * tag 'cpu-to-thread_info-v5.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux: riscv: rely on core code to keep thread_info::cpu updated powerpc: smp: remove hack to obtain offset of task_struct::cpu sched: move CPU field back into thread_info if THREAD_INFO_IN_TASK=y powerpc: add CPU field to struct thread_info s390: add CPU field to struct thread_info x86: add CPU field to struct thread_info arm64: add CPU field to struct thread_info
2021-11-01Merge tag 'arm64-upstream' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux Pull arm64 updates from Will Deacon: "There's the usual summary below, but the highlights are support for the Armv8.6 timer extensions, KASAN support for asymmetric MTE, the ability to kexec() with the MMU enabled and a second attempt at switching to the generic pfn_valid() implementation. Summary: - Support for the Arm8.6 timer extensions, including a self-synchronising view of the system registers to elide some expensive ISB instructions. - Exception table cleanup and rework so that the fixup handlers appear correctly in backtraces. - A handful of miscellaneous changes, the main one being selection of CONFIG_HAVE_POSIX_CPU_TIMERS_TASK_WORK. - More mm and pgtable cleanups. - KASAN support for "asymmetric" MTE, where tag faults are reported synchronously for loads (via an exception) and asynchronously for stores (via a register). - Support for leaving the MMU enabled during kexec relocation, which significantly speeds up the operation. - Minor improvements to our perf PMU drivers. - Improvements to the compat vDSO build system, particularly when building with LLVM=1. - Preparatory work for handling some Coresight TRBE tracing errata. - Cleanup and refactoring of the SVE code to pave the way for SME support in future. - Ensure SCS pages are unpoisoned immediately prior to freeing them when KASAN is enabled for the vmalloc area. - Try moving to the generic pfn_valid() implementation again now that the DMA mapping issue from last time has been resolved. - Numerous improvements and additions to our FPSIMD and SVE selftests" [ armv8.6 timer updates were in a shared branch and already came in through -tip in the timer pull - Linus ] * tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: (85 commits) arm64: Select POSIX_CPU_TIMERS_TASK_WORK arm64: Document boot requirements for FEAT_SME_FA64 arm64/sve: Fix warnings when SVE is disabled arm64/sve: Add stub for sve_max_virtualisable_vl() arm64: errata: Add detection for TRBE write to out-of-range arm64: errata: Add workaround for TSB flush failures arm64: errata: Add detection for TRBE overwrite in FILL mode arm64: Add Neoverse-N2, Cortex-A710 CPU part definition selftests: arm64: Factor out utility functions for assembly FP tests arm64: vmlinux.lds.S: remove `.fixup` section arm64: extable: add load_unaligned_zeropad() handler arm64: extable: add a dedicated uaccess handler arm64: extable: add `type` and `data` fields arm64: extable: use `ex` for `exception_table_entry` arm64: extable: make fixup_exception() return bool arm64: extable: consolidate definitions arm64: gpr-num: support W registers arm64: factor out GPR numbering helpers arm64: kvm: use kvm_exception_table_entry arm64: lib: __arch_copy_to_user(): fold fixups into body ...
2021-11-01Merge tag 'x86_cc_for_v5.16_rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull generic confidential computing updates from Borislav Petkov: "Add an interface called cc_platform_has() which is supposed to be used by confidential computing solutions to query different aspects of the system. The intent behind it is to unify testing of such aspects instead of having each confidential computing solution add its own set of tests to code paths in the kernel, leading to an unwieldy mess" * tag 'x86_cc_for_v5.16_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: treewide: Replace the use of mem_encrypt_active() with cc_platform_has() x86/sev: Replace occurrences of sev_es_active() with cc_platform_has() x86/sev: Replace occurrences of sev_active() with cc_platform_has() x86/sme: Replace occurrences of sme_active() with cc_platform_has() powerpc/pseries/svm: Add a powerpc version of cc_platform_has() x86/sev: Add an x86 version of cc_platform_has() arch/cc: Introduce a function to check for confidential computing features x86/ioremap: Selectively build arch override encryption functions
2021-11-01tracing/osnoise: Remove PREEMPT_RT ifdefs from inside functionsDaniel Bristot de Oliveira
Remove CONFIG_PREEMPT_RT from inside functions, avoiding compilation problems in the future. Link: https://lkml.kernel.org/r/37ee0881b033cdc513efc84ebea26cf77880c8c2.1635702894.git.bristot@kernel.org Cc: Ingo Molnar <mingo@redhat.com> Cc: Tom Zanussi <zanussi@kernel.org> Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Clark Williams <williams@redhat.com> Cc: John Kacur <jkacur@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Daniel Bristot de Oliveira <bristot@kernel.org> Cc: linux-rt-users@vger.kernel.org Cc: linux-trace-devel@vger.kernel.org Cc: linux-kernel@vger.kernel.org Suggested-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Daniel Bristot de Oliveira <bristot@kernel.org> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2021-11-01tracing/osnoise: Remove STACKTRACE ifdefs from inside functionsDaniel Bristot de Oliveira
Remove CONFIG_STACKTRACE from inside functions, avoiding compilation problems in the future. Link: https://lkml.kernel.org/r/3465cca2f28e1ba602a1fc8bdb28d12950b5226e.1635702894.git.bristot@kernel.org Cc: Ingo Molnar <mingo@redhat.com> Cc: Tom Zanussi <zanussi@kernel.org> Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Clark Williams <williams@redhat.com> Cc: John Kacur <jkacur@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Daniel Bristot de Oliveira <bristot@kernel.org> Cc: linux-rt-users@vger.kernel.org Cc: linux-trace-devel@vger.kernel.org Cc: linux-kernel@vger.kernel.org Suggested-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Daniel Bristot de Oliveira <bristot@kernel.org> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2021-11-01tracing/osnoise: Allow multiple instances of the same tracerDaniel Bristot de Oliveira
Currently, the user can start only one instance of timerlat/osnoise tracers and the tracers cannot run in parallel. As starting point to add more flexibility, let's allow the same tracer to run on different trace instances. The workload will start when the first trace_array (instance) is registered and stop when the last instance is unregistered. So, while this patch allows the same tracer to run in multiple instances (e.g., two instances running osnoise), it still does not allow instances of timerlat and osnoise in parallel (e.g., one timerlat and osnoise). That is because the osnoise: events have different behavior depending on which tracer is enabled (osnoise or timerlat). Enabling the parallel usage of these two tracers is in my TODO list. Link: https://lkml.kernel.org/r/38c8f14b613492a4f3f938d9d3bf0b063b72f0f0.1635702894.git.bristot@kernel.org Cc: Ingo Molnar <mingo@redhat.com> Cc: Tom Zanussi <zanussi@kernel.org> Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Clark Williams <williams@redhat.com> Cc: John Kacur <jkacur@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Daniel Bristot de Oliveira <bristot@kernel.org> Cc: linux-rt-users@vger.kernel.org Cc: linux-trace-devel@vger.kernel.org Cc: linux-kernel@vger.kernel.org Signed-off-by: Daniel Bristot de Oliveira <bristot@kernel.org> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2021-11-01tracing/osnoise: Remove TIMERLAT ifdefs from inside functionsDaniel Bristot de Oliveira
Remove CONFIG_TIMERLAT_TRACER from inside functions, avoiding compilation problems in the future. Link: https://lkml.kernel.org/r/8245abb5a112d249f5da6c1df499244ad9e647bc.1635702894.git.bristot@kernel.org Cc: Ingo Molnar <mingo@redhat.com> Cc: Tom Zanussi <zanussi@kernel.org> Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Clark Williams <williams@redhat.com> Cc: John Kacur <jkacur@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Daniel Bristot de Oliveira <bristot@kernel.org> Cc: linux-rt-users@vger.kernel.org Cc: linux-trace-devel@vger.kernel.org Cc: linux-kernel@vger.kernel.org Suggested-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Daniel Bristot de Oliveira <bristot@kernel.org> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2021-11-01tracing/osnoise: Support a list of trace_array *trDaniel Bristot de Oliveira
osnoise/timerlat were built to run a single instance, and for this, a single variable is enough to store the current struct trace_array *tr with information about the tracing instance. This is done via the *osnoise_trace variable. A trace_array represents a trace instance. In preparation to support multiple instances, replace the *osnoise_trace variable with an RCU protected list of instances. The operations that refer to an instance now propagate to all elements of the list (all instances). Also, replace the osnoise_busy variable with a check if the list has elements (busy). No functional change is expected with this patch, i.e., only one instance is allowed yet. Link: https://lkml.kernel.org/r/91d006e889b9a5d1ff258fe6077f021ae3f26372.1635702894.git.bristot@kernel.org Cc: Ingo Molnar <mingo@redhat.com> Cc: Tom Zanussi <zanussi@kernel.org> Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Clark Williams <williams@redhat.com> Cc: John Kacur <jkacur@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Daniel Bristot de Oliveira <bristot@kernel.org> Cc: linux-rt-users@vger.kernel.org Cc: linux-trace-devel@vger.kernel.org Cc: linux-kernel@vger.kernel.org Signed-off-by: Daniel Bristot de Oliveira <bristot@kernel.org> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2021-11-01tracing/osnoise: Use start/stop_per_cpu_kthreads() on osnoise_cpus_write()Daniel Bristot de Oliveira
When writing a new CPU mask via osnoise/cpus, if the tracer is running, the workload is restarted to follow the new cpumask. The restart is currently done using osnoise_workload_start/stop(), which disables the workload *and* the instrumentation. However, disabling the instrumentation is not necessary. Calling start/stop_per_cpu_kthreads() is enough to apply the new osnoise/cpus config. Link: https://lkml.kernel.org/r/ee633e82867c5b88851aa6040522a799c0034486.1635702894.git.bristot@kernel.org Cc: Ingo Molnar <mingo@redhat.com> Cc: Tom Zanussi <zanussi@kernel.org> Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Clark Williams <williams@redhat.com> Cc: John Kacur <jkacur@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Daniel Bristot de Oliveira <bristot@kernel.org> Cc: linux-rt-users@vger.kernel.org Cc: linux-trace-devel@vger.kernel.org Cc: linux-kernel@vger.kernel.org Signed-off-by: Daniel Bristot de Oliveira <bristot@kernel.org> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>