summaryrefslogtreecommitdiff
path: root/arch/x86
AgeCommit message (Collapse)Author
2022-09-22x86/resctrl: Calculate bandwidth from the previous __mon_event_count() chunksJames Morse
mbm_bw_count() is only called by the mbm_handle_overflow() worker once a second. It reads the hardware register, calculates the bandwidth and updates m->prev_bw_msr which is used to hold the previous hardware register value. Operating directly on hardware register values makes it difficult to make this code architecture independent, so that it can be moved to /fs/, making the mba_sc feature something resctrl supports with no additional support from the architecture. Prior to calling mbm_bw_count(), mbm_update() reads from the same hardware register using __mon_event_count(). Change mbm_bw_count() to use the current chunks value most recently saved by __mon_event_count(). This removes an extra call to __rmid_read(). Instead of using m->prev_msr to calculate the number of chunks seen, use the rr->val that was updated by __mon_event_count(). This removes an extra call to mbm_overflow_count() and get_corrected_mbm_count(). Calculating bandwidth like this means mbm_bw_count() no longer operates on hardware register values directly. Signed-off-by: James Morse <james.morse@arm.com> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Jamie Iles <quic_jiles@quicinc.com> Reviewed-by: Shaopeng Tan <tan.shaopeng@fujitsu.com> Reviewed-by: Reinette Chatre <reinette.chatre@intel.com> Tested-by: Xin Hao <xhao@linux.alibaba.com> Tested-by: Shaopeng Tan <tan.shaopeng@fujitsu.com> Tested-by: Cristian Marussi <cristian.marussi@arm.com> Link: https://lore.kernel.org/r/20220902154829.30399-13-james.morse@arm.com
2022-09-22x86/resctrl: Allow update_mba_bw() to update controls directlyJames Morse
update_mba_bw() calculates a new control value for the MBA resource based on the user provided mbps_val and the current measured bandwidth. Some control values need remapping by delay_bw_map(). It does this by calling wrmsrl() directly. This needs splitting up to be done by an architecture specific helper, so that the remainder can eventually be moved to /fs/. Add resctrl_arch_update_one() to apply one configuration value to the provided resource and domain. This avoids the staging and cross-calling that is only needed with changes made by user-space. delay_bw_map() moves to be part of the arch code, to maintain the 'percentage control' view of MBA resources in resctrl. Signed-off-by: James Morse <james.morse@arm.com> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Jamie Iles <quic_jiles@quicinc.com> Reviewed-by: Shaopeng Tan <tan.shaopeng@fujitsu.com> Reviewed-by: Reinette Chatre <reinette.chatre@intel.com> Tested-by: Xin Hao <xhao@linux.alibaba.com> Tested-by: Shaopeng Tan <tan.shaopeng@fujitsu.com> Tested-by: Cristian Marussi <cristian.marussi@arm.com> Link: https://lore.kernel.org/r/20220902154829.30399-12-james.morse@arm.com
2022-09-22x86/resctrl: Remove architecture copy of mbps_valJames Morse
The resctrl arch code provides a second configuration array mbps_val[] for the MBA software controller. Since resctrl switched over to allocating and freeing its own array when needed, nothing uses the arch code version. Remove it. Signed-off-by: James Morse <james.morse@arm.com> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Jamie Iles <quic_jiles@quicinc.com> Reviewed-by: Shaopeng Tan <tan.shaopeng@fujitsu.com> Reviewed-by: Reinette Chatre <reinette.chatre@intel.com> Tested-by: Xin Hao <xhao@linux.alibaba.com> Tested-by: Shaopeng Tan <tan.shaopeng@fujitsu.com> Tested-by: Cristian Marussi <cristian.marussi@arm.com> Link: https://lore.kernel.org/r/20220902154829.30399-11-james.morse@arm.com
2022-09-22x86/resctrl: Switch over to the resctrl mbps_val listJames Morse
Updates to resctrl's software controller follow the same path as other configuration updates, but they don't modify the hardware state. rdtgroup_schemata_write() uses parse_line() and the resource's parse_ctrlval() function to stage the configuration. resctrl_arch_update_domains() then updates the mbps_val[] array instead, and resctrl_arch_update_domains() skips the rdt_ctrl_update() call that would update hardware. This complicates the interface between resctrl's filesystem parts and architecture specific code. It should be possible for mba_sc to be completely implemented by the filesystem parts of resctrl. This would allow it to work on a second architecture with no additional code. resctrl_arch_update_domains() using the mbps_val[] array prevents this. Change parse_bw() to write the configuration value directly to the mbps_val[] array in the domain structure. Change rdtgroup_schemata_write() to skip the call to resctrl_arch_update_domains(), meaning all the mba_sc specific code in resctrl_arch_update_domains() can be removed. On the read-side, show_doms() and update_mba_bw() are changed to read the mbps_val[] array from the domain structure. With this, resctrl_arch_get_config() no longer needs to consider mba_sc resources. Signed-off-by: James Morse <james.morse@arm.com> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Jamie Iles <quic_jiles@quicinc.com> Reviewed-by: Shaopeng Tan <tan.shaopeng@fujitsu.com> Reviewed-by: Reinette Chatre <reinette.chatre@intel.com> Tested-by: Xin Hao <xhao@linux.alibaba.com> Tested-by: Shaopeng Tan <tan.shaopeng@fujitsu.com> Tested-by: Cristian Marussi <cristian.marussi@arm.com> Link: https://lore.kernel.org/r/20220902154829.30399-10-james.morse@arm.com
2022-09-22x86/resctrl: Create mba_sc configuration in the rdt_domainJames Morse
To support resctrl's MBA software controller, the architecture must provide a second configuration array to hold the mbps_val[] from user-space. This complicates the interface between the architecture specific code and the filesystem portions of resctrl that will move to /fs/, to allow multiple architectures to support resctrl. Make the filesystem parts of resctrl create an array for the mba_sc values. The software controller can be changed to use this, allowing the architecture code to only consider the values configured in hardware. Signed-off-by: James Morse <james.morse@arm.com> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Jamie Iles <quic_jiles@quicinc.com> Reviewed-by: Shaopeng Tan <tan.shaopeng@fujitsu.com> Reviewed-by: Reinette Chatre <reinette.chatre@intel.com> Tested-by: Xin Hao <xhao@linux.alibaba.com> Tested-by: Shaopeng Tan <tan.shaopeng@fujitsu.com> Tested-by: Cristian Marussi <cristian.marussi@arm.com> Link: https://lore.kernel.org/r/20220902154829.30399-9-james.morse@arm.com
2022-09-22x86/resctrl: Abstract and use supports_mba_mbps()James Morse
To determine whether the mba_MBps option to resctrl should be supported, resctrl tests the boot CPUs' x86_vendor. This isn't portable, and needs abstracting behind a helper so this check can be part of the filesystem code that moves to /fs/. Re-use the tests set_mba_sc() does to determine if the mba_sc is supported on this system. An 'alloc_capable' test is added so that support for the controls isn't implied by the 'delay_linear' property, which is always true for MPAM. Because mbm_update() only update mba_sc if the mbm_local counters are enabled, supports_mba_mbps() checks is_mbm_local_enabled(). (instead of using is_mbm_enabled(), which checks both). Signed-off-by: James Morse <james.morse@arm.com> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Jamie Iles <quic_jiles@quicinc.com> Reviewed-by: Shaopeng Tan <tan.shaopeng@fujitsu.com> Reviewed-by: Reinette Chatre <reinette.chatre@intel.com> Tested-by: Xin Hao <xhao@linux.alibaba.com> Tested-by: Shaopeng Tan <tan.shaopeng@fujitsu.com> Tested-by: Cristian Marussi <cristian.marussi@arm.com> Link: https://lore.kernel.org/r/20220902154829.30399-8-james.morse@arm.com
2022-09-22x86/resctrl: Remove set_mba_sc()s control array re-initialisationJames Morse
set_mba_sc() enables the 'software controller' to regulate the bandwidth based on the byte counters. This can be managed entirely in the parts of resctrl that move to /fs/, without any extra support from the architecture specific code. set_mba_sc() is called by rdt_enable_ctx() during mount and unmount. It currently resets the arch code's ctrl_val[] and mbps_val[] arrays. The ctrl_val[] was already reset when the domain was created, and by reset_all_ctrls() when the filesystem was last unmounted. Doing the work in set_mba_sc() is not necessary as the values are already at their defaults due to the creation of the domain, or were previously reset during umount(), or are about to reset during umount(). Add a reset of the mbps_val[] in reset_all_ctrls(), allowing the code in set_mba_sc() that reaches in to the architecture specific structures to be removed. Signed-off-by: James Morse <james.morse@arm.com> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Jamie Iles <quic_jiles@quicinc.com> Reviewed-by: Shaopeng Tan <tan.shaopeng@fujitsu.com> Reviewed-by: Reinette Chatre <reinette.chatre@intel.com> Tested-by: Xin Hao <xhao@linux.alibaba.com> Tested-by: Shaopeng Tan <tan.shaopeng@fujitsu.com> Tested-by: Cristian Marussi <cristian.marussi@arm.com> Link: https://lore.kernel.org/r/20220902154829.30399-7-james.morse@arm.com
2022-09-22x86/resctrl: Add domain offline callback for resctrl workJames Morse
Because domains are exposed to user-space via resctrl, the filesystem must update its state when CPU hotplug callbacks are triggered. Some of this work is common to any architecture that would support resctrl, but the work is tied up with the architecture code to free the memory. Move the monitor subdir removal and the cancelling of the mbm/limbo works into a new resctrl_offline_domain() call. These bits are not specific to the architecture. Grouping them in one function allows that code to be moved to /fs/ and re-used by another architecture. Signed-off-by: James Morse <james.morse@arm.com> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Jamie Iles <quic_jiles@quicinc.com> Reviewed-by: Shaopeng Tan <tan.shaopeng@fujitsu.com> Reviewed-by: Reinette Chatre <reinette.chatre@intel.com> Tested-by: Xin Hao <xhao@linux.alibaba.com> Tested-by: Shaopeng Tan <tan.shaopeng@fujitsu.com> Tested-by: Cristian Marussi <cristian.marussi@arm.com> Link: https://lore.kernel.org/r/20220902154829.30399-6-james.morse@arm.com
2022-09-22x86/resctrl: Group struct rdt_hw_domain cleanupJames Morse
domain_add_cpu() and domain_remove_cpu() need to kfree() the child arrays that were allocated by domain_setup_ctrlval(). As this memory is moved around, and new arrays are created, adjusting the error handling cleanup code becomes noisier. To simplify this, move all the kfree() calls into a domain_free() helper. This depends on struct rdt_hw_domain being kzalloc()d, allowing it to unconditionally kfree() all the child arrays. Signed-off-by: James Morse <james.morse@arm.com> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Jamie Iles <quic_jiles@quicinc.com> Reviewed-by: Shaopeng Tan <tan.shaopeng@fujitsu.com> Reviewed-by: Reinette Chatre <reinette.chatre@intel.com> Tested-by: Xin Hao <xhao@linux.alibaba.com> Tested-by: Shaopeng Tan <tan.shaopeng@fujitsu.com> Tested-by: Cristian Marussi <cristian.marussi@arm.com> Link: https://lore.kernel.org/r/20220902154829.30399-5-james.morse@arm.com
2022-09-22x86/resctrl: Add domain online callback for resctrl workJames Morse
Because domains are exposed to user-space via resctrl, the filesystem must update its state when CPU hotplug callbacks are triggered. Some of this work is common to any architecture that would support resctrl, but the work is tied up with the architecture code to allocate the memory. Move domain_setup_mon_state(), the monitor subdir creation call and the mbm/limbo workers into a new resctrl_online_domain() call. These bits are not specific to the architecture. Grouping them in one function allows that code to be moved to /fs/ and re-used by another architecture. Signed-off-by: James Morse <james.morse@arm.com> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Jamie Iles <quic_jiles@quicinc.com> Reviewed-by: Shaopeng Tan <tan.shaopeng@fujitsu.com> Reviewed-by: Reinette Chatre <reinette.chatre@intel.com> Tested-by: Xin Hao <xhao@linux.alibaba.com> Tested-by: Shaopeng Tan <tan.shaopeng@fujitsu.com> Tested-by: Cristian Marussi <cristian.marussi@arm.com> Link: https://lore.kernel.org/r/20220902154829.30399-4-james.morse@arm.com
2022-09-22x86/resctrl: Merge mon_capable and mon_enabledJames Morse
mon_enabled and mon_capable are always set as a pair by rdt_get_mon_l3_config(). There is no point having two values. Merge them together. Signed-off-by: James Morse <james.morse@arm.com> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Jamie Iles <quic_jiles@quicinc.com> Reviewed-by: Shaopeng Tan <tan.shaopeng@fujitsu.com> Reviewed-by: Reinette Chatre <reinette.chatre@intel.com> Tested-by: Xin Hao <xhao@linux.alibaba.com> Tested-by: Shaopeng Tan <tan.shaopeng@fujitsu.com> Tested-by: Cristian Marussi <cristian.marussi@arm.com> Link: https://lore.kernel.org/r/20220902154829.30399-3-james.morse@arm.com
2022-09-22x86/resctrl: Kill off alloc_enabledJames Morse
rdt_resources_all[] used to have extra entries for L2CODE/L2DATA. These were hidden from resctrl by the alloc_enabled value. Now that the L2/L2CODE/L2DATA resources have been merged together, alloc_enabled doesn't mean anything, it always has the same value as alloc_capable which indicates allocation is supported by this resource. Remove alloc_enabled and its helpers. Signed-off-by: James Morse <james.morse@arm.com> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Jamie Iles <quic_jiles@quicinc.com> Reviewed-by: Shaopeng Tan <tan.shaopeng@fujitsu.com> Reviewed-by: Reinette Chatre <reinette.chatre@intel.com> Tested-by: Xin Hao <xhao@linux.alibaba.com> Tested-by: Shaopeng Tan <tan.shaopeng@fujitsu.com> Tested-by: Cristian Marussi <cristian.marussi@arm.com> Link: https://lore.kernel.org/r/20220902154829.30399-2-james.morse@arm.com
2022-09-21x86/paravirt: Ensure proper alignmentThomas Gleixner
The entries in the .parainstructions sections are 8 byte aligned and the corresponding C struct paravirt_patch_site makes the array offset 16 bytes. Though the pushed entries are only using 12 bytes, __parainstructions_end is therefore 4 bytes short. That works by chance because it's only used in a loop: for (p = start; p < end; p++) But this falls flat when calculating the number of elements: n = end - start That's obviously off by one. Ensure that the gap is filled and the last entry is occupying 16 bytes. [ bp: Add the proper struct and section names. ] Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Juergen Gross <jgross@suse.com> Link: https://lore.kernel.org/r/20220915111142.992398801@infradead.org
2022-09-21x86/mm/32: Fix W^X detection when page tables do not support NXDave Hansen
The x86 MM code now actively refuses to create writable+executable mappings, and warns when there is an attempt to create one. The 0day test robot ran across a warning triggered by module unloading on 32-bit kernels. This was only seen on CPUs with NX support, but where a 32-bit kernel was built without PAE support. On those systems, there is no room for the NX bit in the page tables and _PAGE_NX is #defined to 0, breaking some of the W^X detection logic in verify_rwx(). The X86_FEATURE_NX check in there does not do any good here because the CPU itself supports NX. Fix it by checking for _PAGE_NX support directly instead of checking CPU support for NX. Note that since _PAGE_NX is actually defined to be 0 at compile-time this fix should also end up letting the compiler optimize away most of verify_rwx() on non-PAE kernels. Fixes: 652c5bf380ad ("x86/mm: Refuse W^X violations") Reported-by: kernel test robot <yujie.liu@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/all/fcf89147-440b-e478-40c9-228c9fe56691@intel.com/
2022-09-21Merge tag 'v6.0-rc6' into locking/core, to refresh the branchIngo Molnar
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2022-09-21arch: um: Mark the stack non-executable to fix a binutils warningDavid Gow
Since binutils 2.39, ld will print a warning if any stack section is executable, which is the default for stack sections on files without a .note.GNU-stack section. This was fixed for x86 in commit ffcf9c5700e4 ("x86: link vdso and boot with -z noexecstack --no-warn-rwx-segments"), but remained broken for UML, resulting in several warnings: /usr/bin/ld: warning: arch/x86/um/vdso/vdso.o: missing .note.GNU-stack section implies executable stack /usr/bin/ld: NOTE: This behaviour is deprecated and will be removed in a future version of the linker /usr/bin/ld: warning: .tmp_vmlinux.kallsyms1 has a LOAD segment with RWX permissions /usr/bin/ld: warning: .tmp_vmlinux.kallsyms1.o: missing .note.GNU-stack section implies executable stack /usr/bin/ld: NOTE: This behaviour is deprecated and will be removed in a future version of the linker /usr/bin/ld: warning: .tmp_vmlinux.kallsyms2 has a LOAD segment with RWX permissions /usr/bin/ld: warning: .tmp_vmlinux.kallsyms2.o: missing .note.GNU-stack section implies executable stack /usr/bin/ld: NOTE: This behaviour is deprecated and will be removed in a future version of the linker /usr/bin/ld: warning: vmlinux has a LOAD segment with RWX permissions Link both the VDSO and vmlinux with -z noexecstack, fixing the warnings about .note.GNU-stack sections. In addition, pass --no-warn-rwx-segments to dodge the remaining warnings about LOAD segments with RWX permissions in the kallsyms objects. (Note that this flag is apparently not available on lld, so hide it behind a test for BFD, which is what the x86 patch does.) Link: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=ffcf9c5700e49c0aee42dcba9a12ba21338e8136 Link: https://sourceware.org/git/?p=binutils-gdb.git;a=commit;h=ba951afb99912da01a6e8434126b8fac7aa75107 Signed-off-by: David Gow <davidgow@google.com> Reviewed-by: Lukas Straub <lukasstraub2@web.de> Tested-by: Lukas Straub <lukasstraub2@web.de> Acked-by: Randy Dunlap <rdunlap@infradead.org> # build-tested Signed-off-by: Richard Weinberger <richard@nod.at>
2022-09-20x86/dumpstack: Don't mention RIP in "Code: "Jiri Slaby
Commit 238c91115cd0 ("x86/dumpstack: Fix misleading instruction pointer error message") changed the "Code:" line in bug reports when RIP is an invalid pointer. In particular, the report currently says (for example): BUG: kernel NULL pointer dereference, address: 0000000000000000 ... RIP: 0010:0x0 Code: Unable to access opcode bytes at RIP 0xffffffffffffffd6. That Unable to access opcode bytes at RIP 0xffffffffffffffd6. is quite confusing as RIP value is 0, not -42. That -42 comes from "regs->ip - PROLOGUE_SIZE", because Code is dumped with some prologue (and epilogue). So do not mention "RIP" on this line in this context. Signed-off-by: Jiri Slaby <jslaby@suse.cz> Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lore.kernel.org/r/b772c39f-c5ae-8f17-fe6e-6a2bc4d1f83b@kernel.org
2022-09-20x86/asm/bitops: Use __builtin_ctzl() to evaluate constant expressionsVincent Mailhol
If x is not 0, __ffs(x) is equivalent to: (unsigned long)__builtin_ctzl(x) And if x is not ~0UL, ffz(x) is equivalent to: (unsigned long)__builtin_ctzl(~x) Because __builting_ctzl() returns an int, a cast to (unsigned long) is necessary to avoid potential warnings on implicit casts. Concerning the edge cases, __builtin_ctzl(0) is always undefined, whereas __ffs(0) and ffz(~0UL) may or may not be defined, depending on the processor. Regardless, for both functions, developers are asked to check against 0 or ~0UL so replacing __ffs() or ffz() by __builting_ctzl() is safe. For x86_64, the current __ffs() and ffz() implementations do not produce optimized code when called with a constant expression. On the contrary, the __builtin_ctzl() folds into a single instruction. However, for non constant expressions, the __ffs() and ffz() asm versions of the kernel remains slightly better than the code produced by GCC (it produces a useless instruction to clear eax). Use __builtin_constant_p() to select between the kernel's __ffs()/ffz() and the __builtin_ctzl() depending on whether the argument is constant or not. ** Statistics ** On a allyesconfig, before...: $ objdump -d vmlinux.o | grep tzcnt | wc -l 3607 ...and after: $ objdump -d vmlinux.o | grep tzcnt | wc -l 2600 So, roughly 27.9% of the calls to either __ffs() or ffz() were using constant expressions and could be optimized out. (tests done on linux v5.18-rc5 x86_64 using GCC 11.2.1) Note: on x86_64, the BSF instruction produces TZCNT when used with the REP prefix (which explain the use of `grep tzcnt' instead of `grep bsf' in above benchmark). c.f. [1] [1] e26a44a2d618 ("x86: Use REP BSF unconditionally") [ bp: Massage commit message. ] Signed-off-by: Vincent Mailhol <mailhol.vincent@wanadoo.fr> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Nick Desaulniers <ndesaulniers@google.com> Reviewed-by: Yury Norov <yury.norov@gmail.com> Link: https://lore.kernel.org/r/20220511160319.1045812-1-mailhol.vincent@wanadoo.fr
2022-09-20x86/asm/bitops: Use __builtin_ffs() to evaluate constant expressionsVincent Mailhol
For x86_64, the current ffs() implementation does not produce optimized code when called with a constant expression. On the contrary, the __builtin_ffs() functions of both GCC and clang are able to fold the expression into a single instruction. ** Example ** Consider two dummy functions foo() and bar() as below: #include <linux/bitops.h> #define CONST 0x01000000 unsigned int foo(void) { return ffs(CONST); } unsigned int bar(void) { return __builtin_ffs(CONST); } GCC would produce below assembly code: 0000000000000000 <foo>: 0: ba 00 00 00 01 mov $0x1000000,%edx 5: b8 ff ff ff ff mov $0xffffffff,%eax a: 0f bc c2 bsf %edx,%eax d: 83 c0 01 add $0x1,%eax 10: c3 ret <Instructions after ret and before next function were redacted> 0000000000000020 <bar>: 20: b8 19 00 00 00 mov $0x19,%eax 25: c3 ret And clang would produce: 0000000000000000 <foo>: 0: b8 ff ff ff ff mov $0xffffffff,%eax 5: 0f bc 05 00 00 00 00 bsf 0x0(%rip),%eax # c <foo+0xc> c: 83 c0 01 add $0x1,%eax f: c3 ret 0000000000000010 <bar>: 10: b8 19 00 00 00 mov $0x19,%eax 15: c3 ret Both examples clearly demonstrate the benefit of using __builtin_ffs() instead of the kernel's asm implementation for constant expressions. However, for non constant expressions, the kernel's ffs() asm version remains better for x86_64 because, contrary to GCC, it doesn't emit the CMOV assembly instruction, c.f. [1] (noticeably, clang is able optimize out the CMOV call). Use __builtin_constant_p() to select between the kernel's ffs() and the __builtin_ffs() depending on whether the argument is constant or not. As a side benefit, replacing the ffs() function declaration by a macro also removes below -Wshadow warning: ./arch/x86/include/asm/bitops.h:283:28: warning: declaration of 'ffs' shadows a built-in function [-Wshadow] 283 | static __always_inline int ffs(int x) ** Statistics ** On a allyesconfig, before...: $ objdump -d vmlinux.o | grep bsf | wc -l 1081 ...and after: $ objdump -d vmlinux.o | grep bsf | wc -l 792 So, roughly 26.7% of the calls to ffs() were using constant expressions and could be optimized out. (tests done on linux v5.18-rc5 x86_64 using GCC 11.2.1) [1] commit ca3d30cc02f7 ("x86_64, asm: Optimise fls(), ffs() and fls64()") [ bp: Massage commit message. ] Signed-off-by: Vincent Mailhol <mailhol.vincent@wanadoo.fr> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Nick Desaulniers <ndesaulniers@google.com> Reviewed-by: Yury Norov <yury.norov@gmail.com> Link: https://lore.kernel.org/r/20220511160319.1045812-1-mailhol.vincent@wanadoo.fr
2022-09-19smp: add set_nr_cpu_ids()Yury Norov
In preparation to support compile-time nr_cpu_ids, add a setter for the variable. This is a no-op for all arches. Signed-off-by: Yury Norov <yury.norov@gmail.com>
2022-09-19um: Cleanup compiler warning in arch/x86/um/tls_32.cLukas Straub
arch.tls_array is statically allocated so checking for NULL doesn't make sense. This causes the compiler warning below. Remove the checks to silence these warnings. ../arch/x86/um/tls_32.c: In function 'get_free_idx': ../arch/x86/um/tls_32.c:68:13: warning: the comparison will always evaluate as 'true' for the address of 'tls_array' will never be NULL [-Waddress] 68 | if (!t->arch.tls_array) | ^ In file included from ../arch/x86/um/asm/processor.h:10, from ../include/linux/rcupdate.h:30, from ../include/linux/rculist.h:11, from ../include/linux/pid.h:5, from ../include/linux/sched.h:14, from ../arch/x86/um/tls_32.c:7: ../arch/x86/um/asm/processor_32.h:22:31: note: 'tls_array' declared here 22 | struct uml_tls_struct tls_array[GDT_ENTRY_TLS_ENTRIES]; | ^~~~~~~~~ ../arch/x86/um/tls_32.c: In function 'get_tls_entry': ../arch/x86/um/tls_32.c:243:13: warning: the comparison will always evaluate as 'true' for the address of 'tls_array' will never be NULL [-Waddress] 243 | if (!t->arch.tls_array) | ^ ../arch/x86/um/asm/processor_32.h:22:31: note: 'tls_array' declared here 22 | struct uml_tls_struct tls_array[GDT_ENTRY_TLS_ENTRIES]; | ^~~~~~~~~ Signed-off-by: Lukas Straub <lukasstraub2@web.de> Acked-by: Randy Dunlap <rdunlap@infradead.org> # build-tested Signed-off-by: Richard Weinberger <richard@nod.at>
2022-09-19um: Cleanup syscall_handler_t cast in syscalls_32.hLukas Straub
Like in f4f03f299a56ce4d73c5431e0327b3b6cb55ebb9 "um: Cleanup syscall_handler_t definition/cast, fix warning", remove the cast to to fix the compiler warning. Signed-off-by: Lukas Straub <lukasstraub2@web.de> Acked-by: Randy Dunlap <rdunlap@infradead.org> # build-tested Signed-off-by: Richard Weinberger <richard@nod.at>
2022-09-16bpf: Move bpf_dispatcher function out of ftrace locationsJiri Olsa
The dispatcher function is attached/detached to trampoline by dispatcher update function. At the same time it's available as ftrace attachable function. After discussion [1] the proposed solution is to use compiler attributes to alter bpf_dispatcher_##name##_func function: - remove it from being instrumented with __no_instrument_function__ attribute, so ftrace has no track of it - but still generate 5 nop instructions with patchable_function_entry(5) attribute, which are expected by bpf_arch_text_poke used by dispatcher update function Enabling HAVE_DYNAMIC_FTRACE_NO_PATCHABLE option for x86, so __patchable_function_entries functions are not part of ftrace/mcount locations. Adding attributes to bpf_dispatcher_XXX function on x86_64 so it's kept out of ftrace locations and has 5 byte nop generated at entry. These attributes need to be arch specific as pointed out by Ilya Leoshkevic in here [2]. The dispatcher image is generated only for x86_64 arch, so the code can stay as is for other archs. [1] https://lore.kernel.org/bpf/20220722110811.124515-1-jolsa@kernel.org/ [2] https://lore.kernel.org/bpf/969a14281a7791c334d476825863ee449964dd0c.camel@linux.ibm.com/ Suggested-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Jiri Olsa <jolsa@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/bpf/20220903131154.420467-3-jolsa@kernel.org
2022-09-15x86,retpoline: Be sure to emit INT3 after JMP *%\regPeter Zijlstra
Both AMD and Intel recommend using INT3 after an indirect JMP. Make sure to emit one when rewriting the retpoline JMP irrespective of compiler SLS options or even CONFIG_SLS. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Alexei Starovoitov <alexei.starovoitov@gmail.com> Link: https://lkml.kernel.org/r/Yxm+QkFPOhrVSH6q@hirez.programming.kicks-ass.net
2022-09-13perf: Kill __PERF_SAMPLE_CALLCHAIN_EARLYNamhyung Kim
There's no in-tree user anymore. Let's get rid of it. Signed-off-by: Namhyung Kim <namhyung@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20220908214104.3851807-3-namhyung@kernel.org
2022-09-13perf: Use sample_flags for callchainNamhyung Kim
So that it can call perf_callchain() only if needed. Historically it used __PERF_SAMPLE_CALLCHAIN_EARLY but we can do that with sample_flags in the struct perf_sample_data. Signed-off-by: Namhyung Kim <namhyung@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20220908214104.3851807-1-namhyung@kernel.org
2022-09-11kernel: exit: cleanup release_thread()Kefeng Wang
Only x86 has own release_thread(), introduce a new weak release_thread() function to clean empty definitions in other ARCHs. Link: https://lkml.kernel.org/r/20220819014406.32266-1-wangkefeng.wang@huawei.com Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Acked-by: Guo Ren <guoren@kernel.org> [csky] Acked-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Acked-by: Geert Uytterhoeven <geert@linux-m68k.org> Acked-by: Brian Cain <bcain@quicinc.com> Acked-by: Michael Ellerman <mpe@ellerman.id.au> [powerpc] Acked-by: Stafford Horne <shorne@gmail.com> [openrisc] Acked-by: Catalin Marinas <catalin.marinas@arm.com> [arm64] Acked-by: Huacai Chen <chenhuacai@kernel.org> [LoongArch] Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Anton Ivanov <anton.ivanov@cambridgegreys.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Chris Zankel <chris@zankel.net> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: Dinh Nguyen <dinguyen@kernel.org> Cc: Guo Ren <guoren@kernel.org> [csky] Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Helge Deller <deller@gmx.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru> Cc: James Bottomley <James.Bottomley@HansenPartnership.com> Cc: Johannes Berg <johannes@sipsolutions.net> Cc: Jonas Bonn <jonas@southpole.se> Cc: Matt Turner <mattst88@gmail.com> Cc: Max Filippov <jcmvbkbc@gmail.com> Cc: Michal Simek <monstr@monstr.eu> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Richard Henderson <richard.henderson@linaro.org> Cc: Richard Weinberger <richard@nod.at> Cc: Rich Felker <dalias@libc.org> Cc: Stefan Kristiansson <stefan.kristiansson@saunalahti.fi> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Vineet Gupta <vgupta@kernel.org> Cc: Will Deacon <will@kernel.org> Cc: Xuerui Wang <kernel@xen0n.name> Cc: Yoshinori Sato <ysato@users.osdn.me> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-09-11x86/mm: disable instrumentations of mm/pgprot.cNaohiro Aota
Commit 4867fbbdd6b3 ("x86/mm: move protection_map[] inside the platform") moved accesses to protection_map[] from mem_encrypt_amd.c to pgprot.c. As a result, the accesses are now targets of KASAN (and other instrumentations), leading to the crash during the boot process. Disable the instrumentations for pgprot.c like commit 67bb8e999e0a ("x86/mm: Disable various instrumentations of mm/mem_encrypt.c and mm/tlb.c"). Before this patch, my AMD machine cannot boot since v6.0-rc1 with KASAN enabled, without anything printed. After the change, it successfully boots up. Fixes: 4867fbbdd6b3 ("x86/mm: move protection_map[] inside the platform") Link: https://lkml.kernel.org/r/20220824084726.2174758-1-naohiro.aota@wdc.com Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-09-09Merge tag 'asm-generic-fixes-6.0-rc4' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/arnd/asm-generic Pull SOFTIRQ_ON_OWN_STACK rework from Arnd Bergmann: "Just one fixup patch, reworking the softirq_on_own_stack logic for preempt-rt kernels as discussed in https://lore.kernel.org/all/CAHk-=wgZSD3W2y6yczad2Am=EfHYyiPzTn3CfXxrriJf9i5W5w@mail.gmail.com/" * tag 'asm-generic-fixes-6.0-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/arnd/asm-generic: asm-generic: Conditionally enable do_softirq_own_stack() via Kconfig.
2022-09-08x86/sgx: Handle VA page allocation failure for EAUG on PF.Haitao Huang
VM_FAULT_NOPAGE is expected behaviour for -EBUSY failure path, when augmenting a page, as this means that the reclaimer thread has been triggered, and the intention is just to round-trip in ring-3, and retry with a new page fault. Fixes: 5a90d2c3f5ef ("x86/sgx: Support adding of pages to an initialized enclave") Signed-off-by: Haitao Huang <haitao.huang@linux.intel.com> Signed-off-by: Jarkko Sakkinen <jarkko@kernel.org> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Reinette Chatre <reinette.chatre@intel.com> Tested-by: Vijay Dhanraj <vijay.dhanraj@intel.com> Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/r/20220906000221.34286-3-jarkko@kernel.org
2022-09-08x86/sgx: Do not fail on incomplete sanitization on premature stop of ksgxdJarkko Sakkinen
Unsanitized pages trigger WARN_ON() unconditionally, which can panic the whole computer, if /proc/sys/kernel/panic_on_warn is set. In sgx_init(), if misc_register() fails or misc_register() succeeds but neither sgx_drv_init() nor sgx_vepc_init() succeeds, then ksgxd will be prematurely stopped. This may leave unsanitized pages, which will result a false warning. Refine __sgx_sanitize_pages() to return: 1. Zero when the sanitization process is complete or ksgxd has been requested to stop. 2. The number of unsanitized pages otherwise. Fixes: 51ab30eb2ad4 ("x86/sgx: Replace section->init_laundry_list with sgx_dirty_page_list") Reported-by: Paul Menzel <pmenzel@molgen.mpg.de> Signed-off-by: Jarkko Sakkinen <jarkko@kernel.org> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Reinette Chatre <reinette.chatre@intel.com> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/linux-sgx/20220825051827.246698-1-jarkko@kernel.org/T/#u Link: https://lkml.kernel.org/r/20220906000221.34286-2-jarkko@kernel.org
2022-09-08EDAC/i10nm: Add driver decoder for Ice Lake and Tremont CPUsYouquan Song
Current i10nm_edac only supports firmware decoder (ACPI DSM methods). MCA bank registers of Ice Lake or Tremont CPUs contain the information to decode DDR memory errors. To get better decoding performance, add the driver decoder (decoding DDR memory errors via extracting error information from MCA bank registers) for Ice Lake and Tremont CPUs. Co-developed-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com> Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com> Signed-off-by: Youquan Song <youquan.song@intel.com> Signed-off-by: Tony Luck <tony.luck@intel.com> Link: https://lore.kernel.org/all/20220901194310.115427-1-tony.luck@intel.com/
2022-09-07perf/x86/intel: Optimize FIXED_CTR_CTRL accessKan Liang
All the fixed counters share a fixed control register. The current perf reads and re-writes the fixed control register for each fixed counter disable/enable, which is unnecessary. When changing the fixed control register, the entire PMU must be disabled via the global control register. The changing cannot be taken effect until the entire PMU is re-enabled. Only updating the fixed control register once right before the entire PMU re-enabling is enough. The read of the fixed control register is not necessary either. The value can be cached in the per CPU cpu_hw_events. Test results: Counting all the fixed counters with the perf bench sched pipe as below on a SPR machine. $perf stat -e cycles,instructions,ref-cycles,slots --no-inherit -- taskset -c 1 perf bench sched pipe The Total elapsed time reduces from 5.36s (without the patch) to 4.99s (with the patch), which is ~6.9% improvement. Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20220804140729.2951259-1-kan.liang@linux.intel.com
2022-09-07perf/x86/p4: Remove perfctr_second_write quirkPeter Zijlstra
Now that we have a x86_pmu::set_period() method, use it to remove the perfctr_second_write quirk from the generic code. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20220829101321.839502514@infradead.org
2022-09-07perf/x86/intel: Remove x86_pmu::update_topdown_eventPeter Zijlstra
Now that it is all internal to the intel driver, remove x86_pmu::update_topdown_event. Assumes that is_topdown_count(event) can only be true when the hardware has topdown stuff and the function is set. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20220829101321.771635301@infradead.org
2022-09-07perf/x86/intel: Remove x86_pmu::set_topdown_event_periodPeter Zijlstra
Now that it is all internal to the intel driver, remove x86_pmu::set_topdown_event_period. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20220829101321.706354189@infradead.org
2022-09-07perf/x86: Add a x86_pmu::limit_period static_callPeter Zijlstra
Avoid a branch and indirect call. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20220829101321.640658334@infradead.org
2022-09-07perf/x86: Change x86_pmu::limit_period signaturePeter Zijlstra
In preparation for making it a static_call, change the signature. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20220829101321.573713839@infradead.org
2022-09-07perf/x86/intel: Move the topdown stuff into the intel driverPeter Zijlstra
Use the new x86_pmu::{set_period,update}() methods to push the topdown stuff into the Intel driver, where it belongs. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20220829101321.505933457@infradead.org
2022-09-07perf/x86: Add two more x86_pmu methodsPeter Zijlstra
In order to clean up x86_perf_event_{set_period,update)() start by adding them as x86_pmu methods. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20220829101321.440196408@infradead.org
2022-09-07x86/perf: Assert all platform event flags are within PERF_EVENT_FLAG_ARCHAnshuman Khandual
Ensure all platform specific event flags are within PERF_EVENT_FLAG_ARCH. Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: James Clark <james.clark@arm.com> Link: https://lkml.kernel.org/r/20220907091924.439193-5-anshuman.khandual@arm.com
2022-09-06bpf: x86: Support in-register struct arguments in trampoline programsYonghong Song
In C, struct value can be passed as a function argument. For small structs, struct value may be passed in one or more registers. For trampoline based bpf programs, this would cause complication since one-to-one mapping between function argument and arch argument register is not valid any more. The latest llvm16 added bpf support to pass by values for struct up to 16 bytes ([1]). This is also true for x86_64 architecture where two registers will hold the struct value if the struct size is >8 and <= 16. This may not be true if one of struct member is 'double' type but in current linux source code we don't have such instance yet, so we assume all >8 && <= 16 struct holds two general purpose argument registers. Also change on-stack nr_args value to the number of registers holding the arguments. This will permit bpf_get_func_arg() helper to get all argument values. [1] https://reviews.llvm.org/D132144 Signed-off-by: Yonghong Song <yhs@fb.com> Link: https://lore.kernel.org/r/20220831152652.2078600-1-yhs@fb.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-09-06perf: Use sample_flags for txnKan Liang
Use the new sample_flags to indicate whether the txn field is filled by the PMU driver. Remove the txn field from the perf_sample_data_init() to minimize the number of cache lines touched. Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20220901130959.1285717-7-kan.liang@linux.intel.com
2022-09-06perf: Use sample_flags for data_srcKan Liang
Use the new sample_flags to indicate whether the data_src field is filled by the PMU driver. Remove the data_src field from the perf_sample_data_init() to minimize the number of cache lines touched. Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20220901130959.1285717-6-kan.liang@linux.intel.com
2022-09-06perf: Use sample_flags for weightKan Liang
Use the new sample_flags to indicate whether the weight field is filled by the PMU driver. Remove the weight field from the perf_sample_data_init() to minimize the number of cache lines touched. Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20220901130959.1285717-5-kan.liang@linux.intel.com
2022-09-06perf: Use sample_flags for branch stackKan Liang
Use the new sample_flags to indicate whether the branch stack is filled by the PMU driver. Remove the br_stack from the perf_sample_data_init() to minimize the number of cache lines touched. Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20220901130959.1285717-4-kan.liang@linux.intel.com
2022-09-06perf/x86/intel/pebs: Fix PEBS timestamps overwrittenKan Liang
The PEBS TSC-based timestamps do not appear correctly in the final perf.data output file from perf record. The data->time field setup by PEBS in the setup_pebs_fixed_sample_data() is later overwritten by perf_events generic code in perf_prepare_sample(). There is an ordering problem. Set the sample flags when the data->time is updated by PEBS. The data->time field will not be overwritten anymore. Reported-by: Andreas Kogler <andreas.kogler.0x@gmail.com> Reported-by: Stephane Eranian <eranian@google.com> Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20220901130959.1285717-3-kan.liang@linux.intel.com
2022-09-05asm-generic: Conditionally enable do_softirq_own_stack() via Kconfig.Sebastian Andrzej Siewior
Remove the CONFIG_PREEMPT_RT symbol from the ifdef around do_softirq_own_stack() and move it to Kconfig instead. Enable softirq stacks based on SOFTIRQ_ON_OWN_STACK which depends on HAVE_SOFTIRQ_ON_OWN_STACK and its default value is set to !PREEMPT_RT. This ensures that softirq stacks are not used on PREEMPT_RT and avoids a 'select' statement on an option which has a 'depends' statement. Link: https://lore.kernel.org/YvN5E%2FPrHfUhggr7@linutronix.de Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2022-09-02x86/defconfig: Enable CONFIG_DEBUG_WX=yIngo Molnar
7 years after it got introduced it's time to make this the default, at least in the x86 defconfigs. Signed-off-by: Ingo Molnar <mingo@kernel.org>
2022-09-02x86/defconfig: Refresh the defconfigsIngo Molnar
Just go through a 'make savedefconfig' cycle to pick up fresh Kconfig details, no change in settings, just reordering of some entries. ( This makes followup changes generated via 'make savedefconfig' contain less noise. ) Signed-off-by: Ingo Molnar <mingo@kernel.org>