git.armlinux.org.uk/linux.git - Linus' kernel tree

Age	Commit message (Collapse)	Author
2021-10-16	arm64: dts: imx8mm-kontron: Fix connection type for VSC8531 RGMII PHY	Frieder Schrempf
	Previously we falsely relied on the PHY driver to unconditionally enable the internal RX delay. Since the following fix for the PHY driver this is not the case anymore: commit 7b005a1742be ("net: phy: mscc: configure both RX and TX internal delays for RGMII") In order to enable the delay we need to set the connection type to "rgmii-rxid". Without the RX delay the ethernet is not functional at all. Fixes: 8668d8b2e67f ("arm64: dts: Add the Kontron i.MX8M Mini SoMs and baseboards") Cc: stable@vger.kernel.org Signed-off-by: Frieder Schrempf <frieder.schrempf@kontron.de> Signed-off-by: Shawn Guo <shawnguo@kernel.org>
2021-10-16	arm64: dts: imx8mm-kontron: Fix CAN SPI clock frequency	Frieder Schrempf
	The MCP2515 can be used with an SPI clock of up to 10 MHz. Set the limit accordingly to prevent any performance issues caused by the really low clock speed of 100 kHz. This removes the arbitrarily low limit on the SPI frequency, that was caused by a typo in the original dts. Without this change, receiving CAN messages on the board beyond a certain bitrate will cause overrun errors (see 'ip -det -stat link show can0'). With this fix, receiving messages on the bus works without any overrun errors for bitrates up to 1 MBit. Fixes: 8668d8b2e67f ("arm64: dts: Add the Kontron i.MX8M Mini SoMs and baseboards") Cc: stable@vger.kernel.org Signed-off-by: Frieder Schrempf <frieder.schrempf@kontron.de> Signed-off-by: Shawn Guo <shawnguo@kernel.org>
2021-10-16	arm64: dts: imx8mm-kontron: Fix polarity of reg_rst_eth2	Frieder Schrempf
	The regulator reg_rst_eth2 should keep the reset signal of the USB ethernet adapter deasserted anytime. Fix the polarity and mark it as always-on. Anyway, using the regulator is only a workaround for the missing support of specifying a reset GPIO for USB devices in a generic way. As we don't have a solution for this at the moment, at least fix the current workaround. Fixes: 8668d8b2e67f ("arm64: dts: Add the Kontron i.MX8M Mini SoMs and baseboards") Cc: stable@vger.kernel.org Signed-off-by: Frieder Schrempf <frieder.schrempf@kontron.de> Signed-off-by: Shawn Guo <shawnguo@kernel.org>
2021-10-16	arm64: dts: imx8mm-kontron: Set lower limit of VDD_SNVS to 800 mV	Frieder Schrempf
	According to the datasheet the typical value for VDD_SNVS should be 800 mV, so let's make sure that this is within the range of the regulator. Fixes: 8668d8b2e67f ("arm64: dts: Add the Kontron i.MX8M Mini SoMs and baseboards") Cc: stable@vger.kernel.org Signed-off-by: Frieder Schrempf <frieder.schrempf@kontron.de> Signed-off-by: Shawn Guo <shawnguo@kernel.org>
2021-10-16	arm64: dts: imx8mm-kontron: Make sure SOC and DRAM supply voltages are correct	Frieder Schrempf
	It looks like the voltages for the SOC and DRAM supply weren't properly validated before. The datasheet and uboot-imx code tells us that VDD_SOC should be 800 mV in suspend and 850 mV in run mode. VDD_DRAM should be 950 mV for DDR clock frequencies of up to 1.5 GHz. Let's fix these values to make sure the voltages are within the required range. Fixes: 8668d8b2e67f ("arm64: dts: Add the Kontron i.MX8M Mini SoMs and baseboards") Cc: stable@vger.kernel.org Signed-off-by: Frieder Schrempf <frieder.schrempf@kontron.de> Signed-off-by: Shawn Guo <shawnguo@kernel.org>
2021-10-16	arm64: dts: imx8mm-kontron: Add support for ultra high speed modes on SD card	Frieder Schrempf
	In order to use ultra high speed modes (UHS) on the SD card slot, we add matching pinctrls and fix the voltage switching for LDO5 of the PMIC, by providing the SD_VSEL pin as GPIO to the PMIC driver. Signed-off-by: Frieder Schrempf <frieder.schrempf@kontron.de> Signed-off-by: Shawn Guo <shawnguo@kernel.org>
2021-10-15	ARC: fix potential build snafu	Vineet Gupta
	In the big pgtable header split, I inadvertently introduced a couple of duplicate symbols. Fixes: fe6cb7b043b69cd9 ("ARC: mm: disintegrate pgtable.h into levels and flags") Signed-off-by: Vineet Gupta <vgupta@kernel.org>
2021-10-16	csky: Make HAVE_TCM depend on !COMPILE_TEST	Guenter Roeck
	Building csky:allmodconfig results in the following build errors. arch/csky/mm/tcm.c:9:2: error: #error "You should define ITCM_RAM_BASE" 9 \| #error "You should define ITCM_RAM_BASE" \| ^~~~~ arch/csky/mm/tcm.c:14:2: error: #error "You should define DTCM_RAM_BASE" 14 \| #error "You should define DTCM_RAM_BASE" \| ^~~~~ arch/csky/mm/tcm.c:18:2: error: #error "You should define correct DTCM_RAM_BASE" 18 \| #error "You should define correct DTCM_RAM_BASE" This is seen with compile tests since those enable HAVE_TCM, but do not provide useful default values for ITCM_RAM_BASE or DTCM_RAM_BASE. Disable HAVE_TCM for commpile tests to avoid the error. Signed-off-by: Guenter Roeck <linux@roeck-us.net> Signed-off-by: Guo Ren <guoren@kernel.org>
2021-10-16	csky: bitops: Remove duplicate __clear_bit define	Guenter Roeck
	Building csky:allmodconfig results in the following build error. In file included from ./include/linux/bitops.h:33, from ./include/linux/log2.h:12, from kernel/bounds.c:13: ./arch/csky/include/asm/bitops.h:77: error: "__clear_bit" redefined Since commit 9248e52fec95 ("locking/atomic: simplify non-atomic wrappers"), __clear_bit is defined in include/asm-generic/bitops/non-atomic.h, and the define in the csky include file is no longer necessary or useful. Remove it. Fixes: 9248e52fec95 ("locking/atomic: simplify non-atomic wrappers") Signed-off-by: Guenter Roeck <linux@roeck-us.net> Signed-off-by: Guo Ren <guoren@kernel.org>
2021-10-16	csky: Select ARCH_WANT_FRAME_POINTERS only if compiler supports it	Guenter Roeck
	Compiling csky:allmodconfig with an upstream C compiler results in the following error. csky-linux-gcc: error: unrecognized command-line option '-mbacktrace'; did you mean '-fbacktrace'? Select ARCH_WANT_FRAME_POINTERS only if gcc supports it to avoid the error. Signed-off-by: Guenter Roeck <linux@roeck-us.net> Signed-off-by: Guo Ren <guoren@kernel.org>
2021-10-16	csky: Fixup regs.sr broken in ptrace	Guo Ren
	gpr_get() return the entire pt_regs (include sr) to userspace, if we don't restore the C bit in gpr_set, it may break the ALU result in that context. So the C flag bit is part of gpr context, that's why riscv totally remove the C bit in the ISA. That makes sr reg clear from userspace to supervisor privilege. Signed-off-by: Guo Ren <guoren@linux.alibaba.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: stable@vger.kernel.org
2021-10-16	csky: don't let sigreturn play with priveleged bits of status register	Al Viro
	csky restore_sigcontext() blindly overwrites regs->sr with the value it finds in sigcontext. Attacker can store whatever they want in there, which includes things like S-bit. Userland shouldn't be able to set that, or anything other than C flag (bit 0). Do the same thing other architectures with protected bits in flags register do - preserve everything that shouldn't be settable in user mode, picking the rest from the value saved is sigcontext. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Guo Ren <guoren@kernel.org> Cc: stable@vger.kernel.org
2021-10-16	arm64: defconfig: Visconti: Enable PCIe host controller	Nobuhiro Iwamatsu
	Enable Visconti's PCIe host controller in the ARM64 defconfig. Signed-off-by: Nobuhiro Iwamatsu <nobuhiro1.iwamatsu@toshiba.co.jp> Link: https://lore.kernel.org/r/20210907042340.1525711-1-nobuhiro1.iwamatsu@toshiba.co.jp Signed-off-by: Nobuhiro Iwamatsu <nobuhiro1.iwamatsu@toshiba.co.jp>
2021-10-16	arm64: dts: visconti: Add DTS for the VisROBO board	Yuji Ishikawa
	This board consists of two boards: the SoM board (VRC SoM) with the SoC on board, and the board for I/O (VRB). The functions of each board supported by this update are as follows: - VRC SoM - WDT - GPIO - SDCard (SPI-MMC mode) - I2C x1 - VRB board - VRC SoM - UART x2 - Ethernet phy Signed-off-by: Yuji Ishikawa <yuji2.ishikawa@toshiba.co.jp> Link: https://lore.kernel.org/r/20211014092703.15251-4-yuji2.ishikawa@toshiba.co.jp Signed-off-by: Nobuhiro Iwamatsu <nobuhiro1.iwamatsu@toshiba.co.jp>
2021-10-16	arm64: dts: visconti: Add 150MHz fixed clock to TMPV7708 SoC	Yuji Ishikawa
	This clock source is referred by baudrate generators of SPI and I2C devices. Signed-off-by: Yuji Ishikawa <yuji2.ishikawa@toshiba.co.jp> Link: https://lore.kernel.org/r/20211014092703.15251-2-yuji2.ishikawa@toshiba.co.jp Signed-off-by: Nobuhiro Iwamatsu <nobuhiro1.iwamatsu@toshiba.co.jp>
2021-10-16	arm64: dts: visconti: Add PCIe host controller support for TMPV7708 SoC	Nobuhiro Iwamatsu
	Add PCIe node and fixed clock for PCIe in TMPV7708's dtsi, and tmpv7708-rm-mbrc boards's dts. Signed-off-by: Nobuhiro Iwamatsu <nobuhiro1.iwamatsu@toshiba.co.jp> Link: https://lore.kernel.org/r/20210907042500.1525771-1-nobuhiro1.iwamatsu@toshiba.co.jp
2021-10-15	Merge tag 'imx-fixes-5.15-3' of ↵	Arnd Bergmann
	git://git.kernel.org/pub/scm/linux/kernel/git/shawnguo/linux into arm/fixes i.MX fixes for 5.15, round 3: - Add platform device for i.MX System Reset Controller (SRC) to fix a regression caused by fw_devlink change. * tag 'imx-fixes-5.15-3' of git://git.kernel.org/pub/scm/linux/kernel/git/shawnguo/linux: ARM: imx: register reset controller from a platform driver Link: https://lore.kernel.org/r/20211015070017.GI22881@dragon Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2021-10-15	ARM: dts: stm32: use usbphyc ck_usbo_48m as USBH OHCI clock on stm32mp151	Amelie Delaunay
	Referring to the note under USBH reset and clocks chapter of RM0436, "In order to access USBH_OHCI registers it is necessary to activate the USB clocks by enabling the PLL controlled by USBPHYC" (ck_usbo_48m). The point is, when USBPHYC PLL is not enabled, OHCI register access freezes the resume from STANDBY. It is the case when dual USBH is enabled, instead of OTG + single USBH. When OTG is probed, as ck_usbo_48m is USBO clock parent, then USBPHYC PLL is enabled and OHCI register access is OK. This patch adds ck_usbo_48m (provided by USBPHYC PLL) as clock of USBH OHCI, thus USBPHYC PLL will be enabled and OHCI register access will be OK. Signed-off-by: Amelie Delaunay <amelie.delaunay@foss.st.com> Signed-off-by: Alexandre Torgue <alexandre.torgue@foss.st.com>
2021-10-15	ARM: dts: stm32: fix AV96 board SAI2 pin muxing on stm32mp15	Olivier Moysan
	Fix SAI2A and SAI2B pin muxings for AV96 board on STM32MP15. Change sai2a-4 & sai2a-5 to sai2a-2 & sai2a-2. Change sai2a-4 & sai2a-sleep-5 to sai2b-2 & sai2b-sleep-2 Fixes: dcf185ca8175 ("ARM: dts: stm32: Add alternate pinmux for SAI2 pins on stm32mp15") Signed-off-by: Olivier Moysan <olivier.moysan@foss.st.com> Reviewed-by: Marek Vasut <marex@denx.de> Signed-off-by: Alexandre Torgue <alexandre.torgue@foss.st.com>
2021-10-15	ARM: dts: stm32: fix SAI sub nodes register range	Olivier Moysan
	The STM32 SAI subblocks registers offsets are in the range 0x0004 (SAIx_CR1) to 0x0020 (SAIx_DR). The corresponding range length is 0x20 instead of 0x1c. Change reg property accordingly. Fixes: 5afd65c3a060 ("ARM: dts: stm32: add sai support on stm32mp157c") Signed-off-by: Olivier Moysan <olivier.moysan@foss.st.com> Signed-off-by: Alexandre Torgue <alexandre.torgue@foss.st.com>
2021-10-15	ARM: dts: stm32: fix STUSB1600 Type-C irq level on stm32mp15xx-dkx	Fabrice Gasnier
	STUSB1600 IRQ (Alert pin) is active low (open drain). Interrupts may get lost currently, so fix the IRQ type. Fixes: 83686162c0eb ("ARM: dts: stm32: add STUSB1600 Type-C using I2C4 on stm32mp15xx-dkx") Signed-off-by: Fabrice Gasnier <fabrice.gasnier@foss.st.com> Signed-off-by: Alexandre Torgue <alexandre.torgue@foss.st.com>
2021-10-15	ARM: dts: stm32: set the DCMI pins on stm32mp157c-odyssey	Grzegorz Szymaszek
	The Seeed Odyssey-STM32MP157C board has a 20-pin DVP camera output. The DCMI pins used on this output are defined in the pin state definition &pinctrl/dcmi-1, AKA &dcmi_pins_b (added in mainline commit 02814a41529a55dbfb9fbb2a3728e78e70646ea6). Set these pins as the default pinctrl of the DCMI peripheral in the board device tree. The pins are not used for any other purpose, so it seems safe to assume most users will not need to override (delete) what this patch provides. status defaults to "disabled", so the peripheral will not be unnecessarily started. And the users who actually intend to make use of a camera on the DVP port will have this little part of the configuration ready. Signed-off-by: Grzegorz Szymaszek <gszymaszek@short.pl> Reviewed-by: Ahmad Fatoum <a.fatoum@pengutronix.de> Signed-off-by: Alexandre Torgue <alexandre.torgue@foss.st.com>
2021-10-15	ARM: dts: stm32: Reduce DHCOR SPI NOR frequency to 50 MHz	Marek Vasut
	The SPI NOR is a bit further away from the SoC on DHCOR than on DHCOM, which causes additional signal delay. At 108 MHz, this delay triggers a sporadic issue where the first bit of RX data is not received by the QSPI controller. There are two options of addressing this problem, either by using the DLYB block to compensate the extra delay, or by reducing the QSPI bus clock frequency. The former requires calibration and that is overly complex, so opt for the second option. Fixes: 76045bc457104 ("ARM: dts: stm32: Add QSPI NOR on AV96") Signed-off-by: Marek Vasut <marex@denx.de> Cc: Alexandre Torgue <alexandre.torgue@foss.st.com> Cc: Patrice Chotard <patrice.chotard@foss.st.com> Cc: Patrick Delaunay <patrick.delaunay@foss.st.com> Cc: linux-stm32@st-md-mailman.stormreply.com To: linux-arm-kernel@lists.infradead.org Signed-off-by: Alexandre Torgue <alexandre.torgue@foss.st.com>
2021-10-15	ARM: dts: stm32: add initial support of stm32mp135f-dk board	Alexandre Torgue
	Add support of stm32mp135f discovery board (part number: STM32MP135F-DK). It embeds a STM32MP135F SOC with 512 MB of DDR3. Several connections are available on this board: 4USB2.0, 1USB2.0 typeC DRD, SDcard, 2*RJ45, HDMI, Combo Wifi/BT, ... Only SD card, uart4 (console) and watchdog IPs are enabled in this commit. Signed-off-by: Alexandre Torgue <alexandre.torgue@foss.st.com> Acked-by: Arnd Bergmann <arnd@arndb.de>
2021-10-15	ARM: dts: stm32: add STM32MP13 SoCs support	Alexandre Torgue
	Add initial support of STM32MP13 family. The STM32MP13 SoC diversity is composed by: -STM32MP131: -core: 1CA7, 17TIMERS, 5LPTIMERS, DMA/MDMA/DMAMUX -storage: 3SDMCC, 1QSPI, FMC -com: USB (OHCI/EHCI, OTG), 5I2C, 5SPI/I2S, 8U(S)ART -audio: 2SAI -network: 1ETH(GMAC) -STM32MP133: STM32MP131 + 2*CAN, ETH2(GMAC), ADC1 -STM32MP135: STM32MP133 + DCMIPP, LTDC A second diversity layer exists for security features: -STM32MP13xY, "Y" gives information: -Y = A/D means no cryp IP and no secure boot. -Y = C/F means cryp IP + secure boot. This commit adds basic peripheral. Signed-off-by: Alexandre Torgue <alexandre.torgue@foss.st.com> Acked-by: Arnd Bergmann <arnd@arndb.de>
2021-10-16	KVM: PPC: Book3S HV: Make idle_kvm_start_guest() return 0 if it went to guest	Michael Ellerman
	We call idle_kvm_start_guest() from power7_offline() if the thread has been requested to enter KVM. We pass it the SRR1 value that was returned from power7_idle_insn() which tells us what sort of wakeup we're processing. Depending on the SRR1 value we pass in, the KVM code might enter the guest, or it might return to us to do some host action if the wakeup requires it. If idle_kvm_start_guest() is able to handle the wakeup, and enter the guest it is supposed to indicate that by returning a zero SRR1 value to us. That was the behaviour prior to commit 10d91611f426 ("powerpc/64s: Reimplement book3s idle code in C"), however in that commit the handling of SRR1 was reworked, and the zeroing behaviour was lost. Returning from idle_kvm_start_guest() without zeroing the SRR1 value can confuse the host offline code, causing the guest to crash and other weirdness. Fixes: 10d91611f426 ("powerpc/64s: Reimplement book3s idle code in C") Cc: stable@vger.kernel.org # v5.2+ Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20211015133929.832061-2-mpe@ellerman.id.au
2021-10-16	KVM: PPC: Book3S HV: Fix stack handling in idle_kvm_start_guest()	Michael Ellerman
	In commit 10d91611f426 ("powerpc/64s: Reimplement book3s idle code in C") kvm_start_guest() became idle_kvm_start_guest(). The old code allocated a stack frame on the emergency stack, but didn't use the frame to store anything, and also didn't store anything in its caller's frame. idle_kvm_start_guest() on the other hand is written more like a normal C function, it creates a frame on entry, and also stores CR/LR into its callers frame (per the ABI). The problem is that there is no caller frame on the emergency stack. The emergency stack for a given CPU is allocated with: paca_ptrs[i]->emergency_sp = alloc_stack(limit, i) + THREAD_SIZE; So emergency_sp actually points to the first address above the emergency stack allocation for a given CPU, we must not store above it without first decrementing it to create a frame. This is different to the regular kernel stack, paca->kstack, which is initialised to point at an initial frame that is ready to use. idle_kvm_start_guest() stores the backchain, CR and LR all of which write outside the allocation for the emergency stack. It then creates a stack frame and saves the non-volatile registers. Unfortunately the frame it creates is not large enough to fit the non-volatiles, and so the saving of the non-volatile registers also writes outside the emergency stack allocation. The end result is that we corrupt whatever is at 0-24 bytes, and 112-248 bytes above the emergency stack allocation. In practice this has gone unnoticed because the memory immediately above the emergency stack happens to be used for other stack allocations, either another CPUs mc_emergency_sp or an IRQ stack. See the order of calls to irqstack_early_init() and emergency_stack_init(). The low addresses of another stack are the top of that stack, and so are only used if that stack is under extreme pressue, which essentially never happens in practice - and if it did there's a high likelyhood we'd crash due to that stack overflowing. Still, we shouldn't be corrupting someone else's stack, and it is purely luck that we aren't corrupting something else. To fix it we save CR/LR into the caller's frame using the existing r1 on entry, we then create a SWITCH_FRAME_SIZE frame (which has space for pt_regs) on the emergency stack with the backchain pointing to the existing stack, and then finally we switch to the new frame on the emergency stack. Fixes: 10d91611f426 ("powerpc/64s: Reimplement book3s idle code in C") Cc: stable@vger.kernel.org # v5.2+ Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20211015133929.832061-1-mpe@ellerman.id.au
2021-10-15	perf/x86: Add new event for AUX output counter index	Adrian Hunter
	PEBS-via-PT records contain a mask of applicable counters. To identify which event belongs to which counter, a side-band event is needed. Until now, there has been no side-band event, and consequently users were limited to using a single event. Add such a side-band event. Note the event is optimised to output only when the counter index changes for an event. That works only so long as all PEBS-via-PT events are scheduled together, which they are for a recording session because they are in a single group. Also no attribute bit is used to select the new event, so a new kernel is not compatible with older perf tools. The assumption being that PEBS-via-PT is sufficiently esoteric that users will not be troubled by this. Signed-off-by: Adrian Hunter <adrian.hunter@intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20210907163903.11820-2-adrian.hunter@intel.com
2021-10-15	perf/x86/msr: Add Sapphire Rapids CPU support	Kan Liang
	SMI_COUNT MSR is supported on Sapphire Rapids CPU. Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/1633551137-192083-1-git-send-email-kan.liang@linux.intel.com
2021-10-15	sched: Add cluster scheduler level for x86	Tim Chen
	There are x86 CPU architectures (e.g. Jacobsville) where L2 cahce is shared among a cluster of cores instead of being exclusive to one single core. To prevent oversubscription of L2 cache, load should be balanced between such L2 clusters, especially for tasks with no shared data. On benchmark such as SPECrate mcf test, this change provides a boost to performance especially on medium load system on Jacobsville. on a Jacobsville that has 24 Atom cores, arranged into 6 clusters of 4 cores each, the benchmark number is as follow: Improvement over baseline kernel for mcf_r copies run time base rate 1 -0.1% -0.2% 6 25.1% 25.1% 12 18.8% 19.0% 24 0.3% 0.3% So this looks pretty good. In terms of the system's task distribution, some pretty bad clumping can be seen for the vanilla kernel without the L2 cluster domain for the 6 and 12 copies case. With the extra domain for cluster, the load does get evened out between the clusters. Note this patch isn't an universal win as spreading isn't necessarily a win, particually for those workload who can benefit from packing. Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com> Signed-off-by: Barry Song <song.bao.hua@hisilicon.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20210924085104.44806-4-21cnbao@gmail.com
2021-10-15	sched: Add cluster scheduler level in core and related Kconfig for ARM64	Barry Song
	This patch adds scheduler level for clusters and automatically enables the load balance among clusters. It will directly benefit a lot of workload which loves more resources such as memory bandwidth, caches. Testing has widely been done in two different hardware configurations of Kunpeng920: 24 cores in one NUMA(6 clusters in each NUMA node); 32 cores in one NUMA(8 clusters in each NUMA node) Workload is running on either one NUMA node or four NUMA nodes, thus, this can estimate the effect of cluster spreading w/ and w/o NUMA load balance. * Stream benchmark: 4threads stream (on 1NUMA * 24cores = 24cores) stream stream w/o patch w/ patch MB/sec copy 29929.64 ( 0.00%) 32932.68 ( 10.03%) MB/sec scale 29861.10 ( 0.00%) 32710.58 ( 9.54%) MB/sec add 27034.42 ( 0.00%) 32400.68 ( 19.85%) MB/sec triad 27225.26 ( 0.00%) 31965.36 ( 17.41%) 6threads stream (on 1NUMA * 24cores = 24cores) stream stream w/o patch w/ patch MB/sec copy 40330.24 ( 0.00%) 42377.68 ( 5.08%) MB/sec scale 40196.42 ( 0.00%) 42197.90 ( 4.98%) MB/sec add 37427.00 ( 0.00%) 41960.78 ( 12.11%) MB/sec triad 37841.36 ( 0.00%) 42513.64 ( 12.35%) 12threads stream (on 1NUMA * 24cores = 24cores) stream stream w/o patch w/ patch MB/sec copy 52639.82 ( 0.00%) 53818.04 ( 2.24%) MB/sec scale 52350.30 ( 0.00%) 53253.38 ( 1.73%) MB/sec add 53607.68 ( 0.00%) 55198.82 ( 2.97%) MB/sec triad 54776.66 ( 0.00%) 56360.40 ( 2.89%) Thus, it could help memory-bound workload especially under medium load. Similar improvement is also seen in lkp-pbzip2: * lkp-pbzip2 benchmark 2-96 threads (on 4NUMA * 24cores = 96cores) lkp-pbzip2 lkp-pbzip2 w/o patch w/ patch Hmean tput-2 11062841.57 ( 0.00%) 11341817.51 * 2.52%* Hmean tput-5 26815503.70 ( 0.00%) 27412872.65 * 2.23%* Hmean tput-8 41873782.21 ( 0.00%) 43326212.92 * 3.47%* Hmean tput-12 61875980.48 ( 0.00%) 64578337.51 * 4.37%* Hmean tput-21 105814963.07 ( 0.00%) 111381851.01 * 5.26%* Hmean tput-30 150349470.98 ( 0.00%) 156507070.73 * 4.10%* Hmean tput-48 237195937.69 ( 0.00%) 242353597.17 * 2.17%* Hmean tput-79 360252509.37 ( 0.00%) 362635169.23 * 0.66%* Hmean tput-96 394571737.90 ( 0.00%) 400952978.48 * 1.62%* 2-24 threads (on 1NUMA * 24cores = 24cores) lkp-pbzip2 lkp-pbzip2 w/o patch w/ patch Hmean tput-2 11071705.49 ( 0.00%) 11296869.10 * 2.03%* Hmean tput-4 20782165.19 ( 0.00%) 21949232.15 * 5.62%* Hmean tput-6 30489565.14 ( 0.00%) 33023026.96 * 8.31%* Hmean tput-8 40376495.80 ( 0.00%) 42779286.27 * 5.95%* Hmean tput-12 61264033.85 ( 0.00%) 62995632.78 * 2.83%* Hmean tput-18 86697139.39 ( 0.00%) 86461545.74 ( -0.27%) Hmean tput-24 104854637.04 ( 0.00%) 104522649.46 * -0.32%* In the case of 6 threads and 8 threads, we see the greatest performance improvement. Similar improvement can be seen on lkp-pixz though the improvement is smaller: * lkp-pixz benchmark 2-24 threads lkp-pixz (on 1NUMA * 24cores = 24cores) lkp-pixz lkp-pixz w/o patch w/ patch Hmean tput-2 6486981.16 ( 0.00%) 6561515.98 * 1.15%* Hmean tput-4 11645766.38 ( 0.00%) 11614628.43 ( -0.27%) Hmean tput-6 15429943.96 ( 0.00%) 15957350.76 * 3.42%* Hmean tput-8 19974087.63 ( 0.00%) 20413746.98 * 2.20%* Hmean tput-12 28172068.18 ( 0.00%) 28751997.06 * 2.06%* Hmean tput-18 39413409.54 ( 0.00%) 39896830.55 * 1.23%* Hmean tput-24 49101815.85 ( 0.00%) 49418141.47 * 0.64%* * SPECrate benchmark 4,8,16 copies mcf_r(on 1NUMA * 32cores = 32cores) Base Base Run Time Rate ------- --------- 4 Copies w/o 580 (w/ 570) w/o 11.1 (w/ 11.3) 8 Copies w/o 647 (w/ 605) w/o 20.0 (w/ 21.4, +7%) 16 Copies w/o 844 (w/ 844) w/o 30.6 (w/ 30.6) 32 Copies(on 4NUMA * 32 cores = 128cores) [w/o patch] Base Base Base Benchmarks Copies Run Time Rate --------------- ------- --------- --------- 500.perlbench_r 32 584 87.2 * 502.gcc_r 32 503 90.2 * 505.mcf_r 32 745 69.4 * 520.omnetpp_r 32 1031 40.7 * 523.xalancbmk_r 32 597 56.6 * 525.x264_r 1 -- CE 531.deepsjeng_r 32 336 109 * 541.leela_r 32 556 95.4 * 548.exchange2_r 32 513 163 * 557.xz_r 32 530 65.2 * Est. SPECrate2017_int_base 80.3 [w/ patch] Base Base Base Benchmarks Copies Run Time Rate --------------- ------- --------- --------- 500.perlbench_r 32 580 87.8 (+0.688%) * 502.gcc_r 32 477 95.1 (+5.432%) * 505.mcf_r 32 644 80.3 (+13.574%) * 520.omnetpp_r 32 942 44.6 (+9.58%) * 523.xalancbmk_r 32 560 60.4 (+6.714%%) * 525.x264_r 1 -- CE 531.deepsjeng_r 32 337 109 (+0.000%) * 541.leela_r 32 554 95.6 (+0.210%) * 548.exchange2_r 32 515 163 (+0.000%) * 557.xz_r 32 524 66.0 (+1.227%) * Est. SPECrate2017_int_base 83.7 (+4.062%) On the other hand, it is slightly helpful to CPU-bound tasks like kernbench: * 24-96 threads kernbench (on 4NUMA * 24cores = 96cores) kernbench kernbench w/o cluster w/ cluster Min user-24 12054.67 ( 0.00%) 12024.19 ( 0.25%) Min syst-24 1751.51 ( 0.00%) 1731.68 ( 1.13%) Min elsp-24 600.46 ( 0.00%) 598.64 ( 0.30%) Min user-48 12361.93 ( 0.00%) 12315.32 ( 0.38%) Min syst-48 1917.66 ( 0.00%) 1892.73 ( 1.30%) Min elsp-48 333.96 ( 0.00%) 332.57 ( 0.42%) Min user-96 12922.40 ( 0.00%) 12921.17 ( 0.01%) Min syst-96 2143.94 ( 0.00%) 2110.39 ( 1.56%) Min elsp-96 211.22 ( 0.00%) 210.47 ( 0.36%) Amean user-24 12063.99 ( 0.00%) 12030.78 * 0.28%* Amean syst-24 1755.20 ( 0.00%) 1735.53 * 1.12%* Amean elsp-24 601.60 ( 0.00%) 600.19 ( 0.23%) Amean user-48 12362.62 ( 0.00%) 12315.56 * 0.38%* Amean syst-48 1921.59 ( 0.00%) 1894.95 * 1.39%* Amean elsp-48 334.10 ( 0.00%) 332.82 * 0.38%* Amean user-96 12925.27 ( 0.00%) 12922.63 ( 0.02%) Amean syst-96 2146.66 ( 0.00%) 2122.20 * 1.14%* Amean elsp-96 211.96 ( 0.00%) 211.79 ( 0.08%) Note this patch isn't an universal win, it might hurt those workload which can benefit from packing. Though tasks which want to take advantages of lower communication latency of one cluster won't necessarily been packed in one cluster while kernel is not aware of clusters, they have some chance to be randomly packed. But this patch will make them more likely spread. Signed-off-by: Barry Song <song.bao.hua@hisilicon.com> Tested-by: Yicong Yang <yangyicong@hisilicon.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
2021-10-15	topology: Represent clusters of CPUs within a die	Jonathan Cameron
	Both ACPI and DT provide the ability to describe additional layers of topology between that of individual cores and higher level constructs such as the level at which the last level cache is shared. In ACPI this can be represented in PPTT as a Processor Hierarchy Node Structure [1] that is the parent of the CPU cores and in turn has a parent Processor Hierarchy Nodes Structure representing a higher level of topology. For example Kunpeng 920 has 6 or 8 clusters in each NUMA node, and each cluster has 4 cpus. All clusters share L3 cache data, but each cluster has local L3 tag. On the other hand, each clusters will share some internal system bus. +-----------------------------------+ +---------+ \| +------+ +------+ +--------------------------+ \| \| \| CPU0 \| \| cpu1 \| \| +-----------+ \| \| \| +------+ +------+ \| \| \| \| \| \| +----+ L3 \| \| \| \| +------+ +------+ cluster \| \| tag \| \| \| \| \| CPU2 \| \| CPU3 \| \| \| \| \| \| \| +------+ +------+ \| +-----------+ \| \| \| \| \| \| +-----------------------------------+ \| \| +-----------------------------------+ \| \| \| +------+ +------+ +--------------------------+ \| \| \| \| \| \| \| +-----------+ \| \| \| +------+ +------+ \| \| \| \| \| \| \| \| L3 \| \| \| \| +------+ +------+ +----+ tag \| \| \| \| \| \| \| \| \| \| \| \| \| \| +------+ +------+ \| +-----------+ \| \| \| \| \| \| +-----------------------------------+ \| L3 \| \| data \| +-----------------------------------+ \| \| \| +------+ +------+ \| +-----------+ \| \| \| \| \| \| \| \| \| \| \| \| \| +------+ +------+ +----+ L3 \| \| \| \| \| \| tag \| \| \| \| +------+ +------+ \| \| \| \| \| \| \| \| \| \| \| +-----------+ \| \| \| +------+ +------+ +--------------------------+ \| +-----------------------------------\| \| \| +-----------------------------------\| \| \| \| +------+ +------+ +--------------------------+ \| \| \| \| \| \| \| +-----------+ \| \| \| +------+ +------+ \| \| \| \| \| \| +----+ L3 \| \| \| \| +------+ +------+ \| \| tag \| \| \| \| \| \| \| \| \| \| \| \| \| \| +------+ +------+ \| +-----------+ \| \| \| \| \| \| +-----------------------------------+ \| \| +-----------------------------------+ \| \| \| +------+ +------+ +--------------------------+ \| \| \| \| \| \| \| +-----------+ \| \| \| +------+ +------+ \| \| \| \| \| \| \| \| L3 \| \| \| \| +------+ +------+ +---+ tag \| \| \| \| \| \| \| \| \| \| \| \| \| \| +------+ +------+ \| +-----------+ \| \| \| \| \| \| +-----------------------------------+ \| \| +-----------------------------------+ \| \| \| +------+ +------+ +--------------------------+ \| \| \| \| \| \| \| +-----------+ \| \| \| +------+ +------+ \| \| \| \| \| \| \| \| L3 \| \| \| \| +------+ +------+ +--+ tag \| \| \| \| \| \| \| \| \| \| \| \| \| \| +------+ +------+ \| +-----------+ \| \| \| \| +---------+ +-----------------------------------+ That means spreading tasks among clusters will bring more bandwidth while packing tasks within one cluster will lead to smaller cache synchronization latency. So both kernel and userspace will have a chance to leverage this topology to deploy tasks accordingly to achieve either smaller cache latency within one cluster or an even distribution of load among clusters for higher throughput. This patch exposes cluster topology to both kernel and userspace. Libraried like hwloc will know cluster by cluster_cpus and related sysfs attributes. PoC of HWLOC support at [2]. Note this patch only handle the ACPI case. Special consideration is needed for SMT processors, where it is necessary to move 2 levels up the hierarchy from the leaf nodes (thus skipping the processor core level). Note that arm64 / ACPI does not provide any means of identifying a die level in the topology but that may be unrelate to the cluster level. [1] ACPI Specification 6.3 - section 5.2.29.1 processor hierarchy node structure (Type 0) [2] https://github.com/hisilicon/hwloc/tree/linux-cluster Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> Signed-off-by: Tian Tao <tiantao6@hisilicon.com> Signed-off-by: Barry Song <song.bao.hua@hisilicon.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20210924085104.44806-2-21cnbao@gmail.com
2021-10-15	sched: Add wrapper for get_wchan() to keep task blocked	Kees Cook
	Having a stable wchan means the process must be blocked and for it to stay that way while performing stack unwinding. Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Geert Uytterhoeven <geert@linux-m68k.org> Acked-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> [arm] Tested-by: Mark Rutland <mark.rutland@arm.com> [arm64] Link: https://lkml.kernel.org/r/20211008111626.332092234@infradead.org
2021-10-15	x86: Fix get_wchan() to support the ORC unwinder	Qi Zheng
	Currently, the kernel CONFIG_UNWINDER_ORC option is enabled by default on x86, but the implementation of get_wchan() is still based on the frame pointer unwinder, so the /proc/<pid>/wchan usually returned 0 regardless of whether the task <pid> is running. Reimplement get_wchan() by calling stack_trace_save_tsk(), which is adapted to the ORC and frame pointer unwinders. Fixes: ee9f8fce9964 ("x86/unwind: Add the ORC unwinder") Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com> Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20211008111626.271115116@infradead.org
2021-10-15	Merge tag 'kvmarm-fixes-5.15-2' of ↵	Paolo Bonzini
	git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD KVM/arm64 fixes for 5.15, take #2 - Properly refcount pages used as a concatenated stage-2 PGD - Fix missing unlock when detecting the use of MTE+VM_SHARED
2021-10-15	KVM: SEV-ES: fix length of string I/O	Paolo Bonzini
	The size of the data in the scratch buffer is not divided by the size of each port I/O operation, so vcpu->arch.pio.count ends up being larger than it should be by a factor of size. Cc: stable@vger.kernel.org Fixes: 7ed9abfe8e9f ("KVM: SVM: Support string IO operations for an SEV-ES guest") Acked-by: Tom Lendacky <thomas.lendacky@amd.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-10-15	ARM: configs: aspeed: Remove unused USB gadget devices	Joel Stanley
	The HID and mass storage gadget devices are used for the OpenBMC Web UI's remote keyboard/mouse feature. The others are not required, so disable them. Signed-off-by: Joel Stanley <joel@jms.id.au>
2021-10-15	ARM: config: aspeed: Enable Network Block Device	Joel Stanley
	NBD is used for remote media support in the OpenBMC Web UI. Signed-off-by: Joel Stanley <joel@jms.id.au>
2021-10-15	ARM: configs: aspeed: Enable pstore and lockup detectors	Joel Stanley
	These are enabled by OpenBMC platforms. Signed-off-by: Joel Stanley <joel@jms.id.au>
2021-10-15	ARM: configs: aspeed: Enable commonly used drivers	Joel Stanley
	These devices are used by BMC platforms. Enable them so mainline defconfigs can be used to test. Signed-off-by: Joel Stanley <joel@jms.id.au>
2021-10-15	ARM: configs: aspeed: Disable IPV6 SIT device	Joel Stanley
	No one is using this device on OpenBMC systems, and there is no code to manage it in phosphor-networkd (the default OpenBMC userspace) as of March 2021: > [...] if you don't add IPv6 addresses to the sit interface > it doesn't do anything. The defacto way to do that on an interface in > OpenBMC is to have it managed by phosphor-networkd. On top of this, to > support sit you would need a way to configure the local / remote IPv4 > addresses used to back it. Signed-off-by: Joel Stanley <joel@jms.id.au>
2021-10-15	ARM: dts: ls1021a-tsn: use generic "jedec,spi-nor" compatible for flash	Li Yang
	We cannot list all the possible chips used in different board revisions, just use the generic "jedec,spi-nor" compatible instead. This also fixes dtbs_check error: ['jedec,spi-nor', 's25fl256s1', 's25fl512s'] is too long Signed-off-by: Li Yang <leoyang.li@nxp.com> Reviewed-by: Kuldeep Singh <kuldeep.singh@nxp.com> Signed-off-by: Shawn Guo <shawnguo@kernel.org>
2021-10-15	ARM: dts: ls1021a: move thermal-zones node out of soc/	Li Yang
	This fixes dtbs-check error from simple-bus schema: soc: thermal-zones: {'type': 'object'} is not allowed for {'cpu-thermal': ..... } From schema: /home/leo/.local/lib/python3.8/site-packages/dtschema/schemas/simple-bus.yaml Signed-off-by: Li Yang <leoyang.li@nxp.com> Signed-off-by: Shawn Guo <shawnguo@kernel.org>
2021-10-15	ARM: dts: ls1021a-tsn: remove undocumented property "position" from mma8452 node	Li Yang
	Property "postion" is not documented in the mma8452 binding. Remove it to resolve the error in "make dtbs_check" Signed-off-by: Li Yang <leoyang.li@nxp.com> Signed-off-by: Shawn Guo <shawnguo@kernel.org>
2021-10-15	ARM: dts: ls1021a-qds: change fpga to simple-mfd device	Li Yang
	The FPGA is not really a bus but more like an MFD device. Change the compatible string from "simple-bus" to "simple-mfd". This also fix a node name issue with simple-bus schema. Signed-off-by: Li Yang <leoyang.li@nxp.com> Signed-off-by: Shawn Guo <shawnguo@kernel.org>
2021-10-15	ARM: dts: ls1021a: add #power-domain-cells for power-controller node	Li Yang
	Add the #power-domain-cells for power-controller node as required by the schema. Signed-off-by: Li Yang <leoyang.li@nxp.com> Signed-off-by: Shawn Guo <shawnguo@kernel.org>
2021-10-15	ARM: dts: ls1021a: add #dma-cells to qdma node	Li Yang
	Add the #dma-cells to align with the dma schema. Signed-off-by: Li Yang <leoyang.li@nxp.com> Signed-off-by: Shawn Guo <shawnguo@kernel.org>
2021-10-15	ARM: dts: ls1021a: fix memory node for schema check	Li Yang
	Fix the following error from "make dtbs_check" memory: False schema does not allow ... Signed-off-by: Li Yang <leoyang.li@nxp.com> Signed-off-by: Shawn Guo <shawnguo@kernel.org>
2021-10-15	ARM: dts: ls1021a: remove regulators simple-bus	Li Yang
	There is no regulator bus in hardware. So move the regulator nodes out and remove the regulators simple-bus. This also make the dts align with the simple-bus schema. Signed-off-by: Li Yang <leoyang.li@nxp.com> Signed-off-by: Shawn Guo <shawnguo@kernel.org>
2021-10-15	ARM: dts: ls1021a: disable ifc node by default	Li Yang
	Disable the bus in the SoC dtsi file to be enabled only in board dts files. Also breakup long values in the ifc node to fix dtbs_check. Signed-off-by: Li Yang <leoyang.li@nxp.com> Signed-off-by: Shawn Guo <shawnguo@kernel.org>