summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2023-12-13Revert "tcp: disable tcp_autocorking for socket when TCP_NODELAY flag is set"Jakub Kicinski
This reverts commit f3f32a356c0d2379d4431364e74f101f8f075ce3. Paolo reports that the change disables autocorking even after the userspace sets TCP_CORK. Fixes: f3f32a356c0d ("tcp: disable tcp_autocorking for socket when TCP_NODELAY flag is set") Link: https://lore.kernel.org/r/0d30d5a41d3ac990573016308aaeacb40a9dc79f.camel@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-12-13Merge tag 'efi-urgent-for-v6.7-2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/efi/efi Pull EFI fixes from Ard Biesheuvel: - Deal with a regression in the recently refactored x86 EFI stub code on older Dell systems by disabling randomization of the physical load address - Use the correct load address for relocatable Loongarch kernels * tag 'efi-urgent-for-v6.7-2' of git://git.kernel.org/pub/scm/linux/kernel/git/efi/efi: efi/x86: Avoid physical KASLR on older Dell systems efi/loongarch: Use load address to calculate kernel entry address
2023-12-13i40e: Fix ST code value for Clause 45Ivan Vecera
ST code value for clause 45 that has been changed by commit 8196b5fd6c73 ("i40e: Refactor I40E_MDIO_CLAUSE* macros") is currently wrong. The mentioned commit refactored ..MDIO_CLAUSE??_STCODE_MASK so their value is the same for both clauses. The value is correct for clause 22 but not for clause 45. Fix the issue by adding a parameter to I40E_GLGEN_MSCA_STCODE_MASK macro that specifies required value. Fixes: 8196b5fd6c73 ("i40e: Refactor I40E_MDIO_CLAUSE* macros") Signed-off-by: Ivan Vecera <ivecera@redhat.com> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Reviewed-by: Simon Horman <horms@kernel.org> Tested-by: Pucha Himasekhar Reddy <himasekharx.reddy.pucha@intel.com> (A Contingent worker at Intel) Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2023-12-13ice: fix theoretical out-of-bounds access in ethtool link modesMichal Schmidt
To map phy types reported by the hardware to ethtool link mode bits, ice uses two lookup tables (phy_type_low_lkup, phy_type_high_lkup). The "low" table has 64 elements to cover every possible bit the hardware may report, but the "high" table has only 13. If the hardware reports a higher bit in phy_types_high, the driver would access memory beyond the lookup table's end. Instead of iterating through all 64 bits of phy_types_{low,high}, use the sizes of the respective lookup tables. Fixes: 9136e1f1e5c3 ("ice: refactor PHY type to ethtool link mode") Signed-off-by: Michal Schmidt <mschmidt@redhat.com> Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com> Reviewed-by: Simon Horman <horms@kernel.org> Tested-by: Pucha Himasekhar Reddy <himasekharx.reddy.pucha@intel.com> (A Contingent worker at Intel) Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2023-12-13bcachefs: Fix determining required file handle lengthJan Kara
The ->encode_fh method is responsible for setting amount of space required for storing the file handle if not enough space was provided. bch2_encode_fh() was not setting required length in that case which breaks e.g. fanotify. Fix it. Reported-by: Petr Vorel <pvorel@suse.cz> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-12-13KVM: SEV: Do not intercept accesses to MSR_IA32_XSS for SEV-ES guestsMichael Roth
When intercepts are enabled for MSR_IA32_XSS, the host will swap in/out the guest-defined values while context-switching to/from guest mode. However, in the case of SEV-ES, vcpu->arch.guest_state_protected is set, so the guest-defined value is effectively ignored when switching to guest mode with the understanding that the VMSA will handle swapping in/out this register state. However, SVM is still configured to intercept these accesses for SEV-ES guests, so the values in the initial MSR_IA32_XSS are effectively read-only, and a guest will experience undefined behavior if it actually tries to write to this MSR. Fortunately, only CET/shadowstack makes use of this register on SEV-ES-capable systems currently, which isn't yet widely used, but this may become more of an issue in the future. Additionally, enabling intercepts of MSR_IA32_XSS results in #VC exceptions in the guest in certain paths that can lead to unexpected #VC nesting levels. One example is SEV-SNP guests when handling #VC exceptions for CPUID instructions involving leaf 0xD, subleaf 0x1, since they will access MSR_IA32_XSS as part of servicing the CPUID #VC, then generate another #VC when accessing MSR_IA32_XSS, which can lead to guest crashes if an NMI occurs at that point in time. Running perf on a guest while it is issuing such a sequence is one example where these can be problematic. Address this by disabling intercepts of MSR_IA32_XSS for SEV-ES guests if the host/guest configuration allows it. If the host/guest configuration doesn't allow for MSR_IA32_XSS, leave it intercepted so that it can be caught by the existing checks in kvm_{set,get}_msr_common() if the guest still attempts to access it. Fixes: 376c6d285017 ("KVM: SVM: Provide support for SEV-ES vCPU creation/loading") Cc: Alexey Kardashevskiy <aik@amd.com> Suggested-by: Tom Lendacky <thomas.lendacky@amd.com> Signed-off-by: Michael Roth <michael.roth@amd.com> Message-Id: <20231016132819.1002933-4-michael.roth@amd.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-12-13drm/panel: ltk050h3146w: Set burst mode for ltk050h3148wFarouk Bouabid
The ltk050h3148w variant expects the horizontal component lane byte clock cycle(lbcc) to be calculated using lane_mbps (burst mode) instead of the pixel clock. Using the pixel clock rate by default for this calculation was introduced in commit ac87d23694f4 ("drm/bridge: synopsys: dw-mipi-dsi: Use pixel clock rate to calculate lbcc") and starting from commit 93e82bb4de01 ("drm/bridge: synopsys: dw-mipi-dsi: Fix hcomponent lbcc for burst mode") only panels that support burst mode can keep using the lane_mbps. So add MIPI_DSI_MODE_VIDEO_BURST as part of the mode_flags for the dsi host. Fixes: 93e82bb4de01 ("drm/bridge: synopsys: dw-mipi-dsi: Fix hcomponent lbcc for burst mode") Signed-off-by: Farouk Bouabid <farouk.bouabid@theobroma-systems.com> Reviewed-by: Jessica Zhang <quic_jesszhan@quicinc.com> Signed-off-by: Heiko Stuebner <heiko@sntech.de> Link: https://patchwork.freedesktop.org/patch/msgid/20231213145045.41020-1-farouk.bouabid@theobroma-systems.com
2023-12-13fix ufs_get_locked_folio() breakageAl Viro
filemap_lock_folio() returns ERR_PTR(-ENOENT) if the thing is not in cache - not NULL like find_lock_page() used to. Fixes: 5fb7bd50b351 "ufs: add ufs_get_locked_folio and ufs_put_locked_folio" Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2023-12-13io_uring/poll: don't enable lazy wake for POLLEXCLUSIVEJens Axboe
There are a few quirks around using lazy wake for poll unconditionally, and one of them is related the EPOLLEXCLUSIVE. Those may trigger exclusive wakeups, which wake a limited number of entries in the wait queue. If that wake number is less than the number of entries someone is waiting for (and that someone is also using DEFER_TASKRUN), then we can get stuck waiting for more entries while we should be processing the ones we already got. If we're doing exclusive poll waits, flag the request as not being compatible with lazy wakeups. Reported-by: Pavel Begunkov <asml.silence@gmail.com> Fixes: 6ce4a93dbb5b ("io_uring/poll: use IOU_F_TWQ_LAZY_WAKE for wakeups") Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-12-13ACPI: arm64: export acpi_arch_thermal_cpufreq_pctg()Arnd Bergmann
The cpufreq code can be in a loadable module, so the architecture support for it has to be exported: ERROR: modpost: "acpi_arch_thermal_cpufreq_pctg" [drivers/acpi/processor.ko] undefined! Fixes: 310293a2b941 ("ACPI: processor: reduce CPUFREQ thermal reduction pctg for Tegra241") Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2023-12-13ACPI: extlog: Clear Extended Error Log status when RAS_CEC handled the errorTony Luck
When both CONFIG_RAS_CEC and CONFIG_ACPI_EXTLOG are enabled, Linux does not clear the status word of the BIOS supplied error record for corrected errors. This may prevent logging of subsequent uncorrected errors. Fix by clearing the status. Fixes: 23ba710a0864 ("x86/mce: Fix all mce notifiers to update the mce->kflags bitmask") Reported-by: Erwin Tsaur <erwin.tsaur@intel.com> Signed-off-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2023-12-13KVM: selftests: Fix dynamic generation of configuration namesPaolo Bonzini
When we dynamically generate a name for a configuration in get-reg-list we use strcat() to append to a buffer allocated using malloc() but we never initialise that buffer. Since malloc() offers no guarantees regarding the contents of the memory it returns this can lead to us corrupting, and likely overflowing, the buffer: vregs: PASS vregs+pmu: PASS sve: PASS sve+pmu: PASS vregs+pauth_address+pauth_generic: PASS X?vr+gspauth_addre+spauth_generi+pmu: PASS The bug is that strcat() should have been strcpy(), and that replacement would be enough to fix it, but there are other things in the function that leave something to be desired. In particular, an (incorrectly) empty config would cause an out of bounds access to c->name[-1]. Since the strcpy() call relies on c->name[0..len-1] being initialized, enforce that invariant throughout the function. Fixes: 2f9ace5d4557 ("KVM: arm64: selftests: get-reg-list: Introduce vcpu configs") Reviewed-by: Andrew Jones <ajones@ventanamicro.com> Co-developed-by: Mark Brown <broonie@kernel.org> Signed-off-by: Mark Brown <broonie@kernel.org> Message-Id: <20231211-kvm-get-reg-list-str-init-v3-1-6554c71c77b1@kernel.org> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-12-13MAINTAINERS: powerpc: Add Aneesh & NaveenMichael Ellerman
Aneesh and Naveen are helping out with some aspects of upstream maintenance, add them as reviewers. Acked-by: "Aneesh Kumar K.V (IBM)" <aneesh.kumar@kernel.org> Acked-by: "Naveen N. Rao" <naveen.n.rao@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://msgid.link/20231205051105.736470-1-mpe@ellerman.id.au
2023-12-13powerpc/pseries/vas: Migration suspend waits for no in-progress open windowsHaren Myneni
The hypervisor returns migration failure if all VAS windows are not closed. During pre-migration stage, vas_migration_handler() sets migration_in_progress flag and closes all windows from the list. The allocate VAS window routine checks the migration flag, setup the window and then add it to the list. So there is possibility of the migration handler missing the window that is still in the process of setup. t1: Allocate and open VAS t2: Migration event window lock vas_pseries_mutex If migration_in_progress set unlock vas_pseries_mutex return open window HCALL unlock vas_pseries_mutex Modify window HCALL lock vas_pseries_mutex setup window migration_in_progress=true Closes all windows from the list // May miss windows that are // not in the list unlock vas_pseries_mutex lock vas_pseries_mutex return if nr_closed_windows == 0 // No DLPAR CPU or migration add window to the list // Window will be added to the // list after the setup is completed unlock vas_pseries_mutex return unlock vas_pseries_mutex Close VAS window // due to DLPAR CPU or migration return -EBUSY This patch resolves the issue with the following steps: - Set the migration_in_progress flag without holding mutex. - Introduce nr_open_wins_progress counter in VAS capabilities struct - This counter tracks the number of open windows are still in progress - The allocate setup window thread closes windows if the migration is set and decrements nr_open_window_progress counter - The migration handler waits for no in-progress open windows. The code flow with the fix is as follows: t1: Allocate and open VAS t2: Migration event window lock vas_pseries_mutex If migration_in_progress set unlock vas_pseries_mutex return open window HCALL nr_open_wins_progress++ // Window opened, but not // added to the list yet unlock vas_pseries_mutex Modify window HCALL migration_in_progress=true setup window lock vas_pseries_mutex Closes all windows from the list While nr_open_wins_progress { unlock vas_pseries_mutex lock vas_pseries_mutex sleep if nr_closed_windows == 0 // Wait if any open window in or migration is not started // progress. The open window // No DLPAR CPU or migration // thread closes the window without add window to the list // adding to the list and return if nr_open_wins_progress-- // the migration is in progress. unlock vas_pseries_mutex return Close VAS window nr_open_wins_progress-- unlock vas_pseries_mutex return -EBUSY lock vas_pseries_mutex } unlock vas_pseries_mutex return Fixes: 37e6764895ef ("powerpc/pseries/vas: Add VAS migration handler") Signed-off-by: Haren Myneni <haren@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://msgid.link/20231125235104.3405008-1-haren@linux.ibm.com
2023-12-13Merge branch 'stmmac-bug-fixes'David S. Miller
Yanteng Si says: ==================== stmmac: Some bug fixes * Put Krzysztof's patch into my thread, pick Conor's Reviewed-by tag and Jiaxun's Acked-by tag.(prev version is RFC patch) * I fixed an Oops related to mdio, mainly to ensure that mdio is initialized before use, because it will be used in a series of patches I am working on. see <https://lore.kernel.org/loongarch/cover.1699533745.git.siyanteng@loongson.cn/T/#t> ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2023-12-13MIPS: dts: loongson: drop incorrect dwmac fallback compatibleKrzysztof Kozlowski
Device binds to proper PCI ID (LOONGSON, 0x7a03), already listed in DTS, so checking for some other compatible does not make sense. It cannot be bound to unsupported platform. Drop useless, incorrect (space in between) and undocumented compatible. Signed-off-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org> Signed-off-by: Yanteng Si <siyanteng@loongson.cn> Reviewed-by: Conor Dooley <conor.dooley@microchip.com> Acked-by: Jiaxun Yang <jiaxun.yang@flygoat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-12-13stmmac: dwmac-loongson: drop useless check for compatible fallbackKrzysztof Kozlowski
Device binds to proper PCI ID (LOONGSON, 0x7a03), already listed in DTS, so checking for some other compatible does not make sense. It cannot be bound to unsupported platform. Drop useless, incorrect (space in between) and undocumented compatible. Signed-off-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org> Signed-off-by: Yanteng Si <siyanteng@loongson.cn> Reviewed-by: Conor Dooley <conor.dooley@microchip.com> Acked-by: Jiaxun Yang <jiaxun.yang@flygoat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-12-13stmmac: dwmac-loongson: Make sure MDIO is initialized before useYanteng Si
Generic code will use mdio. If it is not initialized before use, the kernel will Oops. Fixes: 30bba69d7db4 ("stmmac: pci: Add dwmac support for Loongson") Signed-off-by: Yanteng Si <siyanteng@loongson.cn> Signed-off-by: Feiyang Chen <chenfeiyang@loongson.cn> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-12-13dt-bindings: panel-simple-dsi: move LG 5" HD TFT LCD panel into DSI yamlDavid Heidelberg
Originally was in the panel-simple, but belongs to panel-simple-dsi. See arch/arm/boot/dts/nvidia/tegra114-roth.dts for more details. Resolves the following warning: ``` arch/arm/boot/dts/tegra114-roth.dt.yaml: panel@0: 'reg' does not match any of the regexes: 'pinctrl-[0-9]+' From schema: Documentation/devicetree/bindings/display/panel/panel-simple.yaml ``` Fixes: 310abcea76e9 ("dt-bindings: display: convert simple lg panels to DT Schema") Signed-off-by: David Heidelberg <david@ixit.cz> Acked-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org> Acked-by: Jessica Zhang <quic_jesszhan@quicinc.com> Link: https://lore.kernel.org/r/20231212200934.99262-1-david@ixit.cz Signed-off-by: Neil Armstrong <neil.armstrong@linaro.org> Link: https://patchwork.freedesktop.org/patch/msgid/20231212200934.99262-1-david@ixit.cz
2023-12-13tcp: disable tcp_autocorking for socket when TCP_NODELAY flag is setSalvatore Dipietro
Based on the tcp man page, if TCP_NODELAY is set, it disables Nagle's algorithm and packets are sent as soon as possible. However in the `tcp_push` function where autocorking is evaluated the `nonagle` value set by TCP_NODELAY is not considered which can trigger unexpected corking of packets and induce delays. For example, if two packets are generated as part of a server's reply, if the first one is not transmitted on the wire quickly enough, the second packet can trigger the autocorking in `tcp_push` and be delayed instead of sent as soon as possible. It will either wait for additional packets to be coalesced or an ACK from the client before transmitting the corked packet. This can interact badly if the receiver has tcp delayed acks enabled, introducing 40ms extra delay in completion times. It is not always possible to control who has delayed acks set, but it is possible to adjust when and how autocorking is triggered. Patch prevents autocorking if the TCP_NODELAY flag is set on the socket. Patch has been tested using an AWS c7g.2xlarge instance with Ubuntu 22.04 and Apache Tomcat 9.0.83 running the basic servlet below: import java.io.IOException; import java.io.OutputStreamWriter; import java.io.PrintWriter; import javax.servlet.ServletException; import javax.servlet.http.HttpServlet; import javax.servlet.http.HttpServletRequest; import javax.servlet.http.HttpServletResponse; public class HelloWorldServlet extends HttpServlet { @Override protected void doGet(HttpServletRequest request, HttpServletResponse response) throws ServletException, IOException { response.setContentType("text/html;charset=utf-8"); OutputStreamWriter osw = new OutputStreamWriter(response.getOutputStream(),"UTF-8"); String s = "a".repeat(3096); osw.write(s,0,s.length()); osw.flush(); } } Load was applied using wrk2 (https://github.com/kinvolk/wrk2) from an AWS c6i.8xlarge instance. With the current auto-corking behavior and TCP_NODELAY set an additional 40ms latency from P99.99+ values are observed. With the patch applied we see no occurrences of 40ms latencies. The patch has also been tested with iperf and uperf benchmarks and no regression was observed. # No patch with tcp_autocorking=1 and TCP_NODELAY set on all sockets ./wrk -t32 -c128 -d40s --latency -R10000 http://172.31.49.177:8080/hello/hello' ... 50.000% 0.91ms 75.000% 1.12ms 90.000% 1.46ms 99.000% 1.73ms 99.900% 1.96ms 99.990% 43.62ms <<< 40+ ms extra latency 99.999% 48.32ms 100.000% 49.34ms # With patch ./wrk -t32 -c128 -d40s --latency -R10000 http://172.31.49.177:8080/hello/hello' ... 50.000% 0.89ms 75.000% 1.13ms 90.000% 1.44ms 99.000% 1.67ms 99.900% 1.78ms 99.990% 2.27ms <<< no 40+ ms extra latency 99.999% 3.71ms 100.000% 4.57ms Fixes: f54b311142a9 ("tcp: auto corking") Signed-off-by: Salvatore Dipietro <dipiets@amazon.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-12-13Merge tag 'mediatek-drm-fixes-20231211' of ↵Dave Airlie
https://git.kernel.org/pub/scm/linux/kernel/git/chunkuang.hu/linux into drm-fixes Mediatek DRM Fixes - 20231211 1. mtk_disp_gamma: Fix breakage due to merge issue 2. fix kernel oops if no crtc is found 3. Add spinlock for setting vblank event in atomic_begin 4. Fix access violation in mtk_drm_crtc_dma_dev_get Signed-off-by: Dave Airlie <airlied@redhat.com> From: Chun-Kuang Hu <chunkuang.hu@kernel.org> Link: https://patchwork.freedesktop.org/patch/msgid/20231211151510.6749-1-chunkuang.hu@kernel.org
2023-12-13ARM: dts: Fix occasional boot hang for am3 usbTony Lindgren
With subtle timings changes, we can now sometimes get an external abort on non-linefetch error booting am3 devices at sysc_reset(). This is because of a missing reset delay needed for the usb target module. Looks like we never enabled the delay earlier for am3, although a similar issue was seen earlier with a similar usb setup for dm814x as described in commit ebf244148092 ("ARM: OMAP2+: Use srst_udelay for USB on dm814x"). Cc: stable@vger.kernel.org Fixes: 0782e8572ce4 ("ARM: dts: Probe am335x musb with ti-sysc") Signed-off-by: Tony Lindgren <tony@atomide.com>
2023-12-13RISCV: KVM: update external interrupt atomically for IMSIC swfileYong-Xuan Wang
The emulated IMSIC update the external interrupt pending depending on the value of eidelivery and topei. It might lose an interrupt when it is interrupted before setting the new value to the pending status. For example, when VCPU0 sends an IPI to VCPU1 via IMSIC: VCPU0 VCPU1 CSRSWAP topei = 0 The VCPU1 has claimed all the external interrupt in its interrupt handler. topei of VCPU1's IMSIC = 0 set pending in VCPU1's IMSIC topei of VCPU1' IMSIC = 1 set the external interrupt pending of VCPU1 clear the external interrupt pending of VCPU1 When the VCPU1 switches back to VS mode, it exits the interrupt handler because the result of CSRSWAP topei is 0. If there are no other external interrupts injected into the VCPU1's IMSIC, VCPU1 will never know this pending interrupt unless it initiative read the topei. If the interruption occurs between updating interrupt pending in IMSIC and updating external interrupt pending of VCPU, it will not cause a problem. Suppose that the VCPU1 clears the IPI pending in IMSIC right after VCPU0 sets the pending, the external interrupt pending of VCPU1 will not be set because the topei is 0. But when the VCPU1 goes back to VS mode, the pending IPI will be reported by the CSRSWAP topei, it will not lose this interrupt. So we only need to make the external interrupt updating procedure as a critical section to avoid the problem. Fixes: db8b7e97d613 ("RISC-V: KVM: Add in-kernel virtualization of AIA IMSIC") Tested-by: Roy Lin <roy.lin@sifive.com> Tested-by: Wayling Chen <wayling.chen@sifive.com> Co-developed-by: Vincent Chen <vincent.chen@sifive.com> Signed-off-by: Vincent Chen <vincent.chen@sifive.com> Signed-off-by: Yong-Xuan Wang <yongxuan.wang@sifive.com> Signed-off-by: Anup Patel <anup@brainfault.org>
2023-12-13KVM: riscv: selftests: Fix get-reg-list print_reg defaultsAndrew Jones
print_reg() will print everything it knows when it encounters a register ID it's unfamiliar with in the default cases of its decoding switches. Fix several issues with these (until now, never tested) paths; missing newlines in printfs, missing complement operator in mask, and missing return in order to avoid continuing to decode. Fixes: 62d0c458f828 ("KVM: riscv: selftests: get-reg-list print_reg should never fail") Signed-off-by: Andrew Jones <ajones@ventanamicro.com> Reviewed-by: Haibo Xu <haibo1.xu@intel.com> Signed-off-by: Anup Patel <anup@brainfault.org>
2023-12-12Input: xpad - add Razer Wolverine V2 supportLuca Weiss
Add the VID and PID of Razer Wolverine V2 to xpad_device. Signed-off-by: Luca Weiss <luca@z3ntu.xyz> Link: https://lore.kernel.org/r/20231125-razer-wolverine-v2-v1-1-979fe9f9288e@z3ntu.xyz Signed-off-by: Dmitry Torokhov <dmitry.torokhov@gmail.com>
2023-12-12tracing: Add size check when printing trace_marker outputSteven Rostedt (Google)
If for some reason the trace_marker write does not have a nul byte for the string, it will overflow the print: trace_seq_printf(s, ": %s", field->buf); The field->buf could be missing the nul byte. To prevent overflow, add the max size that the buf can be by using the event size and the field location. int max = iter->ent_size - offsetof(struct print_entry, buf); trace_seq_printf(s, ": %*.s", max, field->buf); Link: https://lore.kernel.org/linux-trace-kernel/20231212084444.4619b8ce@gandalf.local.home Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-12-12ring-buffer: Have saved event hold the entire eventSteven Rostedt (Google)
For the ring buffer iterator (non-consuming read), the event needs to be copied into the iterator buffer to make sure that a writer does not overwrite it while the user is reading it. If a write happens during the copy, the buffer is simply discarded. But the temp buffer itself was not big enough. The allocation of the buffer was only BUF_MAX_DATA_SIZE, which is the maximum data size that can be passed into the ring buffer and saved. But the temp buffer needs to hold the meta data as well. That would be BUF_PAGE_SIZE and not BUF_MAX_DATA_SIZE. Link: https://lore.kernel.org/linux-trace-kernel/20231212072558.61f76493@gandalf.local.home Cc: stable@vger.kernel.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Fixes: 785888c544e04 ("ring-buffer: Have rb_iter_head_event() handle concurrent writer") Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-12-12ring-buffer: Do not update before stamp when switching sub-buffersSteven Rostedt (Google)
The ring buffer timestamps are synchronized by two timestamp placeholders. One is the "before_stamp" and the other is the "write_stamp" (sometimes referred to as the "after stamp" but only in the comments. These two stamps are key to knowing how to handle nested events coming in with a lockless system. When moving across sub-buffers, the before stamp is updated but the write stamp is not. There's an effort to put back the before stamp to something that seems logical in case there's nested events. But as the current event is about to cross sub-buffers, and so will any new nested event that happens, updating the before stamp is useless, and could even introduce new race conditions. The first event on a sub-buffer simply uses the sub-buffer's timestamp and keeps a "delta" of zero. The "before_stamp" and "write_stamp" are not used in the algorithm in this case. There's no reason to try to fix the before_stamp when this happens. As a bonus, it removes a cmpxchg() when crossing sub-buffers! Link: https://lore.kernel.org/linux-trace-kernel/20231211114420.36dde01b@gandalf.local.home Cc: stable@vger.kernel.org Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Fixes: a389d86f7fd09 ("ring-buffer: Have nested events still record running time stamp") Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-12-12mm/mglru: reclaim offlined memcgs harderYu Zhao
In the effort to reduce zombie memcgs [1], it was discovered that the memcg LRU doesn't apply enough pressure on offlined memcgs. Specifically, instead of rotating them to the tail of the current generation (MEMCG_LRU_TAIL) for a second attempt, it moves them to the next generation (MEMCG_LRU_YOUNG) after the first attempt. Not applying enough pressure on offlined memcgs can cause them to build up, and this can be particularly harmful to memory-constrained systems. On Pixel 8 Pro, launching apps for 50 cycles: Before After Change Zombie memcgs 45 35 -22% [1] https://lore.kernel.org/CABdmKX2M6koq4Q0Cmp_-=wbP0Qa190HdEGGaHfxNS05gAkUtPA@mail.gmail.com/ Link: https://lkml.kernel.org/r/20231208061407.2125867-4-yuzhao@google.com Fixes: e4dde56cd208 ("mm: multi-gen LRU: per-node lru_gen_folio lists") Signed-off-by: Yu Zhao <yuzhao@google.com> Reported-by: T.J. Mercier <tjmercier@google.com> Tested-by: T.J. Mercier <tjmercier@google.com> Cc: Charan Teja Kalla <quic_charante@quicinc.com> Cc: Hillf Danton <hdanton@sina.com> Cc: Jaroslav Pulchart <jaroslav.pulchart@gooddata.com> Cc: Kairui Song <ryncsn@gmail.com> Cc: Kalesh Singh <kaleshsingh@google.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-12-12mm/mglru: respect min_ttl_ms with memcgsYu Zhao
While investigating kswapd "consuming 100% CPU" [1] (also see "mm/mglru: try to stop at high watermarks"), it was discovered that the memcg LRU can breach the thrashing protection imposed by min_ttl_ms. Before the memcg LRU: kswapd() shrink_node_memcgs() mem_cgroup_iter() inc_max_seq() // always hit a different memcg lru_gen_age_node() mem_cgroup_iter() check the timestamp of the oldest generation After the memcg LRU: kswapd() shrink_many() restart: iterate the memcg LRU: inc_max_seq() // occasionally hit the same memcg if raced with lru_gen_rotate_memcg(): goto restart lru_gen_age_node() mem_cgroup_iter() check the timestamp of the oldest generation Specifically, when the restart happens in shrink_many(), it needs to stick with the (memcg LRU) generation it began with. In other words, it should neither re-read memcg_lru->seq nor age an lruvec of a different generation. Otherwise it can hit the same memcg multiple times without giving lru_gen_age_node() a chance to check the timestamp of that memcg's oldest generation (against min_ttl_ms). [1] https://lore.kernel.org/CAK8fFZ4DY+GtBA40Pm7Nn5xCHy+51w3sfxPqkqpqakSXYyX+Wg@mail.gmail.com/ Link: https://lkml.kernel.org/r/20231208061407.2125867-3-yuzhao@google.com Fixes: e4dde56cd208 ("mm: multi-gen LRU: per-node lru_gen_folio lists") Signed-off-by: Yu Zhao <yuzhao@google.com> Tested-by: T.J. Mercier <tjmercier@google.com> Cc: Charan Teja Kalla <quic_charante@quicinc.com> Cc: Hillf Danton <hdanton@sina.com> Cc: Jaroslav Pulchart <jaroslav.pulchart@gooddata.com> Cc: Kairui Song <ryncsn@gmail.com> Cc: Kalesh Singh <kaleshsingh@google.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-12-12mm/mglru: try to stop at high watermarksYu Zhao
The initial MGLRU patchset didn't include the memcg LRU support, and it relied on should_abort_scan(), added by commit f76c83378851 ("mm: multi-gen LRU: optimize multiple memcgs"), to "backoff to avoid overshooting their aggregate reclaim target by too much". Later on when the memcg LRU was added, should_abort_scan() was deemed unnecessary, and the test results [1] showed no side effects after it was removed by commit a579086c99ed ("mm: multi-gen LRU: remove eviction fairness safeguard"). However, that test used memory.reclaim, which sets nr_to_reclaim to SWAP_CLUSTER_MAX. So it can overshoot only by SWAP_CLUSTER_MAX-1 pages, i.e., from nr_reclaimed=nr_to_reclaim-1 to nr_reclaimed=nr_to_reclaim+SWAP_CLUSTER_MAX-1. Compared with the batch size kswapd sets to nr_to_reclaim, SWAP_CLUSTER_MAX is tiny. Therefore that test isn't able to reproduce the worst case scenario, i.e., kswapd overshooting GBs on large systems and "consuming 100% CPU" (see the Closes tag). Bring back a simplified version of should_abort_scan() on top of the memcg LRU, so that kswapd stops when all eligible zones are above their respective high watermarks plus a small delta to lower the chance of KSWAPD_HIGH_WMARK_HIT_QUICKLY. Note that this only applies to order-0 reclaim, meaning compaction-induced reclaim can still run wild (which is a different problem). On Android, launching 55 apps sequentially: Before After Change pgpgin 838377172 802955040 -4% pgpgout 38037080 34336300 -10% [1] https://lore.kernel.org/20221222041905.2431096-1-yuzhao@google.com/ Link: https://lkml.kernel.org/r/20231208061407.2125867-2-yuzhao@google.com Fixes: a579086c99ed ("mm: multi-gen LRU: remove eviction fairness safeguard") Signed-off-by: Yu Zhao <yuzhao@google.com> Reported-by: Charan Teja Kalla <quic_charante@quicinc.com> Reported-by: Jaroslav Pulchart <jaroslav.pulchart@gooddata.com> Closes: https://lore.kernel.org/CAK8fFZ4DY+GtBA40Pm7Nn5xCHy+51w3sfxPqkqpqakSXYyX+Wg@mail.gmail.com/ Tested-by: Jaroslav Pulchart <jaroslav.pulchart@gooddata.com> Tested-by: Kalesh Singh <kaleshsingh@google.com> Cc: Hillf Danton <hdanton@sina.com> Cc: Kairui Song <ryncsn@gmail.com> Cc: T.J. Mercier <tjmercier@google.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-12-12mm/mglru: fix underprotected page cacheYu Zhao
Unmapped folios accessed through file descriptors can be underprotected. Those folios are added to the oldest generation based on: 1. The fact that they are less costly to reclaim (no need to walk the rmap and flush the TLB) and have less impact on performance (don't cause major PFs and can be non-blocking if needed again). 2. The observation that they are likely to be single-use. E.g., for client use cases like Android, its apps parse configuration files and store the data in heap (anon); for server use cases like MySQL, it reads from InnoDB files and holds the cached data for tables in buffer pools (anon). However, the oldest generation can be very short lived, and if so, it doesn't provide the PID controller with enough time to respond to a surge of refaults. (Note that the PID controller uses weighted refaults and those from evicted generations only take a half of the whole weight.) In other words, for a short lived generation, the moving average smooths out the spike quickly. To fix the problem: 1. For folios that are already on LRU, if they can be beyond the tracking range of tiers, i.e., five accesses through file descriptors, move them to the second oldest generation to give them more time to age. (Note that tiers are used by the PID controller to statistically determine whether folios accessed multiple times through file descriptors are worth protecting.) 2. When adding unmapped folios to LRU, adjust the placement of them so that they are not too close to the tail. The effect of this is similar to the above. On Android, launching 55 apps sequentially: Before After Change workingset_refault_anon 25641024 25598972 0% workingset_refault_file 115016834 106178438 -8% Link: https://lkml.kernel.org/r/20231208061407.2125867-1-yuzhao@google.com Fixes: ac35a4902374 ("mm: multi-gen LRU: minimal implementation") Signed-off-by: Yu Zhao <yuzhao@google.com> Reported-by: Charan Teja Kalla <quic_charante@quicinc.com> Tested-by: Kalesh Singh <kaleshsingh@google.com> Cc: T.J. Mercier <tjmercier@google.com> Cc: Kairui Song <ryncsn@gmail.com> Cc: Hillf Danton <hdanton@sina.com> Cc: Jaroslav Pulchart <jaroslav.pulchart@gooddata.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-12-12mm/shmem: fix race in shmem_undo_range w/THPDavid Stevens
Split folios during the second loop of shmem_undo_range. It's not sufficient to only split folios when dealing with partial pages, since it's possible for a THP to be faulted in after that point. Calling truncate_inode_folio in that situation can result in throwing away data outside of the range being targeted. [akpm@linux-foundation.org: tidy up comment layout] Link: https://lkml.kernel.org/r/20230418084031.3439795-1-stevensd@google.com Fixes: b9a8a4195c7d ("truncate,shmem: Handle truncates that split large folios") Signed-off-by: David Stevens <stevensd@chromium.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Suleiman Souhlal <suleiman@google.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-12-12Revert "selftests: error out if kernel header files are not yet built"John Hubbard
This reverts commit 9fc96c7c19df ("selftests: error out if kernel header files are not yet built"). It turns out that requiring the kernel headers to be built as a prerequisite to building selftests, does not work in many cases. For example, Peter Zijlstra writes: "My biggest beef with the whole thing is that I simply do not want to use 'make headers', it doesn't work for me. I have a ton of output directories and I don't care to build tools into the output dirs, in fact some of them flat out refuse to work that way (bpf comes to mind)." [1] Therefore, stop erroring out on the selftests build. Additional patches will be required in order to change over to not requiring the kernel headers. [1] https://lore.kernel.org/20231208221007.GO28727@noisy.programming.kicks-ass.net Link: https://lkml.kernel.org/r/20231209020144.244759-1-jhubbard@nvidia.com Fixes: 9fc96c7c19df ("selftests: error out if kernel header files are not yet built") Signed-off-by: John Hubbard <jhubbard@nvidia.com> Cc: Anders Roxell <anders.roxell@linaro.org> Cc: Muhammad Usama Anjum <usama.anjum@collabora.com> Cc: David Hildenbrand <david@redhat.com> Cc: Peter Xu <peterx@redhat.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Nathan Chancellor <nathan@kernel.org> Cc: Shuah Khan <shuah@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Marcos Paulo de Souza <mpdesouza@suse.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-12-12crash_core: fix the check for whether crashkernel is from high memoryYuntao Wang
If crash_base is equal to CRASH_ADDR_LOW_MAX, it also indicates that the crashkernel memory is allocated from high memory. However, the current check only considers the case where crash_base is greater than CRASH_ADDR_LOW_MAX. Fix it. The runtime effects is that crashkernel high memory is successfully reserved, whereas the crashkernel low memory is bypassed in this case, then kdump kernel bootup will fail because of no low memory under 4G. This patch also includes some minor cleanups. Link: https://lkml.kernel.org/r/20231209141438.77233-1-ytcoode@gmail.com Fixes: 0ab97169aa05 ("crash_core: add generic function to do reservation") Signed-off-by: Yuntao Wang <ytcoode@gmail.com> Cc: Baoquan He <bhe@redhat.com> Cc: Dave Young <dyoung@redhat.com> Cc: Vivek Goyal <vgoyal@redhat.com> Cc: Zhen Lei <thunder.leizhen@huawei.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-12-12x86, kexec: fix the wrong ifdeffery CONFIG_KEXECBaoquan He
With the current ifdeffery CONFIG_KEXEC, get_cmdline_acpi_rsdp() is only available when kexec_load interface is taken, while kexec_file_load interface can't make use of it. Now change it to CONFIG_KEXEC_CORE. Link: https://lkml.kernel.org/r/20231208073036.7884-6-bhe@redhat.com Signed-off-by: Baoquan He <bhe@redhat.com> Cc: Eric DeVolder <eric_devolder@yahoo.com> Cc: Ignat Korchagin <ignat@cloudflare.com> Cc: kernel test robot <lkp@intel.com> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-12-12sh, kexec: fix the incorrect ifdeffery and dependency of CONFIG_KEXECBaoquan He
The select of KEXEC for CRASH_DUMP in kernel/Kconfig.kexec will be dropped, then compiling errors will be triggered if below config items are set: === CONFIG_CRASH_CORE=y CONFIG_KEXEC_CORE=y CONFIG_CRASH_DUMP=y === Here, change the dependency of building kexec_core related object files, and the ifdeffery on SuperH from CONFIG_KEXEC to CONFIG_KEXEC_CORE. Link: https://lkml.kernel.org/r/20231208073036.7884-5-bhe@redhat.com Signed-off-by: Baoquan He <bhe@redhat.com> Cc: Eric DeVolder <eric_devolder@yahoo.com> Cc: Ignat Korchagin <ignat@cloudflare.com> Cc: kernel test robot <lkp@intel.com> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-12-12mips, kexec: fix the incorrect ifdeffery and dependency of CONFIG_KEXECBaoquan He
The select of KEXEC for CRASH_DUMP in kernel/Kconfig.kexec will be dropped, then compiling errors will be triggered if below config items are set: === CONFIG_CRASH_CORE=y CONFIG_KEXEC_CORE=y CONFIG_CRASH_DUMP=y === -------------------------------------------------------------------- mipsel-linux-ld: kernel/kexec_core.o: in function `kimage_free': kernel/kexec_core.c:(.text+0x2200): undefined reference to `machine_kexec_cleanup' mipsel-linux-ld: kernel/kexec_core.o: in function `__crash_kexec': kernel/kexec_core.c:(.text+0x2480): undefined reference to `machine_crash_shutdown' mipsel-linux-ld: kernel/kexec_core.c:(.text+0x2488): undefined reference to `machine_kexec' mipsel-linux-ld: kernel/kexec_core.o: in function `kernel_kexec': kernel/kexec_core.c:(.text+0x29b8): undefined reference to `machine_shutdown' mipsel-linux-ld: kernel/kexec_core.c:(.text+0x29c0): undefined reference to `machine_kexec' -------------------------------------------------------------------- Here, change the dependency of building kexec_core related object files, and the ifdeffery in mips from CONFIG_KEXEC to CONFIG_KEXEC_CORE. Link: https://lkml.kernel.org/r/20231208073036.7884-4-bhe@redhat.com Signed-off-by: Baoquan He <bhe@redhat.com> Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202311302042.sn8cDPIX-lkp@intel.com/ Cc: Eric DeVolder <eric_devolder@yahoo.com> Cc: Ignat Korchagin <ignat@cloudflare.com> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-12-12m68k, kexec: fix the incorrect ifdeffery and build dependency of CONFIG_KEXECBaoquan He
The select of KEXEC for CRASH_DUMP in kernel/Kconfig.kexec will be dropped, then compiling errors will be triggered if below config items are set: === CONFIG_CRASH_CORE=y CONFIG_KEXEC_CORE=y CONFIG_CRASH_DUMP=y === Here, change the dependency of buinding machine_kexec.o relocate_kernel.o and the ifdeffery in asm/kexe.h to CONFIG_KEXEC_CORE. Link: https://lkml.kernel.org/r/20231208073036.7884-3-bhe@redhat.com Signed-off-by: Baoquan He <bhe@redhat.com> Cc: Eric DeVolder <eric_devolder@yahoo.com> Cc: Ignat Korchagin <ignat@cloudflare.com> Cc: kernel test robot <lkp@intel.com> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-12-12loongarch, kexec: change dependency of object filesBaoquan He
Patch series "kexec: fix the incorrect ifdeffery and dependency of CONFIG_KEXEC". The select of KEXEC for CRASH_DUMP in kernel/Kconfig.kexec will be dropped, then compiling errors will be triggered if below config items are set: === CONFIG_CRASH_CORE=y CONFIG_KEXEC_CORE=y CONFIG_CRASH_DUMP=y === E.g on mips, below link error are seen: -------------------------------------------------------------------- mipsel-linux-ld: kernel/kexec_core.o: in function `kimage_free': kernel/kexec_core.c:(.text+0x2200): undefined reference to `machine_kexec_cleanup' mipsel-linux-ld: kernel/kexec_core.o: in function `__crash_kexec': kernel/kexec_core.c:(.text+0x2480): undefined reference to `machine_crash_shutdown' mipsel-linux-ld: kernel/kexec_core.c:(.text+0x2488): undefined reference to `machine_kexec' mipsel-linux-ld: kernel/kexec_core.o: in function `kernel_kexec': kernel/kexec_core.c:(.text+0x29b8): undefined reference to `machine_shutdown' mipsel-linux-ld: kernel/kexec_core.c:(.text+0x29c0): undefined reference to `machine_kexec' -------------------------------------------------------------------- Here, change the incorrect dependency of building kexec_core related object files, and the ifdeffery on architectures from CONFIG_KEXEC to CONFIG_KEXEC_CORE. Testing: ======== Passed on mips and loognarch with the LKP reproducer. This patch (of 5): Currently, in arch/loongarch/kernel/Makefile, building machine_kexec.o relocate_kernel.o depends on CONFIG_KEXEC. Whereas, since we will drop the select of KEXEC for CRASH_DUMP in kernel/Kconfig.kexec, compiling error will be triggered if below config items are set: === CONFIG_CRASH_CORE=y CONFIG_KEXEC_CORE=y CONFIG_CRASH_DUMP=y === --------------------------------------------------------------- loongarch64-linux-ld: kernel/kexec_core.o: in function `.L209': >> kexec_core.c:(.text+0x1660): undefined reference to `machine_kexec_cleanup' loongarch64-linux-ld: kernel/kexec_core.o: in function `.L287': >> kexec_core.c:(.text+0x1c5c): undefined reference to `machine_crash_shutdown' >> loongarch64-linux-ld: kexec_core.c:(.text+0x1c64): undefined reference to `machine_kexec' loongarch64-linux-ld: kernel/kexec_core.o: in function `.L2^B5': >> kexec_core.c:(.text+0x2090): undefined reference to `machine_shutdown' loongarch64-linux-ld: kexec_core.c:(.text+0x20a0): undefined reference to `machine_kexec' --------------------------------------------------------------- Here, change the dependency of machine_kexec.o relocate_kernel.o to CONFIG_KEXEC_CORE can fix above building error. Link: https://lkml.kernel.org/r/20231208073036.7884-1-bhe@redhat.com Link: https://lkml.kernel.org/r/20231208073036.7884-2-bhe@redhat.com Signed-off-by: Baoquan He <bhe@redhat.com> Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202311300946.kHE9Iu71-lkp@intel.com/ Cc: Eric DeVolder <eric_devolder@yahoo.com> Cc: Ignat Korchagin <ignat@cloudflare.com> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-12-12mm/damon/core: make damon_start() waits until kdamond_fn() startsSeongJae Park
The cleanup tasks of kdamond threads including reset of corresponding DAMON context's ->kdamond field and decrease of global nr_running_ctxs counter is supposed to be executed by kdamond_fn(). However, commit 0f91d13366a4 ("mm/damon: simplify stop mechanism") made neither damon_start() nor damon_stop() ensure the corresponding kdamond has started the execution of kdamond_fn(). As a result, the cleanup can be skipped if damon_stop() is called fast enough after the previous damon_start(). Especially the skipped reset of ->kdamond could cause a use-after-free. Fix it by waiting for start of kdamond_fn() execution from damon_start(). Link: https://lkml.kernel.org/r/20231208175018.63880-1-sj@kernel.org Fixes: 0f91d13366a4 ("mm/damon: simplify stop mechanism") Signed-off-by: SeongJae Park <sj@kernel.org> Reported-by: Jakub Acs <acsjakub@amazon.de> Cc: Changbin Du <changbin.du@intel.com> Cc: Jakub Acs <acsjakub@amazon.de> Cc: <stable@vger.kernel.org> # 5.15.x Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-12-12selftests/mm: cow: print ksft header before printing anything elseDavid Hildenbrand
Doing a ksft_print_msg() before the ksft_print_header() seems to confuse the ksft framework in a strange way: running the test on the cmdline results in the expected output. But piping the output somewhere else, results in some odd output, whereby we repeatedly get the same info printed: # [INFO] detected THP size: 2048 KiB # [INFO] detected hugetlb page size: 2048 KiB # [INFO] detected hugetlb page size: 1048576 KiB # [INFO] huge zeropage is enabled TAP version 13 1..190 # [INFO] Anonymous memory tests in private mappings # [RUN] Basic COW after fork() ... with base page # [INFO] detected THP size: 2048 KiB # [INFO] detected hugetlb page size: 2048 KiB # [INFO] detected hugetlb page size: 1048576 KiB # [INFO] huge zeropage is enabled TAP version 13 1..190 # [INFO] Anonymous memory tests in private mappings # [RUN] Basic COW after fork() ... with base page ok 1 No leak from parent into child # [RUN] Basic COW after fork() ... with swapped out base page # [INFO] detected THP size: 2048 KiB # [INFO] detected hugetlb page size: 2048 KiB # [INFO] detected hugetlb page size: 1048576 KiB # [INFO] huge zeropage is enabled Doing the ksft_print_header() first seems to resolve that and gives us the output we expect: TAP version 13 # [INFO] detected THP size: 2048 KiB # [INFO] detected hugetlb page size: 2048 KiB # [INFO] detected hugetlb page size: 1048576 KiB # [INFO] huge zeropage is enabled 1..190 # [INFO] Anonymous memory tests in private mappings # [RUN] Basic COW after fork() ... with base page ok 1 No leak from parent into child # [RUN] Basic COW after fork() ... with swapped out base page ok 2 No leak from parent into child # [RUN] Basic COW after fork() ... with THP ok 3 No leak from parent into child # [RUN] Basic COW after fork() ... with swapped-out THP ok 4 No leak from parent into child # [RUN] Basic COW after fork() ... with PTE-mapped THP ok 5 No leak from parent into child Link: https://lkml.kernel.org/r/20231206103558.38040-1-david@redhat.com Fixes: f4b5fd6946e2 ("selftests/vm: anon_cow: THP tests") Signed-off-by: David Hildenbrand <david@redhat.com> Reported-by: Nico Pache <npache@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-12-12mm: fix VMA heap bounds checkingKefeng Wang
After converting selinux to VMA heap check helper, the gcl triggers an execheap SELinux denial, which is caused by a changed logic check. Previously selinux only checked that the VMA range was within the VMA heap range, and the implementation checks the intersection between the two ranges, but the corner case (vm_end=start_brk, brk=vm_start) isn't handled correctly. Since commit 11250fd12eb8 ("mm: factor out VMA stack and heap checks") was only a function extraction, it seems that the issue was introduced by commit 0db0c01b53a1 ("procfs: fix /proc/<pid>/maps heap check"). Let's fix above corner cases, meanwhile, correct the wrong indentation of the stack and heap check helpers. Fixes: 11250fd12eb8 ("mm: factor out VMA stack and heap checks") Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Reported-by: Ondrej Mosnacek <omosnace@redhat.com> Closes: https://lore.kernel.org/selinux/CAFqZXNv0SVT0fkOK6neP9AXbj3nxJ61JAY4+zJzvxqJaeuhbFw@mail.gmail.com/ Tested-by: Ondrej Mosnacek <omosnace@redhat.com> Link: https://lkml.kernel.org/r/20231207152525.2607420-1-wangkefeng.wang@huawei.com Cc: David Hildenbrand <david@redhat.com> Cc: Paul Moore <paul@paul-moore.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephen Smalley <stephen.smalley.work@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-12-12riscv: fix VMALLOC_START definitionBaoquan He
When below config items are set, compiler complained: -------------------- CONFIG_CRASH_CORE=y CONFIG_KEXEC_CORE=y CONFIG_CRASH_DUMP=y ...... ----------------------- ------------------------------------------------------------------- arch/riscv/kernel/crash_core.c: In function 'arch_crash_save_vmcoreinfo': arch/riscv/kernel/crash_core.c:11:58: warning: format '%lx' expects argument of type 'long unsigned int', but argument 2 has type 'int' [-Wformat=] 11 | vmcoreinfo_append_str("NUMBER(VMALLOC_START)=0x%lx\n", VMALLOC_START); | ~~^ | | | long unsigned int | %x ---------------------------------------------------------------------- This is because on riscv macro VMALLOC_START has different type when CONFIG_MMU is set or unset. arch/riscv/include/asm/pgtable.h: -------------------------------------------------- Changing it to _AC(0, UL) in case CONFIG_MMU=n can fix the warning. Link: https://lkml.kernel.org/r/ZW7OsX4zQRA3mO4+@MiWiFi-R3L-srv Signed-off-by: Baoquan He <bhe@redhat.com> Reported-by: Randy Dunlap <rdunlap@infradead.org> Acked-by: Randy Dunlap <rdunlap@infradead.org> Tested-by: Randy Dunlap <rdunlap@infradead.org> # build-tested Cc: Eric DeVolder <eric_devolder@yahoo.com> Cc: Ignat Korchagin <ignat@cloudflare.com> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Albert Ou <aou@eecs.berkeley.edu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-12-12kexec: drop dependency on ARCH_SUPPORTS_KEXEC from CRASH_DUMPIgnat Korchagin
In commit f8ff23429c62 ("kernel/Kconfig.kexec: drop select of KEXEC for CRASH_DUMP") we tried to fix a config regression, where CONFIG_CRASH_DUMP required CONFIG_KEXEC. However, it was not enough at least for arm64 platforms. While further testing the patch with our arm64 config I noticed that CONFIG_CRASH_DUMP is unavailable in menuconfig. This is because CONFIG_CRASH_DUMP still depends on the new CONFIG_ARCH_SUPPORTS_KEXEC introduced in commit 91506f7e5d21 ("arm64/kexec: refactor for kernel/Kconfig.kexec") and on arm64 CONFIG_ARCH_SUPPORTS_KEXEC requires CONFIG_PM_SLEEP_SMP=y, which in turn requires either CONFIG_SUSPEND=y or CONFIG_HIBERNATION=y neither of which are set in our config. Given that we already established that CONFIG_KEXEC (which is a switch for kexec system call itself) is not required for CONFIG_CRASH_DUMP drop CONFIG_ARCH_SUPPORTS_KEXEC dependency as well. The arm64 kernel builds just fine with CONFIG_CRASH_DUMP=y and with both CONFIG_KEXEC=n and CONFIG_KEXEC_FILE=n after f8ff23429c62 ("kernel/Kconfig.kexec: drop select of KEXEC for CRASH_DUMP") and this patch are applied given that the necessary shared bits are included via CONFIG_KEXEC_CORE dependency. [bhe@redhat.com: don't export some symbols when CONFIG_MMU=n] Link: https://lkml.kernel.org/r/ZW03ODUKGGhP1ZGU@MiWiFi-R3L-srv [bhe@redhat.com: riscv, kexec: fix dependency of two items] Link: https://lkml.kernel.org/r/ZW04G/SKnhbE5mnX@MiWiFi-R3L-srv Link: https://lkml.kernel.org/r/20231129220409.55006-1-ignat@cloudflare.com Fixes: 91506f7e5d21 ("arm64/kexec: refactor for kernel/Kconfig.kexec") Signed-off-by: Ignat Korchagin <ignat@cloudflare.com> Signed-off-by: Baoquan He <bhe@redhat.com> Acked-by: Baoquan He <bhe@redhat.com> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: <stable@vger.kernel.org> # 6.6+: f8ff234: kernel/Kconfig.kexec: drop select of KEXEC for CRASH_DUMP Cc: <stable@vger.kernel.org> # 6.6+ Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-12-12Merge tag 'hid-for-linus-2023121201' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/hid/hid Pull HID fixes from Jiri Kosina: - Lenovo ThinkPad TrackPoint Keyboard II firmware-specific regression fix (Mikhail Khvainitski) - device-specific fixes (various authors) * tag 'hid-for-linus-2023121201' of git://git.kernel.org/pub/scm/linux/kernel/git/hid/hid: HID: apple: Add "hfd.cn" and "WKB603" to the list of non-apple keyboards HID: lenovo: Restrict detection of patched firmware only to USB cptkbd HID: Add quirk for Labtec/ODDOR/aikeec handbrake HID: i2c-hid: Add IDEA5002 to i2c_hid_acpi_blacklist[] mailmap: add address mapping for Jiri Kosina
2023-12-12dpll: sanitize possible null pointer dereference in dpll_pin_parent_pin_set()Jiri Pirko
User may not pass DPLL_A_PIN_STATE attribute in the pin set operation message. Sanitize that by checking if the attr pointer is not null and process the passed state attribute value only in that case. Reported-by: Xingyuan Mo <hdthky0@gmail.com> Fixes: 9d71b54b65b1 ("dpll: netlink: Add DPLL framework base functions") Signed-off-by: Jiri Pirko <jiri@nvidia.com> Acked-by: Vadim Fedorenko <vadim.fedorenko@linux.dev> Link: https://lore.kernel.org/r/20231211083758.1082853-1-jiri@resnulli.us Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-12-12Merge branch 'ena-driver-xdp-bug-fixes'Jakub Kicinski
David Arinzon says: ==================== ENA driver XDP bug fixes This patchset contains multiple XDP-related bug fixes in the ENA driver. ==================== Link: https://lore.kernel.org/r/20231211062801.27891-1-darinzon@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-12-12net: ena: Fix XDP redirection errorDavid Arinzon
When sending TX packets, the meta descriptor can be all zeroes as no meta information is required (as in XDP). This patch removes the validity check, as when `disable_meta_caching` is enabled, such TX packets will be dropped otherwise. Fixes: 0e3a3f6dacf0 ("net: ena: support new LLQ acceleration mode") Signed-off-by: Shay Agroskin <shayagr@amazon.com> Signed-off-by: David Arinzon <darinzon@amazon.com> Link: https://lore.kernel.org/r/20231211062801.27891-5-darinzon@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-12-12net: ena: Fix DMA syncing in XDP path when SWIOTLB is onDavid Arinzon
This patch fixes two issues: Issue 1 ------- Description ``````````` Current code does not call dma_sync_single_for_cpu() to sync data from the device side memory to the CPU side memory before the XDP code path uses the CPU side data. This causes the XDP code path to read the unset garbage data in the CPU side memory, resulting in incorrect handling of the packet by XDP. Solution ```````` 1. Add a call to dma_sync_single_for_cpu() before the XDP code starts to use the data in the CPU side memory. 2. The XDP code verdict can be XDP_PASS, in which case there is a fallback to the non-XDP code, which also calls dma_sync_single_for_cpu(). To avoid calling dma_sync_single_for_cpu() twice: 2.1. Put the dma_sync_single_for_cpu() in the code in such a place where it happens before XDP and non-XDP code. 2.2. Remove the calls to dma_sync_single_for_cpu() in the non-XDP code for the first buffer only (rx_copybreak and non-rx_copybreak cases), since the new call that was added covers these cases. The call to dma_sync_single_for_cpu() for the second buffer and on stays because only the first buffer is handled by the newly added dma_sync_single_for_cpu(). And there is no need for special handling of the second buffer and on for the XDP path since currently the driver supports only single buffer packets. Issue 2 ------- Description ``````````` In case the XDP code forwarded the packet (ENA_XDP_FORWARDED), ena_unmap_rx_buff_attrs() is called with attrs set to 0. This means that before unmapping the buffer, the internal function dma_unmap_page_attrs() will also call dma_sync_single_for_cpu() on the whole buffer (not only on the data part of it). This sync is both wasteful (since a sync was already explicitly called before) and also causes a bug, which will be explained using the below diagram. The following diagram shows the flow of events causing the bug. The order of events is (1)-(4) as shown in the diagram. CPU side memory area (3)convert_to_xdp_frame() initializes the headroom with xdpf metadata || \/ ___________________________________ | | 0 | V 4K --------------------------------------------------------------------- | xdpf->data | other xdpf | < data > | tailroom ||...| | | fields | | GARBAGE || | --------------------------------------------------------------------- /\ /\ || || (4)ena_unmap_rx_buff_attrs() calls (2)dma_sync_single_for_cpu() dma_sync_single_for_cpu() on the copies data from device whole buffer page, overwriting side to CPU side memory the xdpf->data with GARBAGE. || 0 4K --------------------------------------------------------------------- | headroom | < data > | tailroom ||...| | GARBAGE | | GARBAGE || | --------------------------------------------------------------------- Device side memory area /\ || (1) device writes RX packet data After the call to ena_unmap_rx_buff_attrs() in (4), the xdpf->data becomes corrupted, and so when it is later accessed in ena_clean_xdp_irq()->xdp_return_frame(), it causes a page fault, crashing the kernel. Solution ```````` Explicitly tell ena_unmap_rx_buff_attrs() not to call dma_sync_single_for_cpu() by passing it the ENA_DMA_ATTR_SKIP_CPU_SYNC flag. Fixes: f7d625adeb7b ("net: ena: Add dynamic recycling mechanism for rx buffers") Signed-off-by: Arthur Kiyanovski <akiyano@amazon.com> Signed-off-by: David Arinzon <darinzon@amazon.com> Link: https://lore.kernel.org/r/20231211062801.27891-4-darinzon@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>