summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2022-05-18mips: ingenic: Do not manually reference the CPU clockAidan MacDonald
It isn't necessary to manually walk the device tree and enable the CPU clock anymore. The CPU and other necessary clocks are now flagged as critical in the clock driver, which accomplishes the same thing in a more declarative fashion. Signed-off-by: Aidan MacDonald <aidanmacdonald.0x0@gmail.com> Reviewed-by: Paul Cercueil <paul@crapouillou.net> Link: https://lore.kernel.org/r/20220428164454.17908-4-aidanmacdonald.0x0@gmail.com Tested-by: 周琰杰 (Zhou Yanjie) <zhouyanjie@wanyeetech.com> # On X1000 and X1830 Signed-off-by: Stephen Boyd <sboyd@kernel.org>
2022-05-18clk: ingenic: Mark critical clocks in Ingenic SoCsAidan MacDonald
Consider CPU, L2 cache, and memory clocks as critical to prevent them -- and the parent clocks -- from being automatically gated, since nothing calls clk_get() on these clocks. Gating the CPU clock hangs the processor, and gating memory makes external DRAM inaccessible. Normal kernel code can't hope to deal with either situation so those clocks have to be critical. The L2 cache is required only if caches are running, and could be gated if the kernel takes care to flush and disable caches before gating the clock. There's no mechanism to do this, and probably no reason to do it, so it's simpler to mark the L2 cache as critical. Signed-off-by: Aidan MacDonald <aidanmacdonald.0x0@gmail.com> Reviewed-by: Paul Cercueil <paul@crapouillou.net> Link: https://lore.kernel.org/r/20220428164454.17908-3-aidanmacdonald.0x0@gmail.com Tested-by: 周琰杰 (Zhou Yanjie) <zhouyanjie@wanyeetech.com> # On X1000 and X1830 Signed-off-by: Stephen Boyd <sboyd@kernel.org>
2022-05-18clk: ingenic: Allow specifying common clock flagsAidan MacDonald
Provide a flags field for clocks under the ingenic-cgu driver, which can be used to set generic common clock framework flags on the created clocks. For example, the CLK_IS_CRITICAL flag is needed for some clocks (such as CPU or memory) to stop them being automatically disabled. Signed-off-by: Aidan MacDonald <aidanmacdonald.0x0@gmail.com> Reviewed-by: Paul Cercueil <paul@crapouillou.net> Link: https://lore.kernel.org/r/20220428164454.17908-2-aidanmacdonald.0x0@gmail.com Tested-by: 周琰杰 (Zhou Yanjie) <zhouyanjie@wanyeetech.com> # On X1000 and X1830 Signed-off-by: Stephen Boyd <sboyd@kernel.org>
2022-05-18clk: ux500: fix a possible off-by-one in u8500_prcc_reset_base()Hangyu Hua
Off-by-one will happen when index == ARRAY_SIZE(ur->base). Fixes: b14cbdfd467d ("clk: ux500: Add driver for the reset portions of PRCC") Signed-off-by: Hangyu Hua <hbh25y@gmail.com> Link: https://lore.kernel.org/r/20220518062537.17933-1-hbh25y@gmail.com Reviewed-by: Linus Walleij <linus.walleij@linaro.org> Signed-off-by: Stephen Boyd <sboyd@kernel.org>
2022-05-18drm/amd: Don't reset dGPUs if the system is going to s2idleMario Limonciello
An A+A configuration on ASUS ROG Strix G513QY proves that the ASIC reset for handling aborted suspend can't work with s2idle. This functionality was introduced in commit daf8de0874ab5b ("drm/amdgpu: always reset the asic in suspend (v2)"). A few other commits have gone on top of the ASIC reset, but this still doesn't work on the A+A configuration in s2idle. Avoid doing the reset on dGPUs specifically when using s2idle. Fixes: daf8de0874ab5b ("drm/amdgpu: always reset the asic in suspend (v2)") Link: https://gitlab.freedesktop.org/drm/amd/-/issues/2008 Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Mario Limonciello <mario.limonciello@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Cc: stable@vger.kernel.org
2022-05-18libceph: fix misleading ceph_osdc_cancel_request() commentIlya Dryomov
cancel_request() never guaranteed that after its return the OSD client would be completely done with the OSD request. The callback (if specified) can still be invoked and a ref can still be held. Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Xiubo Li <xiubli@redhat.com>
2022-05-18libceph: fix potential use-after-free on linger ping and resendsIlya Dryomov
request_reinit() is not only ugly as the comment rightfully suggests, but also unsafe. Even though it is called with osdc->lock held for write in all cases, resetting the OSD request refcount can still race with handle_reply() and result in use-after-free. Taking linger ping as an example: handle_timeout thread handle_reply thread down_read(&osdc->lock) req = lookup_request(...) ... finish_request(req) # unregisters up_read(&osdc->lock) __complete_request(req) linger_ping_cb(req) # req->r_kref == 2 because handle_reply still holds its ref down_write(&osdc->lock) send_linger_ping(lreq) req = lreq->ping_req # same req # cancel_linger_request is NOT # called - handle_reply already # unregistered request_reinit(req) WARN_ON(req->r_kref != 1) # fires request_init(req) kref_init(req->r_kref) # req->r_kref == 1 after kref_init ceph_osdc_put_request(req) kref_put(req->r_kref) # req->r_kref == 0 after kref_put, req is freed <further req initialization/use> !!! This happens because send_linger_ping() always (re)uses the same OSD request for watch ping requests, relying on cancel_linger_request() to unregister it from the OSD client and rip its messages out from the messenger. send_linger() does the same for watch/notify registration and watch reconnect requests. Unfortunately cancel_request() doesn't guarantee that after it returns the OSD client would be completely done with the OSD request -- a ref could still be held and the callback (if specified) could still be invoked too. The original motivation for request_reinit() was inability to deal with allocation failures in send_linger() and send_linger_ping(). Switching to using osdc->req_mempool (currently only used by CephFS) respects that and allows us to get rid of request_reinit(). Cc: stable@vger.kernel.org Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Xiubo Li <xiubli@redhat.com> Acked-by: Jeff Layton <jlayton@kernel.org>
2022-05-18drm/amd: Don't reset dGPUs if the system is going to s2idleMario Limonciello
An A+A configuration on ASUS ROG Strix G513QY proves that the ASIC reset for handling aborted suspend can't work with s2idle. This functionality was introduced in commit daf8de0874ab5b ("drm/amdgpu: always reset the asic in suspend (v2)"). A few other commits have gone on top of the ASIC reset, but this still doesn't work on the A+A configuration in s2idle. Avoid doing the reset on dGPUs specifically when using s2idle. Fixes: daf8de0874ab5b ("drm/amdgpu: always reset the asic in suspend (v2)") Link: https://gitlab.freedesktop.org/drm/amd/-/issues/2008 Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Mario Limonciello <mario.limonciello@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-05-18drm/amdgpu: Unmap legacy queue when MES is enabledLuben Tuikov
This fixes a kernel oops when MES is not enabled. Reported-by: Kenny Ho <Kenny.Ho@amd.com> Suggested-by: Jack Xiao <Jack.Xiao@amd.com> Reviewed-by: Alex Deucher <Alexander.Deucher@amd.com> Signed-off-by: Luben Tuikov <luben.tuikov@amd.com> Fixes: 18ee4ce63e0f32 ("drm/amdgpu: add mes unmap legacy queue routine") Fixes: 3d879e81f0f9ed ("drm/amdgpu: add init support for GFX11 (v2)") Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-05-18Merge tag 'devfreq-next-for-5.19' of ↵Rafael J. Wysocki
git://git.kernel.org/pub/scm/linux/kernel/git/chanwoo/linux Pull devfreq changes for 5.19-rc1 from Chanwoo Choi: "1. Update devfreq core - Add cpu based scaling support to passive governor. Some device like cache might require the dynamic frequency scaling. But, it has very tightly to cpu frequency. So that use passive governor to scale the frequency according to current cpu frequency. To decide the frequency of the device, the governor does one of the following: : Derives the optimal devfreq device opp from required-opps property of the parent cpu opp_table. : Scales the device frequency in proportion to the CPU frequency. So, if the CPUs are running at their max frequency, the device runs at its max frequency. If the CPUs are running at their min frequency, the device runs at its min frequency. It is interpolated for frequencies in between. 2. Update devfreq drivers - Update rk3399_dmc.c as following: : Convert dt-binding document to YAML and deprecate unused properties. : Use Hz units for the device-tree properties of rk3399_dmc. : rk3399_dmc is able to set the idle time before changing the dmc clock. Specify idle time parameters by using nano-second unit on dt bidning. : Add new disable-freq properties to optimize the power-saving feature of rk3399_dmc. : Disable devfreq-event device on remove() to fix unbalanced enable-disable count. : Use devm_pm_opp_of_add_table() : Block PMU (Power-Management Unit) transitions when scaling frequency by ARM Trust Firmware in order to fix the conflict between PMU and DMC (Dynamic Memory Controller)." * tag 'devfreq-next-for-5.19' of git://git.kernel.org/pub/scm/linux/kernel/git/chanwoo/linux: PM / devfreq: passive: Keep cpufreq_policy for possible cpus PM / devfreq: passive: Reduce duplicate code when passive_devfreq case PM / devfreq: Add cpu based scaling support to passive governor PM / devfreq: Export devfreq_get_freq_range symbol within devfreq PM / devfreq: rk3399_dmc: Block PMU during transitions soc: rockchip: power-domain: Manage resource conflicts with firmware PM / devfreq: rk3399_dmc: Avoid static (reused) profile PM / devfreq: rk3399_dmc: Use devm_pm_opp_of_add_table() PM / devfreq: rk3399_dmc: Disable edev on remove() PM / devfreq: rk3399_dmc: Support new *-ns properties PM / devfreq: rk3399_dmc: Support new disable-freq properties PM / devfreq: rk3399_dmc: Use bitfield macro definitions for ODT_PD PM / devfreq: rk3399_dmc: Drop excess timing properties PM / devfreq: rk3399_dmc: Drop undocumented ondemand DT props dt-bindings: devfreq: rk3399_dmc: Add more disable-freq properties dt-bindings: devfreq: rk3399_dmc: Specify idle params in nanoseconds dt-bindings: devfreq: rk3399_dmc: Fix Hz units dt-bindings: devfreq: rk3399_dmc: Deprecate unused/redundant properties dt-bindings: devfreq: rk3399_dmc: Convert to YAML
2022-05-18thermal: intel: hfi: remove NULL check after container_of() callHaowen Bai
container_of() will never return NULL, so remove useless code. Signed-off-by: Haowen Bai <baihaowen@meizu.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2022-05-18io_uring: add fully sparse buffer registrationPavel Begunkov
Honour IORING_RSRC_REGISTER_SPARSE not only for direct files but fixed buffers as well. It makes the rsrc API more consistent. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/66f429e4912fe39fb3318217ff33a2853d4544be.1652879898.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-05-18powercap: intel_rapl: add support for ALDERLAKE_NZhang Rui
Add ALDERLAKE_N to the list of supported processor models in the Intel RAPL power capping driver. Signed-off-by: Zhang Rui <rui.zhang@intel.com> [ rjw: Changelog ] Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2022-05-18x86/sev: Annotate stack change in the #VC handlerLai Jiangshan
In idtentry_vc(), vc_switch_off_ist() determines a safe stack to switch to, off of the IST stack. Annotate the new stack switch with ENCODE_FRAME_POINTER in case UNWINDER_FRAME_POINTER is used. A stack walk before looks like this: CPU: 0 PID: 0 Comm: swapper Not tainted 5.18.0-rc7+ #2 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015 Call Trace: <TASK> dump_stack_lvl dump_stack kernel_exc_vmm_communication asm_exc_vmm_communication ? native_read_msr ? __x2apic_disable.part.0 ? x2apic_setup ? cpu_init ? trap_init ? start_kernel ? x86_64_start_reservations ? x86_64_start_kernel ? secondary_startup_64_no_verify </TASK> and with the fix, the stack dump is exact: CPU: 0 PID: 0 Comm: swapper Not tainted 5.18.0-rc7+ #3 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015 Call Trace: <TASK> dump_stack_lvl dump_stack kernel_exc_vmm_communication asm_exc_vmm_communication RIP: 0010:native_read_msr Code: ... < snipped regs > ? __x2apic_disable.part.0 x2apic_setup cpu_init trap_init start_kernel x86_64_start_reservations x86_64_start_kernel secondary_startup_64_no_verify </TASK> [ bp: Test in a SEV-ES guest and rewrite the commit message to explain what exactly this does. ] Fixes: a13644f3a53d ("x86/entry/64: Add entry code for #VC handler") Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com> Signed-off-by: Borislav Petkov <bp@suse.de> Acked-by: Josh Poimboeuf <jpoimboe@redhat.com> Link: https://lore.kernel.org/r/20220316041612.71357-1-jiangshanlai@gmail.com
2022-05-18drm: msm: fix possible memory leak in mdp5_crtc_cursor_set()Hangyu Hua
drm_gem_object_lookup will call drm_gem_object_get inside. So cursor_bo needs to be put when msm_gem_get_and_pin_iova fails. Fixes: e172d10a9c4a ("drm/msm/mdp5: Add hardware cursor support") Signed-off-by: Hangyu Hua <hbh25y@gmail.com> Link: https://lore.kernel.org/r/20220509061125.18585-1-hbh25y@gmail.com Signed-off-by: Rob Clark <robdclark@chromium.org>
2022-05-18fs-verity: remove unused parameter desc_size in fsverity_create_info()Zhang Jianhua
The parameter desc_size in fsverity_create_info() is useless and it is not referenced anywhere. The greatest meaning of desc_size here is to indecate the size of struct fsverity_descriptor and futher calculate the size of signature. However, the desc->sig_size can do it also and it is indeed, so remove it. Therefore, it is no need to acquire desc_size by fsverity_get_descriptor() in ensure_verity_info(), so remove the parameter desc_ret in fsverity_get_descriptor() too. Signed-off-by: Zhang Jianhua <chris.zjh@huawei.com> Signed-off-by: Eric Biggers <ebiggers@google.com> Link: https://lore.kernel.org/r/20220518132256.2297655-1-chris.zjh@huawei.com
2022-05-18drm/msm: Fix fb plane offset calculationRob Clark
The offset got dropped by accident. Fixes: d413e6f97134 ("drm/msm: Drop msm_gem_iova()") Signed-off-by: Rob Clark <robdclark@chromium.org> Reviewed-by: Abhinav Kumar <quic_abhinavk@quicinc.com> Reviewed-by: Dmitry Baryshkov <dmitry.baryshkov@linaro.org> Reviewed-by: Stephen Boyd <swboyd@chromium.org> Tested-by: Stephen Boyd <swboyd@chromium.org> # CoachZ Link: https://lore.kernel.org/r/20220510165216.3577068-1-robdclark@gmail.com Signed-off-by: Rob Clark <robdclark@chromium.org>
2022-05-18drm/msm/a6xx: Fix refcount leak in a6xx_gpu_initMiaoqian Lin
of_parse_phandle() returns a node pointer with refcount incremented, we should use of_node_put() on it when not need anymore. a6xx_gmu_init() passes the node to of_find_device_by_node() and of_dma_configure(), of_find_device_by_node() will takes its reference, of_dma_configure() doesn't need the node after usage. Add missing of_node_put() to avoid refcount leak. Fixes: 4b565ca5a2cb ("drm/msm: Add A6XX device support") Signed-off-by: Miaoqian Lin <linmq006@gmail.com> Reviewed-by: Akhil P Oommen <quic_akhilpo@quicinc.com> Link: https://lore.kernel.org/r/20220512121955.56937-1-linmq006@gmail.com Signed-off-by: Rob Clark <robdclark@chromium.org>
2022-05-18drm/msm/dsi: don't powerup at modeset time for parade-ps8640Douglas Anderson
Commit 7d8e9a90509f ("drm/msm/dsi: move DSI host powerup to modeset time") caused sc7180 Chromebooks that use the parade-ps8640 bridge chip to fail to turn the display back on after it turns off. Unfortunately, it doesn't look easy to fix the parade-ps8640 driver to handle the new power sequence. The Linux driver has almost nothing in it and most of the logic for this bridge chip is in black-box firmware that the bridge chip uses. Also unfortunately, reverting the patch will break "tc358762". The long term solution here is probably Dave Stevenson's series [1] that would give more flexibility. However, that is likely not a quick fix. For the short term, we'll look at the compatible of the next bridge in the chain and go back to the old way for the Parade PS8640 bridge chip. If it's found that other bridge chips also need this workaround then we can add them to the list or consider inverting the condition. However, the hope is that the framework will not take too much longer to land and we won't have to add anything other than ps8640 here. [1] https://lore.kernel.org/r/cover.1646406653.git.dave.stevenson@raspberrypi.com Fixes: 7d8e9a90509f ("drm/msm/dsi: move DSI host powerup to modeset time") Suggested-by: Rob Clark <robdclark@gmail.com> Signed-off-by: Douglas Anderson <dianders@chromium.org> Reviewed-by: Dmitry Baryshkov <dmitry.baryshkov@linaro.org> Reviewed-by: Abhinav Kumar <quic_abhinavk@quicinc.com> Link: https://lore.kernel.org/r/20220513131504.v5.1.Ia196e35ad985059e77b038a41662faae9e26f411@changeid Signed-off-by: Rob Clark <robdclark@chromium.org>
2022-05-18cgroup: Make cgroup_debug staticXiu Jianfeng
Make cgroup_debug static since it's only used in cgroup.c Signed-off-by: Xiu Jianfeng <xiujianfeng@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2022-05-18nvme: add support for TP4084 - Time-to-Ready EnhancementsChristoph Hellwig
Add support for using longer timeouts during controller initialization and letting the controller come up with namespaces that are not ready for I/O yet. We skip these not ready namespaces during scanning and only bring them online once anoter scan is kicked off by the AEN that is set when the NRDY bit gets set in the I/O Command Set Independent Identify Namespace Data Structure. This asynchronous probing avoids blocking the kernel boot when controllers take a very long time to recover after unclean shutdowns (up to minutes). Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Hannes Reinecke <hare@suse.de>
2022-05-18Fix double fget() in vhost_net_set_backend()Al Viro
Descriptor table is a shared resource; two fget() on the same descriptor may return different struct file references. get_tap_ptr_ring() is called after we'd found (and pinned) the socket we'll be using and it tries to find the private tun/tap data structures associated with it. Redoing the lookup by the same file descriptor we'd used to get the socket is racy - we need to same struct file. Thanks to Jason for spotting a braino in the original variant of patch - I'd missed the use of fd == -1 for disabling backend, and in that case we can end up with sock == NULL and sock != oldsock. Cc: stable@kernel.org Acked-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Jason Wang <jasowang@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2022-05-18vdpa/mlx5: Use consistent RQT sizeEli Cohen
The current code evaluates RQT size based on the configured number of virtqueues. This can raise an issue in the following scenario: Assume MQ was negotiated. 1. mlx5_vdpa_set_map() gets called. 2. handle_ctrl_mq() is called setting cur_num_vqs to some value, lower than the configured max VQs. 3. A second set_map gets called, but now a smaller number of VQs is used to evaluate the size of the RQT. 4. handle_ctrl_mq() is called with a value larger than what the RQT can hold. This will emit errors and the driver state is compromised. To fix this, we use a new field in struct mlx5_vdpa_net to hold the required number of entries in the RQT. This value is evaluated in mlx5_vdpa_set_driver_features() where we have the negotiated features all set up. In addition to that, we take into consideration the max capability of RQT entries early when the device is added so we don't need to take consider it when creating the RQT. Last, we remove the use of mlx5_vdpa_max_qps() which just returns the max_vas / 2 and make the code clearer. Fixes: 52893733f2c5 ("vdpa/mlx5: Add multiqueue support") Acked-by: Jason Wang <jasowang@redhat.com> Signed-off-by: Eli Cohen <elic@nvidia.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2022-05-18PCI: microchip: Fix potential race in interrupt handlingDaire McNamara
Clear the MSI bit in ISTATUS_LOCAL register after reading it, but before reading and handling individual MSI bits from the ISTATUS_MSI register. This avoids a potential race where new MSI bits may be set on the ISTATUS_MSI register after it was read and be missed when the MSI bit in the ISTATUS_LOCAL register is cleared. ISTATUS_LOCAL is a read/write/clear register; the register's bits are set when the corresponding interrupt source is activated. Each source is independent and thus multiple sources may be active simultaneously. The processor can monitor and clear status bits. If one or more ISTATUS_LOCAL interrupt sources are active, the RootPort issues an interrupt towards the processor (on the AXI domain). Bit 28 of this register reports an MSI has been received by the RootPort. ISTATUS_MSI is a read/write/clear register. Bits 31-0 are asserted when an MSI with message number 31-0 is received by the RootPort. The processor must monitor and clear these bits. Effectively, Bit 28 of ISTATUS_LOCAL informs the processor that an MSI has arrived at the RootPort and ISTATUS_MSI informs the processor which MSI (in the range 0 - 31) needs handling. Reported by: Bjorn Helgaas <bhelgaas@google.com> Link: https://lore.kernel.org/linux-pci/20220127202000.GA126335@bhelgaas/ Link: https://lore.kernel.org/r/20220517141622.145581-1-daire.mcnamara@microchip.com Fixes: 6f15a9c9f941 ("PCI: microchip: Add Microchip PolarFire PCIe controller driver") Signed-off-by: Daire McNamara <daire.mcnamara@microchip.com> Signed-off-by: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com>
2022-05-18Merge tag 'sound-5.18' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound Pull sound fixes from Takashi Iwai: "A collection of last-minute HD- an USB-audio quirks in addition to a fix for the legacy ISA wavefront driver. All look small and easy" * tag 'sound-5.18' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound: ALSA: usb-audio: Restore Rane SL-1 quirk ALSA: hda/realtek: fix right sounds and mute/micmute LEDs for HP machine ALSA: hda/realtek: Add quirk for TongFang devices with pop noise ALSA: hda/realtek: Add quirk for the Framework Laptop ALSA: wavefront: Proper check of get_user() error ALSA: hda/realtek: Add quirk for Dell Latitude 7520 ALSA: hda - fix unused Realtek function when PM is not enabled ALSA: usb-audio: Don't get sample rate for MCT Trigger 5 USB-to-HDMI
2022-05-18netfilter: nf_tables: disable expression reduction infraPablo Neira Ayuso
Either userspace or kernelspace need to pre-fetch keys inconditionally before comparisons for this to work. Otherwise, register tracking data is misleading and it might result in reducing expressions which are not yet registers. First expression is also guaranteed to be evaluated always, however, certain expressions break before writing data to registers, before comparing the data, leaving the register in undetermined state. This patch disables this infrastructure by now. Fixes: b2d306542ff9 ("netfilter: nf_tables: do not reduce read-only expressions") Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2022-05-18netfilter: flowtable: move dst_check to packet pathRitaro Takenaka
Fixes sporadic IPv6 packet loss when flow offloading is enabled. IPv6 route GC and flowtable GC are not synchronized. When dst_cache becomes stale and a packet passes through the flow before the flowtable GC teardowns it, the packet can be dropped. So, it is necessary to check dst every time in packet path. Fixes: 227e1e4d0d6c ("netfilter: nf_flowtable: skip device lookup from interface index") Signed-off-by: Ritaro Takenaka <ritarot634@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2022-05-18netfilter: flowtable: fix TCP flow teardownPablo Neira Ayuso
This patch addresses three possible problems: 1. ct gc may race to undo the timeout adjustment of the packet path, leaving the conntrack entry in place with the internal offload timeout (one day). 2. ct gc removes the ct because the IPS_OFFLOAD_BIT is not set and the CLOSE timeout is reached before the flow offload del. 3. tcp ct is always set to ESTABLISHED with a very long timeout in flow offload teardown/delete even though the state might be already CLOSED. Also as a remark we cannot assume that the FIN or RST packet is hitting flow table teardown as the packet might get bumped to the slow path in nftables. This patch resets IPS_OFFLOAD_BIT from flow_offload_teardown(), so conntrack handles the tcp rst/fin packet which triggers the CLOSE/FIN state transition. Moreover, teturn the connection's ownership to conntrack upon teardown by clearing the offload flag and fixing the established timeout value. The flow table GC thread will asynchonrnously free the flow table and hardware offload entries. Before this patch, the IPS_OFFLOAD_BIT remained set for expired flows on which is also misleading since the flow is back to classic conntrack path. If nf_ct_delete() removes the entry from the conntrack table, then it calls nf_ct_put() which decrements the refcnt. This is not a problem because the flowtable holds a reference to the conntrack object from flow_offload_alloc() path which is released via flow_offload_free(). This patch also updates nft_flow_offload to skip packets in SYN_RECV state. Since we might miss or bump packets to slow path, we do not know what will happen there while we are still in SYN_RECV, this patch postpones offload up to the next packet which also aligns to the existing behaviour in tc-ct. flow_offload_teardown() does not reset the existing tcp state from flow_offload_fixup_tcp() to ESTABLISHED anymore, packets bump to slow path might have already update the state to CLOSE/FIN. Joint work with Oz and Sven. Fixes: 1e5b2471bcc4 ("netfilter: nf_flow_table: teardown flow timeout race") Signed-off-by: Oz Shlomo <ozsh@nvidia.com> Signed-off-by: Sven Auhagen <sven.auhagen@voleatech.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2022-05-18ext4: fix memory leak in parse_apply_sb_mount_options()Eric Biggers
If processing the on-disk mount options fails after any memory was allocated in the ext4_fs_context, e.g. s_qf_names, then this memory is leaked. Fix this by calling ext4_fc_free() instead of kfree() directly. Reproducer: mkfs.ext4 -F /dev/vdc tune2fs /dev/vdc -E mount_opts=usrjquota=file echo clear > /sys/kernel/debug/kmemleak mount /dev/vdc /vdc echo scan > /sys/kernel/debug/kmemleak sleep 5 echo scan > /sys/kernel/debug/kmemleak cat /sys/kernel/debug/kmemleak Fixes: 7edfd85b1ffd ("ext4: Completely separate options parsing and sb setup") Cc: stable@vger.kernel.org Signed-off-by: Eric Biggers <ebiggers@google.com> Tested-by: Ritesh Harjani <ritesh.list@gmail.com> Link: https://lore.kernel.org/r/20220513231605.175121-2-ebiggers@kernel.org Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-05-18ext4: reject the 'commit' option on ext2 filesystemsEric Biggers
The 'commit' option is only applicable for ext3 and ext4 filesystems, and has never been accepted by the ext2 filesystem driver, so the ext4 driver shouldn't allow it on ext2 filesystems. This fixes a failure in xfstest ext4/053. Fixes: 8dc0aa8cf0f7 ("ext4: check incompatible mount options while mounting ext2/3") Signed-off-by: Eric Biggers <ebiggers@google.com> Reviewed-by: Ritesh Harjani <ritesh.list@gmail.com> Reviewed-by: Lukas Czerner <lczerner@redhat.com> Link: https://lore.kernel.org/r/20220510183232.172615-1-ebiggers@kernel.org
2022-05-18ext4: remove duplicated #include of dax.h in inode.cYang Li
Fix following includecheck warning: ./fs/ext4/inode.c: linux/dax.h is included more than once. Reported-by: Abaci Robot <abaci@linux.alibaba.com> Signed-off-by: Yang Li <yang.lee@linux.alibaba.com> Link: https://lore.kernel.org/r/20220504225025.44753-1-yang.lee@linux.alibaba.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-05-19KVM: PPC: Book3S HV: Fix vcore_blocked tracepointFabiano Rosas
We removed most of the vcore logic from the P9 path but there's still a tracepoint that tried to dereference vc->runner. Fixes: ecb6a7207f92 ("KVM: PPC: Book3S HV P9: Remove most of the vcore logic") Signed-off-by: Fabiano Rosas <farosas@linux.ibm.com> Reviewed-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20220328215831.320409-1-farosas@linux.ibm.com
2022-05-19KVM: PPC: Book3s: Remove real mode interrupt controller hcalls handlersAlexey Kardashevskiy
Currently we have 2 sets of interrupt controller hypercalls handlers for real and virtual modes, this is from POWER8 times when switching MMU on was considered an expensive operation. POWER9 however does not have dependent threads and MMU is enabled for handling hcalls so the XIVE native or XICS-on-XIVE real mode handlers never execute on real P9 and later CPUs. This untemplate the handlers and only keeps the real mode handlers for XICS native (up to POWER8) and remove the rest of dead code. Changes in functions are mechanical except few missing empty lines to make checkpatch.pl happy. The default implemented hcalls list already contains XICS hcalls so no change there. This should not cause any behavioral change. Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Acked-by: Cédric Le Goater <clg@kaod.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20220509071150.181250-1-aik@ozlabs.ru
2022-05-19KVM: PPC: Book3s: PR: Enable default TCE hypercallsAlexey Kardashevskiy
When KVM_CAP_PPC_ENABLE_HCALL was introduced, H_GET_TCE and H_PUT_TCE were already implemented and enabled by default; however H_GET_TCE was missed out on PR KVM (probably because the handler was in the real mode code at the time). This enables H_GET_TCE by default. While at this, this wraps the checks in ifdef CONFIG_SPAPR_TCE_IOMMU just like HV KVM. Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Reviewed-by: Fabiano Rosas <farosas@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20220506073737.3823347-1-aik@ozlabs.ru
2022-05-19KVM: PPC: Book3s: Retire H_PUT_TCE/etc real mode handlersAlexey Kardashevskiy
LoPAPR defines guest visible IOMMU with hypercalls to use it - H_PUT_TCE/etc. Implemented first on POWER7 where hypercalls would trap in the KVM in the real mode (with MMU off). The problem with the real mode is some memory is not available and some API usage crashed the host but enabling MMU was an expensive operation. The problems with the real mode handlers are: 1. Occasionally these cannot complete the request so the code is copied+modified to work in the virtual mode, very little is shared; 2. The real mode handlers have to be linked into vmlinux to work; 3. An exception in real mode immediately reboots the machine. If the small DMA window is used, the real mode handlers bring better performance. However since POWER8, there has always been a bigger DMA window which VMs use to map the entire VM memory to avoid calling H_PUT_TCE. Such 1:1 mapping happens once and uses H_PUT_TCE_INDIRECT (a bulk version of H_PUT_TCE) which virtual mode handler is even closer to its real mode version. On POWER9 hypercalls trap straight to the virtual mode so the real mode handlers never execute on POWER9 and later CPUs. So with the current use of the DMA windows and MMU improvements in POWER9 and later, there is no point in duplicating the code. The 32bit passed through devices may slow down but we do not have many of these in practice. For example, with this applied, a 1Gbit ethernet adapter still demostrates above 800Mbit/s of actual throughput. This removes the real mode handlers from KVM and related code from the powernv platform. This updates the list of implemented hcalls in KVM-HV as the realmode handlers are removed. This changes ABI - kvmppc_h_get_tce() moves to the KVM module and kvmppc_find_table() is static now. Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20220506053755.3820702-1-aik@ozlabs.ru
2022-05-19Merge branch 'fixes' into topic/ppc-kvmMichael Ellerman
Merge our fixes branch. In parciular this brings in the KVM TCE handling fix, which is a prerequisite for a subsequent patch.
2022-05-19KVM: PPC: Book3S HV: Initialize AMOR in nested entryFabiano Rosas
The hypervisor always sets AMOR to ~0, but let's ensure we're not passing stale values around. Signed-off-by: Fabiano Rosas <farosas@linux.ibm.com> Reviewed-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20220425142151.1495142-1-farosas@linux.ibm.com
2022-05-19Merge branch 'fixes' into nextMichael Ellerman
Merge our fixes branch from this cycle. In particular this brings in a papr_scm.c change which a subsequent patch has a dependency on.
2022-05-18random: credit architectural init the exact amountJason A. Donenfeld
RDRAND and RDSEED can fail sometimes, which is fine. We currently initialize the RNG with 512 bits of RDRAND/RDSEED. We only need 256 bits of those to succeed in order to initialize the RNG. Instead of the current "all or nothing" approach, actually credit these contributions the amount that is actually contributed. Reviewed-by: Dominik Brodowski <linux@dominikbrodowski.net> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2022-05-18random: handle latent entropy and command line from random_init()Jason A. Donenfeld
Currently, start_kernel() adds latent entropy and the command line to the entropy bool *after* the RNG has been initialized, deferring when it's actually used by things like stack canaries until the next time the pool is seeded. This surely is not intended. Rather than splitting up which entropy gets added where and when between start_kernel() and random_init(), just do everything in random_init(), which should eliminate these kinds of bugs in the future. While we're at it, rename the awkwardly titled "rand_initialize()" to the more standard "random_init()" nomenclature. Reviewed-by: Dominik Brodowski <linux@dominikbrodowski.net> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2022-05-18random: use proper jiffies comparison macroJason A. Donenfeld
This expands to exactly the same code that it replaces, but makes things consistent by using the same macro for jiffy comparisons throughout. Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2022-05-18random: remove ratelimiting for in-kernel unseeded randomnessJason A. Donenfeld
The CONFIG_WARN_ALL_UNSEEDED_RANDOM debug option controls whether the kernel warns about all unseeded randomness or just the first instance. There's some complicated rate limiting and comparison to the previous caller, such that even with CONFIG_WARN_ALL_UNSEEDED_RANDOM enabled, developers still don't see all the messages or even an accurate count of how many were missed. This is the result of basically parallel mechanisms aimed at accomplishing more or less the same thing, added at different points in random.c history, which sort of compete with the first-instance-only limiting we have now. It turns out, however, that nobody cares about the first unseeded randomness instance of in-kernel users. The same first user has been there for ages now, and nobody is doing anything about it. It isn't even clear that anybody _can_ do anything about it. Most places that can do something about it have switched over to using get_random_bytes_wait() or wait_for_random_bytes(), which is the right thing to do, but there is still much code that needs randomness sometimes during init, and as a geeneral rule, if you're not using one of the _wait functions or the readiness notifier callback, you're bound to be doing it wrong just based on that fact alone. So warning about this same first user that can't easily change is simply not an effective mechanism for anything at all. Users can't do anything about it, as the Kconfig text points out -- the problem isn't in userspace code -- and kernel developers don't or more often can't react to it. Instead, show the warning for all instances when CONFIG_WARN_ALL_UNSEEDED_RANDOM is set, so that developers can debug things need be, or if it isn't set, don't show a warning at all. At the same time, CONFIG_WARN_ALL_UNSEEDED_RANDOM now implies setting random.ratelimit_disable=1 on by default, since if you care about one you probably care about the other too. And we can clean up usage around the related urandom_warning ratelimiter as well (whose behavior isn't changing), so that it properly counts missed messages after the 10 message threshold is reached. Cc: Theodore Ts'o <tytso@mit.edu> Cc: Dominik Brodowski <linux@dominikbrodowski.net> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2022-05-18random: move initialization out of reseeding hot pathJason A. Donenfeld
Initialization happens once -- by way of credit_init_bits() -- and then it never happens again. Therefore, it doesn't need to be in crng_reseed(), which is a hot path that is called multiple times. It also doesn't make sense to have there, as initialization activity is better associated with initialization routines. After the prior commit, crng_reseed() now won't be called by multiple concurrent callers, which means that we can safely move the "finialize_init" logic into crng_init_bits() unconditionally. Reviewed-by: Dominik Brodowski <linux@dominikbrodowski.net> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2022-05-18random: avoid initializing twice in credit raceJason A. Donenfeld
Since all changes of crng_init now go through credit_init_bits(), we can fix a long standing race in which two concurrent callers of credit_init_bits() have the new bit count >= some threshold, but are doing so with crng_init as a lower threshold, checked outside of a lock, resulting in crng_reseed() or similar being called twice. In order to fix this, we can use the original cmpxchg value of the bit count, and only change crng_init when the bit count transitions from below a threshold to meeting the threshold. Reviewed-by: Dominik Brodowski <linux@dominikbrodowski.net> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2022-05-18random: use symbolic constants for crng_init statesJason A. Donenfeld
crng_init represents a state machine, with three states, and various rules for transitions. For the longest time, we've been managing these with "0", "1", and "2", and expecting people to figure it out. To make the code more obvious, replace these with proper enum values representing the transition, and then redocument what each of these states mean. Reviewed-by: Dominik Brodowski <linux@dominikbrodowski.net> Cc: Joe Perches <joe@perches.com> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2022-05-18random32: use real rng for non-deterministic randomnessJason A. Donenfeld
random32.c has two random number generators in it: one that is meant to be used deterministically, with some predefined seed, and one that does the same exact thing as random.c, except does it poorly. The first one has some use cases. The second one no longer does and can be replaced with calls to random.c's proper random number generator. The relatively recent siphash-based bad random32.c code was added in response to concerns that the prior random32.c was too deterministic. Out of fears that random.c was (at the time) too slow, this code was anonymously contributed. Then out of that emerged a kind of shadow entropy gathering system, with its own tentacles throughout various net code, added willy nilly. Stop👏making👏bespoke👏random👏number👏generators👏. Fortunately, recent advances in random.c mean that we can stop playing with this sketchiness, and just use get_random_u32(), which is now fast enough. In micro benchmarks using RDPMC, I'm seeing the same median cycle count between the two functions, with the mean being _slightly_ higher due to batches refilling (which we can optimize further need be). However, when doing *real* benchmarks of the net functions that actually use these random numbers, the mean cycles actually *decreased* slightly (with the median still staying the same), likely because the additional prandom code means icache misses and complexity, whereas random.c is generally already being used by something else nearby. The biggest benefit of this is that there are many users of prandom who probably should be using cryptographically secure random numbers. This makes all of those accidental cases become secure by just flipping a switch. Later on, we can do a tree-wide cleanup to remove the static inline wrapper functions that this commit adds. There are also some low-ish hanging fruits for making this even faster in the future: a get_random_u16() function for use in the networking stack will give a 2x performance boost there, using SIMD for ChaCha20 will let us compute 4 or 8 or 16 blocks of output in parallel, instead of just one, giving us large buffers for cheap, and introducing a get_random_*_bh() function that assumes irqs are already disabled will shave off a few cycles for ordinary calls. These are things we can chip away at down the road. Acked-by: Jakub Kicinski <kuba@kernel.org> Acked-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2022-05-18siphash: use one source of truth for siphash permutationsJason A. Donenfeld
The SipHash family of permutations is currently used in three places: - siphash.c itself, used in the ordinary way it was intended. - random32.c, in a construction from an anonymous contributor. - random.c, as part of its fast_mix function. Each one of these places reinvents the wheel with the same C code, same rotation constants, and same symmetry-breaking constants. This commit tidies things up a bit by placing macros for the permutations and constants into siphash.h, where each of the three .c users can access them. It also leaves a note dissuading more users of them from emerging. Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2022-05-18random: help compiler out with fast_mix() by using simpler argumentsJason A. Donenfeld
Now that fast_mix() has more than one caller, gcc no longer inlines it. That's fine. But it also doesn't handle the compound literal argument we pass it very efficiently, nor does it handle the loop as well as it could. So just expand the code to spell out this function so that it generates the same code as it did before. Performance-wise, this now behaves as it did before the last commit. The difference in actual code size on x86 is 45 bytes, which is less than a cache line. Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2022-05-18random: do not use input pool from hard IRQsJason A. Donenfeld
Years ago, a separate fast pool was added for interrupts, so that the cost associated with taking the input pool spinlocks and mixing into it would be avoided in places where latency is critical. However, one oversight was that add_input_randomness() and add_disk_randomness() still sometimes are called directly from the interrupt handler, rather than being deferred to a thread. This means that some unlucky interrupts will be caught doing a blake2s_compress() call and potentially spinning on input_pool.lock, which can also be taken by unprivileged users by writing into /dev/urandom. In order to fix this, add_timer_randomness() now checks whether it is being called from a hard IRQ and if so, just mixes into the per-cpu IRQ fast pool using fast_mix(), which is much faster and can be done lock-free. A nice consequence of this, as well, is that it means hard IRQ context FPU support is likely no longer useful. The entropy estimation algorithm used by add_timer_randomness() is also somewhat different than the one used for add_interrupt_randomness(). The former looks at deltas of deltas of deltas, while the latter just waits for 64 interrupts for one bit or for one second since the last bit. In order to bridge these, and since add_interrupt_randomness() runs after an add_timer_randomness() that's called from hard IRQ, we add to the fast pool credit the related amount, and then subtract one to account for add_interrupt_randomness()'s contribution. A downside of this, however, is that the num argument is potentially attacker controlled, which puts a bit more pressure on the fast_mix() sponge to do more than it's really intended to do. As a mitigating factor, the first 96 bits of input aren't attacker controlled (a cycle counter followed by zeros), which means it's essentially two rounds of siphash rather than one, which is somewhat better. It's also not that much different from add_interrupt_randomness()'s use of the irq stack instruction pointer register. Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Filipe Manana <fdmanana@suse.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2022-05-18arm64/sve: Move sve_free() into SVE code sectionGeert Uytterhoeven
If CONFIG_ARM64_SVE is not set: arch/arm64/kernel/fpsimd.c:294:13: warning: ‘sve_free’ defined but not used [-Wunused-function] Fix this by moving sve_free() and __sve_free() into the existing section protected by "#ifdef CONFIG_ARM64_SVE", now the last user outside that section has been removed. Fixes: a1259dd80719 ("arm64/sve: Delay freeing memory in fpsimd_flush_thread()") Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be> Reviewed-by: Mark Brown <broonie@kernel.org> Link: https://lore.kernel.org/r/cd633284683c24cb9469f8ff429915aedf67f868.1652798894.git.geert+renesas@glider.be Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>