summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2021-08-10Merge tag 'efi-urgent-for-v5.14-rc4' of ↵Thomas Gleixner
git://git.kernel.org/pub/scm/linux/kernel/git/efi/efi into efi/urgent Pull EFI fixes from Ard Biesheuvel: A batch of fixes for the arm64 stub image loader: - fix a logic bug that can make the random page allocator fail spuriously - force reallocation of the Image when it overlaps with firmware reserved memory regions - fix an oversight that defeated on optimization introduced earlier where images loaded at a suitable offset are never moved if booting without randomization - complain about images that were not loaded at the right offset by the firmware image loader. Link: https://lore.kernel.org/r/20210803091215.2566-1-ardb@kernel.org
2021-08-10ovl: prevent private clone if bind mount is not allowedMiklos Szeredi
Add the following checks from __do_loopback() to clone_private_mount() as well: - verify that the mount is in the current namespace - verify that there are no locked children Reported-by: Alois Wohlschlager <alois1@gmx-topmail.de> Fixes: c771d683a62e ("vfs: introduce clone_private_mount()") Cc: <stable@vger.kernel.org> # v3.18 Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2021-08-10ovl: fix uninitialized pointer read in ovl_lookup_real_one()Miklos Szeredi
One error path can result in release_dentry_name_snapshot() being called before "name" was initialized by take_dentry_name_snapshot(). Fix by moving the release_dentry_name_snapshot() to immediately after the only use. Reported-by: Colin Ian King <colin.king@canonical.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2021-08-10ovl: fix deadlock in splice writeMiklos Szeredi
There's possibility of an ABBA deadlock in case of a splice write to an overlayfs file and a concurrent splice write to a corresponding real file. The call chain for splice to an overlay file: -> do_splice [takes sb_writers on overlay file] -> do_splice_from -> iter_file_splice_write [takes pipe->mutex] -> vfs_iter_write ... -> ovl_write_iter [takes sb_writers on real file] And the call chain for splice to a real file: -> do_splice [takes sb_writers on real file] -> do_splice_from -> iter_file_splice_write [takes pipe->mutex] Syzbot successfully bisected this to commit 82a763e61e2b ("ovl: simplify file splice"). Fix by reverting the write part of the above commit and by adding missing bits from ovl_write_iter() into ovl_splice_write(). Fixes: 82a763e61e2b ("ovl: simplify file splice") Reported-and-tested-by: syzbot+579885d1a9a833336209@syzkaller.appspotmail.com Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2021-08-10ovl: skip stale entries in merge dir cache iterationAmir Goldstein
On the first getdents call, ovl_iterate() populates the readdir cache with a list of entries, but for upper entries with origin lower inode, p->ino remains zero. Following getdents calls traverse the readdir cache list and call ovl_cache_update_ino() for entries with zero p->ino to lookup the entry in the overlay and return d_ino that is consistent with st_ino. If the upper file was unlinked between the first getdents call and the getdents call that lists the file entry, ovl_cache_update_ino() will not find the entry and fall back to setting d_ino to the upper real st_ino, which is inconsistent with how this object was presented to users. Instead of listing a stale entry with inconsistent d_ino, simply skip the stale entry, which is better for users. xfstest overlay/077 is failing without this patch. Signed-off-by: Amir Goldstein <amir73il@gmail.com> Link: https://lore.kernel.org/fstests/CAOQ4uxgR_cLnC_vdU5=seP3fwqVkuZM_-WfD6maFTMbMYq=a9w@mail.gmail.com/ Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2021-08-10bpf: Add missing bpf_read_[un]lock_trace() for syscall programYonghong Song
Commit 79a7f8bdb159d ("bpf: Introduce bpf_sys_bpf() helper and program type.") added support for syscall program, which is a sleepable program. But the program run missed bpf_read_lock_trace()/bpf_read_unlock_trace(), which is needed to ensure proper rcu callback invocations. This patch adds bpf_read_[un]lock_trace() properly. Fixes: 79a7f8bdb159d ("bpf: Introduce bpf_sys_bpf() helper and program type.") Signed-off-by: Yonghong Song <yhs@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20210809235151.1663680-1-yhs@fb.com
2021-08-10bpf: Add lockdown check for probe_write_user helperDaniel Borkmann
Back then, commit 96ae52279594 ("bpf: Add bpf_probe_write_user BPF helper to be called in tracers") added the bpf_probe_write_user() helper in order to allow to override user space memory. Its original goal was to have a facility to "debug, divert, and manipulate execution of semi-cooperative processes" under CAP_SYS_ADMIN. Write to kernel was explicitly disallowed since it would otherwise tamper with its integrity. One use case was shown in cf9b1199de27 ("samples/bpf: Add test/example of using bpf_probe_write_user bpf helper") where the program DNATs traffic at the time of connect(2) syscall, meaning, it rewrites the arguments to a syscall while they're still in userspace, and before the syscall has a chance to copy the argument into kernel space. These days we have better mechanisms in BPF for achieving the same (e.g. for load-balancers), but without having to write to userspace memory. Of course the bpf_probe_write_user() helper can also be used to abuse many other things for both good or bad purpose. Outside of BPF, there is a similar mechanism for ptrace(2) such as PTRACE_PEEK{TEXT,DATA} and PTRACE_POKE{TEXT,DATA}, but would likely require some more effort. Commit 96ae52279594 explicitly dedicated the helper for experimentation purpose only. Thus, move the helper's availability behind a newly added LOCKDOWN_BPF_WRITE_USER lockdown knob so that the helper is disabled under the "integrity" mode. More fine-grained control can be implemented also from LSM side with this change. Fixes: 96ae52279594 ("bpf: Add bpf_probe_write_user BPF helper to be called in tracers") Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Andrii Nakryiko <andrii@kernel.org>
2021-08-10drm/meson: fix colour distortion from HDR set during vendor u-bootChristian Hewitt
Add support for the OSD1 HDR registers so meson DRM can handle the HDR properties set by Amlogic u-boot on G12A and newer devices which result in blue/green/pink colour distortion to display output. This takes the original patch submissions from Mathias [0] and [1] with corrections for formatting and the missing description and attribution needed for merge. [0] https://lore.kernel.org/linux-amlogic/59dfd7e6-fc91-3d61-04c4-94e078a3188c@baylibre.com/T/ [1] https://lore.kernel.org/linux-amlogic/CAOKfEHBx_fboUqkENEMd-OC-NSrf46nto+vDLgvgttzPe99kXg@mail.gmail.com/T/#u Fixes: 728883948b0d ("drm/meson: Add G12A Support for VIU setup") Suggested-by: Mathias Steiger <mathias.steiger@googlemail.com> Signed-off-by: Christian Hewitt <christianshewitt@gmail.com> Tested-by: Neil Armstrong <narmstrong@baylibre.com> Tested-by: Philip Milev <milev.philip@gmail.com> [narmsrong: adding missing space on second tested-by tag] Signed-off-by: Neil Armstrong <narmstrong@baylibre.com> Link: https://patchwork.freedesktop.org/patch/msgid/20210806094005.7136-1-christianshewitt@gmail.com
2021-08-10Revert "usb: dwc3: gadget: Use list_replace_init() before traversing lists"Greg Kroah-Hartman
This reverts commit d25d85061bd856d6be221626605319154f9b5043 as it is reported to cause problems on many different types of boards. Reported-by: Thinh Nguyen <Thinh.Nguyen@synopsys.com> Reported-by: John Stultz <john.stultz@linaro.org> Cc: Ray Chi <raychi@google.com> Link: https://lore.kernel.org/r/CANcMJZCEVxVLyFgLwK98hqBEdc0_n4P0x_K6Gih8zNH3ouzbJQ@mail.gmail.com Fixes: d25d85061bd8 ("usb: dwc3: gadget: Use list_replace_init() before traversing lists") Cc: stable <stable@vger.kernel.org> Cc: Felipe Balbi <balbi@kernel.org> Cc: Wesley Cheng <wcheng@codeaurora.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-08-10Merge tag 'iio-fixes-5.14a' of ↵Greg Kroah-Hartman
https://git.kernel.org/pub/scm/linux/kernel/git/jic23/iio into staging-linus Jonathan writes: First set of fixes for IIO in the 5.14 cycle adi,adis: - Ensure GPIO pin direction set explicitly in driver. fxls8952af: - Fix use of ret when not initialized. - Fix issue with use of module symbol from built in. hdc100x: - Add a margin to conversion time as some parts run to slowly. palmas-adc: - Fix a wrong exit condition that leads to adc period always being set to maximum value. st,sensors: - Drop a wrong restriction on number of interrupts in dt binding. ti-ads7950: - Ensure CS deasserted after channel read. * tag 'iio-fixes-5.14a' of https://git.kernel.org/pub/scm/linux/kernel/git/jic23/iio: iio: adc: Fix incorrect exit of for-loop iio: humidity: hdc100x: Add margin to the conversion time dt-bindings: iio: st: Remove wrong items length check iio: accel: fxls8962af: fix i2c dependency iio: adis: set GPIO reset pin direction iio: adc: ti-ads7950: Ensure CS is deasserted after reading channels iio: accel: fxls8962af: fix potential use of uninitialized symbol
2021-08-10locking/rtmutex: Use the correct rtmutex debugging config optionZhen Lei
It's CONFIG_DEBUG_RT_MUTEXES not CONFIG_DEBUG_RT_MUTEX. Fixes: f7efc4799f81 ("locking/rtmutex: Inline chainwalk depth check") Signed-off-by: Zhen Lei <thunder.leizhen@huawei.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Will Deacon <will@kernel.org> Acked-by: Boqun Feng <boqun.feng@gmail.com> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20210731123011.4555-1-thunder.leizhen@huawei.com
2021-08-10can: m_can: m_can_set_bittiming(): fix setting M_CAN_DBTP registerHussein Alasadi
This patch fixes the setting of the M_CAN_DBTP register contents: - use DBTP_ (the data bitrate macros) instead of NBTP_ which area used for the nominal bitrate - do not overwrite possibly-existing DBTP_TDC flag by ORing reg_btp instead of overwriting Link: https://lore.kernel.org/r/FRYP281MB06140984ABD9994C0AAF7433D1F69@FRYP281MB0614.DEUP281.PROD.OUTLOOK.COM Fixes: 20779943a080 ("can: m_can: use bits.h macros for all regmasks") Cc: Torin Cooper-Bennun <torin@maxiluxsystems.com> Cc: Chandrasekar Ramakrishnan <rcsekar@samsung.com> Signed-off-by: Hussein Alasadi <alasadi@arecs.eu> [mkl: update patch description, update indention] Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
2021-08-10MAINTAINERS: fix Microchip CAN BUS Analyzer Tool entry typoBaruch Siach
This patch fixes the abbreviated name of the Microchip CAN BUS Analyzer Tool. Fixes: 8a7b46fa7902 ("MAINTAINERS: add Yasushi SHOJI as reviewer for the Microchip CAN BUS Analyzer Tool driver") Link: https://lore.kernel.org/r/cc4831cb1c8759c15fb32c21fd326e831183733d.1627876781.git.baruch@tkos.co.il Signed-off-by: Baruch Siach <baruch@tkos.co.il> Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
2021-08-09net/mlx5: Fix return value from tracer initializationAya Levin
Check return value of mlx5_fw_tracer_start(), set error path and fix return value of mlx5_fw_tracer_init() accordingly. Fixes: c71ad41ccb0c ("net/mlx5: FW tracer, events handling") Signed-off-by: Aya Levin <ayal@nvidia.com> Reviewed-by: Moshe Shemesh <moshe@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-08-09net/mlx5: Synchronize correct IRQ when destroying CQShay Drory
The CQ destroy is performed based on the IRQ number that is stored in cq->irqn. That number wasn't set explicitly during CQ creation and as expected some of the API users of mlx5_core_create_cq() forgot to update it. This caused to wrong synchronization call of the wrong IRQ with a number 0 instead of the real one. As a fix, set the IRQ number directly in the mlx5_core_create_cq() and update all users accordingly. Fixes: 1a86b377aa21 ("vdpa/mlx5: Add VDPA driver for supported mlx5 devices") Fixes: ef1659ade359 ("IB/mlx5: Add DEVX support for CQ events") Signed-off-by: Shay Drory <shayd@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-08-09net/mlx5e: TC, Fix error handling memory leakChris Mi
Free the offload sample action on error. Fixes: f94d6389f6a8 ("net/mlx5e: TC, Add support to offload sample action") Signed-off-by: Chris Mi <cmi@nvidia.com> Reviewed-by: Oz Shlomo <ozsh@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-08-09net/mlx5: Destroy pool->mutexShay Drory
Destroy pool->mutex when we destroy the pool. Fixes: c36326d38d93 ("net/mlx5: Round-Robin EQs over IRQs") Signed-off-by: Shay Drory <shayd@nvidia.com> Reviewed-by: Parav Pandit <parav@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-08-09net/mlx5: Set all field of mlx5_irq before inserting it to the xarrayShay Drory
Currently irq->pool is set after the irq is insert to the xarray. Set irq->pool before the irq is inserted to the xarray. Fixes: 71e084e26414 ("net/mlx5: Allocating a pool of MSI-X vectors for SFs") Signed-off-by: Shay Drory <shayd@nvidia.com> Reviewed-by: Parav Pandit <parav@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-08-09net/mlx5: Fix order of functions in mlx5_irq_detach_nb()Shay Drory
Change order of functions in mlx5_irq_detach_nb() so it will be a mirror of mlx5_irq_attach_nb. Fixes: 71e084e26414 ("net/mlx5: Allocating a pool of MSI-X vectors for SFs") Signed-off-by: Shay Drory <shayd@nvidia.com> Reviewed-by: Parav Pandit <parav@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-08-09net/mlx5: Block switchdev mode while devlink traps are activeAya Levin
Since switchdev mode can't support devlink traps, verify there are no active devlink traps before moving eswitch to switchdev mode. If there are active traps, prevent the switchdev mode configuration. Fixes: eb3862a0525d ("net/mlx5e: Enable traps according to link state") Signed-off-by: Aya Levin <ayal@nvidia.com> Reviewed-by: Moshe Shemesh <moshe@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-08-09net/mlx5e: Destroy page pool after XDP SQ to fix use-after-freeMaxim Mikityanskiy
mlx5e_close_xdpsq does the cleanup: it calls mlx5e_free_xdpsq_descs to free the outstanding descriptors, which relies on mlx5e_page_release_dynamic and page_pool_release_page. However, page_pool_destroy is already called by this point, because mlx5e_close_rq runs before mlx5e_close_xdpsq. This commit fixes the use-after-free by swapping mlx5e_close_xdpsq and mlx5e_close_rq. The commit cited below started calling page_pool_destroy directly from the driver. Previously, the page pool was destroyed under a call_rcu from xdp_rxq_info_unreg_mem_model, which would defer the deallocation until after the XDPSQ is cleaned up. Fixes: 1da4bbeffe41 ("net: core: page_pool: add user refcnt and reintroduce page_pool_destroy") Signed-off-by: Maxim Mikityanskiy <maximmi@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-08-09net/mlx5: Bridge, fix ageing timeVlad Buslov
Ageing time is not converted from clock_t to jiffies which results incorrect ageing timeout calculation in workqueue update task. Fix it by applying clock_t_to_jiffies() to provided value. Fixes: c636a0f0f3f0 ("net/mlx5: Bridge, dynamic entry ageing") Signed-off-by: Vlad Buslov <vladbu@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-08-09net/mlx5e: Avoid creating tunnel headers for local routeRoi Dayan
It could be local and remote are on the same machine and the route result will be a local route which will result in creating encap id with src/dst mac address of 0. Fixes: a54e20b4fcae ("net/mlx5e: Add basic TC tunnel set action for SRIOV offloads") Signed-off-by: Roi Dayan <roid@nvidia.com> Reviewed-by: Maor Dickman <maord@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-08-09net/mlx5: DR, Add fail on error check on decapAlex Vesker
While processing encapsulated packet on RX, one of the fields that is checked is the inner packet length. If the length as specified in the header doesn't match the actual inner packet length, the packet is invalid and should be dropped. However, such packet caused the NIC to hang. This patch turns on a 'fail_on_error' HW bit which allows HW to drop such an invalid packet while processing RX packet and trying to decap it. Fixes: ad17dc8cf910 ("net/mlx5: DR, Move STEv0 action apply logic") Signed-off-by: Alex Vesker <valex@nvidia.com> Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-08-09net/mlx5: Don't skip subfunction cleanup in case of error in module initLeon Romanovsky
Clean SF resources if mlx5 eth failed to initialize. Fixes: 1958fc2f0712 ("net/mlx5: SF, Add auxiliary device driver") Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Reviewed-by: Parav Pandit <parav@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-08-09scsi: mpt3sas: Fix incorrectly assigned error return and checkColin Ian King
Currently the call to _base_static_config_pages() is assigning the error return to variable 'rc' but checking the error return in error 'r'. Fix this by assigning the error return to variable 'r' instead of 'rc'. Link: https://lore.kernel.org/r/20210804134940.114011-1-colin.king@canonical.com Fixes: 19a622c39a9d ("scsi: mpt3sas: Handle firmware faults during first half of IOC init") Signed-off-by: Colin Ian King <colin.king@canonical.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Addresses-Coverity: ("Unused value")
2021-08-09scsi: storvsc: Log TEST_UNIT_READY errors as warningsMichael Kelley
Commit 08f76547f08d ("scsi: storvsc: Update error logging") added more robust logging of errors, particularly those reported as Hyper-V errors. But this change produces extra logging noise in that TEST_UNIT_READY may report errors during the normal course of detecting device adds and removes. Fix this by logging TEST_UNIT_READY errors as warnings, so that log lines are produced only if the storvsc log level is changed to WARN level on the kernel boot line. Link: https://lore.kernel.org/r/1628269970-87876-1-git-send-email-mikelley@microsoft.com Fixes: 08f76547f08d ("scsi: storvsc: Update error logging") Signed-off-by: Michael Kelley <mikelley@microsoft.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
2021-08-09scsi: lpfc: Move initialization of phba->poll_list earlier to avoid crashEwan D. Milne
The phba->poll_list is traversed in case of an error in lpfc_sli4_hba_setup(), so it must be initialized earlier in case the error path is taken. [ 490.030738] lpfc 0000:65:00.0: 0:1413 Failed to init iocb list. [ 490.036661] BUG: unable to handle kernel NULL pointer dereference at 0000000000000000 [ 490.044485] PGD 0 P4D 0 [ 490.047027] Oops: 0000 [#1] SMP PTI [ 490.050518] CPU: 0 PID: 7 Comm: kworker/0:1 Kdump: loaded Tainted: G I --------- - - 4.18. [ 490.060511] Hardware name: Dell Inc. PowerEdge R440/0WKGTH, BIOS 1.4.8 05/22/2018 [ 490.067994] Workqueue: events work_for_cpu_fn [ 490.072371] RIP: 0010:lpfc_sli4_cleanup_poll_list+0x20/0xb0 [lpfc] [ 490.078546] Code: cf e9 04 f7 fe ff 0f 1f 40 00 0f 1f 44 00 00 41 57 49 89 ff 41 56 41 55 41 54 4d 8d a79 [ 490.097291] RSP: 0018:ffffbd1a463dbcc8 EFLAGS: 00010246 [ 490.102518] RAX: 0000000000008200 RBX: ffff945cdb8c0000 RCX: 0000000000000000 [ 490.109649] RDX: 0000000000018200 RSI: ffff9468d0e16818 RDI: 0000000000000000 [ 490.116783] RBP: ffff945cdb8c1740 R08: 00000000000015c5 R09: 0000000000000042 [ 490.123915] R10: 0000000000000000 R11: ffffbd1a463dbab0 R12: ffff945cdb8c25c0 [ 490.131049] R13: 00000000fffffff4 R14: 0000000000001800 R15: ffff945cdb8c0000 [ 490.138182] FS: 0000000000000000(0000) GS:ffff9468d0e00000(0000) knlGS:0000000000000000 [ 490.146267] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 490.152013] CR2: 0000000000000000 CR3: 000000042ca10002 CR4: 00000000007706f0 [ 490.159146] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 490.166277] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 490.173409] PKRU: 55555554 [ 490.176123] Call Trace: [ 490.178598] lpfc_sli4_queue_destroy+0x7f/0x3c0 [lpfc] [ 490.183745] lpfc_sli4_hba_setup+0x1bc7/0x23e0 [lpfc] [ 490.188797] ? kernfs_activate+0x63/0x80 [ 490.192721] ? kernfs_add_one+0xe7/0x130 [ 490.196647] ? __kernfs_create_file+0x80/0xb0 [ 490.201020] ? lpfc_pci_probe_one_s4.isra.48+0x46f/0x9e0 [lpfc] [ 490.206944] lpfc_pci_probe_one_s4.isra.48+0x46f/0x9e0 [lpfc] [ 490.212697] lpfc_pci_probe_one+0x179/0xb70 [lpfc] [ 490.217492] local_pci_probe+0x41/0x90 [ 490.221246] work_for_cpu_fn+0x16/0x20 [ 490.224994] process_one_work+0x1a7/0x360 [ 490.229009] ? create_worker+0x1a0/0x1a0 [ 490.232933] worker_thread+0x1cf/0x390 [ 490.236687] ? create_worker+0x1a0/0x1a0 [ 490.240612] kthread+0x116/0x130 [ 490.243846] ? kthread_flush_work_fn+0x10/0x10 [ 490.248293] ret_from_fork+0x35/0x40 [ 490.251869] Modules linked in: lpfc(+) xt_CHECKSUM ipt_MASQUERADE xt_conntrack ipt_REJECT nf_reject_ipv4i [ 490.332609] CR2: 0000000000000000 Link: https://lore.kernel.org/r/20210809150947.18104-1-emilne@redhat.com Fixes: 93a4d6f40198 ("scsi: lpfc: Add registration for CPU Offline/Online events") Cc: stable@vger.kernel.org Reviewed-by: James Smart <jsmart2021@gmail.com> Signed-off-by: Ewan D. Milne <emilne@redhat.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
2021-08-09blk-iocost: fix lockdep warning on blkcg->lockMing Lei
blkcg->lock depends on q->queue_lock which may depend on another driver lock required in irq context, one example is dm-thin: Chain exists of: &pool->lock#3 --> &q->queue_lock --> &blkcg->lock Possible interrupt unsafe locking scenario: CPU0 CPU1 ---- ---- lock(&blkcg->lock); local_irq_disable(); lock(&pool->lock#3); lock(&q->queue_lock); <Interrupt> lock(&pool->lock#3); Fix the issue by using spin_lock_irq(&blkcg->lock) in ioc_weight_write(). Cc: Tejun Heo <tj@kernel.org> Reported-by: Bruno Goncalves <bgoncalv@redhat.com> Link: https://lore.kernel.org/linux-block/CA+QYu4rzz6079ighEanS3Qq_Dmnczcf45ZoJoHKVLVATTo1e4Q@mail.gmail.com/T/#u Signed-off-by: Ming Lei <ming.lei@redhat.com> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20210803070608.1766400-1-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-08-09io_uring: fix ctx-exit io_rsrc_put_work() deadlockPavel Begunkov
__io_rsrc_put_work() might need ->uring_lock, so nobody should wait for rsrc nodes holding the mutex. However, that's exactly what io_ring_ctx_free() does with io_wait_rsrc_data(). Split it into rsrc wait + dealloc, and move the first one out of the lock. Cc: stable@vger.kernel.org Fixes: b60c8dce33895 ("io_uring: preparation for rsrc tagging") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/0130c5c2693468173ec1afab714e0885d2c9c363.1628559783.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-08-09io_uring: drop ctx->uring_lock before flushing work itemJens Axboe
Ammar reports that he's seeing a lockdep splat on running test/rsrc_tags from the regression suite: ====================================================== WARNING: possible circular locking dependency detected 5.14.0-rc3-bluetea-test-00249-gc7d102232649 #5 Tainted: G OE ------------------------------------------------------ kworker/2:4/2684 is trying to acquire lock: ffff88814bb1c0a8 (&ctx->uring_lock){+.+.}-{3:3}, at: io_rsrc_put_work+0x13d/0x1a0 but task is already holding lock: ffffc90001c6be70 ((work_completion)(&(&ctx->rsrc_put_work)->work)){+.+.}-{0:0}, at: process_one_work+0x1bc/0x530 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #1 ((work_completion)(&(&ctx->rsrc_put_work)->work)){+.+.}-{0:0}: __flush_work+0x31b/0x490 io_rsrc_ref_quiesce.part.0.constprop.0+0x35/0xb0 __do_sys_io_uring_register+0x45b/0x1060 do_syscall_64+0x35/0xb0 entry_SYSCALL_64_after_hwframe+0x44/0xae -> #0 (&ctx->uring_lock){+.+.}-{3:3}: __lock_acquire+0x119a/0x1e10 lock_acquire+0xc8/0x2f0 __mutex_lock+0x86/0x740 io_rsrc_put_work+0x13d/0x1a0 process_one_work+0x236/0x530 worker_thread+0x52/0x3b0 kthread+0x135/0x160 ret_from_fork+0x1f/0x30 other info that might help us debug this: Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock((work_completion)(&(&ctx->rsrc_put_work)->work)); lock(&ctx->uring_lock); lock((work_completion)(&(&ctx->rsrc_put_work)->work)); lock(&ctx->uring_lock); *** DEADLOCK *** 2 locks held by kworker/2:4/2684: #0: ffff88810004d938 ((wq_completion)events){+.+.}-{0:0}, at: process_one_work+0x1bc/0x530 #1: ffffc90001c6be70 ((work_completion)(&(&ctx->rsrc_put_work)->work)){+.+.}-{0:0}, at: process_one_work+0x1bc/0x530 stack backtrace: CPU: 2 PID: 2684 Comm: kworker/2:4 Tainted: G OE 5.14.0-rc3-bluetea-test-00249-gc7d102232649 #5 Hardware name: Acer Aspire ES1-421/OLVIA_BE, BIOS V1.05 07/02/2015 Workqueue: events io_rsrc_put_work Call Trace: dump_stack_lvl+0x6a/0x9a check_noncircular+0xfe/0x110 __lock_acquire+0x119a/0x1e10 lock_acquire+0xc8/0x2f0 ? io_rsrc_put_work+0x13d/0x1a0 __mutex_lock+0x86/0x740 ? io_rsrc_put_work+0x13d/0x1a0 ? io_rsrc_put_work+0x13d/0x1a0 ? io_rsrc_put_work+0x13d/0x1a0 ? process_one_work+0x1ce/0x530 io_rsrc_put_work+0x13d/0x1a0 process_one_work+0x236/0x530 worker_thread+0x52/0x3b0 ? process_one_work+0x530/0x530 kthread+0x135/0x160 ? set_kthread_struct+0x40/0x40 ret_from_fork+0x1f/0x30 which is due to holding the ctx->uring_lock when flushing existing pending work, while the pending work flushing may need to grab the uring lock if we're using IOPOLL. Fix this by dropping the uring_lock a bit earlier as part of the flush. Cc: stable@vger.kernel.org Link: https://github.com/axboe/liburing/issues/404 Tested-by: Ammar Faizi <ammarfaizi2@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-08-09io-wq: fix IO_WORKER_F_FIXED issue in create_io_worker()Hao Xu
There may be cases like: A B spin_lock(wqe->lock) nr_workers is 0 nr_workers++ spin_unlock(wqe->lock) spin_lock(wqe->lock) nr_wokers is 1 nr_workers++ spin_unlock(wqe->lock) create_io_worker() acct->worker is 1 create_io_worker() acct->worker is 1 There should be one worker marked IO_WORKER_F_FIXED, but no one is. Fix this by introduce a new agrument for create_io_worker() to indicate if it is the first worker. Fixes: 3d4e4face9c1 ("io-wq: fix no lock protection of acct->nr_worker") Signed-off-by: Hao Xu <haoxu@linux.alibaba.com> Link: https://lore.kernel.org/r/20210808135434.68667-3-haoxu@linux.alibaba.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-08-09io-wq: fix bug of creating io-wokers unconditionallyHao Xu
The former patch to add check between nr_workers and max_workers has a bug, which will cause unconditionally creating io-workers. That's because the result of the check doesn't affect the call of create_io_worker(), fix it by bringing in a boolean value for it. Fixes: 21698274da5b ("io-wq: fix lack of acct->nr_workers < acct->max_workers judgement") Signed-off-by: Hao Xu <haoxu@linux.alibaba.com> Link: https://lore.kernel.org/r/20210808135434.68667-2-haoxu@linux.alibaba.com [axboe: drop hunk that isn't strictly needed] Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-08-09io_uring: rsrc ref lock needs to be IRQ safeJens Axboe
Nadav reports running into the below splat on re-enabling softirqs: WARNING: CPU: 2 PID: 1777 at kernel/softirq.c:364 __local_bh_enable_ip+0xaa/0xe0 Modules linked in: CPU: 2 PID: 1777 Comm: umem Not tainted 5.13.1+ #161 Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/22/2020 RIP: 0010:__local_bh_enable_ip+0xaa/0xe0 Code: a9 00 ff ff 00 74 38 65 ff 0d a2 21 8c 7a e8 ed 1a 20 00 fb 66 0f 1f 44 00 00 5b 41 5c 5d c3 65 8b 05 e6 2d 8c 7a 85 c0 75 9a <0f> 0b eb 96 e8 2d 1f 20 00 eb a5 4c 89 e7 e8 73 4f 0c 00 eb ae 65 RSP: 0018:ffff88812e58fcc8 EFLAGS: 00010046 RAX: 0000000000000000 RBX: 0000000000000201 RCX: dffffc0000000000 RDX: 0000000000000007 RSI: 0000000000000201 RDI: ffffffff8898c5ac RBP: ffff88812e58fcd8 R08: ffffffff8575dbbf R09: ffffed1028ef14f9 R10: ffff88814778a7c3 R11: ffffed1028ef14f8 R12: ffffffff85c9e9ae R13: ffff88814778a000 R14: ffff88814778a7b0 R15: ffff8881086db890 FS: 00007fbcfee17700(0000) GS:ffff8881e0300000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 000000c0402a5008 CR3: 000000011c1ac003 CR4: 00000000003706e0 Call Trace: _raw_spin_unlock_bh+0x31/0x40 io_rsrc_node_ref_zero+0x13e/0x190 io_dismantle_req+0x215/0x220 io_req_complete_post+0x1b8/0x720 __io_complete_rw.isra.0+0x16b/0x1f0 io_complete_rw+0x10/0x20 where it's clear we end up calling the percpu count release directly from the completion path, as it's in atomic mode and we drop the last ref. For file/block IO, this can be from IRQ context already, and the softirq locking for rsrc isn't enough. Just make the lock fully IRQ safe, and ensure we correctly safe state from the release path as we don't know the full context there. Reported-by: Nadav Amit <nadav.amit@gmail.com> Tested-by: Nadav Amit <nadav.amit@gmail.com> Link: https://lore.kernel.org/io-uring/C187C836-E78B-4A31-B24C-D16919ACA093@gmail.com/ Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-08-09Merge branch 'for-5.14-fixes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup Pull cgroup fix from Tejun Heo: "One commit to fix a possible A-A deadlock around u64_stats_sync on 32bit machines caused by updating it without disabling IRQ when it may be read from IRQ context" * 'for-5.14-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: cgroup: rstat: fix A-A deadlock on 32bit around u64_stats_sync
2021-08-09Merge branch 'add-frag-page-support-in-page-pool'Jakub Kicinski
Yunsheng Lin says: ==================== add frag page support in page pool This patchset adds frag page support in page pool and enable skb's page frag recycling based on page pool in hns3 drvier. ==================== Link: https://lore.kernel.org/r/1628217982-53533-1-git-send-email-linyunsheng@huawei.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-08-09net: hns3: support skb's frag page recycling based on page poolYunsheng Lin
This patch adds skb's frag page recycling support based on the frag page support in page pool. The performance improves above 10~20% for single thread iperf TCP flow with IOMMU disabled when iperf server and irq/NAPI have a different CPU. The performance improves about 135%(14Gbit to 33Gbit) for single thread iperf TCP flow when IOMMU is in strict mode and iperf server shares the same cpu with irq/NAPI. Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-08-09page_pool: add frag page recycling support in page poolYunsheng Lin
Currently page pool only support page recycling when there is only one user of the page, and the split page reusing implemented in the most driver can not use the page pool as bing-pong way of reusing requires the multi user support in page pool. Those reusing or recycling has below limitations: 1. page from page pool can only be used be one user in order for the page recycling to happen. 2. Bing-pong way of reusing in most driver does not support multi desc using different part of the same page in order to save memory. So add multi-users support and frag page recycling in page pool to overcome the above limitation. Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-08-09page_pool: add interface to manipulate frag count in page poolYunsheng Lin
For 32 bit systems with 64 bit dma, dma_addr[1] is used to store the upper 32 bit dma addr, those system should be rare those days. For normal system, the dma_addr[1] in 'struct page' is not used, so we can reuse dma_addr[1] for storing frag count, which means how many frags this page might be splited to. In order to simplify the page frag support in the page pool, the PAGE_POOL_DMA_USE_PP_FRAG_COUNT macro is added to indicate the 32 bit systems with 64 bit dma, and the page frag support in page pool is disabled for such system. The newly added page_pool_set_frag_count() is called to reserve the maximum frag count before any page frag is passed to the user. The page_pool_atomic_sub_frag_count_return() is called when user is done with the page frag. Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-08-09page_pool: keep pp info as long as page pool owns the pageYunsheng Lin
Currently, page->pp is cleared and set everytime the page is recycled, which is unnecessary. So only set the page->pp when the page is added to the page pool and only clear it when the page is released from the page pool. This is also a preparation to support allocating frag page in page pool. Reviewed-by: Ilias Apalodimas <ilias.apalodimas@linaro.org> Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-08-09bareudp: Fix invalid read beyond skb's linear dataGuillaume Nault
Data beyond the UDP header might not be part of the skb's linear data. Use skb_copy_bits() instead of direct access to skb->data+X, so that we read the correct bytes even on a fragmented skb. Fixes: 4b5f67232d95 ("net: Special handling for IP & MPLS.") Signed-off-by: Guillaume Nault <gnault@redhat.com> Link: https://lore.kernel.org/r/7741c46545c6ef02e70c80a9b32814b22d9616b3.1628264975.git.gnault@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-08-09net: openvswitch: fix kernel-doc warnings in flow.cRandy Dunlap
Repair kernel-doc notation in a few places to make it conform to the expected format. Fixes the following kernel-doc warnings: flow.c:296: warning: This comment starts with '/**', but isn't a kernel-doc comment. Refer Documentation/doc-guide/kernel-doc.rst * Parse vlan tag from vlan header. flow.c:296: warning: missing initial short description on line: * Parse vlan tag from vlan header. flow.c:537: warning: No description found for return value of 'key_extract_l3l4' flow.c:769: warning: No description found for return value of 'key_extract' Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Cc: Pravin B Shelar <pshelar@ovn.org> Cc: dev@openvswitch.org Link: https://lore.kernel.org/r/20210808190834.23362-1-rdunlap@infradead.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-08-09psample: Add a fwd declaration for skbuffRoi Dayan
Without this there is a warning if source files include psample.h before skbuff.h or doesn't include it at all. Fixes: 6ae0a6286171 ("net: Introduce psample, a new genetlink channel for packet sampling") Signed-off-by: Roi Dayan <roid@nvidia.com> Link: https://lore.kernel.org/r/20210808065242.1522535-1-roid@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-08-09MAINTAINERS: update Vineet's email addressVineet Gupta
I'll be leaving Synopsys shortly, but will continue to handle maintenance for the transition period. Signed-off-by: Vineet Gupta <vgupta@kernel.org>
2021-08-09selftests/bpf: Add tests for XDP bondingJussi Maki
Add a test suite to test XDP bonding implementation over a pair of veth devices. Signed-off-by: Jussi Maki <joamaki@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20210731055738.16820-8-joamaki@gmail.com
2021-08-09selftests/bpf: Fix xdp_tx.c prog section nameJussi Maki
The program type cannot be deduced from 'tx' which causes an invalid argument error when trying to load xdp_tx.o using the skeleton. Rename the section name to "xdp" so that libbpf can deduce the type. Signed-off-by: Jussi Maki <joamaki@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20210731055738.16820-7-joamaki@gmail.com
2021-08-09net, core: Allow netdev_lower_get_next_private_rcu in bh contextJussi Maki
For the XDP bonding slave lookup to work in the NAPI poll context in which the redudant rcu_read_lock() has been removed we have to follow the same approach as in 694cea395fde ("bpf: Allow RCU-protected lookups to happen from bh context") and modify the WARN_ON to also check rcu_read_lock_bh_held(). Signed-off-by: Jussi Maki <joamaki@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Cc: Toke Høiland-Jørgensen <toke@redhat.com> Link: https://lore.kernel.org/bpf/20210731055738.16820-6-joamaki@gmail.com
2021-08-09bpf, devmap: Exclude XDP broadcast to master deviceJussi Maki
If the ingress device is bond slave, do not broadcast back through it or the bond master. Signed-off-by: Jussi Maki <joamaki@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20210731055738.16820-5-joamaki@gmail.com
2021-08-09net, bonding: Add XDP support to the bonding driverJussi Maki
XDP is implemented in the bonding driver by transparently delegating the XDP program loading, removal and xmit operations to the bonding slave devices. The overall goal of this work is that XDP programs can be attached to a bond device *without* any further changes (or awareness) necessary to the program itself, meaning the same XDP program can be attached to a native device but also a bonding device. Semantics of XDP_TX when attached to a bond are equivalent in such setting to the case when a tc/BPF program would be attached to the bond, meaning transmitting the packet out of the bond itself using one of the bond's configured xmit methods to select a slave device (rather than XDP_TX on the slave itself). Handling of XDP_TX to transmit using the configured bonding mechanism is therefore implemented by rewriting the BPF program return value in bpf_prog_run_xdp. To avoid performance impact this check is guarded by a static key, which is incremented when a XDP program is loaded onto a bond device. This approach was chosen to avoid changes to drivers implementing XDP. If the slave device does not match the receive device, then XDP_REDIRECT is transparently used to perform the redirection in order to have the network driver release the packet from its RX ring. The bonding driver hashing functions have been refactored to allow reuse with xdp_buff's to avoid code duplication. The motivation for this change is to enable use of bonding (and 802.3ad) in hairpinning L4 load-balancers such as [1] implemented with XDP and also to transparently support bond devices for projects that use XDP given most modern NICs have dual port adapters. An alternative to this approach would be to implement 802.3ad in user-space and implement the bonding load-balancing in the XDP program itself, but is rather a cumbersome endeavor in terms of slave device management (e.g. by watching netlink) and requires separate programs for native vs bond cases for the orchestrator. A native in-kernel implementation overcomes these issues and provides more flexibility. Below are benchmark results done on two machines with 100Gbit Intel E810 (ice) NIC and with 32-core 3970X on sending machine, and 16-core 3950X on receiving machine. 64 byte packets were sent with pktgen-dpdk at full rate. Two issues [2, 3] were identified with the ice driver, so the tests were performed with iommu=off and patch [2] applied. Additionally the bonding round robin algorithm was modified to use per-cpu tx counters as high CPU load (50% vs 10%) and high rate of cache misses were caused by the shared rr_tx_counter (see patch 2/3). The statistics were collected using "sar -n dev -u 1 10". On top of that, for ice, further work is in progress on improving the XDP_TX numbers [4]. -----------------------| CPU |--| rxpck/s |--| txpck/s |---- without patch (1 dev): XDP_DROP: 3.15% 48.6Mpps XDP_TX: 3.12% 18.3Mpps 18.3Mpps XDP_DROP (RSS): 9.47% 116.5Mpps XDP_TX (RSS): 9.67% 25.3Mpps 24.2Mpps ----------------------- with patch, bond (1 dev): XDP_DROP: 3.14% 46.7Mpps XDP_TX: 3.15% 13.9Mpps 13.9Mpps XDP_DROP (RSS): 10.33% 117.2Mpps XDP_TX (RSS): 10.64% 25.1Mpps 24.0Mpps ----------------------- with patch, bond (2 devs): XDP_DROP: 6.27% 92.7Mpps XDP_TX: 6.26% 17.6Mpps 17.5Mpps XDP_DROP (RSS): 11.38% 117.2Mpps XDP_TX (RSS): 14.30% 28.7Mpps 27.4Mpps -------------------------------------------------------------- RSS: Receive Side Scaling, e.g. the packets were sent to a range of destination IPs. [1]: https://cilium.io/blog/2021/05/20/cilium-110#standalonelb [2]: https://lore.kernel.org/bpf/20210601113236.42651-1-maciej.fijalkowski@intel.com/T/#t [3]: https://lore.kernel.org/bpf/CAHn8xckNXci+X_Eb2WMv4uVYjO2331UWB2JLtXr_58z0Av8+8A@mail.gmail.com/ [4]: https://lore.kernel.org/bpf/20210805230046.28715-1-maciej.fijalkowski@intel.com/T/#t Signed-off-by: Jussi Maki <joamaki@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Cc: Jay Vosburgh <j.vosburgh@gmail.com> Cc: Veaceslav Falico <vfalico@gmail.com> Cc: Andy Gospodarek <andy@greyhouse.net> Cc: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Cc: Magnus Karlsson <magnus.karlsson@intel.com> Link: https://lore.kernel.org/bpf/20210731055738.16820-4-joamaki@gmail.com
2021-08-09net, core: Add support for XDP redirection to slave deviceJussi Maki
This adds the ndo_xdp_get_xmit_slave hook for transforming XDP_TX into XDP_REDIRECT after BPF program run when the ingress device is a bond slave. The dev_xdp_prog_count is exposed so that slave devices can be checked for loaded XDP programs in order to avoid the situation where both bond master and slave have programs loaded according to xdp_state. Signed-off-by: Jussi Maki <joamaki@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Cc: Jay Vosburgh <j.vosburgh@gmail.com> Cc: Veaceslav Falico <vfalico@gmail.com> Cc: Andy Gospodarek <andy@greyhouse.net> Link: https://lore.kernel.org/bpf/20210731055738.16820-3-joamaki@gmail.com