Age | Commit message (Collapse) | Author |
|
ASO query can be scheduled in atomic context as such it can't use usleep.
Use udelay as recommended in Documentation/timers/timers-howto.rst.
Fixes: 76e463f6508b ("net/mlx5e: Overcome slow response for first IPsec ASO WQE")
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
|
|
XFRM state which is changed to be XFRM_STATE_EXPIRED doesn't really
need to hold lock while modifying flow steering rules to drop traffic.
That state can be deleted only and as such mlx5e_ipsec_handle_tx_limit()
work will be canceled anyway and won't run in parallel.
Fixes: b2f7b01d36a9 ("net/mlx5e: Simulate missing IPsec TX limits hardware functionality")
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
|
|
Previously during mlx5e_ipsec_handle_event the driver tried to execute
an operation that could sleep, while holding a spinlock, which caused
the kernel panic mentioned below.
Move the function call that can sleep outside of the spinlock context.
Call Trace:
<TASK>
dump_stack_lvl+0x49/0x6c
__schedule_bug.cold+0x42/0x4e
schedule_debug.constprop.0+0xe0/0x118
__schedule+0x59/0x58a
? __mod_timer+0x2a1/0x3ef
schedule+0x5e/0xd4
schedule_timeout+0x99/0x164
? __pfx_process_timeout+0x10/0x10
__wait_for_common+0x90/0x1da
? __pfx_schedule_timeout+0x10/0x10
wait_func+0x34/0x142 [mlx5_core]
mlx5_cmd_invoke+0x1f3/0x313 [mlx5_core]
cmd_exec+0x1fe/0x325 [mlx5_core]
mlx5_cmd_do+0x22/0x50 [mlx5_core]
mlx5_cmd_exec+0x1c/0x40 [mlx5_core]
mlx5_modify_ipsec_obj+0xb2/0x17f [mlx5_core]
mlx5e_ipsec_update_esn_state+0x69/0xf0 [mlx5_core]
? wake_affine+0x62/0x1f8
mlx5e_ipsec_handle_event+0xb1/0xc0 [mlx5_core]
process_one_work+0x1e2/0x3e6
? __pfx_worker_thread+0x10/0x10
worker_thread+0x54/0x3ad
? __pfx_worker_thread+0x10/0x10
kthread+0xda/0x101
? __pfx_kthread+0x10/0x10
ret_from_fork+0x29/0x37
</TASK>
BUG: workqueue leaked lock or atomic: kworker/u256:4/0x7fffffff/189754#012 last function: mlx5e_ipsec_handle_event [mlx5_core]
CPU: 66 PID: 189754 Comm: kworker/u256:4 Kdump: loaded Tainted: G W 6.2.0-2596.20230309201517_5.el8uek.rc1.x86_64 #2
Hardware name: Oracle Corporation ORACLE SERVER X9-2/ASMMBX9-2, BIOS 61070300 08/17/2022
Workqueue: mlx5e_ipsec: eth%d mlx5e_ipsec_handle_event [mlx5_core]
Call Trace:
<TASK>
dump_stack_lvl+0x49/0x6c
process_one_work.cold+0x2b/0x3c
? __pfx_worker_thread+0x10/0x10
worker_thread+0x54/0x3ad
? __pfx_worker_thread+0x10/0x10
kthread+0xda/0x101
? __pfx_kthread+0x10/0x10
ret_from_fork+0x29/0x37
</TASK>
BUG: scheduling while atomic: kworker/u256:4/189754/0x00000000
Fixes: cee137a63431 ("net/mlx5e: Handle ESN update events")
Signed-off-by: Patrisious Haddad <phaddad@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
|
|
XFRM core provides two callbacks to release resources, one is .xdo_dev_policy_delete()
and another is .xdo_dev_policy_free(). This separation allows delayed release so
"ip xfrm policy free" commands won't starve. Unfortunately, mlx5 command interface
can't run in .xdo_dev_policy_free() callbacks as the latter runs in ATOMIC context.
BUG: scheduling while atomic: swapper/7/0/0x00000100
Modules linked in: act_mirred act_tunnel_key cls_flower sch_ingress vxlan mlx5_vdpa vringh vhost_iotlb vdpa rpcrdma rdma_ucm ib_iser libiscsi ib_umad scsi_transport_iscsi rdma_cm ib_ipoib iw_cm ib_cm mlx5_ib ib_uverbs ib_core xt_conntrack xt_MASQUERADE nf_conntrack_netlink nfnetlink xt_addrtype iptable_nat nf_nat br_netfilter rpcsec_gss_krb5 auth_rpcgss oid_registry overlay mlx5_core zram zsmalloc fuse
CPU: 7 PID: 0 Comm: swapper/7 Not tainted 6.3.0+ #1
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
Call Trace:
<IRQ>
dump_stack_lvl+0x33/0x50
__schedule_bug+0x4e/0x60
__schedule+0x5d5/0x780
? __mod_timer+0x286/0x3d0
schedule+0x50/0x90
schedule_timeout+0x7c/0xf0
? __bpf_trace_tick_stop+0x10/0x10
__wait_for_common+0x88/0x190
? usleep_range_state+0x90/0x90
cmd_exec+0x42e/0xb40 [mlx5_core]
mlx5_cmd_do+0x1e/0x40 [mlx5_core]
mlx5_cmd_exec+0x18/0x30 [mlx5_core]
mlx5_cmd_delete_fte+0xa8/0xd0 [mlx5_core]
del_hw_fte+0x60/0x120 [mlx5_core]
mlx5_del_flow_rules+0xec/0x270 [mlx5_core]
? default_send_IPI_single_phys+0x26/0x30
mlx5e_accel_ipsec_fs_del_pol+0x1a/0x60 [mlx5_core]
mlx5e_xfrm_free_policy+0x15/0x20 [mlx5_core]
xfrm_policy_destroy+0x5a/0xb0
xfrm4_dst_destroy+0x7b/0x100
dst_destroy+0x37/0x120
rcu_core+0x2d6/0x540
__do_softirq+0xcd/0x273
irq_exit_rcu+0x82/0xb0
sysvec_apic_timer_interrupt+0x72/0x90
</IRQ>
<TASK>
asm_sysvec_apic_timer_interrupt+0x16/0x20
RIP: 0010:default_idle+0x13/0x20
Code: c0 08 00 00 00 4d 29 c8 4c 01 c7 4c 29 c2 e9 72 ff ff ff cc cc cc cc 8b 05 7a 4d ee 00 85 c0 7e 07 0f 00 2d 2f 98 2e 00 fb f4 <fa> c3 66 66 2e 0f 1f 84 00 00 00 00 00 65 48 8b 04 25 40 b4 02 00
RSP: 0018:ffff888100843ee0 EFLAGS: 00000242
RAX: 0000000000000001 RBX: ffff888100812b00 RCX: 4000000000000000
RDX: 0000000000000001 RSI: 0000000000000083 RDI: 000000000002d2ec
RBP: 0000000000000007 R08: 00000021daeded59 R09: 0000000000000001
R10: 0000000000000000 R11: 000000000000000f R12: 0000000000000000
R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
default_idle_call+0x30/0xb0
do_idle+0x1c1/0x1d0
cpu_startup_entry+0x19/0x20
start_secondary+0xfe/0x120
secondary_startup_64_no_verify+0xf3/0xfb
</TASK>
bad: scheduling from the idle thread!
Fixes: a5b8ca9471d3 ("net/mlx5e: Add XFRM policy offload logic")
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
|
|
The kernel IRQ system needs the irq affinity notifier to be clear
before attempting to free the irq, see WARN_ON log below.
On a normal driver unload we don't have this issue since we do the
complete cleanup of the irq resources.
To fix this, put the important resources cleanup in a helper function
and use it in both normal driver unload and shutdown flows.
[ 4497.498434] ------------[ cut here ]------------
[ 4497.498726] WARNING: CPU: 0 PID: 9 at kernel/irq/manage.c:2034 free_irq+0x295/0x340
[ 4497.499193] Modules linked in:
[ 4497.499386] CPU: 0 PID: 9 Comm: kworker/0:1 Tainted: G W 6.4.0-rc4+ #10
[ 4497.499876] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.2-1.fc38 04/01/2014
[ 4497.500518] Workqueue: events do_poweroff
[ 4497.500849] RIP: 0010:free_irq+0x295/0x340
[ 4497.501132] Code: 85 c0 0f 84 1d ff ff ff 48 89 ef ff d0 0f 1f 00 e9 10 ff ff ff 0f 0b e9 72 ff ff ff 49 8d 7f 28 ff d0 0f 1f 00 e9 df fd ff ff <0f> 0b 48 c7 80 c0 008
[ 4497.502269] RSP: 0018:ffffc90000053da0 EFLAGS: 00010282
[ 4497.502589] RAX: ffff888100949600 RBX: ffff88810330b948 RCX: 0000000000000000
[ 4497.503035] RDX: ffff888100949600 RSI: ffff888100400490 RDI: 0000000000000023
[ 4497.503472] RBP: ffff88810330c7e0 R08: ffff8881004005d0 R09: ffffffff8273a260
[ 4497.503923] R10: 0000000000000000 R11: 0000000000000000 R12: ffff8881009ae000
[ 4497.504359] R13: ffff8881009ae148 R14: 0000000000000000 R15: ffff888100949600
[ 4497.504804] FS: 0000000000000000(0000) GS:ffff88813bc00000(0000) knlGS:0000000000000000
[ 4497.505302] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 4497.505671] CR2: 00007fce98806298 CR3: 000000000262e005 CR4: 0000000000370ef0
[ 4497.506104] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 4497.506540] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 4497.507002] Call Trace:
[ 4497.507158] <TASK>
[ 4497.507299] ? free_irq+0x295/0x340
[ 4497.507522] ? __warn+0x7c/0x130
[ 4497.507740] ? free_irq+0x295/0x340
[ 4497.507963] ? report_bug+0x171/0x1a0
[ 4497.508197] ? handle_bug+0x3c/0x70
[ 4497.508417] ? exc_invalid_op+0x17/0x70
[ 4497.508662] ? asm_exc_invalid_op+0x1a/0x20
[ 4497.508926] ? free_irq+0x295/0x340
[ 4497.509146] mlx5_irq_pool_free_irqs+0x48/0x90
[ 4497.509421] mlx5_irq_table_free_irqs+0x38/0x50
[ 4497.509714] mlx5_core_eq_free_irqs+0x27/0x40
[ 4497.509984] shutdown+0x7b/0x100
[ 4497.510184] pci_device_shutdown+0x30/0x60
[ 4497.510440] device_shutdown+0x14d/0x240
[ 4497.510698] kernel_power_off+0x30/0x70
[ 4497.510938] process_one_work+0x1e6/0x3e0
[ 4497.511183] worker_thread+0x49/0x3b0
[ 4497.511407] ? __pfx_worker_thread+0x10/0x10
[ 4497.511679] kthread+0xe0/0x110
[ 4497.511879] ? __pfx_kthread+0x10/0x10
[ 4497.512114] ret_from_fork+0x29/0x50
[ 4497.512342] </TASK>
Fixes: 9c2d08010963 ("net/mlx5: Free irqs only on shutdown callback")
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Reviewed-by: Shay Drory <shayd@nvidia.com>
|
|
When TUNNEL_L3_TO_L2 decap action was created, a pointer to a local
variable was passed as its HW action data, resulting in attempt to
free invalid address:
BUG: KASAN: invalid-free in mlx5dr_action_destroy+0x318/0x410 [mlx5_core]
Fixes: 4781df92f4da ("net/mlx5: DR, Move STEv0 modify header logic")
Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com>
Reviewed-by: Alex Vesker <valex@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
|
|
In some cases, steering might need to use SW-created action in
FW table, which results in wrong packet reformat being used:
mlx5_core 0000:81:00.1: mlx5_cmd_check:756:(pid 1154):
SET_FLOW_TABLE_ENTRY(0×936) op_mod(0×0) failed,
status bad resource(0×5), syndrome (0xf2ff71)
This patch adds support for usage of SW-created packet reformat (encap)
actions in FW tables, and adds clear error flow for attempt to use
SW-created modify header on FW tables.
Fixes: 6a48faeeca10 ("net/mlx5: Add direct rule fs_cmd implementation")
Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com>
Reviewed-by: Erez Shitrit <erezsh@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
|
|
The cited commit removes special handling of CT action. But it
removes too much. Pre ct/ct_nat tables and some other resources
are not destroyed due to the cited commit.
Fix it by adding it back.
Fixes: 08fe94ec5f77 ("net/mlx5e: TC, Remove special handling of CT action")
Signed-off-by: Chris Mi <cmi@nvidia.com>
Reviewed-by: Paul Blakey <paulb@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
|
|
The cited commits add hardware miss support to tc action. But if
the rules can't be offloaded, the pointers are null and system
will panic when accessing them.
Fix it by checking null pointer.
Fixes: 08fe94ec5f77 ("net/mlx5e: TC, Remove special handling of CT action")
Fixes: 6702782845a5 ("net/mlx5e: TC, Set CT miss to the specific ct action instance")
Signed-off-by: Chris Mi <cmi@nvidia.com>
Reviewed-by: Paul Blakey <paulb@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
|
|
When a PCI device has just one msix vector available, we want to share
this vector between async and completion events. Current code fails to
do that assuming it will always have at least one dedicated vector for
completion events. Fix this by detecting when the pool contains just a
single vector.
Fixes: 3354822cde5a ("net/mlx5: Use dynamic msix vectors allocation")
Signed-off-by: Eli Cohen <elic@nvidia.com>
Reviewed-by: Shay Drory <shayd@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
|
|
The cited commit missed setting napi_id on XSK RQs, it only affected
regular RQs. Add the missing part to support socket busy polling on XSK
RQs.
Fixes: a2740f529da2 ("net/mlx5e: xsk: Set napi_id to support busy polling")
Signed-off-by: Maxim Mikityanskiy <maxtram95@gmail.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
|
|
The cited commits missed passing frag_size to __xdp_rxq_info_reg, which
is required by bpf_xdp_adjust_tail to support growing the tail pointer
in fragmented packets. Pass the missing parameter when the current RQ
mode allows XDP multi buffer.
Fixes: ea5d49bdae8b ("net/mlx5e: Add XDP multi buffer support to the non-linear legacy RQ")
Fixes: 9cb9482ef10e ("net/mlx5e: Use fragments of the same size in non-linear legacy RQ with XDP")
Signed-off-by: Maxim Mikityanskiy <maxtram95@gmail.com>
Cc: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
"Two fixes for NOCOW files, a regression fix in scrub and an assertion
fix:
- NOCOW fixes:
- keep length of iomap direct io request in case of a failure
- properly pass mode of extent reference checking, this can break
some cases for swapfile
- fix error value confusion when scrubbing a stripe
- convert assertion to a proper error handling when loading global
roots, reported by syzbot"
* tag 'for-6.4-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: scrub: fix a return value overwrite in scrub_stripe()
btrfs: do not ASSERT() on duplicated global roots
btrfs: can_nocow_file_extent should pass down args->strict from callers
btrfs: fix iomap_begin length for nocow writes
|
|
Pull block fix from Jens Axboe:
"Just a single fix for blk-cg stats flushing"
* tag 'block-6.4-2023-06-15' of git://git.kernel.dk/linux:
blk-cgroup: Flush stats before releasing blkcg_gq
|
|
Pull io_uring fixes from Jens Axboe:
"A fix for sendmsg with CMSG, and the followup fix discussed for
avoiding touching task->worker_private after the worker has started
exiting"
* tag 'io_uring-6.4-2023-06-15' of git://git.kernel.dk/linux:
io_uring/io-wq: clear current->worker_private on exit
io_uring/net: save msghdr->msg_control for retries
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound
Pull sound fixes from Takashi Iwai:
"Just a few small fixes. The only change to the core code is for a
minor race in ALSA OSS sequencer, and the rest are all device-specific
fixes (regression fixes and a usual quirk)"
* tag 'sound-6.4-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound:
ALSA: usb-audio: Add quirk flag for HEM devices to enable native DSD playback
ALSA: usb-audio: Fix broken resume due to UAC3 power state
ALSA: seq: oss: Fix racy open/close of MIDI devices
ASoC: tegra: Fix Master Volume Control
ALSA: hda/realtek: Add a quirk for Compaq N14JP6
firmware: cs_dsp: Log correct region name in bin error messages
|
|
"ecpu" field in struct mlx5_sf_table is not used anywhere. Remove it.
Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Reviewed-by: Shay Drory <shayd@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
|
|
Separate the event API defined in the generic mlx5.h header into
a dedicated header. And remove the TODO comment in commit
69c1280b1f3b ("net/mlx5: Device events, Use async events chain").
Signed-off-by: Juhee Kang <claudiajkang@gmail.com>
Reviewed-by: Larysa Zaremba <larysa.zaremba@intel.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
|
|
This change is needed to use EC VFs with metadata based steering.
There was an assumption that vport was equal to function ID. That's
not the case for EC VF functions. Adjust to function ID and set the
ec_vf_function bit accordingly.
Fixes: 9ac0b128248e ("net/mlx5: Update vport caps query/set for EC VFs")
Signed-off-by: Daniel Jurgens <danielj@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
|
|
The last value is not set correctly. This results in representors not
being created for all EC VFs when the base value is higher than 0.
Fixes: a7719b29a821 ("net/mlx5: Add management of EC VF vports")
Signed-off-by: Daniel Jurgens <danielj@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
|
|
Add counter for number of unicast, multicast and broadcast packets/
octets that were loopback.
Signed-off-by: Or Har-Toov <ohartoov@nvidia.com>
Reviewed-by: Avihai Horon <avihaih@nvidia.com>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
|
|
Add needed HW bits for querying local loopback counter and the
HCA capability for it.
Signed-off-by: Or Har-Toov <ohartoov@nvidia.com>
Reviewed-by: Avihai Horon <avihaih@nvidia.com>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
|
|
The msglvl support was implemented using the mlx5e_dbg() macro which is
rarely used in the driver, and is not very useful when you can just use
dynamic debug instead.
Remove mlx5e_dbg() and convert its usages to netdev_dbg().
Signed-off-by: Gal Pressman <gal@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
|
|
These else statement blocks are redundant since the if block already
jumps to the function abort label.
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Reviewed-by: Rahul Rameshbabu <rrameshbabu@nvidia.com>
|
|
For debugging purposes expose offloaded FDB state (flags, counters, etc.)
via debugfs inside 'esw' root directory. Example debugfs file output:
$ cat mlx5/0000\:08\:00.0/esw/bridge/bridge1/fdb
DEV MAC VLAN PACKETS BYTES LASTUSE FLAGS
enp8s0f0_1 e4:0a:05:08:00:06 2 2 204 4295567112 0x0
enp8s0f0_0 e4:0a:05:08:00:03 2 3 278 4295567112 0x0
Signed-off-by: Vlad Buslov <vladbu@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Gal Pressman <gal@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
|
|
Following patch requires access to additional data in bridge net_device.
Pass the whole structure down the stack instead of adding necessary fields
as function arguments one-by-one.
Signed-off-by: Vlad Buslov <vladbu@nvidia.com>
Reviewed-by: Gal Pressman <gal@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
|
|
Following patch in series uses the new directory for bridge FDB debugfs.
The new directory is intended for all future eswitch-specific debugfs
files.
Signed-off-by: Vlad Buslov <vladbu@nvidia.com>
Reviewed-by: Gal Pressman <gal@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
|
|
Added a new event handler to firmware sync reset, which is used to
support firmware sync reset flow on smart NIC. Adding this new stage to
the flow enables the firmware to ensure host PFs unload before ECPFs
unload, to avoid race of PFs recovery.
If firmware sends sync_reset_unload event to driver the driver should
unload and close all HW resources of the function. Once the driver
finishes unloading part, it can't get any more events from firmware as
event queues are closed, so it polls the reset state field to know when
to continue to next stage of the sync reset flow.
Added capability bit for supporting sync_reset_unload event.
Signed-off-by: Moshe Shemesh <moshe@nvidia.com>
Reviewed-by: Shay Drory <shayd@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
|
|
The Default Timeout Register (DTOR) provides timeout values to driver
for flows that are device dependent. Zero value for DTOR entry is not
valid and should not be used. In case of reading zero value from DTOR,
the driver should use the hard coded SW default value instead.
Signed-off-by: Moshe Shemesh <moshe@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
|
|
Expose new timoueout in Default Timeouts Register to be used on sync
reset flow running on smart NIC. In this flow the driver should know how
much time to wait from getting unload request till firmware will ask the
PF to continue to next stage of the flow.
Signed-off-by: Moshe Shemesh <moshe@nvidia.com>
Reviewed-by: Shay Drory <shayd@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
|
|
Verify at reset_request stage that PF is capable to do reset_now. In
case PF is not capable, notify the firmware that the sync reset can not
happen and so firmware will abort the sync reset at early stage and will
not send reset_now event to any PF.
Signed-off-by: Moshe Shemesh <moshe@nvidia.com>
Reviewed-by: Shay Drory <shayd@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
|
|
KPTI keeps around two PGDs: one for userspace and another for the
kernel. Among other things, set_pgd() contains infrastructure to
ensure that updates to the kernel PGD are reflected in the user PGD
as well.
One side-effect of this is that set_pgd() expects to be passed whole
pages. Unfortunately, init_trampoline_kaslr() passes in a single entry:
'trampoline_pgd_entry'.
When KPTI is on, set_pgd() will update 'trampoline_pgd_entry' (an
8-Byte globally stored [.bss] variable) and will then proceed to
replicate that value into the non-existent neighboring user page
(located +4k away), leading to the corruption of other global [.bss]
stored variables.
Fix it by directly assigning 'trampoline_pgd_entry' and avoiding
set_pgd().
[ dhansen: tweak subject and changelog ]
Fixes: 0925dda5962e ("x86/mm/KASLR: Use only one PUD entry for real mode trampoline")
Suggested-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Lee Jones <lee@kernel.org>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: <stable@vger.kernel.org>
Link: https://lore.kernel.org/all/20230614163859.924309-1-lee@kernel.org/g
|
|
The tick period is aligned very early while the first clock_event_device is
registered. At that point the system runs in periodic mode and switches
later to one-shot mode if possible.
The next wake-up event is programmed based on the aligned value
(tick_next_period) but the delta value, that is used to program the
clock_event_device, is computed based on ktime_get().
With the subtracted offset, the device fires earlier than the exact time
frame. With a large enough offset the system programs the timer for the
next wake-up and the remaining time left is too small to make any boot
progress. The system hangs.
Move the alignment later to the setup of tick_sched timer. At this point
the system switches to oneshot mode and a high resolution clocksource is
available. At this point it is safe to align tick_next_period because
ktime_get() will now return accurate (not jiffies based) time.
[bigeasy: Patch description + testing].
Fixes: e9523a0d81899 ("tick/common: Align tick period with the HZ tick.")
Reported-by: Mathias Krause <minipli@grsecurity.net>
Reported-by: "Bhatnagar, Rishabh" <risbhat@amazon.com>
Suggested-by: Mathias Krause <minipli@grsecurity.net>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Richard W.M. Jones <rjones@redhat.com>
Tested-by: Mathias Krause <minipli@grsecurity.net>
Acked-by: SeongJae Park <sj@kernel.org>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/5a56290d-806e-b9a5-f37c-f21958b5a8c0@grsecurity.net
Link: https://lore.kernel.org/12c6f9a3-d087-b824-0d05-0d18c9bc1bf3@amazon.com
Link: https://lore.kernel.org/r/20230615091830.RxMV2xf_@linutronix.de
|
|
Splicing to SOCK_RAW sockets may set MSG_SPLICE_PAGES, but in such a case,
__ip_append_data() will call skb_splice_from_iter() to access the 'from'
data, assuming it to point to a msghdr struct with an iter, instead of
using the provided getfrag function to access it.
In the case of raw_sendmsg(), however, this is not the case and 'from' will
point to a raw_frag_vec struct and raw_getfrag() will be the frag-getting
function. A similar issue may occur with rawv6_sendmsg().
Fix this by ignoring MSG_SPLICE_PAGES if getfrag != ip_generic_getfrag as
ip_generic_getfrag() expects "from" to be a msghdr*, but the other getfrags
don't. Note that this will prevent MSG_SPLICE_PAGES from being effective
for udplite.
This likely affects ping sockets too. udplite looks like it should be okay
as it expects "from" to be a msghdr.
Signed-off-by: David Howells <dhowells@redhat.com>
Reported-by: syzbot+d8486855ef44506fd675@syzkaller.appspotmail.com
Link: https://lore.kernel.org/r/000000000000ae4cbf05fdeb8349@google.com/
Fixes: 2dc334f1a63a ("splice, net: Use sendmsg(MSG_SPLICE_PAGES) rather than ->sendpage()")
Tested-by: syzbot+d8486855ef44506fd675@syzkaller.appspotmail.com
cc: David Ahern <dsahern@kernel.org>
cc: Jens Axboe <axboe@kernel.dk>
cc: Matthew Wilcox <willy@infradead.org>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://lore.kernel.org/r/1410156.1686729856@warthog.procyon.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu
Pull RCU fix from Paul McKenney:
"This fixes a spinlock-initialization regression in SRCU that causes
the SRCU notifier to fail.
The fix simply adds the initialization, but introduces a #ifdef
because there is no spinlock to initialize for the Tiny SRCU used in
!SMP builds.
Yes, it would be nice to abstract this somehow in order to hide it in
SRCU, but I still don't see a good way of doing this"
* tag 'urgent-rcu.2023.06.11a' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu:
notifier: Initialize new struct srcu_usage field
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux
Pull RISC-V fix from Palmer Dabbelt:
- A documentation patch describing how we use patchwork
* tag 'riscv-for-linus-6.4-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux:
Documentation: RISC-V: patch-acceptance: mention patchwork's role
|
|
Drop struct acpi_thermal_flags which is not really used (only one
flag in it is ever set, but it is never read) and call
acpi_execute_simple_method() directly to evaluate _SCP instead of
using acpi_thermal_set_cooling_mode(), which has no callers after
that change, so drop it.
No intentional functional impact.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Reviewed-by: Daniel Lezcano <daniel.lezcano@linaro.org>
|
|
Drop struct acpi_thermal_state which is not really used.
No functional impact.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Reviewed-by: Daniel Lezcano <daniel.lezcano@linaro.org>
|
|
Because the only drivers that cared about button fixed events take care
of those events by themselves now, eliminate the code related to them
from acpi_device_install_notify_handler() and
acpi_device_remove_notify_handler().
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
|
|
Rework the ACPI tiny-power-button driver to install a notify handler or
a fixed event handler for the device it binds to by itself and drop its
notify callback.
This will allow acpi_device_install_notify_handler() and
acpi_device_remove_notify_handler() to be simplified going forward.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
|
|
Since the lid handling in acpi_button_notify() is special, introduce
acpi_lid_notify() specifically for handling lid notifications.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
|
|
Rework the ACPI button driver to install notify handlers or fixed
event handlers for the devices it binds to by itself, reduce the
indentation level in its notify handler routine and drop its
notify callback.
This will allow acpi_device_install_notify_handler() and
acpi_device_remove_notify_handler() to be simplified going forward
and it will allow the driver to use different notify handlers for the
lid and for the power and sleep buttons.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Reviewed-by: Michal Wilczynski <michal.wilczynski@intel.com>
|
|
Ensure there is no path where we might attempt to save SME state after we
flush a task by updating the SVCR register state as well as updating our
in memory state. I haven't seen a specific case where this is happening or
seen a path where it might happen but for the cost of a single low overhead
instruction it seems sensible to close the potential gap.
Signed-off-by: Mark Brown <broonie@kernel.org>
Link: https://lore.kernel.org/r/20230607-arm64-flush-svcr-v2-1-827306001841@kernel.org
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
|
|
Commit f38d1a6d0025 ("PM: domains: Allocate governor data dynamically
based on a genpd governor") started to use the in-parameters in
genpd_add_device(), without first doing a verification of them.
This isn't really a big problem, as most callers do a verification already.
Therefore, let's drop the verification from genpd_add_device() and make
sure all the callers take care of it instead.
Reported-by: Dan Carpenter <dan.carpenter@linaro.org>
Fixes: f38d1a6d0025 ("PM: domains: Allocate governor data dynamically based on a genpd governor")
Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
|
|
amd-pstate passive mode driver is hyphenated. So make amd-pstate active
mode driver consistent with that rename "amd_pstate_epp" to
"amd-pstate-epp".
Fixes: ffa5096a7c33 ("cpufreq: amd-pstate: implement Pstate EPP support for the AMD processors")
Cc: All applicable <stable@vger.kernel.org>
Reviewed-by: Gautham R. Shenoy <gautham.shenoy@amd.com>
Signed-off-by: Wyes Karny <wyes.karny@amd.com>
Acked-by: Huang Rui <ray.huang@amd.com>
Reviewed-by: Perry Yuan <Perry.Yuan@amd.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
|
|
Currently amd_pstate sets CPPC enable bit in MSR_AMD_CPPC_ENABLE only
for the CPU where the module_init happened. But MSR_AMD_CPPC_ENABLE is
per-socket. This causes CPPC enable bit to set for only one socket for
servers with more than one physical packages. To fix this write
MSR_AMD_CPPC_ENABLE per-socket.
Also, handle duplicate calls for cppc_enable, because it's called from
per-policy/per-core callbacks and can result in duplicate MSR writes.
Before the fix:
amd@amd:~$ sudo rdmsr -a 0xc00102b1 | uniq --count
192 0
192 1
After the fix:
amd@amd:~$ sudo rdmsr -a 0xc00102b1 | uniq --count
384 1
Suggested-by: Gautham R. Shenoy <gautham.shenoy@amd.com>
Signed-off-by: Wyes Karny <wyes.karny@amd.com>
Acked-by: Huang Rui <ray.huang@amd.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
|
|
In a typical VM guest, the mwait instruction is not available, leaving
only the 'hlt' instruction (which causes a VMEXIT to the host).
So for this common case, intel_idle will detect the lack of mwait, and
fail to initialize (after which another idle method would step in which
will just use hlt always).
Other (non-common) cases exist; the table below shows the before/after
for these:
+------------+--------------------------+-------------------------+
| Hypervisor | Idle method before patch | Idle method after patch |
| exposes | | |
+============+==========================+=========================+
| nothing | default_idle fallback | intel_idle VM table |
| (common) | (straight "hlt") | |
+------------+--------------------------+-------------------------+
| mwait | intel_idle mwait table | intel_idle mwait table |
+------------+--------------------------+-------------------------+
| ACPI | ACPI C1 state ("hlt") | intel_idle VM table |
+------------+--------------------------+-------------------------+
This is only applicable to CPUs known by intel_idle. For the bare metal
case, unknown CPU models will use the ACPI tables (when available) to
get estimates for latency and break even point for longer idle states.
In guests, the common case is that ACPI tables are not available, but
even when they are available, they can't and don't provide the latency
information for the longer (mwait based) states. For this scenario
(unknown CPU model), the default_idle mode (no ACPI) or ACPI C1 (ACPI
avaible) will be used.
By providing capability to do this with the intel_idle driver, we can
do better than the fallback or ACPI table methods. While this current
change only gets us to the existing behavior, later patches in this
series will add new capabilities such as optimized TLB flushing.
In order to do this, a simplified version of the initialization
function for VM guests is created, and this will be called if the CPU
is recognized, but mwait is not supported, and we're in a VM guest.
One thing to note is that the max latency (and break even) of this C1
state is higher than the typical bare metal C1 state. Because hlt causes
a vmexit, and the cost of vmexit + hypervisor overhead + vmenter is
typically in the order of upto 5 microseconds... even if the hypervisor
does not actually goes into a hardware power saving state.
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
[ rjw: Dropped redundant checks from should_verify_mwait() ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
|
|
bpf_free_inode() is invoked as a RCU callback. Usually RCU callbacks are
invoked within softirq context. By setting rcutree.use_softirq=0 boot
option the RCU callbacks will be invoked in a per-CPU kthread with
bottom halves disabled which implies a RCU read section.
On PREEMPT_RT the context remains fully preemptible. The RCU read
section however does not allow schedule() invocation. The latter happens
in mutex_lock() performed by bpf_trampoline_unlink_prog() originated
from bpf_link_put().
It was pointed out that the bpf_link_put() invocation should not be
delayed if originated from close(). It was also pointed out that other
invocations from within a syscall should also avoid the workqueue.
Everyone else should use workqueue by default to remain safe in the
future (while auditing the code, every caller was preemptible except for
the RCU case).
Let bpf_link_put() use the worker unconditionally. Add
bpf_link_put_direct() which will directly free the resources and is used
by close() and from within __sys_bpf().
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20230614083430.oENawF8f@linutronix.de
|
|
This is an existing optional property that ieee80211.yaml/cfg80211
provides. It's useful to further restrict supported frequencies
for a specified device through device-tree.
For testing the addition, I added the ieee80211-freq-limit
property with values from an OpenMesh A62 device. This is
because the OpenMesh A62 has "special filters in front of
the RX+TX paths to the 5GHz PHYs. These filtered channel
can in theory still be used by the hardware but the signal
strength is reduced so much that it makes no sense."
The driver supported this since ~2018 by
commit 34d5629d2ca8 ("ath10k: limit available channels via DT ieee80211-freq-limit")
Link: https://git.openwrt.org/?p=openwrt/openwrt.git;a=commit;h=e3b8ae2b09e137ce2eae33551923daf302293a0c
Signed-off-by: Christian Lamparter <chunkeey@gmail.com>
Acked-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
Signed-off-by: Kalle Valo <quic_kvalo@quicinc.com>
Link: https://lore.kernel.org/r/c33c928b7c6c9bb4e7abe84eb8df9f440add275b.1686486464.git.chunkeey@gmail.com
|
|
After grabbing q->sysfs_lock, q->elevator may become NULL because of
elevator switch.
Fix the NULL dereference on q->elevator by checking it with lock.
Reported-by: Guangwu Zhang <guazhang@redhat.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20230616132354.415109-1-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|