Age | Commit message (Collapse) | Author |
|
Refactor __set_ptes(), set_pmd_at() and set_pud_at() so that they are
all a thin wrapper around a new common __set_ptes_anysz(), which takes
pgsize parameter. Additionally, refactor __ptep_get_and_clear() and
pmdp_huge_get_and_clear() to use a new common
__ptep_get_and_clear_anysz() which also takes a pgsize parameter.
These changes will permit the huge_pte API to efficiently batch-set
pgtable entries and take advantage of the future barrier optimizations.
Additionally since the new *_anysz() helpers call the correct
page_table_check_*_set() API based on pgsize, this means that huge_ptes
will be able to get proper coverage. Currently the huge_pte API always
uses the pte API which assumes an entry only covers a single page.
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>
Tested-by: Luiz Capitulino <luizcap@redhat.com>
Link: https://lore.kernel.org/r/20250422081822.1836315-5-ryan.roberts@arm.com
Signed-off-by: Will Deacon <will@kernel.org>
|
|
Convert page_table_check_p[mu]d_set(...) to
page_table_check_p[mu]ds_set(..., nr) to allow checking a contiguous set
of pmds/puds in single batch. We retain page_table_check_p[mu]d_set(...)
as macros that call new batch functions with nr=1 for compatibility.
arm64 is about to reorganise its pte/pmd/pud helpers to reuse more code
and to allow the implementation for huge_pte to more efficiently set
ptes/pmds/puds in batches. We need these batch-helpers to make the
refactoring possible.
Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>
Reviewed-by: Pasha Tatashin <pasha.tatashin@soleen.com>
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
Tested-by: Luiz Capitulino <luizcap@redhat.com>
Link: https://lore.kernel.org/r/20250422081822.1836315-4-ryan.roberts@arm.com
Signed-off-by: Will Deacon <will@kernel.org>
|
|
When operating on contiguous blocks of ptes (or pmds) for some hugetlb
sizes, we must honour break-before-make requirements and clear down the
block to invalid state in the pgtable then invalidate the relevant tlb
entries before making the pgtable entries valid again.
However, the tlb maintenance is currently always done assuming the worst
case stride (PAGE_SIZE), last_level (false) and tlb_level
(TLBI_TTL_UNKNOWN). We can do much better with the hinting; In reality,
we know the stride from the huge_pte pgsize, we are always operating
only on the last level, and we always know the tlb_level, again based on
pgsize. So let's start providing these hints.
Additionally, avoid tlb maintenace in set_huge_pte_at().
Break-before-make is only required if we are transitioning the
contiguous pte block from valid -> valid. So let's elide the
clear-and-flush ("break") if the pte range was previously invalid.
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
Tested-by: Luiz Capitulino <luizcap@redhat.com>
Link: https://lore.kernel.org/r/20250422081822.1836315-3-ryan.roberts@arm.com
Signed-off-by: Will Deacon <will@kernel.org>
|
|
Not all huge_pte helper APIs explicitly provide the size of the
huge_pte. So the helpers have to depend on various methods to determine
the size of the huge_pte. Some of these methods are dubious.
Let's clean up the code to use preferred methods and retire the dubious
ones. The options in order of preference:
- If size is provided as parameter, use it together with
num_contig_ptes(). This is explicit and works for both present and
non-present ptes.
- If vma is provided as a parameter, retrieve size via
huge_page_size(hstate_vma(vma)) and use it together with
num_contig_ptes(). This is explicit and works for both present and
non-present ptes.
- If the pte is present and contiguous, use find_num_contig() to walk
the pgtable to find the level and infer the number of ptes from
level. Only works for *present* ptes.
- If the pte is present and not contiguous and you can infer from this
that only 1 pte needs to be operated on. This is ok if you don't care
about the absolute size, and just want to know the number of ptes.
- NEVER rely on resolving the PFN of a present pte to a folio and
getting the folio's size. This is fragile at best, because there is
nothing to stop the core-mm from allocating a folio twice as big as
the huge_pte then mapping it across 2 consecutive huge_ptes. Or just
partially mapping it.
Where we require that the pte is present, add warnings if not-present.
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
Tested-by: Luiz Capitulino <luizcap@redhat.com>
Link: https://lore.kernel.org/r/20250422081822.1836315-2-ryan.roberts@arm.com
Signed-off-by: Will Deacon <will@kernel.org>
|
|
This value is never read, so it's not needed. Remove it.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Link: https://patch.msgid.link/20250508121306.1277801-16-miriam.rachel.korenblit@intel.com
Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com>
|
|
There are a number of MAC parameters that are in the iwl_cfg
(which is the last config matched to the MAC/RF combination).
This isn't necessary, there are many more of those than MACs,
so move (most of) the data into the MAC family config struct.
Note that DCCM information remains for use by older devices,
and on 9000 series it'll be in struct iwl_cfg but be ignored
when the CRF is in a Qu/So platform.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Link: https://patch.msgid.link/20250508121306.1277801-15-miriam.rachel.korenblit@intel.com
Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com>
|
|
This is only used with old-style debug dump, which isn't
supported on newer devices, so remove the data.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Link: https://patch.msgid.link/20250508121306.1277801-14-miriam.rachel.korenblit@intel.com
Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com>
|
|
Since 22000 series, the data is read by the firmware and the
driver doesn't need to know, remove the useless setting.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Link: https://patch.msgid.link/20250508121306.1277801-13-miriam.rachel.korenblit@intel.com
Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com>
|
|
These are (going to be) base MAC parameters that are identical
even for different platforms with the same MAC, so rename the
structure accordingly, calling it iwl_family_base_params.
Also rename the pointer to it so the dereferencing is a bit
shorter.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Link: https://patch.msgid.link/20250508121306.1277801-12-miriam.rachel.korenblit@intel.com
Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com>
|
|
This field is always set for >= 9000 series, but then we
already check that, so it's not needed. Remove it.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Link: https://patch.msgid.link/20250508121306.1277801-11-miriam.rachel.korenblit@intel.com
Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com>
|
|
This field is unused, remove it.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Link: https://patch.msgid.link/20250508121306.1277801-10-miriam.rachel.korenblit@intel.com
Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com>
|
|
Since 9000 series devices, the devices are split into MAC and
CRF parts. Currently, "struct iwl_cfg" reflects some MAC and
some RF parameters, but we want to clean this up and move the
MAC data to what's now "struct iwl_cfg_trans_params". As the
first step, to reflect the intent, rename this structure.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Link: https://patch.msgid.link/20250508121306.1277801-9-miriam.rachel.korenblit@intel.com
Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com>
|
|
There's no need to pass various different pointers when
the transport is already established, so just pass that
instead.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Link: https://patch.msgid.link/20250508121306.1277801-8-miriam.rachel.korenblit@intel.com
Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com>
|
|
This value hasn't been used since unified firmware in 22000
series, so there's no need to set the value for that or
newer devices. Remove it.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Link: https://patch.msgid.link/20250508121306.1277801-7-miriam.rachel.korenblit@intel.com
Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com>
|
|
Instead of using fw_name_pre, handle the cc firmware file
name specially in iwl_drv_get_fwname_pre() for the cc MAC
type.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Link: https://patch.msgid.link/20250508121306.1277801-6-miriam.rachel.korenblit@intel.com
Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com>
|
|
Add support for the TY MAC (discrete device) and GF4 RF to
the list of MAC/RF types, and use that to remove fw_name_pre
for the ax210 family devices.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Link: https://patch.msgid.link/20250508121306.1277801-5-miriam.rachel.korenblit@intel.com
Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com>
|
|
This is never used, so remove it.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Link: https://patch.msgid.link/20250508121306.1277801-4-miriam.rachel.korenblit@intel.com
Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com>
|
|
Since JF RF always uses b0 step and QuZ MAC always uses
a0 step for firmware, there's no need for these configs
that just force the steps to those values. Remove them.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Link: https://patch.msgid.link/20250508121306.1277801-3-miriam.rachel.korenblit@intel.com
Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com>
|
|
This is more maintainable than the fw_name_pre prefix, and
requires fewer duplicate structs as well. Since only b0 FW
exists, override the MAC/RF steps for these devices.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Link: https://patch.msgid.link/20250508121306.1277801-2-miriam.rachel.korenblit@intel.com
Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com>
|
|
This is needed, otherwise we don't know what to do and will
not find the correct firmware.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Link: https://patch.msgid.link/20250506194102.3407967-16-miriam.rachel.korenblit@intel.com
Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com>
|
|
These are test chips, not available to users, and not even listed
in the PCI IDs table (so the driver won't bind them). There's no
reason to list specific devices with them in the driver.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Link: https://patch.msgid.link/20250506194102.3407967-15-miriam.rachel.korenblit@intel.com
Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com>
|
|
With just a handful of values in two bytes, the params are
smaller than the pointer to them. Inline them and save some
space.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Link: https://patch.msgid.link/20250506194102.3407967-14-miriam.rachel.korenblit@intel.com
Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com>
|
|
Since there's no HT on 6 GHz, only HE, the HT capabilities
are never initialized, and so the ht40_bands value is never
checked for the 6 GHz band. Remove the misleading value.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Link: https://patch.msgid.link/20250506194102.3407967-13-miriam.rachel.korenblit@intel.com
Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com>
|
|
The driver must not hold the wiphy mutex when unregistering the thermal
devices. Do not hold the lock for the call to iwl_mld_thermal_exit and
only do a lock/unlock to cancel the ct_kill_exit_wk work.
The problem is that iwl_mld_tzone_get_temp needs to take the wiphy lock
while the thermal code is holding its own locks already. When
unregistering the device, the reverse would happen as the driver was
calling thermal_cooling_device_unregister with the wiphy mutex already
held.
It is not likely to trigger this deadlock as it can only happen if the
thermal code is polling the temperature while the driver is being
unloaded. However, lockdep reported it so fix it.
Signed-off-by: Benjamin Berg <benjamin.berg@intel.com>
Reviewed-by: Johannes Berg <johannes.berg@intel.com>
Link: https://patch.msgid.link/20250506194102.3407967-12-miriam.rachel.korenblit@intel.com
Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com>
|
|
rx_omi::finished_work is initialized when the containing link is.
If the worker was queued and then an error happened, we will get to
iwl_mld_init_link from the reconfig and initialize the work after it was
queued.
Reviewed-by: Johannes Berg <johannes.berg@intel.com>
Link: https://patch.msgid.link/20250506194102.3407967-11-miriam.rachel.korenblit@intel.com
Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com>
|
|
Different hardware has a different maximum power consumption and the
BIOS can also define a power limit for the device. Add code to select
an appropriate maximum power budget for the device and configure that
instead of using a hardcoded table.
This removes the old table. It does not work with the variable upper
limit and the there should be no consumer that requires these exact step
values.
This considerably increases the power budget for some devices and can
prevent throttling in high traffic situations.
Signed-off-by: Benjamin Berg <benjamin.berg@intel.com>
Link: https://patch.msgid.link/20250506194102.3407967-10-miriam.rachel.korenblit@intel.com
Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com>
|
|
Different hardware has a different maximum power consumption and the
BIOS can also define a power limit for the device. Add code to select
an appropriate maximum power budget for the device and configure that
instead of using a hardcoded table.
This removes the old table. It does not work with the variable upper
limit and the there should be no consumer that requires these exact step
values.
This considerably increases the power budget for some devices and can
prevent throttling in high traffic situations.
Signed-off-by: Benjamin Berg <benjamin.berg@intel.com>
Link: https://patch.msgid.link/20250506194102.3407967-9-miriam.rachel.korenblit@intel.com
Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com>
|
|
The compare_temps function in both mvm and mld dropped the const
qualifier in a cast in a way that makes -Werror=cast-qual unhappy. Add
the const to the cast to fix this.
Signed-off-by: Benjamin Berg <benjamin.berg@intel.com>
Link: https://patch.msgid.link/20250506194102.3407967-8-miriam.rachel.korenblit@intel.com
Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com>
|
|
This function is not implemented nor called. Remove its declaration.
Link: https://patch.msgid.link/20250506194102.3407967-7-miriam.rachel.korenblit@intel.com
Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com>
|
|
Add support for a new version of link configuration command
which includes NPCA and high priority TX traffic support for wifi8.
Signed-off-by: Yedidya Benshimol <yedidya.ben.shimol@intel.com>
Link: https://patch.msgid.link/20250506194102.3407967-6-miriam.rachel.korenblit@intel.com
Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com>
|
|
Add a new version of mac configuration command
which includes UHR support indication.
Signed-off-by: Yedidya Benshimol <yedidya.ben.shimol@intel.com>
Link: https://patch.msgid.link/20250506194102.3407967-5-miriam.rachel.korenblit@intel.com
Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com>
|
|
Add a new version of sta configuration command
which includes these wifi8 features:
1. LDPC X2 CW size support indication
2. Indication if ICF frame is needed instead of RTS
3. support for MIC padding delays for protected control frames
Signed-off-by: Yedidya Benshimol <yedidya.ben.shimol@intel.com>
Link: https://patch.msgid.link/20250506194102.3407967-4-miriam.rachel.korenblit@intel.com
Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com>
|
|
Range response version 10 removes the rx and tx rates fields.
These fields aren't used by the driver anyway, so no change is
needed to support it.
Signed-off-by: Avraham Stern <avraham.stern@intel.com>
Link: https://patch.msgid.link/20250506194102.3407967-3-miriam.rachel.korenblit@intel.com
Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com>
|
|
Since the FW is the one to assign an ID to a BA, it can happen that
the FW sends a bar_frame_release_notif before the driver had the chance to
allocate the BAID.
Convert the IWL_FW_CHECK into a regular debug print.
Reviewed-by: Johannes Berg <johannes.berg@intel.com>
Link: https://patch.msgid.link/20250506194102.3407967-2-miriam.rachel.korenblit@intel.com
Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com>
|
|
Cong Wang says:
====================
net_sched: Fix gso_skb flushing during qdisc change
This patchset contains a bug fix and its test cases, please check each
patch description for more details. To keep the bug fix minimum, I
intentionally limit the code changes to the cases reported here.
---
v2: added a missing qlen--
fixed the new boolean parameter for two qdiscs
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Added new test cases for FQ, FQ_CODEL, FQ_PIE, and HHF qdiscs to verify queue
trimming behavior when the qdisc limit is dynamically reduced.
Each test injects packets, reduces the qdisc limit, and checks that the new
limit is enforced. This is still best effort since timing qdisc backlog
is not easy.
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Previously, when reducing a qdisc's limit via the ->change() operation, only
the main skb queue was trimmed, potentially leaving packets in the gso_skb
list. This could result in NULL pointer dereference when we only check
sch->limit against sch->q.qlen.
This patch introduces a new helper, qdisc_dequeue_internal(), which ensures
both the gso_skb list and the main queue are properly flushed when trimming
excess packets. All relevant qdiscs (codel, fq, fq_codel, fq_pie, hhf, pie)
are updated to use this helper in their ->change() routines.
Fixes: 76e3cc126bb2 ("codel: Controlled Delay AQM")
Fixes: 4b549a2ef4be ("fq_codel: Fair Queue Codel AQM")
Fixes: afe4fd062416 ("pkt_sched: fq: Fair Queue packet scheduler")
Fixes: ec97ecf1ebe4 ("net: sched: add Flow Queue PIE packet scheduler")
Fixes: 10239edf86f1 ("net-qdisc-hhf: Heavy-Hitter Filter (HHF) qdisc")
Fixes: d4b36210c2e6 ("net: pkt_sched: PIE AQM scheme")
Reported-by: Will <willsroot@protonmail.com>
Reported-by: Savy <savy@syst3mfailure.io>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
meson_ddr_pmu_create()
The Amlogic DDR PMU driver meson_ddr_pmu_create() function incorrectly uses
smp_processor_id(), which assumes disabled preemption. This leads to kernel
warnings during module loading because meson_ddr_pmu_create() can be called
in a preemptible context.
Following kernel warning and stack trace:
[ 31.745138] [ T2289] BUG: using smp_processor_id() in preemptible [00000000] code: (udev-worker)/2289
[ 31.745154] [ T2289] caller is debug_smp_processor_id+0x28/0x38
[ 31.745172] [ T2289] CPU: 4 UID: 0 PID: 2289 Comm: (udev-worker) Tainted: GW 6.14.0-0-MANJARO-ARM #1 59519addcbca6ba8de735e151fd7b9e97aac7ff0
[ 31.745181] [ T2289] Tainted: [W]=WARN
[ 31.745183] [ T2289] Hardware name: Hardkernel ODROID-N2Plus (DT)
[ 31.745188] [ T2289] Call trace:
[ 31.745191] [ T2289] show_stack+0x28/0x40 (C)
[ 31.745199] [ T2289] dump_stack_lvl+0x4c/0x198
[ 31.745205] [ T2289] dump_stack+0x20/0x50
[ 31.745209] [ T2289] check_preemption_disabled+0xec/0xf0
[ 31.745213] [ T2289] debug_smp_processor_id+0x28/0x38
[ 31.745216] [ T2289] meson_ddr_pmu_create+0x200/0x560 [meson_ddr_pmu_g12 8095101c49676ad138d9961e3eddaee10acca7bd]
[ 31.745237] [ T2289] g12_ddr_pmu_probe+0x20/0x38 [meson_ddr_pmu_g12 8095101c49676ad138d9961e3eddaee10acca7bd]
[ 31.745246] [ T2289] platform_probe+0x98/0xe0
[ 31.745254] [ T2289] really_probe+0x144/0x3f8
[ 31.745258] [ T2289] __driver_probe_device+0xb8/0x180
[ 31.745261] [ T2289] driver_probe_device+0x54/0x268
[ 31.745264] [ T2289] __driver_attach+0x11c/0x288
[ 31.745267] [ T2289] bus_for_each_dev+0xfc/0x160
[ 31.745274] [ T2289] driver_attach+0x34/0x50
[ 31.745277] [ T2289] bus_add_driver+0x160/0x2b0
[ 31.745281] [ T2289] driver_register+0x78/0x120
[ 31.745285] [ T2289] __platform_driver_register+0x30/0x48
[ 31.745288] [ T2289] init_module+0x30/0xfe0 [meson_ddr_pmu_g12 8095101c49676ad138d9961e3eddaee10acca7bd]
[ 31.745298] [ T2289] do_one_initcall+0x11c/0x438
[ 31.745303] [ T2289] do_init_module+0x68/0x228
[ 31.745311] [ T2289] load_module+0x118c/0x13a8
[ 31.745315] [ T2289] __arm64_sys_finit_module+0x274/0x390
[ 31.745320] [ T2289] invoke_syscall+0x74/0x108
[ 31.745326] [ T2289] el0_svc_common+0x90/0xf8
[ 31.745330] [ T2289] do_el0_svc+0x2c/0x48
[ 31.745333] [ T2289] el0_svc+0x60/0x150
[ 31.745337] [ T2289] el0t_64_sync_handler+0x80/0x118
[ 31.745341] [ T2289] el0t_64_sync+0x1b8/0x1c0
Changes replaces smp_processor_id() with raw_smp_processor_id() to
ensure safe CPU ID retrieval in preemptible contexts.
Cc: Jiucheng Xu <jiucheng.xu@amlogic.com>
Fixes: 2016e2113d35 ("perf/amlogic: Add support for Amlogic meson G12 SoC DDR PMU driver")
Signed-off-by: Anand Moon <linux.amoon@gmail.com>
Link: https://lore.kernel.org/r/20250407063206.5211-1-linux.amoon@gmail.com
Signed-off-by: Will Deacon <will@kernel.org>
|
|
Somehow the encodings for REQ2/SNP2 channels in XP events
got mixed up... Unmix them.
CC: stable@vger.kernel.org
Fixes: 23760a014417 ("perf/arm-cmn: Add CMN-700 support")
Signed-off-by: Robin Murphy <robin.murphy@arm.com>
Link: https://lore.kernel.org/r/087023e9737ac93d7ec7a841da904758c254cb01.1746717400.git.robin.murphy@arm.com
Signed-off-by: Will Deacon <will@kernel.org>
|
|
Joel Savitz <jsavitz@redhat.com> says:
The two patches are independent of each other. The first patch removes
unnecssary NULL guards from free_nsproxy() and create_new_namespaces()
in line with other usage of the put_*_ns() call sites. The second patch
slightly reduces the size of the kernel when CONFIG_CGROUPS is not
selected.
* patches from https://lore.kernel.org/20250508184930.183040-1-jsavitz@redhat.com:
include/cgroup: separate {get,put}_cgroup_ns no-op case
kernel/nsproxy: remove unnecessary guards
Link: https://lore.kernel.org/20250508184930.183040-1-jsavitz@redhat.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
When CONFIG_CGROUPS is not selected, {get,put}_cgroup_ns become no-ops
and therefore it is not necessary to compile in the code for changing
the reference count.
When CONFIG_CGROUP is selected, there is no valid case where
either of {get,put}_cgroup_ns() will be called with a NULL argument.
Signed-off-by: Joel Savitz <jsavitz@redhat.com>
Link: https://lore.kernel.org/20250508184930.183040-3-jsavitz@redhat.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
In free_nsproxy() and the error path of create_new_namesapces() the
put_*_ns() calls are guarded by unnecessary NULL checks.
put_pid_ns(), put_ipc_ns(), put_uts_ns(), and put_time_ns() will never
receive a NULL argument unless their namespace type is disabled, and in
this case all four become no-ops at compile time anyway. put_mnt_ns()
will never receive a null argument at any time.
This unguarded usage is in line with other call sites of put_*_ns().
Signed-off-by: Joel Savitz <jsavitz@redhat.com>
Link: https://lore.kernel.org/20250508184930.183040-2-jsavitz@redhat.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Using FREEZE_HOLDER_USERSPACE has two consequences:
(1) If userspace freezes the filesystem after mnt_drop_write_file() but
before freeze_super() was called filesystem resizing will fail
because the freeze isn't marked as nestable.
(2) If the kernel has successfully frozen the filesystem via
FREEZE_HOLDER_USERSPACE userspace can simply undo it by using the
FITHAW ioctl.
Fix both issues by using FREEZE_HOLDER_KERNEL. It will nest with
FREEZE_HOLDER_USERSPACE and cannot be undone by userspace.
And it is the correct thing to do because the kernel temporarily freezes
the filesystem.
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Christian Brauner <brauner@kernel.org> says:
Now all the pieces are in place to actually allow the power subsystem to
freeze/thaw filesystems during suspend/resume. Filesystems are only
frozen and thawed if the power subsystem does actually own the freeze.
Othwerwise it risks thawing filesystems it didn't own. This could be
done differently be e.g., keeping the filesystems that were actually
frozen on a list and then unfreezing them from that list. This is
disgustingly unclean though and reeks of an ugly hack.
If the filesystem is already frozen by the time we've frozen all
userspace processes we don't care to freeze it again. That's userspace's
job once the process resumes. We only actually freeze filesystems if we
absolutely have to and we ignore other failures to freeze.
We could bubble up errors and fail suspend/resume if the error isn't
EBUSY (aka it's already frozen) but I don't think that this is worth it.
Filesystem freezing during suspend/resume is best-effort. If the user
has 500 ext4 filesystems mounted and 4 fail to freeze for whatever
reason then we simply skip them.
What we have now is already a big improvement and let's see how we fare
with it before making our lives even harder (and uglier) than we have
to.
* patches from https://lore.kernel.org/r/20250402-work-freeze-v2-0-6719a97b52ac@kernel.org:
kernfs: add warning about implementing freeze/thaw
power: freeze filesystems during suspend/resume
Link: https://lore.kernel.org/r/20250402-work-freeze-v2-0-6719a97b52ac@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Christian Brauner <brauner@kernel.org> says:
Allow efivarfs to partake to resync variable state during system
hibernation and suspend. Add freeze/thaw support.
This is a pretty straightforward implementation. We simply add regular
freeze/thaw support for both userspace and the kernel. This works
without any big issues and congrats afaict efivars is the first
pseudofilesystem that adds support for filesystem freezing and thawing.
The simplicity comes from the fact that we simply always resync variable
state after efivarfs has been frozen. It doesn't matter whether that's
because of suspend, userspace initiated freeze or hibernation. Efivars
is simple enough that it doesn't matter that we walk all dentries. There
are no directories and there aren't insane amounts of entries and both
freeze/thaw are already heavy-handed operations. If userspace initiated
a freeze/thaw cycle they would need CAP_SYS_ADMIN in the initial user
namespace (as that's where efivarfs is mounted) so it can't be triggered
by random userspace. IOW, we really really don't care.
* patches from https://lore.kernel.org/r/20250331-work-freeze-v1-0-6dfbe8253b9f@kernel.org:
efivarfs: support freeze/thaw
libfs: export find_next_child()
Link: https://lore.kernel.org/r/20250331-work-freeze-v1-0-6dfbe8253b9f@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Sysfs is built on top of kernfs and sysfs provides the power management
infrastructure to support suspend/hibernate by writing to various files
in /sys/power/. As filesystems may be automatically frozen during
suspend/hibernate implementing freeze/thaw support for kernfs
generically will cause deadlocks as the suspending/hibernation
initiating task will hold a VFS lock that it will then wait upon to be
released. If freeze/thaw for kernfs is needed talk to the VFS.
Link: https://lore.kernel.org/r/20250402-work-freeze-v2-4-6719a97b52ac@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Allow efivarfs to partake to resync variable state during system
hibernation and suspend. Add freeze/thaw support.
This is a pretty straightforward implementation. We simply add regular
freeze/thaw support for both userspace and the kernel. This works
without any big issues and congrats afaict efivars is the first
pseudofilesystem that adds support for filesystem freezing and thawing.
The simplicity comes from the fact that we simply always resync variable
state after efivarfs has been frozen. It doesn't matter whether that's
because of suspend, userspace initiated freeze or hibernation. Efivars
is simple enough that it doesn't matter that we walk all dentries. There
are no directories and there aren't insane amounts of entries and both
freeze/thaw are already heavy-handed operations. We really really don't
need to care.
Link: https://lore.kernel.org/r/20250331-work-freeze-v1-2-6dfbe8253b9f@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Now all the pieces are in place to actually allow the power subsystem
to freeze/thaw filesystems during suspend/resume. Filesystems are only
frozen and thawed if the power subsystem does actually own the freeze.
We could bubble up errors and fail suspend/resume if the error isn't
EBUSY (aka it's already frozen) but I don't think that this is worth it.
Filesystem freezing during suspend/resume is best-effort. If the user
has 500 ext4 filesystems mounted and 4 fail to freeze for whatever
reason then we simply skip them.
What we have now is already a big improvement and let's see how we fare
with it before making our lives even harder (and uglier) than we have
to.
We add a new sysctl know /sys/power/freeze_filesystems that will allow
userspace to freeze filesystems during suspend/hibernate. For now it
defaults to off. The thaw logic doesn't require checking whether
freezing is enabled because the power subsystem exclusively owns frozen
filesystems for the duration of suspend/hibernate and is able to skip
filesystems it doesn't need to freeze.
Also it is technically possible that filesystem
filesystem_freeze_enabled is true and power freezes the filesystems but
before freezing all processes another process disables
filesystem_freeze_enabled. If power were to place the filesystems_thaw()
call under filesystems_freeze_enabled it would fail to thaw the
fileystems it frozw. The exclusive holder mechanism makes it possible to
iterate through the list without any concern making sure that no
filesystems are left frozen.
Link: https://lore.kernel.org/r/20250402-work-freeze-v2-3-6719a97b52ac@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Export find_next_child() so it can be used by efivarfs.
Keep it internal for now. There's no reason to advertise this
kernel-wide.
Link: https://lore.kernel.org/r/20250331-work-freeze-v1-1-6dfbe8253b9f@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Christian Brauner <brauner@kernel.org> says:
Add the necessary infrastructure changes to support freezing for suspend
and hibernate. This should all that's needed to wire up power.
* patches from https://lore.kernel.org/r/20250329-work-freeze-v2-0-a47af37ecc3d@kernel.org:
super: add filesystem freezing helpers for suspend and hibernate
gfs2: pass through holder from the VFS for freeze/thaw
super: use common iterator (Part 2)
super: use a common iterator (Part 1)
super: skip dying superblocks early
super: simplify user_get_super()
super: remove pointless s_root checks
Link: https://lore.kernel.org/r/20250329-work-freeze-v2-0-a47af37ecc3d@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
|