Age | Commit message (Collapse) | Author |
|
git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu
Pull RCU updates from Paul McKenney:
- Documentation updates. This is the second in a series from an ongoing
review of the RCU documentation.
- Miscellaneous fixes.
- Introduce a default-off Kconfig option that depends on RCU_NOCB_CPU
that, on CPUs mentioned in the nohz_full or rcu_nocbs boot-argument
CPU lists, causes call_rcu() to introduce delays.
These delays result in significant power savings on nearly idle
Android and ChromeOS systems. These savings range from a few percent
to more than ten percent.
This series also includes several commits that change call_rcu() to a
new call_rcu_hurry() function that avoids these delays in a few
cases, for example, where timely wakeups are required. Several of
these are outside of RCU and thus have acks and reviews from the
relevant maintainers.
- Create an srcu_read_lock_nmisafe() and an srcu_read_unlock_nmisafe()
for architectures that support NMIs, but which do not provide
NMI-safe this_cpu_inc(). These NMI-safe SRCU functions are required
by the upcoming lockless printk() work by John Ogness et al.
- Changes providing minor but important increases in torture test
coverage for the new RCU polled-grace-period APIs.
- Changes to torturescript that avoid redundant kernel builds, thus
providing about a 30% speedup for the torture.sh acceptance test.
* tag 'rcu.2022.12.02a' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu: (49 commits)
net: devinet: Reduce refcount before grace period
net: Use call_rcu_hurry() for dst_release()
workqueue: Make queue_rcu_work() use call_rcu_hurry()
percpu-refcount: Use call_rcu_hurry() for atomic switch
scsi/scsi_error: Use call_rcu_hurry() instead of call_rcu()
rcu/rcutorture: Use call_rcu_hurry() where needed
rcu/rcuscale: Use call_rcu_hurry() for async reader test
rcu/sync: Use call_rcu_hurry() instead of call_rcu
rcuscale: Add laziness and kfree tests
rcu: Shrinker for lazy rcu
rcu: Refactor code a bit in rcu_nocb_do_flush_bypass()
rcu: Make call_rcu() lazy to save power
rcu: Implement lockdep_rcu_enabled for !CONFIG_DEBUG_LOCK_ALLOC
srcu: Debug NMI safety even on archs that don't require it
srcu: Explain the reason behind the read side critical section on GP start
srcu: Warn when NMI-unsafe API is used in NMI
arch/s390: Add ARCH_HAS_NMI_SAFE_THIS_CPU_OPS Kconfig option
arch/loongarch: Add ARCH_HAS_NMI_SAFE_THIS_CPU_OPS Kconfig option
rcu: Fix __this_cpu_read() lockdep warning in rcu_force_quiescent_state()
rcu-tasks: Make grace-period-age message human-readable
...
|
|
Merge devfreq updates and cpupower utility updates for 6.2-rc1:
- Add a private governor_data for devfreq governors (Kant Fan).
- Reorganize devfreq code to use device_match_of_node() and
devm_platform_get_and_ioremap_resource() instead of open coding
them (ye xingchen, Minghao Chi).
- Make cpupower choose base_cpu to display default cpupower details
instead of picking CPU 0 (Saket Kumar Bhaskar).
- Add Georgian translation to cpupower documentation (Zurab
Kargareteli).
- Introduce powercap intel-rapl library, powercap-info command, and
RAPL monitor into cpupower (Thomas Renninger).
* pm-devfreq:
PM / devfreq: event: use devm_platform_get_and_ioremap_resource()
PM / devfreq: event: Use device_match_of_node()
PM / devfreq: Use device_match_of_node()
PM/devfreq: governor: Add a private governor_data for governor
* pm-tools:
cpupower: rapl monitor - shows the used power consumption in uj for each rapl domain
cpupower: Introduce powercap intel-rapl library and powercap-info command
cpupower: Add Georgian translation
tools/cpupower: Choose base_cpu to display default cpupower details
|
|
Merge cpufreq changes for 6.2-rc1:
- Generalize of_perf_domain_get_sharing_cpumask phandle format (Hector
Martin).
- Add new cpufreq driver for Apple SoC CPU P-states (Hector Martin).
- Update Qualcomm cpufreq driver, including:
* CPU clock provider support,
* Generic cleanups or reorganization.
* Potential memleak fix.
* Fix of the return value of cpufreq_driver->get().
(Manivannan Sadhasivam, Chen Hui).
- Update Qualcomm cpufreq driver's DT bindings, including:
* Support for CPU clock provider.
* Missing cache-related properties fixes.
* Support for QDU1000/QRU1000.
(Manivannan Sadhasivam, Rob Herring, Melody Olvera).
- Add support for ti,am625 SoC and enable build of ti-cpufreq for
ARCH_K3 (Dave Gerlach, and Vibhore Vardhan).
- Use flexible array to simplify memory allocation in the tegra186
cpufreq driver (Christophe JAILLET).
- Convert cpufreq statistics code to use sysfs_emit_at() (ye xingchen).
- Allow intel_pstate to use no-HWP mode on Sapphire Rapids (Giovanni
Gherdovich).
- Add missing pci_dev_put() to the amd_freq_sensitivity cpufreq driver
(Xiongfeng Wang).
- Initialize the kobj_unregister completion before calling
kobject_init_and_add() in the cpufreq core code (Yongqiang Liu).
- Defer setting boost MSRs in the ACPI cpufreq driver (Stuart Hayes,
Nathan Chancellor).
- Make intel_pstate accept initial EPP value of 0x80 (Srinivas
Pandruvada).
- Make read-only array sys_clk_src in the SPEAr cpufreq driver static
(Colin Ian King).
- Make array speeds in the longhaul cpufreq driver static (Colin Ian
King).
- Use str_enabled_disabled() helper in the ACPI cpufreq driver (Andy
Shevchenko).
- Drop a reference to CVS from cpufreq documentation (Conghui Wang).
* pm-cpufreq: (30 commits)
cpufreq: Remove CVS version control contents from documentation
cpufreq: stats: Convert to use sysfs_emit_at() API
cpufreq: ACPI: Only set boost MSRs on supported CPUs
dt-bindings: cpufreq: cpufreq-qcom-hw: Add QDU1000/QRU1000 cpufreq
cpufreq: tegra186: Use flexible array to simplify memory allocation
cpufreq: intel_pstate: Add Sapphire Rapids support in no-HWP mode
cpufreq: amd_freq_sensitivity: Add missing pci_dev_put()
cpufreq: Init completion before kobject_init_and_add()
cpufreq: apple-soc: Add new driver to control Apple SoC CPU P-states
cpufreq: qcom-hw: Add CPU clock provider support
dt-bindings: cpufreq: cpufreq-qcom-hw: Add cpufreq clock provider
cpufreq: qcom-hw: Fix the frequency returned by cpufreq_driver->get()
cpufreq: ACPI: Remove unused variables 'acpi_cpufreq_online' and 'ret'
cpufreq: qcom-hw: Fix memory leak in qcom_cpufreq_hw_read_lut()
arm64: dts: ti: k3-am625-sk: Add 1.4GHz OPP
cpufreq: ti: Enable ti-cpufreq for ARCH_K3
arm64: dts: ti: k3-am625: Introduce operating-points table
cpufreq: dt-platdev: Blacklist ti,am625 SoC
cpufreq: ti-cpufreq: Add support for AM625
dt-bindings: cpufreq: qcom: Add missing cache related properties
...
|
|
Make ACPI power management changes, ACPI processor driver updates, ACPI
EC driver quirk and ACPI backlight driver updates for 6.2-rc1:
- Print full name paths of ACPI power resources objects during
enumeration (Kane Chen).
- Eliminate a compiler warning regarding a missing function prototype
in the ACPI power management code (Sudeep Holla).
- Fix and clean up the ACPI processor driver (Rafael Wysocki, Li Zhong,
Colin Ian King, Sudeep Holla).
- Add quirk for the HP Pavilion Gaming 15-cx0041ur to the ACPI EC
driver (Mia Kanashi).
- Add some mew ACPI backlight handling quirks and update some existing
ones (Hans de Goede).
- Make the ACPI backlight driver prefer the native backlight control
over vendor backlight control when possible (Hans de Goede).
* acpi-pm:
ACPI: PM: Silence missing prototype warning
ACPI: PM: Print full name path while adding power resource
* acpi-processor:
ACPI: processor: perflib: Adjust acpi_processor_notify_smm() return value
ACPI: processor: perflib: Rearrange acpi_processor_notify_smm()
ACPI: processor: perflib: Rearrange unregistration routine
ACPI: processor: perflib: Drop redundant parentheses
ACPI: processor: perflib: Adjust white space
ACPI: processor: idle: Drop unnecessary statements and parens
ACPI: processor: Silence missing prototype warnings
ACPI: processor_idle: Silence missing prototype warnings
ACPI: processor: throttling: remove variable count
ACPI: processor: idle: Check acpi_fetch_acpi_dev() return value
* acpi-ec:
ACPI: EC: Add quirk for the HP Pavilion Gaming 15-cx0041ur
* acpi-video:
ACPI: video: Prefer native over vendor
ACPI: video: Simplify __acpi_video_get_backlight_type()
ACPI: video: Add force_native quirk for Sony Vaio VPCY11S1E
ACPI: video: Add force_vendor quirk for Sony Vaio PCG-FRV35
ACPI: video: Change Sony Vaio VPCEH3U1E quirk to force_native
ACPI: video: Change GIGABYTE GB-BXBT-2807 quirk to force_none
ACPI: video: Add a few bugtracker links to DMI quirks
|
|
Merge ACPI changes related to device enumeration, device object
managenet, operation region handling, table parsing and sysfs
interface:
- Use ZERO_PAGE(0) instead of empty_zero_page in the ACPI device
enumeration code (Giulio Benetti).
- Change the return type of the ACPI driver remove callback to void and
update its users accordingly (Dawei Li).
- Add general support for FFH address space type and implement the low-
level part of it for ARM64 (Sudeep Holla).
- Fix stale comments in the ACPI tables parsing code and make it print
more messages related to MADT (Hanjun Guo, Huacai Chen).
- Replace invocations of generic library functions with more kernel-
specific counterparts in the ACPI sysfs interface (Christophe JAILLET,
Xu Panda).
* acpi-scan:
ACPI: scan: substitute empty_zero_page with helper ZERO_PAGE(0)
* acpi-bus:
ACPI: FFH: Silence missing prototype warnings
ACPI: make remove callback of ACPI driver void
ACPI: bus: Fix the _OSC capability check for FFH OpRegion
arm64: Add architecture specific ACPI FFH Opregion callbacks
ACPI: Implement a generic FFH Opregion handler
* acpi-tables:
ACPI: tables: Fix the stale comments for acpi_locate_initial_tables()
ACPI: tables: Print CORE_PIC information when MADT is parsed
* acpi-sysfs:
ACPI: sysfs: use sysfs_emit() to instead of scnprintf()
ACPI: sysfs: Use kstrtobool() instead of strtobool()
|
|
Merge ACPICA changes, including bug fixes and cleanups as well as support
for some recently defined data structures, for 6.2-rc1:
- Make acpi_ex_load_op() match upstream implementation (Rafael Wysocki).
- Add support for loong_arch-specific APICs in MADT (Huacai Chen).
- Add support for fixed PCIe wake event (Huacai Chen).
- Add EBDA pointer sanity checks (Vit Kabele).
- Avoid accessing VGA memory when EBDA < 1KiB (Vit Kabele).
- Add CCEL table support to both compiler/disassembler (Kuppuswamy
Sathyanarayanan).
- Add a couple of new UUIDs to the known UUID list (Bob Moore).
- Add support for FFH Opregion special context data (Sudeep Holla).
- Improve warning message for "invalid ACPI name" (Bob Moore).
- Add support for CXL 3.0 structures (CXIMS & RDPAS) in the CEDT table
(Alison Schofield).
- Prepare IORT support for revision E.e (Robin Murphy).
- Finish support for the CDAT table (Bob Moore).
- Fix error code path in acpi_ds_call_control_method() (Rafael Wysocki).
- Fix use-after-free in acpi_ut_copy_ipackage_to_ipackage() (Li Zetao).
- Update the version of the ACPICA code in the kernel (Bob Moore).
* acpica:
ACPICA: Fix use-after-free in acpi_ut_copy_ipackage_to_ipackage()
ACPICA: Fix error code path in acpi_ds_call_control_method()
ACPICA: Update version to 20221020
ACPICA: Add utcksum.o to the acpidump Makefile
Revert "LoongArch: Provisionally add ACPICA data structures"
ACPICA: Finish support for the CDAT table
ACPICA: IORT: Update for revision E.e
ACPICA: Add CXL 3.0 structures (CXIMS & RDPAS) to the CEDT table
ACPICA: Improve warning message for "invalid ACPI name"
ACPICA: Add support for FFH Opregion special context data
ACPICA: Add a couple of new UUIDs to the known UUID list
ACPICA: iASL: Add CCEL table to both compiler/disassembler
ACPICA: Do not touch VGA memory when EBDA < 1ki_b
ACPICA: Check that EBDA pointer is in valid memory
ACPICA: Events: Support fixed PCIe wake event
ACPICA: MADT: Add loong_arch-specific APICs support
ACPICA: Make acpi_ex_load_op() match upstream
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/mkl/linux-can-next
Marc Kleine-Budde says:
====================
linux-can-next-for-6.2-20221212
this is a pull request of 39 patches for net-next/master.
The first 2 patches are by me fix a warning and coding style in the
kvaser_usb driver.
Vivek Yadav's patch sorts the includes of the m_can driver.
Biju Das contributes 5 patches for the rcar_canfd driver improve the
support for different IP core variants.
Jean Delvare's patch for the ctucanfd drops the dependency on
COMPILE_TEST.
Vincent Mailhol's patch sorts the includes of the etas_es58x driver.
Haibo Chen's contributes 2 patches that add i.MX93 support to the
flexcan driver.
Lad Prabhakar's patch updates the dt-bindings documentation of the
rcar_canfd driver.
Minghao Chi's patch converts the c_can platform driver to
devm_platform_get_and_ioremap_resource().
In the next 7 patches Vincent Mailhol adds devlink support to the
etas_es58x driver to report firmware, bootloader and hardware version.
Xu Panda's patch converts a strncpy() -> strscpy() in the ucan driver.
Ye Bin's patch removes a useless parameter from the AF_CAN protocol.
The next 2 patches by Vincent Mailhol and remove unneeded or unused
pointers to struct usb_interface in device's priv struct in the ucan
and gs_usb driver.
Vivek Yadav's patch cleans up the usage of the RAM initialization in
the m_can driver.
A patch by me add support for SO_MARK to the AF_CAN protocol.
Geert Uytterhoeven's patch fixes the number of CAN channels in the
rcan_canfd bindings documentation.
In the last 11 patches Markus Schneider-Pargmann optimizes the
register access in the t_can driver and cleans up the tcan glue
driver.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
As discussed in [1], abbreviating the bootloader to "bl" might not be
well understood. Instead, a bootloader technically being a firmware,
name it "fw.bootloader".
Add a new macro to devlink.h to formalize this new info attribute name
and update the documentation.
[1] https://lore.kernel.org/netdev/20221128142723.2f826d20@kernel.org/
Suggested-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Vincent Mailhol <mailhol.vincent@wanadoo.fr>
Link: https://lore.kernel.org/all/20221130174658.29282-5-mailhol.vincent@wanadoo.fr
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
|
|
usb_cache_string() can also be useful for the drivers so export it.
Signed-off-by: Vincent Mailhol <mailhol.vincent@wanadoo.fr>
Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Link: https://lore.kernel.org/all/20221130174658.29282-4-mailhol.vincent@wanadoo.fr
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
|
|
There are two nat functions are nearly the same in both OVS and
TC code, (ovs_)ct_nat_execute() and ovs_ct_nat/tcf_ct_act_nat().
This patch creates nf_nat_ovs.c under netfilter and moves them
there then exports nf_ct_nat() so that it can be shared by both
OVS and TC, and keeps the nat (type) check and nat flag update
in OVS and TC's own place, as these parts are different between
OVS and TC.
Note that in OVS nat function it was using skb->protocol to get
the proto as it already skips vlans in key_extract(), while it
doesn't in TC, and TC has to call skb_protocol() to get proto.
So in nf_ct_nat_execute(), we keep using skb_protocol() which
works for both OVS and TC contrack.
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Aaron Conole <aconole@redhat.com>
Acked-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Now, it's possible to convert USO vnet packets from/to skb.
Added support for GSO_UDP_L4 offload.
Signed-off-by: Andrew Melnychenko <andrew@daynix.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Added new GSO type for USO: VIRTIO_NET_HDR_GSO_UDP_L4.
Feature VIRTIO_NET_F_HOST_USO allows to enable NETIF_F_GSO_UDP_L4.
Separated VIRTIO_NET_F_GUEST_USO4 & VIRTIO_NET_F_GUEST_USO6 features
required for Windows guests.
Signed-off-by: Andrew Melnychenko <andrew@daynix.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Added 2 additional offlloads for USO(IPv4 & IPv6).
Separate offloads are required for Windows VM guests,
g.e. Windows may set USO rx only for IPv4.
Signed-off-by: Andrew Melnychenko <andrew@daynix.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Fix the typo of 'suport' in kcov.h
Link: https://lkml.kernel.org/r/tencent_922CA94B789587D79FD154445D035AA19E07@qq.com
Signed-off-by: Rong Tao <rongtao@cestc.cn>
Reviewed-by: Dmitry Vyukov <dvyukov@google.com>
Cc: Andrey Konovalov <andreyknvl@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
It is spurious to have some code out-side the include guard in a .h file.
Fix it.
Link: https://lkml.kernel.org/r/4dbaf427d4300edba6c6bbfaf4d57493b9bec6ee.1669565241.git.christophe.jaillet@wanadoo.fr
Fixes: 1fbaf8fc12a0 ("mm: add a io_mapping_map_user helper")
Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Commit ee62c6b2dc93 ("eventfd: change int to __u64 in eventfd_signal()")
forgot to change int to __u64 in the CONFIG_EVENTFD=n stub function.
Link: https://lkml.kernel.org/r/20221124140154.104680-1-zhangqilong3@huawei.com
Fixes: ee62c6b2dc93 ("eventfd: change int to __u64 in eventfd_signal()")
Signed-off-by: Zhang Qilong <zhangqilong3@huawei.com>
Cc: Dylan Yudaken <dylany@fb.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Sha Zhengju <handai.szj@taobao.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
change "stat" to "start".
Link: https://lkml.kernel.org/r/20221207074011.GA151242@cloud
Fixes: c959924b0dc5 ("memory tiering: adjust hot threshold automatically")
Signed-off-by: Wang Yong <yongw.kernel@gmail.com>
Reviewed-by: "Huang, Ying" <ying.huang@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
The nodes= arg instructs the kernel to only scan the given nodes for
proactive reclaim. For example use cases, consider a 2 tier memory
system:
nodes 0,1 -> top tier
nodes 2,3 -> second tier
$ echo "1m nodes=0" > memory.reclaim
This instructs the kernel to attempt to reclaim 1m memory from node 0.
Since node 0 is a top tier node, demotion will be attempted first. This
is useful to direct proactive reclaim to specific nodes that are under
pressure.
$ echo "1m nodes=2,3" > memory.reclaim
This instructs the kernel to attempt to reclaim 1m memory in the second
tier, since this tier of memory has no demotion targets the memory will be
reclaimed.
$ echo "1m nodes=0,1" > memory.reclaim
Instructs the kernel to reclaim memory from the top tier nodes, which can
be desirable according to the userspace policy if there is pressure on the
top tiers. Since these nodes have demotion targets, the kernel will
attempt demotion first.
Since commit 3f1509c57b1b ("Revert "mm/vmscan: never demote for memcg
reclaim""), the proactive reclaim interface memory.reclaim does both
reclaim and demotion. Reclaim and demotion incur different latency costs
to the jobs in the cgroup. Demoted memory would still be addressable by
the userspace at a higher latency, but reclaimed memory would need to
incur a pagefault.
The 'nodes' arg is useful to allow the userspace to control demotion and
reclaim independently according to its policy: if the memory.reclaim is
called on a node with demotion targets, it will attempt demotion first; if
it is called on a node without demotion targets, it will only attempt
reclaim.
Link: https://lkml.kernel.org/r/20221202223533.1785418-1-almasrymina@google.com
Signed-off-by: Mina Almasry <almasrymina@google.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Muchun Song <songmuchun@bytedance.com>
Cc: Bagas Sanjaya <bagasdotme@gmail.com>
Cc: "Huang, Ying" <ying.huang@intel.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Tejun Heo <tj@kernel.org>
Cc: Wei Xu <weixugc@google.com>
Cc: Yang Shi <yang.shi@linux.alibaba.com>
Cc: Yosry Ahmed <yosryahmed@google.com>
Cc: zefan li <lizefan.x@bytedance.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Patch series "mm: memcg: fix protection of reclaim target memcg", v3.
This series fixes a bug in calculating the protection of the reclaim
target memcg where we end up using stale effective protection values from
the last reclaim operation, instead of completely ignoring the protection
of the reclaim target as intended. More detailed explanation and examples
in patch 1, which includes the fix. Patches 2 & 3 introduce a selftest
case that catches the bug.
This patch (of 3):
When we are doing memcg reclaim, the intended behavior is that we
ignore any protection (memory.min, memory.low) of the target memcg (but
not its children). Ever since the patch pointed to by the "Fixes" tag,
we actually read a stale value for the target memcg protection when
deciding whether to skip the memcg or not because it is protected. If
the stale value happens to be high enough, we don't reclaim from the
target memcg.
Essentially, in some cases we may falsely skip reclaiming from the
target memcg of reclaim because we read a stale protection value from
last time we reclaimed from it.
During reclaim, mem_cgroup_calculate_protection() is used to determine the
effective protection (emin and elow) values of a memcg. The protection of
the reclaim target is ignored, but we cannot set their effective
protection to 0 due to a limitation of the current implementation (see
comment in mem_cgroup_protection()). Instead, we leave their effective
protection values unchaged, and later ignore it in
mem_cgroup_protection().
However, mem_cgroup_protection() is called later in
shrink_lruvec()->get_scan_count(), which is after the
mem_cgroup_below_{min/low}() checks in shrink_node_memcgs(). As a result,
the stale effective protection values of the target memcg may lead us to
skip reclaiming from the target memcg entirely, before calling
shrink_lruvec(). This can be even worse with recursive protection, where
the stale target memcg protection can be higher than its standalone
protection. See two examples below (a similar version of example (a) is
added to test_memcontrol in a later patch).
(a) A simple example with proactive reclaim is as follows. Consider the
following hierarchy:
ROOT
|
A
|
B (memory.min = 10M)
Consider the following scenario:
- B has memory.current = 10M.
- The system undergoes global reclaim (or memcg reclaim in A).
- In shrink_node_memcgs():
- mem_cgroup_calculate_protection() calculates the effective min (emin)
of B as 10M.
- mem_cgroup_below_min() returns true for B, we do not reclaim from B.
- Now if we want to reclaim 5M from B using proactive reclaim
(memory.reclaim), we should be able to, as the protection of the
target memcg should be ignored.
- In shrink_node_memcgs():
- mem_cgroup_calculate_protection() immediately returns for B without
doing anything, as B is the target memcg, relying on
mem_cgroup_protection() to ignore B's stale effective min (still 10M).
- mem_cgroup_below_min() reads the stale effective min for B and we
skip it instead of ignoring its protection as intended, as we never
reach mem_cgroup_protection().
(b) An more complex example with recursive protection is as follows.
Consider the following hierarchy with memory_recursiveprot:
ROOT
|
A (memory.min = 50M)
|
B (memory.min = 10M, memory.high = 40M)
Consider the following scenario:
- B has memory.current = 35M.
- The system undergoes global reclaim (target memcg is NULL).
- B will have an effective min of 50M (all of A's unclaimed protection).
- B will not be reclaimed from.
- Now allocate 10M more memory in B, pushing it above it's high limit.
- The system undergoes memcg reclaim from B (target memcg is B).
- Like example (a), we do nothing in mem_cgroup_calculate_protection(),
then call mem_cgroup_below_min(), which will read the stale effective
min for B (50M) and skip it. In this case, it's even worse because we
are not just considering B's standalone protection (10M), but we are
reading a much higher stale protection (50M) which will cause us to not
reclaim from B at all.
This is an artifact of commit 45c7f7e1ef17 ("mm, memcg: decouple
e{low,min} state mutations from protection checks") which made
mem_cgroup_calculate_protection() only change the state without returning
any value. Before that commit, we used to return MEMCG_PROT_NONE for the
target memcg, which would cause us to skip the
mem_cgroup_below_{min/low}() checks. After that commit we do not return
anything and we end up checking the min & low effective protections for
the target memcg, which are stale.
Update mem_cgroup_supports_protection() to also check if we are reclaiming
from the target, and rename it to mem_cgroup_unprotected() (now returns
true if we should not protect the memcg, much simpler logic).
Link: https://lkml.kernel.org/r/20221202031512.1365483-1-yosryahmed@google.com
Link: https://lkml.kernel.org/r/20221202031512.1365483-2-yosryahmed@google.com
Fixes: 45c7f7e1ef17 ("mm, memcg: decouple e{low,min} state mutations from protection checks")
Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
Reviewed-by: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Chris Down <chris@chrisdown.name>
Cc: David Rientjes <rientjes@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vasily Averin <vasily.averin@linux.dev>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Yu Zhao <yuzhao@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Implement unshare in fsdax mode: copy data from srcmap to iomap.
Link: https://lkml.kernel.org/r/1669908753-169-1-git-send-email-ruansy.fnst@fujitsu.com
Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Patch series "fsdax,xfs: fix warning messages", v2.
Many testcases failed in dax+reflink mode with warning message in dmesg.
Such as generic/051,075,127. The warning message is like this:
[ 775.509337] ------------[ cut here ]------------
[ 775.509636] WARNING: CPU: 1 PID: 16815 at fs/dax.c:386 dax_insert_entry.cold+0x2e/0x69
[ 775.510151] Modules linked in: auth_rpcgss oid_registry nfsv4 algif_hash af_alg af_packet nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip_set nf_tables nfnetlink ip6table_filter ip6_tables iptable_filter ip_tables x_tables dax_pmem nd_pmem nd_btt sch_fq_codel configfs xfs libcrc32c fuse
[ 775.524288] CPU: 1 PID: 16815 Comm: fsx Kdump: loaded Tainted: G W 6.1.0-rc4+ #164 eb34e4ee4200c7cbbb47de2b1892c5a3e027fd6d
[ 775.524904] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS Arch Linux 1.16.0-3-3 04/01/2014
[ 775.525460] RIP: 0010:dax_insert_entry.cold+0x2e/0x69
[ 775.525797] Code: c7 c7 18 eb e0 81 48 89 4c 24 20 48 89 54 24 10 e8 73 6d ff ff 48 83 7d 18 00 48 8b 54 24 10 48 8b 4c 24 20 0f 84 e3 e9 b9 ff <0f> 0b e9 dc e9 b9 ff 48 c7 c6 a0 20 c3 81 48 c7 c7 f0 ea e0 81 48
[ 775.526708] RSP: 0000:ffffc90001d57b30 EFLAGS: 00010082
[ 775.527042] RAX: 000000000000002a RBX: 0000000000000000 RCX: 0000000000000042
[ 775.527396] RDX: ffffea000a0f6c80 RSI: ffffffff81dfab1b RDI: 00000000ffffffff
[ 775.527819] RBP: ffffea000a0f6c40 R08: 0000000000000000 R09: ffffffff820625e0
[ 775.528241] R10: ffffc90001d579d8 R11: ffffffff820d2628 R12: ffff88815fc98320
[ 775.528598] R13: ffffc90001d57c18 R14: 0000000000000000 R15: 0000000000000001
[ 775.528997] FS: 00007f39fc75d740(0000) GS:ffff88817bc80000(0000) knlGS:0000000000000000
[ 775.529474] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 775.529800] CR2: 00007f39fc772040 CR3: 0000000107eb6001 CR4: 00000000003706e0
[ 775.530214] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 775.530592] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 775.531002] Call Trace:
[ 775.531230] <TASK>
[ 775.531444] dax_fault_iter+0x267/0x6c0
[ 775.531719] dax_iomap_pte_fault+0x198/0x3d0
[ 775.532002] __xfs_filemap_fault+0x24a/0x2d0 [xfs aa8d25411432b306d9554da38096f4ebb86bdfe7]
[ 775.532603] __do_fault+0x30/0x1e0
[ 775.532903] do_fault+0x314/0x6c0
[ 775.533166] __handle_mm_fault+0x646/0x1250
[ 775.533480] handle_mm_fault+0xc1/0x230
[ 775.533810] do_user_addr_fault+0x1ac/0x610
[ 775.534110] exc_page_fault+0x63/0x140
[ 775.534389] asm_exc_page_fault+0x22/0x30
[ 775.534678] RIP: 0033:0x7f39fc55820a
[ 775.534950] Code: 00 01 00 00 00 74 99 83 f9 c0 0f 87 7b fe ff ff c5 fe 6f 4e 20 48 29 fe 48 83 c7 3f 49 8d 0c 10 48 83 e7 c0 48 01 fe 48 29 f9 <f3> a4 c4 c1 7e 7f 00 c4 c1 7e 7f 48 20 c5 f8 77 c3 0f 1f 44 00 00
[ 775.535839] RSP: 002b:00007ffc66a08118 EFLAGS: 00010202
[ 775.536157] RAX: 00007f39fc772001 RBX: 0000000000042001 RCX: 00000000000063c1
[ 775.536537] RDX: 0000000000006400 RSI: 00007f39fac42050 RDI: 00007f39fc772040
[ 775.536919] RBP: 0000000000006400 R08: 00007f39fc772001 R09: 0000000000042000
[ 775.537304] R10: 0000000000000001 R11: 0000000000000246 R12: 0000000000000001
[ 775.537694] R13: 00007f39fc772000 R14: 0000000000006401 R15: 0000000000000003
[ 775.538086] </TASK>
[ 775.538333] ---[ end trace 0000000000000000 ]---
This also affects dax+noreflink mode if we run the test after a
dax+reflink test. So, the most urgent thing is solving the warning
messages.
With these fixes, most warning messages in dax_associate_entry() are gone.
But honestly, generic/388 will randomly failed with the warning. The
case shutdown the xfs when fsstress is running, and do it for many times.
I think the reason is that dax pages in use are not able to be invalidated
in time when fs is shutdown. The next time dax page to be associated, it
still remains the mapping value set last time. I'll keep on solving it.
The warning message in dax_writeback_one() can also be fixed because of
the dax unshare.
This patch (of 8):
fsdax page is used not only when CoW, but also mapread. To make the it
easily understood, use 'share' to indicate that the dax page is shared by
more than one extent. And add helper functions to use it.
Also, the flag needs to be renamed to PAGE_MAPPING_DAX_SHARED.
[ruansy.fnst@fujitsu.com: rename several functions]
Link: https://lkml.kernel.org/r/1669972991-246-1-git-send-email-ruansy.fnst@fujitsu.com
[ruansy.fnst@fujitsu.com: v2.2]
Link: https://lkml.kernel.org/r/1670381359-53-1-git-send-email-ruansy.fnst@fujitsu.com
Link: https://lkml.kernel.org/r/1669908538-55-1-git-send-email-ruansy.fnst@fujitsu.com
Link: https://lkml.kernel.org/r/1669908538-55-2-git-send-email-ruansy.fnst@fujitsu.com
Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com>
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Patch series "convert core hugetlb functions to folios", v5.
============== OVERVIEW ===========================
Now that many hugetlb helper functions that deal with hugetlb specific
flags[1] and hugetlb cgroups[2] are converted to folios, higher level
allocation, prep, and freeing functions within hugetlb can also be
converted to operate in folios.
Patch 1 of this series implements the wrapper functions around setting the
compound destructor and compound order for a folio. Besides the user
added in patch 1, patch 2 and patch 9 also use these helper functions.
Patches 2-10 convert the higher level hugetlb functions to folios.
============== TESTING ===========================
LTP:
Ran 10 back to back rounds of the LTP hugetlb test suite.
Gigantic Huge Pages:
Test allocation and freeing via hugeadm commands:
hugeadm --pool-pages-min 1GB:10
hugeadm --pool-pages-min 1GB:0
Demote:
Demote 1 1GB hugepages to 512 2MB hugepages
echo 1 > /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages
echo 1 > /sys/kernel/mm/hugepages/hugepages-1048576kB/demote
cat /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages
# 512
cat /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages
# 0
[1] https://lore.kernel.org/lkml/20220922154207.1575343-1-sidhartha.kumar@oracle.com/
[2] https://lore.kernel.org/linux-mm/20221101223059.460937-1-sidhartha.kumar@oracle.com/
This patch (of 10):
Add folio equivalents for set_compound_order() and
set_compound_page_dtor().
Also remove extra new-lines introduced by mm/hugetlb: convert
move_hugetlb_state() to folios and mm/hugetlb_cgroup: convert
hugetlb_cgroup_uncharge_page() to folios.
[sidhartha.kumar@oracle.com: clarify folio_set_compound_order() zero support]
Link: https://lkml.kernel.org/r/20221207223731.32784-1-sidhartha.kumar@oracle.com
Link: https://lkml.kernel.org/r/20221129225039.82257-1-sidhartha.kumar@oracle.com
Link: https://lkml.kernel.org/r/20221129225039.82257-2-sidhartha.kumar@oracle.com
Signed-off-by: Sidhartha Kumar <sidhartha.kumar@oracle.com>
Suggested-by: Mike Kravetz <mike.kravetz@oracle.com>
Suggested-by: Muchun Song <songmuchun@bytedance.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Mina Almasry <almasrymina@google.com>
Cc: Tarun Sahu <tsahu@linux.ibm.com>
Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Cc: Wei Chen <harperchen1110@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
There are no longer any callers of lru_cache_add(), so remove it. This
saves 79 bytes of kernel text. Also cleanup some comments such that
they reference the new folio_add_lru() instead.
Link: https://lkml.kernel.org/r/20221101175326.13265-6-vishal.moola@gmail.com
Signed-off-by: Vishal Moola (Oracle) <vishal.moola@gmail.com>
Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Miklos Szeredi <mszeredi@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Patch series "Removing the lru_cache_add() wrapper".
This patchset replaces all calls of lru_cache_add() with the folio
equivalent: folio_add_lru(). This is allows us to get rid of the wrapper
The series passes xfstests and the userfaultfd selftests.
This patch (of 5):
Eliminates 7 calls to compound_head().
Link: https://lkml.kernel.org/r/20221101175326.13265-1-vishal.moola@gmail.com
Link: https://lkml.kernel.org/r/20221101175326.13265-2-vishal.moola@gmail.com
Signed-off-by: Vishal Moola (Oracle) <vishal.moola@gmail.com>
Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Miklos Szeredi <mszeredi@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Generalise vmemmap_populate_hugepages() so ARM64 & X86 & LoongArch can
share its implementation.
Link: https://lkml.kernel.org/r/20221027125253.3458989-4-chenhuacai@loongson.cn
Signed-off-by: Feiyang Chen <chenfeiyang@loongson.cn>
Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
Acked-by: Will Deacon <will@kernel.org>
Acked-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Arnd Bergmann <arnd@arndb.de>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Dinh Nguyen <dinguyen@kernel.org>
Cc: Guo Ren <guoren@kernel.org>
Cc: Jiaxun Yang <jiaxun.yang@flygoat.com>
Cc: Min Zhou <zhoumin@loongson.cn>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Philippe Mathieu-Daudé <philmd@linaro.org>
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Cc: Xuefeng Li <lixuefeng@loongson.cn>
Cc: Xuerui Wang <kernel@xen0n.name>
Cc: Muchun Song <songmuchun@bytedance.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Add sparse memory vmemmap support for LoongArch. SPARSEMEM_VMEMMAP uses a
virtually mapped memmap to optimise pfn_to_page and page_to_pfn
operations. This is the most efficient option when sufficient kernel
resources are available.
Link: https://lkml.kernel.org/r/20221027125253.3458989-3-chenhuacai@loongson.cn
Signed-off-by: Min Zhou <zhoumin@loongson.cn>
Signed-off-by: Feiyang Chen <chenfeiyang@loongson.cn>
Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
Reviewed-by: Arnd Bergmann <arnd@arndb.de>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Dinh Nguyen <dinguyen@kernel.org>
Cc: Guo Ren <guoren@kernel.org>
Cc: Jiaxun Yang <jiaxun.yang@flygoat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Philippe Mathieu-Daudé <philmd@linaro.org>
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Cc: Will Deacon <will@kernel.org>
Cc: Xuefeng Li <lixuefeng@loongson.cn>
Cc: Xuerui Wang <kernel@xen0n.name>
Cc: Muchun Song <songmuchun@bytedance.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Return value from ptep_get_and_clear_full() directly instead of taking
this in another redundant variable.
Link: https://lkml.kernel.org/r/202211282107437343474@zte.com.cn
Signed-off-by: zhang songyi <zhang.songyi@zte.com.cn>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
"mm_khugepaged_collapse_file" for capturing is_shmem.
Currently, is_shmem is not being captured. Capturing is_shmem is useful
as it can indicate if tmpfs is being used as a backing store instead of
persistent storage. Add the tracepoint in collapse_file() named
"mm_khugepaged_collapse_file" for capturing is_shmem.
[gautammenghani201@gmail.com: swap is_shmem and addr to save space, per Steven Rostedt]
Link: https://lkml.kernel.org/r/20221202201807.182829-1-gautammenghani201@gmail.com
Link: https://lkml.kernel.org/r/20221026052218.148234-1-gautammenghani201@gmail.com
Signed-off-by: Gautam Menghani <gautammenghani201@gmail.com>
Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org> [tracing]
Cc: David Hildenbrand <david@redhat.com>
Cc: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Cc: Yang Shi <shy828301@gmail.com>
Cc: Zach O'Keefe <zokeefe@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Fortunately, the last user (KSM) is gone, so let's just remove this rather
special code from generic GUP handling -- especially because KSM never
required the PMD handling as KSM only deals with individual base pages.
[akpm@linux-foundation.org: fix merge snafu]Link: https://lkml.kernel.org/r/20221021101141.84170-10-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Peter Xu <peterx@redhat.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Let's add walk_page_range_vma(), which is similar to walk_page_vma(),
however, is only interested in a subset of the VMA range.
To be used in KSM code to stop using follow_page() next.
Link: https://lkml.kernel.org/r/20221021101141.84170-8-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Peter Xu <peterx@redhat.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
All users -- GUP and KSM -- are gone, let's just remove it.
Link: https://lkml.kernel.org/r/20221021101141.84170-4-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Acked-by: Peter Xu <peterx@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
As Peter points out, the caller passes a single VMA and can just do that
check itself.
And in fact, no existing users rely on test_walk() getting called. So
let's just remove it and make the implementation slightly more efficient.
Link: https://lkml.kernel.org/r/20221021101141.84170-7-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Peter Xu <peterx@redhat.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull misc fixes from Andrew Morton:
"Nine hotfixes.
Six for MM, three for other areas. Four of these patches address
post-6.0 issues"
* tag 'mm-hotfixes-stable-2022-12-10-1' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
memcg: fix possible use-after-free in memcg_write_event_control()
MAINTAINERS: update Muchun Song's email
mm/gup: fix gup_pud_range() for dax
mmap: fix do_brk_flags() modifying obviously incorrect VMAs
mm/swap: fix SWP_PFN_BITS with CONFIG_PHYS_ADDR_T_64BIT on 32bit
tmpfs: fix data loss from failed fallocate
kselftests: cgroup: update kmem test precision tolerance
mm: do not BUG_ON missing brk mapping, because userspace can unmap it
mailmap: update Matti Vaittinen's email address
|
|
Change the run_estimation flag to start/stop the kthread tasks.
Signed-off-by: Julian Anastasov <ja@ssi.bg>
Cc: yunhong-cgl jiang <xintian1976@gmail.com>
Cc: "dust.li" <dust.li@linux.alibaba.com>
Reviewed-by: Jiri Wiesner <jwiesner@suse.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
|
Allow the kthreads for stats to be configured for
specific cpulist (isolation) and niceness (scheduling
priority).
Signed-off-by: Julian Anastasov <ja@ssi.bg>
Cc: yunhong-cgl jiang <xintian1976@gmail.com>
Cc: "dust.li" <dust.li@linux.alibaba.com>
Reviewed-by: Jiri Wiesner <jwiesner@suse.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
|
Estimating all entries in single list in timer context
by single CPU causes large latency with multiple IPVS rules
as reported in [1], [2], [3].
Spread the estimator structures in multiple chains and
use kthread(s) for the estimation. The chains are processed
in multiple (50) timer ticks to ensure the 2-second interval
between estimations with some accuracy. Every chain is
processed under RCU lock.
Every kthread works over its own data structure and all
such contexts are attached to array. The contexts can be
preserved while the kthread tasks are stopped or restarted.
When estimators are removed, unused kthread contexts are
released and the slots in array are left empty.
First kthread determines parameters to use, eg. maximum
number of estimators to process per kthread based on
chain's length (chain_max), allowing sub-100us cond_resched
rate and estimation taking up to 1/8 of the CPU capacity
to avoid any problems if chain_max is not correctly
calculated.
chain_max is calculated taking into account factors
such as CPU speed and memory/cache speed where the
cache_factor (4) is selected from real tests with
current generation of CPU/NUMA configurations to
correct the difference in CPU usage between
cached (during calc phase) and non-cached (working) state
of the estimated per-cpu data.
First kthread also plays the role of distributor of
added estimators to all kthreads, keeping low the
time to add estimators. The optimization is based on
the fact that newly added estimator should be estimated
after 2 seconds, so we have the time to offload the
adding to chain from controlling process to kthread 0.
The allocated kthread context may grow from 1 to 50
allocated structures for timer ticks which saves memory for
setups with small number of estimators.
We also add delayed work est_reload_work that will
make sure the kthread tasks are properly started/stopped.
ip_vs_start_estimator() is changed to report errors
which allows to safely store the estimators in
allocated structures.
Many thanks to Jiri Wiesner for his valuable comments
and for spending a lot of time reviewing and testing
the changes on different platforms with 48-256 CPUs and
1-8 NUMA nodes under different cpufreq governors.
[1] Report from Yunhong Jiang:
https://lore.kernel.org/netdev/D25792C1-1B89-45DE-9F10-EC350DC04ADC@gmail.com/
[2]
https://marc.info/?l=linux-virtual-server&m=159679809118027&w=2
[3] Report from Dust:
https://archive.linuxvirtualserver.org/html/lvs-devel/2020-12/msg00000.html
Signed-off-by: Julian Anastasov <ja@ssi.bg>
Cc: yunhong-cgl jiang <xintian1976@gmail.com>
Cc: "dust.li" <dust.li@linux.alibaba.com>
Reviewed-by: Jiri Wiesner <jwiesner@suse.de>
Tested-by: Jiri Wiesner <jwiesner@suse.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
|
Use the provided u64_stats_t type to avoid
load/store tearing.
Fixes: 316580b69d0a ("u64_stats: provide u64_stats_t type")
Signed-off-by: Julian Anastasov <ja@ssi.bg>
Cc: yunhong-cgl jiang <xintian1976@gmail.com>
Cc: "dust.li" <dust.li@linux.alibaba.com>
Reviewed-by: Jiri Wiesner <jwiesner@suse.de>
Tested-by: Jiri Wiesner <jwiesner@suse.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
|
Move alloc_percpu/free_percpu logic in new functions
Signed-off-by: Julian Anastasov <ja@ssi.bg>
Cc: yunhong-cgl jiang <xintian1976@gmail.com>
Cc: "dust.li" <dust.li@linux.alibaba.com>
Reviewed-by: Jiri Wiesner <jwiesner@suse.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
|
In preparation to using RCU locking for the list
with estimators, make sure the struct ip_vs_stats
are released after RCU grace period by using RCU
callbacks. This affects ipvs->tot_stats where we
can not use RCU callbacks for ipvs, so we use
allocated struct ip_vs_stats_rcu. For services
and dests we force RCU callbacks for all cases.
Signed-off-by: Julian Anastasov <ja@ssi.bg>
Cc: yunhong-cgl jiang <xintian1976@gmail.com>
Cc: "dust.li" <dust.li@linux.alibaba.com>
Reviewed-by: Jiri Wiesner <jwiesner@suse.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
|
verifier.c:states_equal() must maintain register ID mapping across all
function frames. Otherwise the following example might be erroneously
marked as safe:
main:
fp[-24] = map_lookup_elem(...) ; frame[0].fp[-24].id == 1
fp[-32] = map_lookup_elem(...) ; frame[0].fp[-32].id == 2
r1 = &fp[-24]
r2 = &fp[-32]
call foo()
r0 = 0
exit
foo:
0: r9 = r1
1: r8 = r2
2: r7 = ktime_get_ns()
3: r6 = ktime_get_ns()
4: if (r6 > r7) goto skip_assign
5: r9 = r8
skip_assign: ; <--- checkpoint
6: r9 = *r9 ; (a) frame[1].r9.id == 2
; (b) frame[1].r9.id == 1
7: if r9 == 0 goto exit: ; mark_ptr_or_null_regs() transfers != 0 info
; for all regs sharing ID:
; (a) r9 != 0 => &frame[0].fp[-32] != 0
; (b) r9 != 0 => &frame[0].fp[-24] != 0
8: r8 = *r8 ; (a) r8 == &frame[0].fp[-32]
; (b) r8 == &frame[0].fp[-32]
9: r0 = *r8 ; (a) safe
; (b) unsafe
exit:
10: exit
While processing call to foo() verifier considers the following
execution paths:
(a) 0-10
(b) 0-4,6-10
(There is also path 0-7,10 but it is not interesting for the issue at
hand. (a) is verified first.)
Suppose that checkpoint is created at (6) when path (a) is verified,
next path (b) is verified and (6) is reached.
If states_equal() maintains separate 'idmap' for each frame the
mapping at (6) for frame[1] would be empty and
regsafe(r9)::check_ids() would add a pair 2->1 and return true,
which is an error.
If states_equal() maintains single 'idmap' for all frames the mapping
at (6) would be { 1->1, 2->2 } and regsafe(r9)::check_ids() would
return false when trying to add a pair 2->1.
This issue was suggested in the following discussion:
https://lore.kernel.org/bpf/CAEf4BzbFB5g4oUfyxk9rHy-PJSLQ3h8q9mV=rVoXfr_JVm8+1Q@mail.gmail.com/
Suggested-by: Andrii Nakryiko <andrii.nakryiko@gmail.com>
Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20221209135733.28851-4-eddyz87@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
Clean up: Simplify the tracepoint's only call site.
Also, I noticed that when svc_authenticate() returns SVC_COMPLETE,
it leaves rq_auth_stat set to an error value. That doesn't need to
be recorded in the trace log.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
|
|
Add tracepoints to trace start and end of CB_RECALL_ANY operation.
Signed-off-by: Dai Ngo <dai.ngo@oracle.com>
[ cel: added show_rca_mask() macro ]
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
|
|
The delegation reaper is called by nfsd memory shrinker's on
the 'count' callback. It scans the client list and sends the
courtesy CB_RECALL_ANY to the clients that hold delegations.
To avoid flooding the clients with CB_RECALL_ANY requests, the
delegation reaper sends only one CB_RECALL_ANY request to each
client per 5 seconds.
Signed-off-by: Dai Ngo <dai.ngo@oracle.com>
[ cel: moved definition of RCA4_TYPE_MASK_RDATA_DLG ]
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
|
|
Steven Rostedt says:
> The include/trace/events/ directory should only hold files that
> are to create events, not headers that hold helper functions.
>
> Can you please move them out of include/trace/events/ as that
> directory is "special" in the creation of events.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Acked-by: Leon Romanovsky <leonro@nvidia.com>
Acked-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Acked-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
|
|
The svc_ungetu32 function is not used, you could remove it.
Signed-off-by: Li zeming <zeming@nfschina.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec-next
Steffen Klassert says:
====================
ipsec-next 2022-12-09
1) Add xfrm packet offload core API.
From Leon Romanovsky.
2) Add xfrm packet offload support for mlx5.
From Leon Romanovsky and Raed Salem.
3) Fix a typto in a error message.
From Colin Ian King.
* tag 'ipsec-next-2022-12-09' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec-next: (38 commits)
xfrm: Fix spelling mistake "oflload" -> "offload"
net/mlx5e: Open mlx5 driver to accept IPsec packet offload
net/mlx5e: Handle ESN update events
net/mlx5e: Handle hardware IPsec limits events
net/mlx5e: Update IPsec soft and hard limits
net/mlx5e: Store all XFRM SAs in Xarray
net/mlx5e: Provide intermediate pointer to access IPsec struct
net/mlx5e: Skip IPsec encryption for TX path without matching policy
net/mlx5e: Add statistics for Rx/Tx IPsec offloaded flows
net/mlx5e: Improve IPsec flow steering autogroup
net/mlx5e: Configure IPsec packet offload flow steering
net/mlx5e: Use same coding pattern for Rx and Tx flows
net/mlx5e: Add XFRM policy offload logic
net/mlx5e: Create IPsec policy offload tables
net/mlx5e: Generalize creation of default IPsec miss group and rule
net/mlx5e: Group IPsec miss handles into separate struct
net/mlx5e: Make clear what IPsec rx_err does
net/mlx5e: Flatten the IPsec RX add rule path
net/mlx5e: Refactor FTE setup code to be more clear
net/mlx5e: Move IPsec flow table creation to separate function
...
====================
Link: https://lore.kernel.org/r/20221209093310.4018731-1-steffen.klassert@secunet.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
syzkaller reported:
BUG: KASAN: slab-out-of-bounds in __build_skb_around+0x235/0x340 net/core/skbuff.c:294
Write of size 32 at addr ffff88802aa172c0 by task syz-executor413/5295
For bpf_prog_test_run_skb(), which uses a kmalloc()ed buffer passed to
build_skb().
When build_skb() is passed a frag_size of 0, it means the buffer came
from kmalloc. In these cases, ksize() is used to find its actual size,
but since the allocation may not have been made to that size, actually
perform the krealloc() call so that all the associated buffer size
checking will be correctly notified (and use the "new" pointer so that
compiler hinting works correctly). Split this logic out into a new
interface, slab_build_skb(), but leave the original 0 checking for now
to catch any stragglers.
Reported-by: syzbot+fda18eaa8c12534ccb3b@syzkaller.appspotmail.com
Link: https://groups.google.com/g/syzkaller-bugs/c/UnIKxTtU5-0/m/-wbXinkgAQAJ
Fixes: 38931d8989b5 ("mm: Make ksize() a reporting-only function")
Cc: Pavel Begunkov <asml.silence@gmail.com>
Cc: pepsipu <soopthegoop@gmail.com>
Cc: syzbot+fda18eaa8c12534ccb3b@syzkaller.appspotmail.com
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: kasan-dev <kasan-dev@googlegroups.com>
Cc: Andrii Nakryiko <andrii@kernel.org>
Cc: ast@kernel.org
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Hao Luo <haoluo@google.com>
Cc: Jesper Dangaard Brouer <hawk@kernel.org>
Cc: John Fastabend <john.fastabend@gmail.com>
Cc: jolsa@kernel.org
Cc: KP Singh <kpsingh@kernel.org>
Cc: martin.lau@linux.dev
Cc: Stanislav Fomichev <sdf@google.com>
Cc: song@kernel.org
Cc: Yonghong Song <yhs@fb.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/20221208060256.give.994-kees@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux
Saeed Mahameed says:
====================
mlx5-updates-2022-12-08
1) Support range match action in SW steering
Yevgeny Kliteynik says:
=======================
The following patch series adds support for a range match action in
SW Steering.
SW steering is able to match only on the exact values of the packet fields,
as requested by the user: the user provides mask for the fields that are of
interest, and the exact values to be matched on when the traffic is handled.
The following patch series add new type of action - Range Match, where the
user provides a field to be matched on and a range of values (min to max)
that will be considered as hit.
There are several new notions that were implemented in order to support
Range Match:
- MATCH_RANGES Steering Table Entry (STE): the new STE type that allows
matching the packets' fields on the range of values instead of a specific
value.
- Match Definer: this is a general FW object that defines which fields
in the packet will be referenced by the mask and tag of each STE.
Match definer ID is part of STE fields, and it defines how the HW needs
to interpret the STE's mask/tag values.
Till now SW steering used the definers that were managed by FW and
implemented the STE layout as described by the HW spec.
Now that we're adding a new type of STE, SW steering needs to also be
able to define this new STE's layout, and this is do
=======================
2) From OZ add support for meter mtu offload
2.1: Refactor the code to allow both metering and range post actions as a
pre-step for adding police mtu offload support.
2.2: Instantiate mtu green/red flow tables with a single match-all rule.
Add the green/red actions to the hit/miss table accordingly
2.3: Initialize the meter object with the TC police mtu parameter.
Use the hardware range match action feature.
3) From MaorD, support routes with more than 2 nexthops in multipath
4) Michael and Or, improve and extend vport representor counters.
* tag 'mlx5-updates-2022-12-08' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux:
net/mlx5: Expose steering dropped packets counter
net/mlx5: Refactor and expand rep vport stat group
net/mlx5e: multipath, support routes with more than 2 nexthops
net/mlx5e: TC, add support for meter mtu offload
net/mlx5e: meter, add mtu post meter tables
net/mlx5e: meter, refactor to allow multiple post meter tables
net/mlx5: DR, Add support for range match action
net/mlx5: DR, Add function that tells if STE miss addr has been initialized
net/mlx5: DR, Some refactoring of miss address handling
net/mlx5: DR, Manage definers with refcounts
net/mlx5: DR, Handle FT action in a separate function
net/mlx5: DR, Rework is_fw_table function
net/mlx5: DR, Add functions to create/destroy MATCH_DEFINER general object
net/mlx5: fs, add match on ranges API
net/mlx5: mlx5_ifc updates for MATCH_DEFINER general object
====================
Link: https://lore.kernel.org/r/20221209001420.142794-1-saeed@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
|
|
memcg_write_event_control() accesses the dentry->d_name of the specified
control fd to route the write call. As a cgroup interface file can't be
renamed, it's safe to access d_name as long as the specified file is a
regular cgroup file. Also, as these cgroup interface files can't be
removed before the directory, it's safe to access the parent too.
Prior to 347c4a874710 ("memcg: remove cgroup_event->cft"), there was a
call to __file_cft() which verified that the specified file is a regular
cgroupfs file before further accesses. The cftype pointer returned from
__file_cft() was no longer necessary and the commit inadvertently dropped
the file type check with it allowing any file to slip through. With the
invarients broken, the d_name and parent accesses can now race against
renames and removals of arbitrary files and cause use-after-free's.
Fix the bug by resurrecting the file type check in __file_cft(). Now that
cgroupfs is implemented through kernfs, checking the file operations needs
to go through a layer of indirection. Instead, let's check the superblock
and dentry type.
Link: https://lkml.kernel.org/r/Y5FRm/cfcKPGzWwl@slm.duckdns.org
Fixes: 347c4a874710 ("memcg: remove cgroup_event->cft")
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Jann Horn <jannh@google.com>
Acked-by: Roman Gushchin <roman.gushchin@linux.dev>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: <stable@vger.kernel.org> [3.14+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|