Age | Commit message (Collapse) | Author |
|
We will add color mgmt properties to DRM planes in the next patches and
we want to track when one of this properties change to define atomic
commit behaviors. Using a similar approach from CRTC color props, we set
a color_mgmt_changed boolean whenever a plane color prop changes.
Reviewed-by: Harry Wentland <harry.wentland@amd.com>
Signed-off-by: Melissa Wen <mwen@igalia.com>
Acked-by: Maxime Ripard <mripard@kernel.org>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
|
Place it in drm_property where drm_property_replace_blob and
drm_property_lookup_blob live. Then we can use the DRM helper for
driver-specific KMS properties too.
Reviewed-by: Harry Wentland <harry.wentland@amd.com>
Reviewed-by: Liviu Dudau <liviu.dudau@arm.com>
Signed-off-by: Melissa Wen <mwen@igalia.com>
Acked-by: Maxime Ripard <mripard@kernel.org>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
|
DRM_OBJECT_MAX_PROPERTY limits the number of properties to be attached
and we are increasing that value all time we add a new property (generic
or driver-specific).
In this series, we are adding 13 new KMS driver-specific properties for
AMD color manage:
- CRTC Gamma enumerated Transfer Function
- Plane: Degamma LUT+size+TF, HDR multiplier, shaper LUT+size+TF, 3D
LUT+size, blend LUT+size+TF (12)
Therefore, just increase DRM_OBJECT_MAX_PROPERTY to a number (64) that
accomodates these new properties and gives some room for others,
avoiding change this number everytime we add a new KMS property.
Reviewed-by: Harry Wentland <harry.wentland@amd.com>
Reviewed-by: Simon Ser <contact@emersion.fr>
Signed-off-by: Melissa Wen <mwen@igalia.com>
Reviewed-by: Abhinav Kumar <quic_abhinavk@quicinc.com>
Acked-by: Maxime Ripard <mripard@kernel.org>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
|
'fixes.2023.12.13a', 'rcu-tasks.2023.12.12b' and 'srcu.2023.12.13a' into rcu-merge.2023.12.13a
|
|
The brief summary in the docstring for function list_next_or_null_rcu()
states that the function is supposed to provide the "first" member of a
list, whereas in truth it returns the next member.
Change the docstring so it describes what the function actually does.
Signed-off-by: Philipp Stanner <pstanner@redhat.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Neeraj Upadhyay (AMD) <neeraj.iitr10@gmail.com>
|
|
It is claimed that srcu_read_lock_nmisafe() NMI-safe. However it
triggers a lockdep if used from NMI because lockdep expects a deadlock
since nothing disables NMIs while the lock is acquired.
This is because commit f0f44752f5f61 ("rcu: Annotate SRCU's update-side
lockdep dependencies") annotates synchronize_srcu() as a write lock
usage. This helps to detect a deadlocks such as
srcu_read_lock();
synchronize_srcu();
srcu_read_unlock();
The side effect is that the lock srcu_struct now has a USED usage in normal
contexts, so it conflicts with a USED_READ usage in NMI. But this shouldn't
cause a real deadlock because the write lock usage from synchronize_srcu()
is a fake one and only used for read/write deadlock detection.
Use a try-lock annotation for srcu_read_lock_nmisafe() to avoid lockdep
complains if used from NMI.
Fixes: f0f44752f5f6 ("rcu: Annotate SRCU's update-side lockdep dependencies")
Link: https://lore.kernel.org/r/20230927160231.XRCDDSK4@linutronix.de
Reviewed-by: Boqun Feng <boqun.feng@gmail.com>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Neeraj Upadhyay (AMD) <neeraj.iitr10@gmail.com>
|
|
Rename min_buffers_needed into min_queued_buffers and update
the documentation about it.
Signed-off-by: Benjamin Gaignard <benjamin.gaignard@collabora.com>
Signed-off-by: Hans Verkuil <hverkuil-cisco@xs4all.nl>
[hverkuil: Drop the change where min_queued_buffers + 1 buffers would be]
[hverkuil: allocated. Now this patch only renames this field instead of making]
[hverkuil: a functional change as well.]
[hverkuil: Renamed 3 remaining min_buffers_needed occurrences.]
|
|
There are a few quirks around using lazy wake for poll unconditionally,
and one of them is related the EPOLLEXCLUSIVE. Those may trigger
exclusive wakeups, which wake a limited number of entries in the wait
queue. If that wake number is less than the number of entries someone is
waiting for (and that someone is also using DEFER_TASKRUN), then we can
get stuck waiting for more entries while we should be processing the ones
we already got.
If we're doing exclusive poll waits, flag the request as not being
compatible with lazy wakeups.
Reported-by: Pavel Begunkov <asml.silence@gmail.com>
Fixes: 6ce4a93dbb5b ("io_uring/poll: use IOU_F_TWQ_LAZY_WAKE for wakeups")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
This function was added with a8df7b1ab70b ("leds: add led_trigger_rename
function") 11 yrs ago, but it has no users. So remove it.
Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Link: https://lore.kernel.org/r/d90f30be-f661-4db7-b0b5-d09d07a78a68@gmail.com
Signed-off-by: Lee Jones <lee@kernel.org>
|
|
Subdev states store all standard pad configuration data, except for
frame intervals. Fix it by adding interval fields in the
v4l2_subdev_pad_config and v4l2_subdev_stream_config structures, with
corresponding accessor functions and a helper function to implement the
.get_frame_interval() operation.
Signed-off-by: Laurent Pinchart <laurent.pinchart@ideasonboard.com>
Signed-off-by: Hans Verkuil <hverkuil-cisco@xs4all.nl>
|
|
Due to a historical mishap, the v4l2_subdev_frame_interval structure
is the only part of the V4L2 subdev userspace API that doesn't contain a
'which' field. This prevents trying frame intervals using the subdev
'TRY' state mechanism.
Adding a 'which' field is simple as the structure has 8 reserved fields.
This would however break userspace as the field is currently set to 0,
corresponding to V4L2_SUBDEV_FORMAT_TRY, while the corresponding ioctls
currently operate on the 'ACTIVE' state. We thus need to add a new
subdev client cap, V4L2_SUBDEV_CLIENT_CAP_INTERVAL_USES_WHICH, to
indicate that userspace is aware of this new field.
All drivers that implement the subdev .get_frame_interval() and
.set_frame_interval() operations are updated to return -EINVAL when
operating on the TRY state, preserving the current behaviour.
While at it, fix a bad copy&paste in the documentation of the struct
v4l2_subdev_frame_interval_enum 'which' field.
Signed-off-by: Laurent Pinchart <laurent.pinchart@ideasonboard.com>
Reviewed-by: Philipp Zabel <p.zabel@pengutronix.de> # for imx-media
Reviewed-by: Hans Verkuil <hverkuil-cisco@xs4all.nl>
Reviewed-by: Luca Ceresoli <luca.ceresoli@bootlin.com> # for tegra-video
Reviewed-by: Mauro Carvalho Chehab <mchehab@kernel.org>
Signed-off-by: Hans Verkuil <hverkuil-cisco@xs4all.nl>
|
|
The subdev .[gs]_frame_interval are video operations, but they operate
on pads (and even on streams). Not only is this confusing, it causes
practical issues for drivers as the operations don't receive a subdev
state pointer, requiring manual state handling.
To improve the situation, turn the operations into pad operations, and
extend them to receive a state pointer like other pad operations.
While at it, rename the operations to .[gs]et_frame_interval at the same
time to match the naming scheme of other pad operations. This isn't
strictly necessary, but given that all drivers using those operations
need to be modified, handling the rename separately would generate more
churn for very little gain (if at all).
Signed-off-by: Laurent Pinchart <laurent.pinchart@ideasonboard.com>
Reviewed-by: Hans Verkuil <hverkuil-cisco@xs4all.nl>
Reviewed-by: Philipp Zabel <p.zabel@pengutronix.de> # for imx-media
Reviewed-by: Luca Ceresoli <luca.ceresoli@bootlin.com> # for tegra-video
Signed-off-by: Hans Verkuil <hverkuil-cisco@xs4all.nl>
|
|
CONFIG_MEDIA_CONTROLLER
The subdev state locking macros and macros to get the active state are
currently behind CONFIG_MEDIA_CONTROLLER. This makes sense, as there can
be no subdev state without MC.
However, we have code paths common to MC and non-MC cases which call
subdev operations that have subdev state as a parameter. In the non-MC
case the state parameter would always be NULL.
Thus it makes sense to allow, e.g.:
v4l2_subdev_call_state_active(sd, pad, get_fmt, fmt)
which for non-MC case would call the subdev passing NULL as the state.
This currently fails:
https://lore.kernel.org/oe-kbuild-all/202312061101.PLrz5NnJ-lkp@intel.com/
Fix the issue by moving the related macros to be outside
CONFIG_MEDIA_CONTROLLER. The v4l2_subdev_lock_state() and
v4l2_subdev_unlock_state() macros will crash if given NULL as the state,
but the other macros behave correctly even when there's no active state,
and they will only call the lock/unlock macros if there is a state.
An alternative fix would be to make another version of
v4l2_subdev_call_state_try() with ifdefs, which would not use any state
macros and would always pass NULL as the state. But having two version
of a macro/function is always more confusing than having just one, so I
went this way.
So, this fixes the v4l2_subdev_call_state_active() macro. But we also
have v4l2_subdev_call_state_try(). It would be possible to fix that
macro by additionally creating "no-op" variants of the state alloc and
free functions. However, v4l2_subdev_call_state_try() is only used by a
single driver (stm32-dcmi), which selects MC, and the macro is supposed
to be removed as soon as the users have been converted away from the
macro. Thus I have not touched the state alloc/free functions, and I
think it makes sense to keep alloc/free functions available only if
there's actually something that can be allocated or freed.
Signed-off-by: Tomi Valkeinen <tomi.valkeinen@ideasonboard.com>
Reviewed-by: Laurent Pinchart <laurent.pinchart@ideasonboard.com>
Reviewed-by: Hans Verkuil <hverkuil-cisco@xs4all.nl>
Link: https://lore.kernel.org/r/20231208-v4l2-state-mc-fix-v1-1-a0c8162557c6@ideasonboard.com
Signed-off-by: Laurent Pinchart <laurent.pinchart@ideasonboard.com>
Signed-off-by: Hans Verkuil <hverkuil-cisco@xs4all.nl>
|
|
Correct spellos reported by codespell.
Fix some grammar (as 's' to a few words).
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Cc: David Airlie <airlied@gmail.com>
Cc: Daniel Vetter <daniel@ffwll.ch>
Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
Cc: Maxime Ripard <mripard@kernel.org>
Cc: Thomas Zimmermann <tzimmermann@suse.de>
Reviewed-by: Jani Nikula <jani.nikula@intel.com>
Signed-off-by: Maxime Ripard <mripard@kernel.org>
Link: https://patchwork.freedesktop.org/patch/msgid/20231213044107.29214-1-rdunlap@infradead.org
|
|
Fix spelling problems as identified by codespell.
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Cc: David Airlie <airlied@gmail.com>
Cc: Daniel Vetter <daniel@ffwll.ch>
Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
Cc: Maxime Ripard <mripard@kernel.org>
Cc: Thomas Zimmermann <tzimmermann@suse.de>
Reviewed-by: Jani Nikula <jani.nikula@intel.com>
Signed-off-by: Maxime Ripard <mripard@kernel.org>
Link: https://patchwork.freedesktop.org/patch/msgid/20231213043226.10046-1-rdunlap@infradead.org
|
|
Correct spelling mistakes that were identified by codespell.
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Cc: David Airlie <airlied@gmail.com>
Cc: Daniel Vetter <daniel@ffwll.ch>
Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
Cc: Maxime Ripard <mripard@kernel.org>
Cc: Thomas Zimmermann <tzimmermann@suse.de>
Reviewed-by: Jani Nikula <jani.nikula@intel.com>
Signed-off-by: Maxime Ripard <mripard@kernel.org>
Link: https://patchwork.freedesktop.org/patch/msgid/20231213043925.13852-1-rdunlap@infradead.org
|
|
There are 2 HDMI, 2 DP, 2 eDP on rk3588, so add
corresponding endpoint definition for it.
Signed-off-by: Andy Yan <andy.yan@rock-chips.com>
Acked-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
Signed-off-by: Heiko Stuebner <heiko@sntech.de>
Link: https://patchwork.freedesktop.org/patch/msgid/20231211115907.1785377-1-andyshrk@163.com
|
|
The clear and set pattern is commonly used for accessing PCI config,
move the helper pci_clear_and_set_dword() from aspm.c into PCI header.
In addition, rename to pci_clear_and_set_config_dword() to retain the
"config" information and match the other accessors.
No functional change intended.
Signed-off-by: Shuai Xue <xueshuai@linux.alibaba.com>
Acked-by: Bjorn Helgaas <bhelgaas@google.com>
Tested-by: Ilkka Koskinen <ilkka@os.amperecomputing.com>
Link: https://lore.kernel.org/r/20231208025652.87192-4-xueshuai@linux.alibaba.com
Signed-off-by: Will Deacon <will@kernel.org>
|
|
The Alibaba Vendor ID (0x1ded) is now used by Alibaba elasticRDMA ("erdma")
and will be shared with the upcoming PCIe PMU ("dwc_pcie_pmu"). Move the
Vendor ID to linux/pci_ids.h so that it can shared by several drivers
later.
Signed-off-by: Shuai Xue <xueshuai@linux.alibaba.com>
Acked-by: Bjorn Helgaas <bhelgaas@google.com> # pci_ids.h
Tested-by: Ilkka Koskinen <ilkka@os.amperecomputing.com>
Link: https://lore.kernel.org/r/20231208025652.87192-3-xueshuai@linux.alibaba.com
Signed-off-by: Will Deacon <will@kernel.org>
|
|
Some sensors, e.g. Sony IMX290, are using little-endian registers. Add
support for those by encoding the endianness into Bit 20 of the register
address.
Fixes: af73323b9770 ("media: imx290: Convert to new CCI register access helpers")
Cc: stable@vger.kernel.org
Signed-off-by: Alexander Stein <alexander.stein@ew.tq-group.com>
Reviewed-by: Hans de Goede <hdegoede@redhat.com>
Reviewed-by: Laurent Pinchart <laurent.pinchart@ideasonboard.com>
[Sakari Ailus: Fixed commit message.]
Signed-off-by: Sakari Ailus <sakari.ailus@linux.intel.com>
Signed-off-by: Hans Verkuil <hverkuil-cisco@xs4all.nl>
|
|
The _store callbacks of the trip point temperature and hysteresis sysfs
attributes invoke thermal_notify_tz_trip_change() to send a notification
regarding the trip point change, but when trip points are updated by the
platform firmware, trip point change notifications are not sent.
To make the behavior after a trip point change more consistent,
modify all of the 3 places where trip point temperature is updated
to use a new function called thermal_zone_set_trip_temp() for this
purpose and make that function call thermal_notify_tz_trip_change().
Note that trip point hysteresis can only be updated via sysfs and
trip_point_hyst_store() calls thermal_notify_tz_trip_change() already,
so this code path need not be changed.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Daniel Lezcano <daniel.lezcano@linaro.org>
|
|
There is no in-kernel function to get the status register of a tty device
like the TIOCMGET ioctl returns to userspace. Create a new function,
tty_get_tiocm(), to obtain the status register that other portions of the
kernel can call if they need this information, and move the existing
internal tty_tiocmget() function to use this interface.
Signed-off-by: Florian Eckert <fe@dev.tdt.de>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Link: https://lore.kernel.org/r/20231127110311.3583957-2-fe@dev.tdt.de
Signed-off-by: Lee Jones <lee@kernel.org>
|
|
Add 2.5G, 5G and 10G as available speeds to the netdev LED trigger.
Signed-off-by: Daniel Golle <daniel@makrotopia.org>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Link: https://lore.kernel.org/r/99e7d3304c6bba7f4863a4a80764a869855f2085.1701143925.git.daniel@makrotopia.org
Signed-off-by: Lee Jones <lee@kernel.org>
|
|
Mode supported is currently reported to the user exactly the same, as
the current mode. That's because mode changing is not implemented.
Remove the leftover mode_supported() op and use mode_get() to fill up
the supported mode exposed to user.
One, if even, mode changing is going to be introduced, this could be
very easily taken back. In the meantime, prevent drivers form
implementing this in wrong way (as for example recent netdevsim
implementation attempt intended to do).
Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Since commit 7c41cdcd3bbe ("OPP: Simplify the over-designed pstate <->
level dance"), there is no longer any users of the
pm_genpd_opp_to_performance_state() API. Let's therefore drop it and its
corresponding ->opp_to_performance_state() callback, which also no longer
has any users.
Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org>
Acked-by: Rafael J. Wysocki <rafael@kernel.org>
Link: https://lore.kernel.org/r/20231127151931.47055-1-ulf.hansson@linaro.org
|
|
|
|
git://anongit.freedesktop.org/drm/drm-intel into drm-next
UAPI Changes:
- drm/i915: Implement fdinfo memory stats printing
Use the newly added drm_print_memory_stats helper to show memory
utilisation of our objects in drm/driver specific fdinfo output.
To collect the stats we walk the per memory regions object lists
and accumulate object size into the respective drm_memory_stats
categories.
Cross-subsystem Changes:
- Backmerge of drm-next (to bring drm-intel-next for PXP changes)
Driver Changes:
- Wa_18028616096 now applies to all DG2 (Matt R)
- Drop Wa_22014600077 on all DG2 (Matt R)
- Add new ATS-M device ID (Haridhar)
- More Meteorlake (MTL) workarounds (Matt R, Dnyaneshwar, Jonathan,
Gustavo, Radhakrishna)
- PMU WARN_ON cleanup on driver unbind (Umesh)
- Limit GGTT WC flushing workaround to pre BXT/ICL platforms
- Complement implementation for Wa_16018031267 / Wa_16018063123
(Andrzej, Jonathan, Nirmoy, Chris)
- Properly print internal GSC engine in trace logs (Tvrtko)
- Track gt pm wakerefs (Andrzej)
- Fix null deref bugs on perf code when perf is disabled (Harshit,
Tvrtko)
- Fix __i915_request_create memory leak on driver unbind (Andrzej)
- Remove spurious unsupported HuC message on MTL (Daniele)
- Read a shadowed mmio register for ggtt flush (Vinay)
- Add missing new-line to GT_TRACE (Andrzej)
- Add drm_dbgs for critical PXP events (Alan)
- Skip pxp init if gt is wedged (Zhanjun)
- Replace custom intel runtime_pm tracker with ref_tracker library
(Andrzej)
- Compiler warning/static checker/coding style cleanups (Arnd, Nirmoy,
Soumya, Gilbert, Dorcas, Kunwu, Sam, Tvrtko)
- Code structure and helper cleanups (Jani, Tvrtko, Andi)
- Selftest improvements (John, Tvrtko, Andrzej)
Signed-off-by: Dave Airlie <airlied@redhat.com>
# Conflicts:
# drivers/gpu/drm/i915/gt/intel_gt_mcr.c
From: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/ZXNBcsSwJEVsq9On@jlahtine-mobl.ger.corp.intel.com
|
|
In the effort to reduce zombie memcgs [1], it was discovered that the
memcg LRU doesn't apply enough pressure on offlined memcgs. Specifically,
instead of rotating them to the tail of the current generation
(MEMCG_LRU_TAIL) for a second attempt, it moves them to the next
generation (MEMCG_LRU_YOUNG) after the first attempt.
Not applying enough pressure on offlined memcgs can cause them to build
up, and this can be particularly harmful to memory-constrained systems.
On Pixel 8 Pro, launching apps for 50 cycles:
Before After Change
Zombie memcgs 45 35 -22%
[1] https://lore.kernel.org/CABdmKX2M6koq4Q0Cmp_-=wbP0Qa190HdEGGaHfxNS05gAkUtPA@mail.gmail.com/
Link: https://lkml.kernel.org/r/20231208061407.2125867-4-yuzhao@google.com
Fixes: e4dde56cd208 ("mm: multi-gen LRU: per-node lru_gen_folio lists")
Signed-off-by: Yu Zhao <yuzhao@google.com>
Reported-by: T.J. Mercier <tjmercier@google.com>
Tested-by: T.J. Mercier <tjmercier@google.com>
Cc: Charan Teja Kalla <quic_charante@quicinc.com>
Cc: Hillf Danton <hdanton@sina.com>
Cc: Jaroslav Pulchart <jaroslav.pulchart@gooddata.com>
Cc: Kairui Song <ryncsn@gmail.com>
Cc: Kalesh Singh <kaleshsingh@google.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
While investigating kswapd "consuming 100% CPU" [1] (also see "mm/mglru:
try to stop at high watermarks"), it was discovered that the memcg LRU can
breach the thrashing protection imposed by min_ttl_ms.
Before the memcg LRU:
kswapd()
shrink_node_memcgs()
mem_cgroup_iter()
inc_max_seq() // always hit a different memcg
lru_gen_age_node()
mem_cgroup_iter()
check the timestamp of the oldest generation
After the memcg LRU:
kswapd()
shrink_many()
restart:
iterate the memcg LRU:
inc_max_seq() // occasionally hit the same memcg
if raced with lru_gen_rotate_memcg():
goto restart
lru_gen_age_node()
mem_cgroup_iter()
check the timestamp of the oldest generation
Specifically, when the restart happens in shrink_many(), it needs to stick
with the (memcg LRU) generation it began with. In other words, it should
neither re-read memcg_lru->seq nor age an lruvec of a different
generation. Otherwise it can hit the same memcg multiple times without
giving lru_gen_age_node() a chance to check the timestamp of that memcg's
oldest generation (against min_ttl_ms).
[1] https://lore.kernel.org/CAK8fFZ4DY+GtBA40Pm7Nn5xCHy+51w3sfxPqkqpqakSXYyX+Wg@mail.gmail.com/
Link: https://lkml.kernel.org/r/20231208061407.2125867-3-yuzhao@google.com
Fixes: e4dde56cd208 ("mm: multi-gen LRU: per-node lru_gen_folio lists")
Signed-off-by: Yu Zhao <yuzhao@google.com>
Tested-by: T.J. Mercier <tjmercier@google.com>
Cc: Charan Teja Kalla <quic_charante@quicinc.com>
Cc: Hillf Danton <hdanton@sina.com>
Cc: Jaroslav Pulchart <jaroslav.pulchart@gooddata.com>
Cc: Kairui Song <ryncsn@gmail.com>
Cc: Kalesh Singh <kaleshsingh@google.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Unmapped folios accessed through file descriptors can be underprotected.
Those folios are added to the oldest generation based on:
1. The fact that they are less costly to reclaim (no need to walk the
rmap and flush the TLB) and have less impact on performance (don't
cause major PFs and can be non-blocking if needed again).
2. The observation that they are likely to be single-use. E.g., for
client use cases like Android, its apps parse configuration files
and store the data in heap (anon); for server use cases like MySQL,
it reads from InnoDB files and holds the cached data for tables in
buffer pools (anon).
However, the oldest generation can be very short lived, and if so, it
doesn't provide the PID controller with enough time to respond to a surge
of refaults. (Note that the PID controller uses weighted refaults and
those from evicted generations only take a half of the whole weight.) In
other words, for a short lived generation, the moving average smooths out
the spike quickly.
To fix the problem:
1. For folios that are already on LRU, if they can be beyond the
tracking range of tiers, i.e., five accesses through file
descriptors, move them to the second oldest generation to give them
more time to age. (Note that tiers are used by the PID controller
to statistically determine whether folios accessed multiple times
through file descriptors are worth protecting.)
2. When adding unmapped folios to LRU, adjust the placement of them so
that they are not too close to the tail. The effect of this is
similar to the above.
On Android, launching 55 apps sequentially:
Before After Change
workingset_refault_anon 25641024 25598972 0%
workingset_refault_file 115016834 106178438 -8%
Link: https://lkml.kernel.org/r/20231208061407.2125867-1-yuzhao@google.com
Fixes: ac35a4902374 ("mm: multi-gen LRU: minimal implementation")
Signed-off-by: Yu Zhao <yuzhao@google.com>
Reported-by: Charan Teja Kalla <quic_charante@quicinc.com>
Tested-by: Kalesh Singh <kaleshsingh@google.com>
Cc: T.J. Mercier <tjmercier@google.com>
Cc: Kairui Song <ryncsn@gmail.com>
Cc: Hillf Danton <hdanton@sina.com>
Cc: Jaroslav Pulchart <jaroslav.pulchart@gooddata.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
The cleanup tasks of kdamond threads including reset of corresponding
DAMON context's ->kdamond field and decrease of global nr_running_ctxs
counter is supposed to be executed by kdamond_fn(). However, commit
0f91d13366a4 ("mm/damon: simplify stop mechanism") made neither
damon_start() nor damon_stop() ensure the corresponding kdamond has
started the execution of kdamond_fn().
As a result, the cleanup can be skipped if damon_stop() is called fast
enough after the previous damon_start(). Especially the skipped reset
of ->kdamond could cause a use-after-free.
Fix it by waiting for start of kdamond_fn() execution from
damon_start().
Link: https://lkml.kernel.org/r/20231208175018.63880-1-sj@kernel.org
Fixes: 0f91d13366a4 ("mm/damon: simplify stop mechanism")
Signed-off-by: SeongJae Park <sj@kernel.org>
Reported-by: Jakub Acs <acsjakub@amazon.de>
Cc: Changbin Du <changbin.du@intel.com>
Cc: Jakub Acs <acsjakub@amazon.de>
Cc: <stable@vger.kernel.org> # 5.15.x
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
After converting selinux to VMA heap check helper, the gcl triggers an
execheap SELinux denial, which is caused by a changed logic check.
Previously selinux only checked that the VMA range was within the VMA heap
range, and the implementation checks the intersection between the two
ranges, but the corner case (vm_end=start_brk, brk=vm_start) isn't handled
correctly.
Since commit 11250fd12eb8 ("mm: factor out VMA stack and heap checks") was
only a function extraction, it seems that the issue was introduced by
commit 0db0c01b53a1 ("procfs: fix /proc/<pid>/maps heap check"). Let's
fix above corner cases, meanwhile, correct the wrong indentation of the
stack and heap check helpers.
Fixes: 11250fd12eb8 ("mm: factor out VMA stack and heap checks")
Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Reported-by: Ondrej Mosnacek <omosnace@redhat.com>
Closes: https://lore.kernel.org/selinux/CAFqZXNv0SVT0fkOK6neP9AXbj3nxJ61JAY4+zJzvxqJaeuhbFw@mail.gmail.com/
Tested-by: Ondrej Mosnacek <omosnace@redhat.com>
Link: https://lkml.kernel.org/r/20231207152525.2607420-1-wangkefeng.wang@huawei.com
Cc: David Hildenbrand <david@redhat.com>
Cc: Paul Moore <paul@paul-moore.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephen Smalley <stephen.smalley.work@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-pinctrl
Linus Walleij says:
====================
Immutable tag for the PEF2256 framer
* tag 'pef2256-framer' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-pinctrl:
MAINTAINERS: Add the Lantiq PEF2256 driver entry
pinctrl: Add support for the Lantic PEF2256 pinmux
net: wan: framer: Add support for the Lantiq PEF2256 framer
dt-bindings: net: Add the Lantiq PEF2256 E1/T1/J1 framer
net: wan: Add framer framework support
====================
Link: https://lore.kernel.org/all/CACRpkdYT1J7noFUhObFgfA60XQAfL4rb=knEmWS__TKKtCMh7Q@mail.gmail.com/
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Immutable tag for the PEF2256 framer
|
|
The Lantiq PEF2256 is a framer and line interface component designed to
fulfill all required interfacing between an analog E1/T1/J1 line and the
digital PCM system highway/H.100 bus.
Signed-off-by: Herve Codina <herve.codina@bootlin.com>
Reviewed-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Reviewed-by: Linus Walleij <linus.walleij@linaro.org>
Acked-by: Jakub Kicinski <kuba@kernel.org>
Link: https://lore.kernel.org/r/20231128132534.258459-4-herve.codina@bootlin.com
Signed-off-by: Linus Walleij <linus.walleij@linaro.org>
|
|
A framer is a component in charge of an E1/T1 line interface.
Connected usually to a TDM bus, it converts TDM frames to/from E1/T1
frames. It also provides information related to the E1/T1 line.
The framer framework provides a set of APIs for the framer drivers
(framer provider) to create/destroy a framer and APIs for the framer
users (framer consumer) to obtain a reference to the framer, and
use the framer.
This basic implementation provides a framer abstraction for:
- power on/off the framer
- get the framer status (line state)
- be notified on framer status changes
- get/set the framer configuration
Signed-off-by: Herve Codina <herve.codina@bootlin.com>
Reviewed-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Acked-by: Jakub Kicinski <kuba@kernel.org>
Link: https://lore.kernel.org/r/20231128132534.258459-2-herve.codina@bootlin.com
Signed-off-by: Linus Walleij <linus.walleij@linaro.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4
Pull ext4 fixes from Ted Ts'o:
"Fix various bugs / regressions for ext4, including a soft lockup, a
WARN_ON, and a BUG"
* tag 'ext4_for_linus-6.7-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
jbd2: fix soft lockup in journal_finish_inode_data_buffers()
ext4: fix warning in ext4_dio_write_end_io()
jbd2: increase the journal IO's priority
jbd2: correct the printing of write_flags in jbd2_write_superblock()
ext4: prevent the normalized size from exceeding EXT_MAX_BLOCKS
|
|
166 was skipped by mistake and two clocks:
* CLK_MOUT_CMU_HSI0_USBDPDGB
* CLK_GOUT_HSI0_USBDPDGB
Have an incorrect DGB ending instead of DBG.
This is an ABI break, but as the patch was only applied yesterday this
header has never been in an actual release so it seems better to fix
this early than ignore it.
Fixes: 0a910f160638 ("dt-bindings: clock: Add Google gs101 clock management unit bindings")
Signed-off-by: Peter Griffin <peter.griffin@linaro.org>
Reviewed-by: Chanwoo Choi <cw00.choi@samsung.com>
Reviewed-by: Sam Protsenko <semen.protsenko@linaro.org>
Reviewed-by: André Draszik <andre.draszik@linaro.org>
Link: https://lore.kernel.org/r/20231211162331.435900-7-peter.griffin@linaro.org
Signed-off-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse
Pull fuse fixes from Miklos Szeredi:
- Fix a couple of potential crashes, one introduced in 6.6 and one
in 5.10
- Fix misbehavior of virtiofs submounts on memory pressure
- Clarify naming in the uAPI for a recent feature
* tag 'fuse-fixes-6.7-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse:
fuse: disable FOPEN_PARALLEL_DIRECT_WRITES with FUSE_DIRECT_IO_ALLOW_MMAP
fuse: dax: set fc->dax to NULL in fuse_dax_conn_free()
fuse: share lookup state between submount and its parent
docs/fuse-io: Document the usage of DIRECT_IO_ALLOW_MMAP
fuse: Rename DIRECT_IO_RELAX to DIRECT_IO_ALLOW_MMAP
|
|
Xe, is a new driver for Intel GPUs that supports both integrated and
discrete platforms starting with Tiger Lake (first Intel Xe Architecture).
The code is at a stage where it is already functional and has experimental
support for multiple platforms starting from Tiger Lake, with initial
support implemented in Mesa (for Iris and Anv, our OpenGL and Vulkan
drivers), as well as in NEO (for OpenCL and Level0).
The new Xe driver leverages a lot from i915.
As for display, the intent is to share the display code with the i915
driver so that there is maximum reuse there. But it is not added
in this patch.
This initial work is a collaboration of many people and unfortunately
the big squashed patch won't fully honor the proper credits. But let's
get some git quick stats so we can at least try to preserve some of the
credits:
Co-developed-by: Matthew Brost <matthew.brost@intel.com>
Co-developed-by: Matthew Auld <matthew.auld@intel.com>
Co-developed-by: Matt Roper <matthew.d.roper@intel.com>
Co-developed-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Co-developed-by: Francois Dugast <francois.dugast@intel.com>
Co-developed-by: Lucas De Marchi <lucas.demarchi@intel.com>
Co-developed-by: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
Co-developed-by: Philippe Lecluse <philippe.lecluse@intel.com>
Co-developed-by: Nirmoy Das <nirmoy.das@intel.com>
Co-developed-by: Jani Nikula <jani.nikula@intel.com>
Co-developed-by: José Roberto de Souza <jose.souza@intel.com>
Co-developed-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
Co-developed-by: Dave Airlie <airlied@redhat.com>
Co-developed-by: Faith Ekstrand <faith.ekstrand@collabora.com>
Co-developed-by: Daniel Vetter <daniel.vetter@ffwll.ch>
Co-developed-by: Mauro Carvalho Chehab <mchehab@kernel.org>
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
|
|
Running my yearly branch profiler to see where likely/unlikely annotation
may be added or removed, I discovered this:
correct incorrect % Function File Line
------- --------- - -------- ---- ----
0 457918 100 page_try_dup_anon_rmap rmap.h 264
[..]
458021 0 0 page_try_dup_anon_rmap rmap.h 265
I thought it was interesting that line 264 of rmap.h had a 100% incorrect
annotation, but the line directly below it was 100% correct. Looking at the
code:
if (likely(!is_device_private_page(page) &&
unlikely(page_needs_cow_for_dma(vma, page))))
It didn't make sense. The "likely()" was around the entire if statement
(not just the "!is_device_private_page(page)"), which also included the
"unlikely()" portion of that if condition.
If the unlikely portion is unlikely to be true, that would make the entire
if condition unlikely to be true, so it made no sense at all to say the
entire if condition is true.
What is more likely to be likely is just the first part of the if statement
before the && operation. It's likely to be a misplaced parenthesis. And
after making the if condition broken into a likely() && unlikely(), both
now appear to be correct!
Link: https://lkml.kernel.org/r/20231201145936.5ddfdb50@gandalf.local.home
Fixes:fb3d824d1a46c ("mm/rmap: split page_dup_rmap() into page_dup_file_rmap() and page_try_dup_anon_rmap()")
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: David Hildenbrand <david@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Patch series "mm: cleanup and use more folio in page fault", v3.
Rename page_copy_prealloc() to folio_prealloc(), which is used by more
functions, also do more folio conversion in page fault.
This patch (of 5):
Since ksm only support normal page, no swapout/in for ksm large folio too,
add large folio check in ksm_might_need_to_copy(), also convert
page->index to folio->index as page->index is going away.
Then convert ksm_might_need_to_copy() to use more folio api to save nine
compound_head() calls, short 'address' to reduce max-line-length.
Link: https://lkml.kernel.org/r/20231118023232.1409103-1-wangkefeng.wang@huawei.com
Link: https://lkml.kernel.org/r/20231118023232.1409103-2-wangkefeng.wang@huawei.com
Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com>
Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Patch series "mm/damon: let users feed and tame/auto-tune DAMOS".
Introduce Aim-oriented Feedback-driven DAMOS Aggressiveness Auto-tuning.
It makes DAMOS self-tuned with periodic simple user feedback.
Background: DAMOS Control Difficulty
====================================
DAMOS helps users easily implement access pattern aware system operations.
However, controlling DAMOS in the wild is not that easy.
The basic way for DAMOS control is specifying the target access pattern.
In this approach, the user is assumed to well understand the access
pattern and the characteristics of the system and the workloads. Though
there are useful tools for that, it takes time and effort depending on the
complexity and the dynamicity of the system and the workloads. After all,
the access pattern consists of three ranges, namely the size, the access
rate, and the age of the regions. It means users need to tune six
parameters, which is anyway not a simple task.
One of the worst cases would be DAMOS being too aggressive like a
berserker, and therefore consuming too much system resource and making
unwanted radical system operations. To let users avoid such cases, DAMOS
allows users to set the upper-limit of the schemes' aggressiveness, namely
DAMOS quota. DAMOS further provides its best-effort under the limit by
prioritizing regions based on the access pattern of the regions. For
example, users can ask DAMOS to page out up to 100 MiB of memory regions
per second. Then DAMOS pages out regions that are not accessed for a
longer time (colder) first under the limit. This allows users to set the
target access pattern a bit naive with wider ranges, and focus on tuning
only one parameter, the quota. In other words, the number of parameters
to tune can be reduced from six to one.
Still, however, the optimum value for the quota depends on the system and
the workloads' characteristics, so not that simple. The number of
parameters to tune can also increase again if the user needs to run
multiple schemes.
Aim-oriented Feedback-driven DAMOS Aggressiveness Auto Tuning
=============================================================
Users would use DAMOS since they want to achieve something with it. They
will likely have measurable metrics representing the achievement and the
target number of the metric like SLO, and continuously measure that
anyway. While the additional cost of getting the information is nearly
zero, it could be useful for DAMOS to understand how appropriate its
current aggressiveness is set, and adjust it on its own to make the metric
value more close to the target.
Based on this idea, we introduce a new way of tuning DAMOS with nearly
zero additional effort, namely Aim-oriented Feedback-driven DAMOS
Aggressiveness Auto Tuning. It asks users to provide feedback
representing how well DAMOS is doing relative to the users' aim. Then
DAMOS adjusts its aggressiveness, specifically the quota that provides
the best effort result under the limit, based on the current level of
the aggressiveness and the users' feedback.
Implementation
==============
The implementation asks users to represent the feedback with score
numbers. The scores could be anything including user-space specific
metrics including latency and throughput of special user-space workloads,
and system metrics including free memory ratio, memory pressure stall time
(PSI), and active to inactive LRU lists size ratio. The feedback scores
and the aggressiveness of the given DAMOS scheme are assumed to be
positively proportional, though. Selecting metrics of the assumption is
the users' responsibility.
The core logic uses the below simple feedback loop algorithm to calculate
the next aggressiveness level of the scheme from the current
aggressiveness level and the current feedback (target_score and
current_score). It calculates the compensation for next aggressiveness as
a proportion of current aggressiveness and distance to the target score.
As a result, it arrives at the near-goal state in a short time using big
steps when it's far from the goal, but avoids making unnecessarily radical
changes that could turn out to be a bad decision using small steps when
its near to the goal.
f(n) = max(1, f(n - 1) * ((target_score - current_score) / target_score + 1))
Note that the compensation value becomes negative when it's over
achieving the goal. That's why the feedback metric and the
aggressiveness of the scheme should be positively proportional. The
distance-adaptive speed manipulation is simply applied.
Example Use Cases
=================
If users want to reduce the memory footprint of the system as much as
possible as long as the time spent for handling the resulting memory
pressure is within a threshold, they could use DAMOS scheme that reclaims
cold memory regions aiming for a little level of memory pressure stall
time.
If users want the active/inactive LRU lists well balanced to reduce the
performance impact due to possible future memory pressure, they could use
two schemes. The first one would be set to locate hot pages in the active
LRU list, aiming for a specific active-to-inactive LRU list size ratio,
say, 70%. The second one would be to locate cold pages in the inactive
LRU list, aiming for a specific inactive-to-active LRU list size ratio,
say, 30%. Then, DAMOS will balance the two schemes based on the goal and
feedback.
This aim-oriented auto tuning could also be useful for general
balancing-required access aware system operations such as system memory
auto scaling[3] and tiered memory management[4]. These two example usages
are not what current DAMOS implementation is already supporting, but
require additional DAMOS action developments, though.
Evaluation: subtle memory pressure aiming proactive reclamation
===============================================================
To show if the implementation works as expected, we prepare four different
system configurations on AWS i3.metal instances. The first setup
(original) runs the workload without any DAMOS scheme. The second setup
(not-tuned) runs the workload with a virtual address space-based proactive
reclamation scheme that pages out memory regions that are not accessed for
five seconds or more. The third setup (offline-tuned) runs the same
proactive reclamation DAMOS scheme, but after making it tuned for each
workload offline, using our previous user-space driven automatic tuning
approach, namely DAMOOS[1]. The fourth and final setup (AFDAA) runs the
scheme that is the same as that of 'not-tuned' setup, but aims to keep
0.5% of 'some' memory pressure stall time (PSI) for the last 10 seconds
using the aiming-oriented auto tuning.
For each setup, we run realistic workloads from PARSEC3 and SPLASH-2X
benchmark suites. For each run, we measure RSS and runtime of the
workload, and 'some' memory pressure stall time (PSI) of the system. We
repeat the runs five times and use averaged measurements.
For simple comparison of the results, we normalize the measurements to
those of 'original'. In the case of the PSI, though, the measurement for
'original' was zero, so we normalize the value to that of 'not-tuned'
scheme's result. The normalized results are shown below.
Not-tuned Offline-tuned AFDAA
RSS 0.622688178226118 0.787950678944904 0.740093483278979
runtime 1.11767826657912 1.0564674983585 1.0910833880499
PSI 1 0.727521443794069 0.308498846350299
The 'not-tuned' scheme achieves about 38.7% memory saving but incur about
11.7% runtime slowdown. The 'offline-tuned' scheme achieves about 22.2%
memory saving with about 5.5% runtime slowdown. It also achieves about
28.2% memory pressure stall time saving. AFDAA achieves about 26% memory
saving with about 9.1% runtime slowdown. It also achieves about 69.1%
memory pressure stall time saving. We repeat this test multiple times,
and get consistent results. AFDAA is now integrated in our daily DAMON
performance test setup.
Apparently the aggressiveness of 'AFDAA' setup is somewhere between those
of 'not-tuned' and 'offline-tuned' setup, since its memory saving and
runtime overhead are between those of the other two setups. Actually we
set the memory pressure stall time goal aiming for this middle
aggressiveness. The difference in the two metrics are not significant,
though. However, it shows significant saving of the memory pressure stall
time, which was the goal of the auto-tuning, over the two variants.
Hence, we conclude the automatic tuning is working as expected.
Please note that the AFDAA setup is only for the evaluation, and
therefore intentionally set a bit aggressive. It might not be
appropriate for production environments.
The test code is also available[2], so you could reproduce it on your
system and workloads.
Patches Sequence
================
The first four patches implement the core logic and user interfaces for
the auto tuning. The first patch implements the core logic for the auto
tuning, and the API for DAMOS users in the kernel space. The second
patch implements basic file operations of DAMON sysfs directories and
files that will be used for setting the goals and providing the
feedback. The third patch connects the quota goals files inputs to the
DAMOS core logic. Finally the fourth patch implements a dedicated DAMOS
sysfs command for efficiently committing the quota goals feedback.
Two patches for simple tests of the logic and interfaces follow. The
fifth patch implements the core logic unit test. The sixth patch
implements a selftest for the DAMON Sysfs interface for the goals.
Finally, three patches for documentation follows. The seventh patch
documents the design of the feature. The eighth patch updates the API
doc for the new sysfs files. The final eighth patch updates the usage
document for the features.
References
==========
[1] DAOS paper:
https://www.amazon.science/publications/daos-data-access-aware-operating-system
[2] Evaluation code:
https://github.com/damonitor/damon-tests/commit/3f884e61193f0166b8724554b6d06b0c449a712d
[3] Memory auto scaling RFC idea:
https://lore.kernel.org/damon/20231112195114.61474-1-sj@kernel.org/
[4] DAMON-based tiered memory management RFC idea:
https://lore.kernel.org/damon/20231112195602.61525-1-sj@kernel.org/
This patch (of 9)
Users can effectively control the upper-limit aggressiveness of DAMOS
schemes using the quota feature. The quota provides best result under the
limit by prioritizing regions based on the access pattern. That said,
finding the best value, which could depend on dynamic characteristics of
the system and the workloads, is still challenging.
Implement a simple feedback-driven tuning mechanism and use it for
automatic tuning of DAMOS quota. The implementation allows users to
provide the feedback by setting a feedback score returning callback
function. Then DAMOS periodically calls the function back and adjusts the
quota based on the return value of the callback and current quota value.
Note that the absolute-value based time/size quotas still work as the
maximum hard limits of the scheme's aggressiveness. The feedback-driven
auto-tuned quota is applied only if it is not exceeding the manually set
maximum limits. Same for the scheme-target access pattern and filters
like other features.
[sj@kernel.org: document get_score_arg field of struct damos_quota]
Link: https://lkml.kernel.org/r/20231204170106.60992-1-sj@kernel.org
Link: https://lkml.kernel.org/r/20231130023652.50284-1-sj@kernel.org
Link: https://lkml.kernel.org/r/20231130023652.50284-2-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Cc: Brendan Higgins <brendanhiggins@google.com>
Cc: David Gow <davidgow@google.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Shuah Khan <shuah@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Currently, we only shrink the zswap pool when the user-defined limit is
hit. This means that if we set the limit too high, cold data that are
unlikely to be used again will reside in the pool, wasting precious
memory. It is hard to predict how much zswap space will be needed ahead
of time, as this depends on the workload (specifically, on factors such as
memory access patterns and compressibility of the memory pages).
This patch implements a memcg- and NUMA-aware shrinker for zswap, that is
initiated when there is memory pressure. The shrinker does not have any
parameter that must be tuned by the user, and can be opted in or out on a
per-memcg basis.
Furthermore, to make it more robust for many workloads and prevent
overshrinking (i.e evicting warm pages that might be refaulted into
memory), we build in the following heuristics:
* Estimate the number of warm pages residing in zswap, and attempt to
protect this region of the zswap LRU.
* Scale the number of freeable objects by an estimate of the memory
saving factor. The better zswap compresses the data, the fewer pages
we will evict to swap (as we will otherwise incur IO for relatively
small memory saving).
* During reclaim, if the shrinker encounters a page that is also being
brought into memory, the shrinker will cautiously terminate its
shrinking action, as this is a sign that it is touching the warmer
region of the zswap LRU.
As a proof of concept, we ran the following synthetic benchmark: build the
linux kernel in a memory-limited cgroup, and allocate some cold data in
tmpfs to see if the shrinker could write them out and improved the overall
performance. Depending on the amount of cold data generated, we observe
from 14% to 35% reduction in kernel CPU time used in the kernel builds.
[nphamcs@gmail.com: check shrinker enablement early, use less costly stat flushing]
Link: https://lkml.kernel.org/r/20231206194456.3234203-1-nphamcs@gmail.com
Link: https://lkml.kernel.org/r/20231130194023.4102148-7-nphamcs@gmail.com
Signed-off-by: Nhat Pham <nphamcs@gmail.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Tested-by: Bagas Sanjaya <bagasdotme@gmail.com>
Cc: Chris Li <chrisl@kernel.org>
Cc: Dan Streetman <ddstreet@ieee.org>
Cc: Domenico Cerasuolo <cerasuolodomenico@gmail.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Seth Jennings <sjenning@redhat.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Vitaly Wool <vitaly.wool@konsulko.com>
Cc: Yosry Ahmed <yosryahmed@google.com>
Cc: Chengming Zhou <chengming.zhou@linux.dev>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Since zswap now writes back pages from memcg-specific LRUs, we now need a
new stat to show writebacks count for each memcg.
[nphamcs@gmail.com: rename ZSWP_WB to ZSWPWB]
Link: https://lkml.kernel.org/r/20231205193307.2432803-1-nphamcs@gmail.com
Link: https://lkml.kernel.org/r/20231130194023.4102148-5-nphamcs@gmail.com
Suggested-by: Nhat Pham <nphamcs@gmail.com>
Signed-off-by: Domenico Cerasuolo <cerasuolodomenico@gmail.com>
Signed-off-by: Nhat Pham <nphamcs@gmail.com>
Tested-by: Bagas Sanjaya <bagasdotme@gmail.com>
Reviewed-by: Yosry Ahmed <yosryahmed@google.com>
Cc: Chris Li <chrisl@kernel.org>
Cc: Dan Streetman <ddstreet@ieee.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Seth Jennings <sjenning@redhat.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Vitaly Wool <vitaly.wool@konsulko.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Currently, we only have a single global LRU for zswap. This makes it
impossible to perform worload-specific shrinking - an memcg cannot
determine which pages in the pool it owns, and often ends up writing pages
from other memcgs. This issue has been previously observed in practice
and mitigated by simply disabling memcg-initiated shrinking:
https://lore.kernel.org/all/20230530232435.3097106-1-nphamcs@gmail.com/T/#u
This patch fully resolves the issue by replacing the global zswap LRU
with memcg- and NUMA-specific LRUs, and modify the reclaim logic:
a) When a store attempt hits an memcg limit, it now triggers a
synchronous reclaim attempt that, if successful, allows the new
hotter page to be accepted by zswap.
b) If the store attempt instead hits the global zswap limit, it will
trigger an asynchronous reclaim attempt, in which an memcg is
selected for reclaim in a round-robin-like fashion.
[nphamcs@gmail.com: use correct function for the onlineness check, use mem_cgroup_iter_break()]
Link: https://lkml.kernel.org/r/20231205195419.2563217-1-nphamcs@gmail.com
[nphamcs@gmail.com: drop the pool's reference at the end of the writeback step]
Link: https://lkml.kernel.org/r/20231206030627.4155634-1-nphamcs@gmail.com
Link: https://lkml.kernel.org/r/20231130194023.4102148-4-nphamcs@gmail.com
Signed-off-by: Domenico Cerasuolo <cerasuolodomenico@gmail.com>
Co-developed-by: Nhat Pham <nphamcs@gmail.com>
Signed-off-by: Nhat Pham <nphamcs@gmail.com>
Tested-by: Bagas Sanjaya <bagasdotme@gmail.com>
Cc: Chris Li <chrisl@kernel.org>
Cc: Dan Streetman <ddstreet@ieee.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Seth Jennings <sjenning@redhat.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Vitaly Wool <vitaly.wool@konsulko.com>
Cc: Yosry Ahmed <yosryahmed@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
This patch implements a helper function that try to get a reference to an
memcg's css, as well as checking if it is online. This new function is
almost exactly the same as the existing mem_cgroup_tryget(), except for
the onlineness check. In the !CONFIG_MEMCG case, it always returns true,
analogous to mem_cgroup_tryget(). This is useful for e.g to the new zswap
writeback scheme, where we need to select the next online memcg as a
candidate for the global limit reclaim.
Link: https://lkml.kernel.org/r/20231130194023.4102148-3-nphamcs@gmail.com
Signed-off-by: Nhat Pham <nphamcs@gmail.com>
Tested-by: Bagas Sanjaya <bagasdotme@gmail.com>
Reviewed-by: Yosry Ahmed <yosryahmed@google.com>
Cc: Chris Li <chrisl@kernel.org>
Cc: Dan Streetman <ddstreet@ieee.org>
Cc: Domenico Cerasuolo <cerasuolodomenico@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Seth Jennings <sjenning@redhat.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Vitaly Wool <vitaly.wool@konsulko.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Patch series "workload-specific and memory pressure-driven zswap
writeback", v8.
There are currently several issues with zswap writeback:
1. There is only a single global LRU for zswap, making it impossible to
perform worload-specific shrinking - an memcg under memory pressure
cannot determine which pages in the pool it owns, and often ends up
writing pages from other memcgs. This issue has been previously
observed in practice and mitigated by simply disabling
memcg-initiated shrinking:
https://lore.kernel.org/all/20230530232435.3097106-1-nphamcs@gmail.com/T/#u
But this solution leaves a lot to be desired, as we still do not
have an avenue for an memcg to free up its own memory locked up in
the zswap pool.
2. We only shrink the zswap pool when the user-defined limit is hit.
This means that if we set the limit too high, cold data that are
unlikely to be used again will reside in the pool, wasting precious
memory. It is hard to predict how much zswap space will be needed
ahead of time, as this depends on the workload (specifically, on
factors such as memory access patterns and compressibility of the
memory pages).
This patch series solves these issues by separating the global zswap LRU
into per-memcg and per-NUMA LRUs, and performs workload-specific (i.e
memcg- and NUMA-aware) zswap writeback under memory pressure. The new
shrinker does not have any parameter that must be tuned by the user, and
can be opted in or out on a per-memcg basis.
As a proof of concept, we ran the following synthetic benchmark: build the
linux kernel in a memory-limited cgroup, and allocate some cold data in
tmpfs to see if the shrinker could write them out and improved the overall
performance. Depending on the amount of cold data generated, we observe
from 14% to 35% reduction in kernel CPU time used in the kernel builds.
This patch (of 6):
The interface of list_lru is based on the assumption that the list node
and the data it represents belong to the same allocated on the correct
node/memcg. While this assumption is valid for existing slab objects LRU
such as dentries and inodes, it is undocumented, and rather inflexible for
certain potential list_lru users (such as the upcoming zswap shrinker and
the THP shrinker). It has caused us a lot of issues during our
development.
This patch changes list_lru interface so that the caller must explicitly
specify numa node and memcg when adding and removing objects. The old
list_lru_add() and list_lru_del() are renamed to list_lru_add_obj() and
list_lru_del_obj(), respectively.
It also extends the list_lru API with a new function, list_lru_putback,
which undoes a previous list_lru_isolate call. Unlike list_lru_add, it
does not increment the LRU node count (as list_lru_isolate does not
decrement the node count). list_lru_putback also allows for explicit
memcg and NUMA node selection.
Link: https://lkml.kernel.org/r/20231130194023.4102148-1-nphamcs@gmail.com
Link: https://lkml.kernel.org/r/20231130194023.4102148-2-nphamcs@gmail.com
Signed-off-by: Nhat Pham <nphamcs@gmail.com>
Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Tested-by: Bagas Sanjaya <bagasdotme@gmail.com>
Cc: Chris Li <chrisl@kernel.org>
Cc: Dan Streetman <ddstreet@ieee.org>
Cc: Domenico Cerasuolo <cerasuolodomenico@gmail.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Seth Jennings <sjenning@redhat.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Vitaly Wool <vitaly.wool@konsulko.com>
Cc: Yosry Ahmed <yosryahmed@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
ma_wr_state was previously tracking the end of the node for writing.
Since the implementation of the ma_state end tracking, this is duplicated
work. This patch removes the maple write state tracking of the end of the
node and uses the maple state end instead.
Link: https://lkml.kernel.org/r/20231101171629.3612299-11-Liam.Howlett@oracle.com
Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Peng Zhang <zhangpeng.00@bytedance.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
The maple tree node is overloaded to keep status as well as the active
node. This, unfortunately, results in a re-walk on underflow or overflow.
Since the maple state has room, the status can be placed in its own enum
in the structure. Once an underflow/overflow is detected, certain modes
can restore the status to active and others may need to re-walk just that
one node to see the entry.
The status being an enum has the benefit of detecting unhandled status in
switch statements.
[Liam.Howlett@oracle.com: fix comments about MAS_*]
Link: https://lkml.kernel.org/r/20231106154124.614247-1-Liam.Howlett@oracle.com
[Liam.Howlett@oracle.com: update forking to separate maple state and node]
Link: https://lkml.kernel.org/r/20231106154551.615042-1-Liam.Howlett@oracle.com
[Liam.Howlett@oracle.com: fix mas_prev() state separation code]
Link: https://lkml.kernel.org/r/20231207193319.4025462-1-Liam.Howlett@oracle.com
Link: https://lkml.kernel.org/r/20231101171629.3612299-9-Liam.Howlett@oracle.com
Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Peng Zhang <zhangpeng.00@bytedance.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|