Age | Commit message (Collapse) | Author |
|
Instead of always falling back to memcpy_fromio() for any size, prefer
using read{b,w,l}(). When reading struct members it's common to read
individual integer variables individually. Going through memcpy_fromio()
for each of them poses a high penalty.
Employ a similar trick as __seqprop() by using _Generic() to generate
only the specific call based on a type-compatible variable.
For a pariticular i915 workload producing GPU context switches,
__get_engine_usage_record() is particularly hot since the engine usage
is read from device local memory with dgfx, possibly multiple times
since it's racy. Test execution time for this test shows a ~12.5%
improvement with DG2:
Before:
nrepeats = 1000; min = 7.63243e+06; max = 1.01817e+07;
median = 9.52548e+06; var = 526149;
After:
nrepeats = 1000; min = 7.03402e+06; max = 8.8832e+06;
median = 8.33955e+06; var = 333113;
Other things attempted that didn't prove very useful:
1) Change the _Generic() on x86 to just dereference the memory address
2) Change __get_engine_usage_record() to do just 1 read per loop,
comparing with the previous value read
3) Change __get_engine_usage_record() to access the fields directly as it
was before the conversion to iosys-map
(3) did gave a small improvement (~3%), but doesn't seem to scale well
to other similar cases in the driver.
Additional test by Chris Wilson using gem_create from igt with some
changes to track object creation time. This happens to accidentally
stress this code path:
Pre iosys_map conversion of engine busyness:
lmem0: Creating 262144 4KiB objects took 59274.2ms
Unpatched:
lmem0: Creating 262144 4KiB objects took 108830.2ms
With readl (this patch):
lmem0: Creating 262144 4KiB objects took 61348.6ms
s/readl/READ_ONCE/
lmem0: Creating 262144 4KiB objects took 61333.2ms
So we do take a little bit more time than before the conversion, but
that is due to other factors: bringing the READ_ONCE back would be as
good as just doing this conversion.
v2:
- Remove default from _Generic() - callers wanting to read more
than u64 should use iosys_map_memcpy_from()
- Add READ_ONCE() cases dereferencing the pointer when using system
memory
v3:
- Fix precedence issue when casting inside READ_ONCE(). By not using ()
around vaddr__ the offset was not part of the cast, but rather added
to it, producing a wrong address
- Remove compiletime_assert() as READ_ONCE() already contains it
Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>
Reviewed-by: Christian König <christian.koenig@amd.com> # v1
Reviewed-by: Thomas Zimmermann <tzimmermann@suse.de>
Link: https://patchwork.freedesktop.org/patch/msgid/20220628191016.3899428-1-lucas.demarchi@intel.com
|
|
Context tracking is going to be used not only to track user transitions
but also idle/IRQs/NMIs. The user tracking part will then become a
separate feature. Prepare Kconfig for that.
[ frederic: Apply Max Filippov feedback. ]
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Neeraj Upadhyay <quic_neeraju@quicinc.com>
Cc: Uladzislau Rezki <uladzislau.rezki@sony.com>
Cc: Joel Fernandes <joel@joelfernandes.org>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Nicolas Saenz Julienne <nsaenz@kernel.org>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Xiongfeng Wang <wangxiongfeng2@huawei.com>
Cc: Yu Liao <liaoyu15@huawei.com>
Cc: Phil Auld <pauld@redhat.com>
Cc: Paul Gortmaker<paul.gortmaker@windriver.com>
Cc: Alex Belits <abelits@marvell.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Reviewed-by: Nicolas Saenz Julienne <nsaenzju@redhat.com>
Tested-by: Nicolas Saenz Julienne <nsaenzju@redhat.com>
|
|
context_tracking_cpu_set() is called in order to tell a CPU to track
user/kernel transitions. Since context tracking is going to expand in
to also track transitions from/to idle/IRQ/NMIs, the scope
of this function name becomes too broad and needs to be made more
specific. Also shorten the prefix to align with the new namespace.
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Neeraj Upadhyay <quic_neeraju@quicinc.com>
Cc: Uladzislau Rezki <uladzislau.rezki@sony.com>
Cc: Joel Fernandes <joel@joelfernandes.org>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Nicolas Saenz Julienne <nsaenz@kernel.org>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Xiongfeng Wang <wangxiongfeng2@huawei.com>
Cc: Yu Liao <liaoyu15@huawei.com>
Cc: Phil Auld <pauld@redhat.com>
Cc: Paul Gortmaker<paul.gortmaker@windriver.com>
Cc: Alex Belits <abelits@marvell.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Reviewed-by: Nicolas Saenz Julienne <nsaenzju@redhat.com>
Tested-by: Nicolas Saenz Julienne <nsaenzju@redhat.com>
|
|
context_tracking_enter() and context_tracking_exit() have confusing
names that don't explain the fact they are referring to user/guest state.
Use more self-explanatory names and shrink to the new context tracking
prefix instead.
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Neeraj Upadhyay <quic_neeraju@quicinc.com>
Cc: Uladzislau Rezki <uladzislau.rezki@sony.com>
Cc: Joel Fernandes <joel@joelfernandes.org>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Nicolas Saenz Julienne <nsaenz@kernel.org>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Xiongfeng Wang <wangxiongfeng2@huawei.com>
Cc: Yu Liao <liaoyu15@huawei.com>
Cc: Phil Auld <pauld@redhat.com>
Cc: Paul Gortmaker<paul.gortmaker@windriver.com>
Cc: Alex Belits <abelits@marvell.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Reviewed-by: Nicolas Saenz Julienne <nsaenzju@redhat.com>
Tested-by: Nicolas Saenz Julienne <nsaenzju@redhat.com>
|
|
user_enter/exit_callable()
context_tracking_user_enter() and context_tracking_user_exit() are
ASM callable versions of user_enter() and user_exit() for architectures
that didn't manage to check the context tracking static key from ASM.
Change those function names to better reflect their purpose.
[ frederic: Apply Max Filippov feedback. ]
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Neeraj Upadhyay <quic_neeraju@quicinc.com>
Cc: Uladzislau Rezki <uladzislau.rezki@sony.com>
Cc: Joel Fernandes <joel@joelfernandes.org>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Nicolas Saenz Julienne <nsaenz@kernel.org>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Xiongfeng Wang <wangxiongfeng2@huawei.com>
Cc: Yu Liao <liaoyu15@huawei.com>
Cc: Phil Auld <pauld@redhat.com>
Cc: Paul Gortmaker<paul.gortmaker@windriver.com>
Cc: Alex Belits <abelits@marvell.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Reviewed-by: Nicolas Saenz Julienne <nsaenzju@redhat.com>
Tested-by: Nicolas Saenz Julienne <nsaenzju@redhat.com>
|
|
I don't see how to make it nice without introducing btf id lists
for the hooks where these helpers are allowed. Some LSM hooks
work on the locked sockets, some are triggering early and
don't grab any locks, so have two lists for now:
1. LSM hooks which trigger under socket lock - minority of the hooks,
but ideal case for us, we can expose existing BTF-based helpers
2. LSM hooks which trigger without socket lock, but they trigger
early in the socket creation path where it should be safe to
do setsockopt without any locks
3. The rest are prohibited. I'm thinking that this use-case might
be a good gateway to sleeping lsm cgroup hooks in the future.
We can either expose lock/unlock operations (and add tracking
to the verifier) or have another set of bpf_setsockopt
wrapper that grab the locks and might sleep.
Reviewed-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Stanislav Fomichev <sdf@google.com>
Link: https://lore.kernel.org/r/20220628174314.1216643-7-sdf@google.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
Previous patch adds 1:1 mapping between all 211 LSM hooks
and bpf_cgroup program array. Instead of reserving a slot per
possible hook, reserve 10 slots per cgroup for lsm programs.
Those slots are dynamically allocated on demand and reclaimed.
struct cgroup_bpf {
struct bpf_prog_array * effective[33]; /* 0 264 */
/* --- cacheline 4 boundary (256 bytes) was 8 bytes ago --- */
struct hlist_head progs[33]; /* 264 264 */
/* --- cacheline 8 boundary (512 bytes) was 16 bytes ago --- */
u8 flags[33]; /* 528 33 */
/* XXX 7 bytes hole, try to pack */
struct list_head storages; /* 568 16 */
/* --- cacheline 9 boundary (576 bytes) was 8 bytes ago --- */
struct bpf_prog_array * inactive; /* 584 8 */
struct percpu_ref refcnt; /* 592 16 */
struct work_struct release_work; /* 608 72 */
/* size: 680, cachelines: 11, members: 7 */
/* sum members: 673, holes: 1, sum holes: 7 */
/* last cacheline: 40 bytes */
};
Reviewed-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Stanislav Fomichev <sdf@google.com>
Link: https://lore.kernel.org/r/20220628174314.1216643-5-sdf@google.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
Allow attaching to lsm hooks in the cgroup context.
Attaching to per-cgroup LSM works exactly like attaching
to other per-cgroup hooks. New BPF_LSM_CGROUP is added
to trigger new mode; the actual lsm hook we attach to is
signaled via existing attach_btf_id.
For the hooks that have 'struct socket' or 'struct sock' as its first
argument, we use the cgroup associated with that socket. For the rest,
we use 'current' cgroup (this is all on default hierarchy == v2 only).
Note that for some hooks that work on 'struct sock' we still
take the cgroup from 'current' because some of them work on the socket
that hasn't been properly initialized yet.
Behind the scenes, we allocate a shim program that is attached
to the trampoline and runs cgroup effective BPF programs array.
This shim has some rudimentary ref counting and can be shared
between several programs attaching to the same lsm hook from
different cgroups.
Note that this patch bloats cgroup size because we add 211
cgroup_bpf_attach_type(s) for simplicity sake. This will be
addressed in the subsequent patch.
Also note that we only add non-sleepable flavor for now. To enable
sleepable use-cases, bpf_prog_run_array_cg has to grab trace rcu,
shim programs have to be freed via trace rcu, cgroup_bpf.effective
should be also trace-rcu-managed + maybe some other changes that
I'm not aware of.
Reviewed-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Stanislav Fomichev <sdf@google.com>
Link: https://lore.kernel.org/r/20220628174314.1216643-4-sdf@google.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
This lets us reclaim some space to be used by new cgroup lsm slots.
Before:
struct cgroup_bpf {
struct bpf_prog_array * effective[23]; /* 0 184 */
/* --- cacheline 2 boundary (128 bytes) was 56 bytes ago --- */
struct list_head progs[23]; /* 184 368 */
/* --- cacheline 8 boundary (512 bytes) was 40 bytes ago --- */
u32 flags[23]; /* 552 92 */
/* XXX 4 bytes hole, try to pack */
/* --- cacheline 10 boundary (640 bytes) was 8 bytes ago --- */
struct list_head storages; /* 648 16 */
struct bpf_prog_array * inactive; /* 664 8 */
struct percpu_ref refcnt; /* 672 16 */
struct work_struct release_work; /* 688 32 */
/* size: 720, cachelines: 12, members: 7 */
/* sum members: 716, holes: 1, sum holes: 4 */
/* last cacheline: 16 bytes */
};
After:
struct cgroup_bpf {
struct bpf_prog_array * effective[23]; /* 0 184 */
/* --- cacheline 2 boundary (128 bytes) was 56 bytes ago --- */
struct hlist_head progs[23]; /* 184 184 */
/* --- cacheline 5 boundary (320 bytes) was 48 bytes ago --- */
u8 flags[23]; /* 368 23 */
/* XXX 1 byte hole, try to pack */
/* --- cacheline 6 boundary (384 bytes) was 8 bytes ago --- */
struct list_head storages; /* 392 16 */
struct bpf_prog_array * inactive; /* 408 8 */
struct percpu_ref refcnt; /* 416 16 */
struct work_struct release_work; /* 432 72 */
/* size: 504, cachelines: 8, members: 7 */
/* sum members: 503, holes: 1, sum holes: 1 */
/* last cacheline: 56 bytes */
};
Suggested-by: Jakub Sitnicki <jakub@cloudflare.com>
Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com>
Reviewed-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Stanislav Fomichev <sdf@google.com>
Link: https://lore.kernel.org/r/20220628174314.1216643-3-sdf@google.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
I'll be adding lsm cgroup specific helpers that grab
trampoline mutex.
No functional changes.
Reviewed-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Stanislav Fomichev <sdf@google.com>
Link: https://lore.kernel.org/r/20220628174314.1216643-2-sdf@google.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
On a devfreq PROBE_DEFER, the freq_table in the driver profile struct,
is never reset and may be leaved in an undefined state.
This comes from the fact that we store the freq_table in the driver
profile struct that is commonly defined as static and not reset on
PROBE_DEFER.
We currently skip the reinit of the freq_table if we found
it's already defined since a driver may declare his own freq_table.
This logic is flawed in the case devfreq core generate a freq_table, set
it in the profile struct and then PROBE_DEFER, freeing the freq_table.
In this case devfreq will found a NOT NULL freq_table that has been
freed, skip the freq_table generation and probe the driver based on the
wrong table.
To fix this and correctly handle PROBE_DEFER, use a local freq_table and
max_state in the devfreq struct and never modify the freq_table present
in the profile struct if it does provide it.
Fixes: 0ec09ac2cebe ("PM / devfreq: Set the freq_table of devfreq device")
Cc: stable@vger.kernel.org
Signed-off-by: Christian Marangi <ansuelsmth@gmail.com>
Signed-off-by: Chanwoo Choi <cw00.choi@samsung.com>
|
|
Without that, the parameter is not properly parsed:
include/linux/usb/typec_altmode.h:132: warning: Function parameter or member 'altmode' not described in 'typec_altmode_get_orientation'
Signed-off-by: Mauro Carvalho Chehab <mchehab@kernel.org>
Acked-by: Heikki Krogerus <heikki.krogerus@linux.intel.com>
Link: https://lore.kernel.org/r/70dc4c5d744cf1fe9a0efe6b85deaa0489628282.1656409369.git.mchehab@kernel.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
This flag is a bit of a hack and the same thing can be accomplished
using a custom ->get_irq_reg() callback. Add a warning to catch any
use of the flag.
Signed-off-by: Aidan MacDonald <aidanmacdonald.0x0@gmail.com>
Link: https://lore.kernel.org/r/20220623211420.918875-13-aidanmacdonald.0x0@gmail.com
Signed-off-by: Mark Brown <broonie@kernel.org>
|
|
Replace the internal sub_irq_reg() function with a public callback
that drivers can use when they have more complex register layouts.
The default implementation is regmap_irq_get_irq_reg_linear(), used
if the chip doesn't provide its own callback.
Signed-off-by: Aidan MacDonald <aidanmacdonald.0x0@gmail.com>
Link: https://lore.kernel.org/r/20220623211420.918875-12-aidanmacdonald.0x0@gmail.com
Signed-off-by: Mark Brown <broonie@kernel.org>
|
|
To me "unmask" suggests that we write 1s to the register when
an interrupt is enabled. This also makes sense because it's the
opposite of what the "mask" register does (write 1s to disable
an interrupt).
But regmap-irq does the opposite: for a disabled interrupt, it
writes 1s to "unmask" and 0s to "mask". This is surprising and
deviates from the usual way mask registers are handled.
Additionally, mask_invert didn't interact with unmask registers
properly -- it caused them to be ignored entirely.
Fix this by making mask and unmask registers orthogonal, using
the following behavior:
* Mask registers are written with 1s for disabled interrupts.
* Unmask registers are written with 1s for enabled interrupts.
This behavior supports both normal or inverted mask registers
and separate set/clear registers via different combinations of
mask_base/unmask_base.
The old unmask register behavior is deprecated. Drivers need to
opt-in to the new behavior by setting mask_unmask_non_inverted.
Warnings are issued if the driver relies on deprecated behavior.
Chips that only set one of mask_base/unmask_base don't have to
use the mask_unmask_non_inverted flag because that use case was
previously not supported.
The mask_invert flag is also deprecated in favor of describing
inverted mask registers as unmask registers.
Signed-off-by: Aidan MacDonald <aidanmacdonald.0x0@gmail.com>
Link: https://lore.kernel.org/r/20220623211420.918875-11-aidanmacdonald.0x0@gmail.com
Signed-off-by: Mark Brown <broonie@kernel.org>
|
|
Config registers can be used to replace both type and virtual
registers, so mark both features are deprecated and issue a
warning if they're used.
Signed-off-by: Aidan MacDonald <aidanmacdonald.0x0@gmail.com>
Link: https://lore.kernel.org/r/20220623211420.918875-10-aidanmacdonald.0x0@gmail.com
Signed-off-by: Mark Brown <broonie@kernel.org>
|
|
Config registers provide a more uniform approach to handling irq type
registers. They are essentially an extension of the virtual registers
used by the qcom-pm8008 driver.
Config registers can be represented as a 2D array:
config_base[0] reg0,0 reg0,1 reg0,2 reg0,3
config_base[1] reg1,0 reg1,1 reg1,2 reg1,3
config_base[2] reg2,0 reg2,1 reg2,2 reg2,3
There are 'num_config_bases' base registers, each of which is used to
address 'num_config_regs' registers. The addresses are calculated in
the same way as for other bases. It is assumed that an irq's type is
controlled by one column of registers; that column is identified by
the irq's 'type_reg_offset'.
The set_type_config() callback is responsible for updating the config
register contents. It receives an array of buffers (each represents a
row of registers) and the index of the column to update, along with
the 'struct regmap_irq' description and requested irq type.
Buffered values are written to registers in regmap_irq_sync_unlock().
Note that the entire register contents are overwritten, which is a
minor change in behavior from type registers via 'type_base'.
Signed-off-by: Aidan MacDonald <aidanmacdonald.0x0@gmail.com>
Link: https://lore.kernel.org/r/20220623211420.918875-9-aidanmacdonald.0x0@gmail.com
Signed-off-by: Mark Brown <broonie@kernel.org>
|
|
Commit a71411dbf6c8 ("regmap: irq: add chip option mask_writeonly")
introduced the mask_writeonly option, but it isn't used now and it
appears it's never been used by any in-tree drivers. The motivation
for the option is mentioned in the commit message,
Some irq controllers have writeonly/multipurpose register
layouts. In those cases we read invalid data back. [...]
The option causes mask register updates to use regmap_write_bits()
instead of regmap_update_bits().
However, regmap_write_bits() doesn't solve the reading invalid data
problem. It's still a read-modify-write op like regmap_update_bits().
The difference is that 'update bits' will only write the new value
if it is different from the current value, while 'write bits' will
write the new value unconditionally, even if it's the same as the
current value.
This seems like a bit of a specialized use case and probably isn't
that useful for regmap-irq, so let's just remove the option and go
back to using an 'update bits' op for the mask registers. We can
always add the option back if some driver ends up needing it in the
future.
Signed-off-by: Aidan MacDonald <aidanmacdonald.0x0@gmail.com>
Link: https://lore.kernel.org/r/20220623211420.918875-7-aidanmacdonald.0x0@gmail.com
Signed-off-by: Mark Brown <broonie@kernel.org>
|
|
Check types_supported instead of checking type_rising/falling_val
when using type_in_mask interrupts. This makes the intent clearer
and allows a type_in_mask irq to support level or edge triggers,
rather than only edge triggers.
Update the documentation and comments to reflect the new behavior.
This shouldn't affect existing drivers, because if they didn't
set types_supported properly the type buffer wouldn't be updated.
Signed-off-by: Aidan MacDonald <aidanmacdonald.0x0@gmail.com>
Link: https://lore.kernel.org/r/20220623211420.918875-5-aidanmacdonald.0x0@gmail.com
Signed-off-by: Mark Brown <broonie@kernel.org>
|
|
It appears that no chip ever required a nonzero type_reg_stride
and commit 1066cfbdfa3f ("regmap-irq: Extend sub-irq to support
non-fixed reg strides") broke support. Just remove the field.
Signed-off-by: Aidan MacDonald <aidanmacdonald.0x0@gmail.com>
Link: https://lore.kernel.org/r/20220623211420.918875-3-aidanmacdonald.0x0@gmail.com
Signed-off-by: Mark Brown <broonie@kernel.org>
|
|
Use 'unsigned int' for bitfields for consistency with most other
kernel code.
Signed-off-by: Aidan MacDonald <aidanmacdonald.0x0@gmail.com>
Link: https://lore.kernel.org/r/20220623211420.918875-2-aidanmacdonald.0x0@gmail.com
Signed-off-by: Mark Brown <broonie@kernel.org>
|
|
Independent access ranges only matter for file system I/O and are only
valid with a registered gendisk, so move them there.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Tested-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Link: https://lore.kernel.org/r/20220629062013.1331068-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
The kernel documentation added for the new helpers had a few tiny
ordering issues. Fix them.
Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
|
|
Most likely due to copy-paste mistake the device managed version of the
denoted reset control getter has been implemented with invalid semantic,
which can be immediately spotted by having "WARN_ON(shared && acquired)"
warning in the system log as soon as the method is called. Anyway let's
fix it by altering the boolean arguments passed to the
__devm_reset_control_bulk_get() method from
- shared = true, optional = false, acquired = true
to
+ shared = false, optional = true, acquired = true
That's what they were supposed to be in the first place (see the non-devm
version of the same method: reset_control_bulk_get_optional_exclusive()).
Fixes: 48d71395896d ("reset: Add reset_control_bulk API")
Signed-off-by: Serge Semin <Sergey.Semin@baikalelectronics.ru>
Reviewed-by: Dmitry Osipenko <dmitry.osipenko@collabora.com>
Signed-off-by: Philipp Zabel <p.zabel@pengutronix.de>
Link: https://lore.kernel.org/r/20220624141853.7417-2-Sergey.Semin@baikalelectronics.ru
|
|
The 'extern' keyword is not necessary and removing it lets us shorten
some lines.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
|
|
All callers of find_get_pages_range(), pagevec_lookup_range() and
pagevec_lookup() have now been removed.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Acked-by: Christian Brauner (Microsoft) <brauner@kernel.org>
|
|
Change the guts of check_move_unevictable_pages() over to use folios
and add check_move_unevictable_pages() as a wrapper.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Acked-by: Christian Brauner (Microsoft) <brauner@kernel.org>
|
|
This is the equivalent of find_get_pages() but fills a folio_batch
instead of an array of pages.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Acked-by: Christian Brauner (Microsoft) <brauner@kernel.org>
|
|
These functions have no more users, so delete them.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Acked-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Muchun Song <songmuchun@bytedance.com>
|
|
This wrapper is no longer used. Remove it and all references to it.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
|
|
SGI interrupt register and reset is performed by EEMI ioctl
IOCTL_REGISTER_SGI. However, this is not correct use of EEMI call.
SGI registration functionality does not qualify as energy management
activity and so shouldn't be mapped to EEMI call.
This new call will replace IOCTL_REGISTER_SGI and will be handled by TF-A
specific handler in TF-A. To maintain backward compatibility for a while
firmware driver will still use IOCTL_REGISTER_SGI as fallback strategy if
new call fails or is not supported by TF-A.
This new design also helps to make TF-A as pass through layer for EEMI
calls. So we don't have to maintain PM_IOCTL as EEMI API ID in TF-A.
Signed-off-by: Tanmay Shah <tanmay.shah@xilinx.com>
Acked-by: Michal Simek <michal.simek@amd.com>
Link: https://lore.kernel.org/r/20220607224253.54919-1-tanmay.shah@xilinx.com
Signed-off-by: Michal Simek <michal.simek@amd.com>
|
|
The old non-type safe helpers will soon be removed.
Cc: Seth Forshee <sforshee@digitalocean.com>
Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
|
|
The HAS_UNMAPPED_ID() helper is fully self contained so we can port it
to vfs{g,u}id_t without much effort.
Cc: Seth Forshee <sforshee@digitalocean.com>
Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
|
|
This can be used by subsystems to unregister a platform device registered
by sysfb and also to disable future platform device registration in sysfb.
Suggested-by: Daniel Vetter <daniel.vetter@ffwll.ch>
Signed-off-by: Javier Martinez Canillas <javierm@redhat.com>
Reviewed-by: Daniel Vetter <daniel.vetter@ffwll.ch>
Link: https://patchwork.freedesktop.org/patch/msgid/20220607182338.344270-3-javierm@redhat.com
|
|
This function just returned 0 on success or an errno code on error, but it
could be useful for sysfb_init() callers to have a pointer to the device.
Signed-off-by: Javier Martinez Canillas <javierm@redhat.com>
Reviewed-by: Daniel Vetter <daniel.vetter@ffwll.ch>
Reviewed-by: Thomas Zimmermann <tzimmermann@suse.de>
Link: https://patchwork.freedesktop.org/patch/msgid/20220607182338.344270-2-javierm@redhat.com
|
|
Unlike other copying operations on ITER_PIPE, copy_mc_to_iter() can
result in a short copy. In that case we need to trim the unused
buffers, as well as the length of partially filled one - it's not
enough to set ->head, ->iov_offset and ->count to reflect how
much had we copied. Not hard to fix, fortunately...
I'd put a helper (pipe_discard_from(pipe, head)) into pipe_fs_i.h,
rather than iov_iter.c - it has nothing to do with iov_iter and
having it will allow us to avoid an ugly kludge in fs/splice.c.
We could put it into lib/iov_iter.c for now and move it later,
but I don't see the point going that way...
Cc: stable@kernel.org # 4.19+
Fixes: ca146f6f091e "lib/iov_iter: Fix pipe handling in _copy_to_iter_mcsafe()"
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Christian Brauner (Microsoft) <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|
Currently, the arch_efi_call_virt() assumes all users of it will have
defined a type 'efi_##f##_t' to make use of it.
Simplify the arch_efi_call_virt() macro by eliminating the explicit
need for efi_##f##_t type for every user of this macro.
Signed-off-by: Sudeep Holla <sudeep.holla@arm.com>
Acked-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
[ardb: apply Sudeep's ARM fix to i686, Loongarch and RISC-V too]
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
|
|
Pass a gendisk to the sysfs register/unregister functions and give
them descriptive names. Also move the unregistration helper next
to the one doing the registration.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Link: https://lore.kernel.org/r/20220628171850.1313069-7-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Add the trace attributes to the default gendisk attributes, just like
we already do for partitions.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Link: https://lore.kernel.org/r/20220628171850.1313069-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Fix a divide 0 error in rdma_dim_stats_compare() when prev->cpe_ratio ==
0.
CallTrace:
Hardware name: H3C R4900 G3/RS33M2C9S, BIOS 2.00.37P21 03/12/2020
task: ffff880194b78000 task.stack: ffffc90006714000
RIP: 0010:backport_rdma_dim+0x10e/0x240 [mlx_compat]
RSP: 0018:ffff880c10e83ec0 EFLAGS: 00010202
RAX: 0000000000002710 RBX: ffff88096cd7f780 RCX: 0000000000000064
RDX: 0000000000000000 RSI: 0000000000000002 RDI: 0000000000000001
RBP: 0000000000000001 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000000 R12: 000000001d7c6c09
R13: ffff88096cd7f780 R14: ffff880b174fe800 R15: 0000000000000000
FS: 0000000000000000(0000) GS:ffff880c10e80000(0000)
knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00000000a0965b00 CR3: 000000000200a003 CR4: 00000000007606e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
PKRU: 55555554
Call Trace:
<IRQ>
ib_poll_handler+0x43/0x80 [ib_core]
irq_poll_softirq+0xae/0x110
__do_softirq+0xd1/0x28c
irq_exit+0xde/0xf0
do_IRQ+0x54/0xe0
common_interrupt+0x8f/0x8f
</IRQ>
? cpuidle_enter_state+0xd9/0x2a0
? cpuidle_enter_state+0xc7/0x2a0
? do_idle+0x170/0x1d0
? cpu_startup_entry+0x6f/0x80
? start_secondary+0x1b9/0x210
? secondary_startup_64+0xa5/0xb0
Code: 0f 87 e1 00 00 00 8b 4c 24 14 44 8b 43 14 89 c8 4d 63 c8 44 29 c0 99 31 d0 29 d0 31 d2 48 98 48 8d 04 80 48 8d 04 80 48 c1 e0 02 <49> f7 f1 48 83 f8 0a 0f 86 c1 00 00 00 44 39 c1 7f 10 48 89 df
RIP: backport_rdma_dim+0x10e/0x240 [mlx_compat] RSP: ffff880c10e83ec0
Fixes: f4915455dcf0 ("linux/dim: Implement RDMA adaptive moderation (DIM)")
Link: https://lore.kernel.org/r/20220627140004.3099-1-thomas.liu@ucloud.cn
Signed-off-by: Tao Liu <thomas.liu@ucloud.cn>
Reviewed-by: Max Gurtovoy <mgurtovoy@nvidia.com>
Acked-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
|
|
blk_cleanup_disk is nothing but a trivial wrapper for put_disk now,
so remove it.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Link: https://lore.kernel.org/r/20220619060552.1850436-7-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Set the queue dying flag and call blk_mq_exit_queue from del_gendisk for
all disks that do not have separately allocated queues, and thus remove
the need to call blk_cleanup_queue for them.
Rename blk_cleanup_disk to blk_mq_destroy_queue to make it clear that
this function is intended only for separately allocated blk-mq queues.
This saves an extra queue freeze for devices without a separately
allocated queue.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Link: https://lore.kernel.org/r/20220619060552.1850436-6-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Disallow setting the blk-mq state on any queue that is already dying as
setting the state even then is a bad idea, and remove the now unused
QUEUE_FLAG_DEAD flag.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Link: https://lore.kernel.org/r/20220619060552.1850436-4-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
document n_ramp_values field at struct regulator_desc, in order
to solve this warning:
include/linux/regulator/driver.h:434: warning: Function parameter or member 'n_ramp_values' not described in 'regulator_desc'
Signed-off-by: Mauro Carvalho Chehab <mchehab@kernel.org>
Link: https://lore.kernel.org/r/15efc16e878aa327aa2769023bcdf959a795f41d.1656409369.git.mchehab@kernel.org
Signed-off-by: Mark Brown <broonie@kernel.org>
|
|
Commit ceaf69f8eadc ("fanotify: do not allow setting dirent events in
mask of non-dir") added restrictions about setting dirent events in the
mask of a non-dir inode mark, which does not make any sense.
For backward compatibility, these restictions were added only to new
(v5.17+) APIs.
It also does not make any sense to set the flags FAN_EVENT_ON_CHILD or
FAN_ONDIR in the mask of a non-dir inode. Add these flags to the
dir-only restriction of the new APIs as well.
Move the check of the dir-only flags for new APIs into the helper
fanotify_events_supported(), which is only called for FAN_MARK_ADD,
because there is no need to error on an attempt to remove the dir-only
flags from non-dir inode.
Fixes: ceaf69f8eadc ("fanotify: do not allow setting dirent events in mask of non-dir")
Link: https://lore.kernel.org/linux-fsdevel/20220627113224.kr2725conevh53u4@quack3.lan/
Link: https://lore.kernel.org/r/20220627174719.2838175-1-amir73il@gmail.com
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
|
|
Add configuration values(enable/disable) for tri-state parameter.
Signed-off-by: Sai Krishna Potthuri <lakshmi.sai.krishna.potthuri@xilinx.com>
Link: https://lore.kernel.org/r/1655462819-28801-2-git-send-email-lakshmi.sai.krishna.potthuri@xilinx.com
Signed-off-by: Linus Walleij <linus.walleij@linaro.org>
|
|
effective_cpu_util() already has a `int cpu' parameter which allows to
retrieve the CPU capacity scale factor (or maximum CPU capacity) inside
this function via an arch_scale_cpu_capacity(cpu).
A lot of code calling effective_cpu_util() (or the shim
sched_cpu_util()) needs the maximum CPU capacity, i.e. it will call
arch_scale_cpu_capacity() already.
But not having to pass it into effective_cpu_util() will make the EAS
wake-up code easier, especially when the maximum CPU capacity reduced
by the thermal pressure is passed through the EAS wake-up functions.
Due to the asymmetric CPU capacity support of arm/arm64 architectures,
arch_scale_cpu_capacity(int cpu) is a per-CPU variable read access via
per_cpu(cpu_scale, cpu) on such a system.
On all other architectures it is a a compile-time constant
(SCHED_CAPACITY_SCALE).
Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Vincent Guittot <vincent.guittot@linaro.org>
Tested-by: Lukasz Luba <lukasz.luba@arm.com>
Link: https://lkml.kernel.org/r/20220621090414.433602-4-vdonnefort@google.com
|
|
Sometimes we want to know an accurate number of samples even if it's
lost. Currenlty PERF_RECORD_LOST is generated for a ring-buffer which
might be shared with other events. So it's hard to know per-event
lost count.
Add event->lost_samples field and PERF_FORMAT_LOST to retrieve it from
userspace.
Original-patch-by: Jiri Olsa <jolsa@redhat.com>
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20220616180623.1358843-1-namhyung@kernel.org
|
|
[Problem Statement]
select_idle_cpu() might spend too much time searching for an idle CPU,
when the system is overloaded.
The following histogram is the time spent in select_idle_cpu(),
when running 224 instances of netperf on a system with 112 CPUs
per LLC domain:
@usecs:
[0] 533 | |
[1] 5495 | |
[2, 4) 12008 | |
[4, 8) 239252 | |
[8, 16) 4041924 |@@@@@@@@@@@@@@ |
[16, 32) 12357398 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ |
[32, 64) 14820255 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[64, 128) 13047682 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ |
[128, 256) 8235013 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@ |
[256, 512) 4507667 |@@@@@@@@@@@@@@@ |
[512, 1K) 2600472 |@@@@@@@@@ |
[1K, 2K) 927912 |@@@ |
[2K, 4K) 218720 | |
[4K, 8K) 98161 | |
[8K, 16K) 37722 | |
[16K, 32K) 6715 | |
[32K, 64K) 477 | |
[64K, 128K) 7 | |
netperf latency usecs:
=======
case load Lat_99th std%
TCP_RR thread-224 257.39 ( 0.21)
The time spent in select_idle_cpu() is visible to netperf and might have a negative
impact.
[Symptom analysis]
The patch [1] from Mel Gorman has been applied to track the efficiency
of select_idle_sibling. Copy the indicators here:
SIS Search Efficiency(se_eff%):
A ratio expressed as a percentage of runqueues scanned versus
idle CPUs found. A 100% efficiency indicates that the target,
prev or recent CPU of a task was idle at wakeup. The lower the
efficiency, the more runqueues were scanned before an idle CPU
was found.
SIS Domain Search Efficiency(dom_eff%):
Similar, except only for the slower SIS
patch.
SIS Fast Success Rate(fast_rate%):
Percentage of SIS that used target, prev or
recent CPUs.
SIS Success rate(success_rate%):
Percentage of scans that found an idle CPU.
The test is based on Aubrey's schedtests tool, including netperf, hackbench,
schbench and tbench.
Test on vanilla kernel:
schedstat_parse.py -f netperf_vanilla.log
case load se_eff% dom_eff% fast_rate% success_rate%
TCP_RR 28 threads 99.978 18.535 99.995 100.000
TCP_RR 56 threads 99.397 5.671 99.964 100.000
TCP_RR 84 threads 21.721 6.818 73.632 100.000
TCP_RR 112 threads 12.500 5.533 59.000 100.000
TCP_RR 140 threads 8.524 4.535 49.020 100.000
TCP_RR 168 threads 6.438 3.945 40.309 99.999
TCP_RR 196 threads 5.397 3.718 32.320 99.982
TCP_RR 224 threads 4.874 3.661 25.775 99.767
UDP_RR 28 threads 99.988 17.704 99.997 100.000
UDP_RR 56 threads 99.528 5.977 99.970 100.000
UDP_RR 84 threads 24.219 6.992 76.479 100.000
UDP_RR 112 threads 13.907 5.706 62.538 100.000
UDP_RR 140 threads 9.408 4.699 52.519 100.000
UDP_RR 168 threads 7.095 4.077 44.352 100.000
UDP_RR 196 threads 5.757 3.775 35.764 99.991
UDP_RR 224 threads 5.124 3.704 28.748 99.860
schedstat_parse.py -f schbench_vanilla.log
(each group has 28 tasks)
case load se_eff% dom_eff% fast_rate% success_rate%
normal 1 mthread 99.152 6.400 99.941 100.000
normal 2 mthreads 97.844 4.003 99.908 100.000
normal 3 mthreads 96.395 2.118 99.917 99.998
normal 4 mthreads 55.288 1.451 98.615 99.804
normal 5 mthreads 7.004 1.870 45.597 61.036
normal 6 mthreads 3.354 1.346 20.777 34.230
normal 7 mthreads 2.183 1.028 11.257 21.055
normal 8 mthreads 1.653 0.825 7.849 15.549
schedstat_parse.py -f hackbench_vanilla.log
(each group has 28 tasks)
case load se_eff% dom_eff% fast_rate% success_rate%
process-pipe 1 group 99.991 7.692 99.999 100.000
process-pipe 2 groups 99.934 4.615 99.997 100.000
process-pipe 3 groups 99.597 3.198 99.987 100.000
process-pipe 4 groups 98.378 2.464 99.958 100.000
process-pipe 5 groups 27.474 3.653 89.811 99.800
process-pipe 6 groups 20.201 4.098 82.763 99.570
process-pipe 7 groups 16.423 4.156 77.398 99.316
process-pipe 8 groups 13.165 3.920 72.232 98.828
process-sockets 1 group 99.977 5.882 99.999 100.000
process-sockets 2 groups 99.927 5.505 99.996 100.000
process-sockets 3 groups 99.397 3.250 99.980 100.000
process-sockets 4 groups 79.680 4.258 98.864 99.998
process-sockets 5 groups 7.673 2.503 63.659 92.115
process-sockets 6 groups 4.642 1.584 58.946 88.048
process-sockets 7 groups 3.493 1.379 49.816 81.164
process-sockets 8 groups 3.015 1.407 40.845 75.500
threads-pipe 1 group 99.997 0.000 100.000 100.000
threads-pipe 2 groups 99.894 2.932 99.997 100.000
threads-pipe 3 groups 99.611 4.117 99.983 100.000
threads-pipe 4 groups 97.703 2.624 99.937 100.000
threads-pipe 5 groups 22.919 3.623 87.150 99.764
threads-pipe 6 groups 18.016 4.038 80.491 99.557
threads-pipe 7 groups 14.663 3.991 75.239 99.247
threads-pipe 8 groups 12.242 3.808 70.651 98.644
threads-sockets 1 group 99.990 6.667 99.999 100.000
threads-sockets 2 groups 99.940 5.114 99.997 100.000
threads-sockets 3 groups 99.469 4.115 99.977 100.000
threads-sockets 4 groups 87.528 4.038 99.400 100.000
threads-sockets 5 groups 6.942 2.398 59.244 88.337
threads-sockets 6 groups 4.359 1.954 49.448 87.860
threads-sockets 7 groups 2.845 1.345 41.198 77.102
threads-sockets 8 groups 2.871 1.404 38.512 74.312
schedstat_parse.py -f tbench_vanilla.log
case load se_eff% dom_eff% fast_rate% success_rate%
loopback 28 threads 99.976 18.369 99.995 100.000
loopback 56 threads 99.222 7.799 99.934 100.000
loopback 84 threads 19.723 6.819 70.215 100.000
loopback 112 threads 11.283 5.371 55.371 99.999
loopback 140 threads 0.000 0.000 0.000 0.000
loopback 168 threads 0.000 0.000 0.000 0.000
loopback 196 threads 0.000 0.000 0.000 0.000
loopback 224 threads 0.000 0.000 0.000 0.000
According to the test above, if the system becomes busy, the
SIS Search Efficiency(se_eff%) drops significantly. Although some
benchmarks would finally find an idle CPU(success_rate% = 100%), it is
doubtful whether it is worth it to search the whole LLC domain.
[Proposal]
It would be ideal to have a crystal ball to answer this question:
How many CPUs must a wakeup path walk down, before it can find an idle
CPU? Many potential metrics could be used to predict the number.
One candidate is the sum of util_avg in this LLC domain. The benefit
of choosing util_avg is that it is a metric of accumulated historic
activity, which seems to be smoother than instantaneous metrics
(such as rq->nr_running). Besides, choosing the sum of util_avg
would help predict the load of the LLC domain more precisely, because
SIS_PROP uses one CPU's idle time to estimate the total LLC domain idle
time.
In summary, the lower the util_avg is, the more select_idle_cpu()
should scan for idle CPU, and vice versa. When the sum of util_avg
in this LLC domain hits 85% or above, the scan stops. The reason to
choose 85% as the threshold is that this is the imbalance_pct(117)
when a LLC sched group is overloaded.
Introduce the quadratic function:
y = SCHED_CAPACITY_SCALE - p * x^2
and y'= y / SCHED_CAPACITY_SCALE
x is the ratio of sum_util compared to the CPU capacity:
x = sum_util / (llc_weight * SCHED_CAPACITY_SCALE)
y' is the ratio of CPUs to be scanned in the LLC domain,
and the number of CPUs to scan is calculated by:
nr_scan = llc_weight * y'
Choosing quadratic function is because:
[1] Compared to the linear function, it scans more aggressively when the
sum_util is low.
[2] Compared to the exponential function, it is easier to calculate.
[3] It seems that there is no accurate mapping between the sum of util_avg
and the number of CPUs to be scanned. Use heuristic scan for now.
For a platform with 112 CPUs per LLC, the number of CPUs to scan is:
sum_util% 0 5 15 25 35 45 55 65 75 85 86 ...
scan_nr 112 111 108 102 93 81 65 47 25 1 0 ...
For a platform with 16 CPUs per LLC, the number of CPUs to scan is:
sum_util% 0 5 15 25 35 45 55 65 75 85 86 ...
scan_nr 16 15 15 14 13 11 9 6 3 0 0 ...
Furthermore, to minimize the overhead of calculating the metrics in
select_idle_cpu(), borrow the statistics from periodic load balance.
As mentioned by Abel, on a platform with 112 CPUs per LLC, the
sum_util calculated by periodic load balance after 112 ms would
decay to about 0.5 * 0.5 * 0.5 * 0.7 = 8.75%, thus bringing a delay
in reflecting the latest utilization. But it is a trade-off.
Checking the util_avg in newidle load balance would be more frequent,
but it brings overhead - multiple CPUs write/read the per-LLC shared
variable and introduces cache contention. Tim also mentioned that,
it is allowed to be non-optimal in terms of scheduling for the
short-term variations, but if there is a long-term trend in the load
behavior, the scheduler can adjust for that.
When SIS_UTIL is enabled, the select_idle_cpu() uses the nr_scan
calculated by SIS_UTIL instead of the one from SIS_PROP. As Peter and
Mel suggested, SIS_UTIL should be enabled by default.
This patch is based on the util_avg, which is very sensitive to the
CPU frequency invariance. There is an issue that, when the max frequency
has been clamp, the util_avg would decay insanely fast when
the CPU is idle. Commit addca285120b ("cpufreq: intel_pstate: Handle no_turbo
in frequency invariance") could be used to mitigate this symptom, by adjusting
the arch_max_freq_ratio when turbo is disabled. But this issue is still
not thoroughly fixed, because the current code is unaware of the user-specified
max CPU frequency.
[Test result]
netperf and tbench were launched with 25% 50% 75% 100% 125% 150%
175% 200% of CPU number respectively. Hackbench and schbench were launched
by 1, 2 ,4, 8 groups. Each test lasts for 100 seconds and repeats 3 times.
The following is the benchmark result comparison between
baseline:vanilla v5.19-rc1 and compare:patched kernel. Positive compare%
indicates better performance.
Each netperf test is a:
netperf -4 -H 127.0.1 -t TCP/UDP_RR -c -C -l 100
netperf.throughput
=======
case load baseline(std%) compare%( std%)
TCP_RR 28 threads 1.00 ( 0.34) -0.16 ( 0.40)
TCP_RR 56 threads 1.00 ( 0.19) -0.02 ( 0.20)
TCP_RR 84 threads 1.00 ( 0.39) -0.47 ( 0.40)
TCP_RR 112 threads 1.00 ( 0.21) -0.66 ( 0.22)
TCP_RR 140 threads 1.00 ( 0.19) -0.69 ( 0.19)
TCP_RR 168 threads 1.00 ( 0.18) -0.48 ( 0.18)
TCP_RR 196 threads 1.00 ( 0.16) +194.70 ( 16.43)
TCP_RR 224 threads 1.00 ( 0.16) +197.30 ( 7.85)
UDP_RR 28 threads 1.00 ( 0.37) +0.35 ( 0.33)
UDP_RR 56 threads 1.00 ( 11.18) -0.32 ( 0.21)
UDP_RR 84 threads 1.00 ( 1.46) -0.98 ( 0.32)
UDP_RR 112 threads 1.00 ( 28.85) -2.48 ( 19.61)
UDP_RR 140 threads 1.00 ( 0.70) -0.71 ( 14.04)
UDP_RR 168 threads 1.00 ( 14.33) -0.26 ( 11.16)
UDP_RR 196 threads 1.00 ( 12.92) +186.92 ( 20.93)
UDP_RR 224 threads 1.00 ( 11.74) +196.79 ( 18.62)
Take the 224 threads as an example, the SIS search metrics changes are
illustrated below:
vanilla patched
4544492 +237.5% 15338634 sched_debug.cpu.sis_domain_search.avg
38539 +39686.8% 15333634 sched_debug.cpu.sis_failed.avg
128300000 -87.9% 15551326 sched_debug.cpu.sis_scanned.avg
5842896 +162.7% 15347978 sched_debug.cpu.sis_search.avg
There is -87.9% less CPU scans after patched, which indicates lower overhead.
Besides, with this patch applied, there is -13% less rq lock contention
in perf-profile.calltrace.cycles-pp._raw_spin_lock.raw_spin_rq_lock_nested
.try_to_wake_up.default_wake_function.woken_wake_function.
This might help explain the performance improvement - Because this patch allows
the waking task to remain on the previous CPU, rather than grabbing other CPUs'
lock.
Each hackbench test is a:
hackbench -g $job --process/threads --pipe/sockets -l 1000000 -s 100
hackbench.throughput
=========
case load baseline(std%) compare%( std%)
process-pipe 1 group 1.00 ( 1.29) +0.57 ( 0.47)
process-pipe 2 groups 1.00 ( 0.27) +0.77 ( 0.81)
process-pipe 4 groups 1.00 ( 0.26) +1.17 ( 0.02)
process-pipe 8 groups 1.00 ( 0.15) -4.79 ( 0.02)
process-sockets 1 group 1.00 ( 0.63) -0.92 ( 0.13)
process-sockets 2 groups 1.00 ( 0.03) -0.83 ( 0.14)
process-sockets 4 groups 1.00 ( 0.40) +5.20 ( 0.26)
process-sockets 8 groups 1.00 ( 0.04) +3.52 ( 0.03)
threads-pipe 1 group 1.00 ( 1.28) +0.07 ( 0.14)
threads-pipe 2 groups 1.00 ( 0.22) -0.49 ( 0.74)
threads-pipe 4 groups 1.00 ( 0.05) +1.88 ( 0.13)
threads-pipe 8 groups 1.00 ( 0.09) -4.90 ( 0.06)
threads-sockets 1 group 1.00 ( 0.25) -0.70 ( 0.53)
threads-sockets 2 groups 1.00 ( 0.10) -0.63 ( 0.26)
threads-sockets 4 groups 1.00 ( 0.19) +11.92 ( 0.24)
threads-sockets 8 groups 1.00 ( 0.08) +4.31 ( 0.11)
Each tbench test is a:
tbench -t 100 $job 127.0.0.1
tbench.throughput
======
case load baseline(std%) compare%( std%)
loopback 28 threads 1.00 ( 0.06) -0.14 ( 0.09)
loopback 56 threads 1.00 ( 0.03) -0.04 ( 0.17)
loopback 84 threads 1.00 ( 0.05) +0.36 ( 0.13)
loopback 112 threads 1.00 ( 0.03) +0.51 ( 0.03)
loopback 140 threads 1.00 ( 0.02) -1.67 ( 0.19)
loopback 168 threads 1.00 ( 0.38) +1.27 ( 0.27)
loopback 196 threads 1.00 ( 0.11) +1.34 ( 0.17)
loopback 224 threads 1.00 ( 0.11) +1.67 ( 0.22)
Each schbench test is a:
schbench -m $job -t 28 -r 100 -s 30000 -c 30000
schbench.latency_90%_us
========
case load baseline(std%) compare%( std%)
normal 1 mthread 1.00 ( 31.22) -7.36 ( 20.25)*
normal 2 mthreads 1.00 ( 2.45) -0.48 ( 1.79)
normal 4 mthreads 1.00 ( 1.69) +0.45 ( 0.64)
normal 8 mthreads 1.00 ( 5.47) +9.81 ( 14.28)
*Consider the Standard Deviation, this -7.36% regression might not be valid.
Also, a OLTP workload with a commercial RDBMS has been tested, and there
is no significant change.
There were concerns that unbalanced tasks among CPUs would cause problems.
For example, suppose the LLC domain is composed of 8 CPUs, and 7 tasks are
bound to CPU0~CPU6, while CPU7 is idle:
CPU0 CPU1 CPU2 CPU3 CPU4 CPU5 CPU6 CPU7
util_avg 1024 1024 1024 1024 1024 1024 1024 0
Since the util_avg ratio is 87.5%( = 7/8 ), which is higher than 85%,
select_idle_cpu() will not scan, thus CPU7 is undetected during scan.
But according to Mel, it is unlikely the CPU7 will be idle all the time
because CPU7 could pull some tasks via CPU_NEWLY_IDLE.
lkp(kernel test robot) has reported a regression on stress-ng.sock on a
very busy system. According to the sched_debug statistics, it might be caused
by SIS_UTIL terminates the scan and chooses a previous CPU earlier, and this
might introduce more context switch, especially involuntary preemption, which
impacts a busy stress-ng. This regression has shown that, not all benchmarks
in every scenario benefit from idle CPU scan limit, and it needs further
investigation.
Besides, there is slight regression in hackbench's 16 groups case when the
LLC domain has 16 CPUs. Prateek mentioned that we should scan aggressively
in an LLC domain with 16 CPUs. Because the cost to search for an idle one
among 16 CPUs is negligible. The current patch aims to propose a generic
solution and only considers the util_avg. Something like the below could
be applied on top of the current patch to fulfill the requirement:
if (llc_weight <= 16)
nr_scan = nr_scan * 32 / llc_weight;
For LLC domain with 16 CPUs, the nr_scan will be expanded to 2 times large.
The smaller the CPU number this LLC domain has, the larger nr_scan will be
expanded. This needs further investigation.
There is also ongoing work[2] from Abel to filter out the busy CPUs during
wakeup, to further speed up the idle CPU scan. And it could be a following-up
optimization on top of this change.
Suggested-by: Tim Chen <tim.c.chen@intel.com>
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Yicong Yang <yangyicong@hisilicon.com>
Tested-by: Mohini Narkhede <mohini.narkhede@intel.com>
Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Link: https://lore.kernel.org/r/20220612163428.849378-1-yu.c.chen@intel.com
|
|
so it will be consistent with code mm directory and with
Documentation/admin-guide/mm and won't be confused with virtual machines.
Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Suggested-by: Matthew Wilcox <willy@infradead.org>
Tested-by: Ira Weiny <ira.weiny@intel.com>
Acked-by: Jonathan Corbet <corbet@lwn.net>
Acked-by: Wu XiangCheng <bobwxc@email.cn>
|