Age | Commit message (Collapse) | Author |
|
- Use i_size instead of i_size_read() due to immutable fses;
- Get rid of an unneeded goto since erofs_fill_dentries() also works;
- Remove unnecessary lines.
Signed-off-by: Hongzhen Luo <hongzhen@linux.alibaba.com>
Link: https://lore.kernel.org/r/20240801112622.2164029-1-hongzhen@linux.alibaba.com
Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
|
|
Replace the deprecated one-element arrays with flexible-array members
in the structs filesystem_attribute_info and filesystem_device_info.
There are no binary differences after this conversion.
Link: https://github.com/KSPP/linux/issues/79
Signed-off-by: Thorsten Blum <thorsten.blum@toblux.com>
Reviewed-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Acked-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
|
|
There are a couple of spelling mistakes in the documentation. This patch
fixes them.
Signed-off-by: Victor Timofei <victor@vtimothy.com>
Acked-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
|
|
operations()
If there is ->PreviousSessionId field in the session setup request,
The session of the previous connection should be destroyed.
During this, if the smb2 operation requests in the previous session are
being processed, a racy issue could happen with ksmbd_destroy_file_table().
This patch sets conn->status to KSMBD_SESS_NEED_RECONNECT to block
incoming operations and waits until on-going operations are complete
(i.e. idle) before desctorying the previous session.
Fixes: c8efcc786146 ("ksmbd: add support for durable handles v1/v2")
Cc: stable@vger.kernel.org # v6.6+
Reported-by: zdi-disclosures@trendmicro.com # ZDI-CAN-25040
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
|
|
rsp buffer is allocated larger than spnego_blob from
smb2_allocate_rsp_buf().
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
|
|
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core
Pull driver core fixes from Greg KH:
"Here are two driver fixes for regressions from 6.11-rc1 due to the
driver core change making a structure in a driver core callback const.
These were missed by all testing EXCEPT for what Bart happened to be
running, so I appreciate the fixes provided here for some
odd/not-often-used driver subsystems that nothing else happened to
catch.
Both of these fixes have been in linux-next all week with no reported
issues"
* tag 'driver-core-6.11-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core:
mips: sgi-ip22: Fix the build
ARM: riscpc: ecard: Fix the build
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc
Pull char / misc fixes from Greg KH:
"Here are some small char/misc fixes for 6.11-rc4 to resolve reported
problems. Included in here are:
- fastrpc revert of a change that broke userspace
- xillybus fixes for reported issues
Half of these have been in linux-next this week with no reported
problems, I don't know if the last bit of xillybus driver changes made
it in, but they are 'obviously correct' so will be safe :)"
* tag 'char-misc-6.11-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc:
char: xillybus: Check USB endpoints when probing device
char: xillybus: Refine workqueue handling
Revert "misc: fastrpc: Restrict untrusted app to attach to privileged PD"
char: xillybus: Don't destroy workqueue from work item running on it
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty
Pull tty / serial fixes from Greg KH:
"Here are some small tty and serial driver fixes for 6.11-rc4 to
resolve some reported problems. Included in here are:
- conmakehash.c userspace build issues
- fsl_lpuart driver fix
- 8250_omap revert for reported regression
- atmel_serial rts flag fix
All of these have been in linux-next this week with no reported
issues"
* tag 'tty-6.11-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty:
Revert "serial: 8250_omap: Set the console genpd always on if no console suspend"
tty: atmel_serial: use the correct RTS flag.
tty: vt: conmakehash: remove non-portable code printing comment header
tty: serial: fsl_lpuart: mark last busy before uart_add_one_port
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb
Pull USB / Thunderbolt driver fixes from Greg KH:
"Here are some small USB and Thunderbolt driver fixes for 6.11-rc4 to
resolve some reported issues. Included in here are:
- thunderbolt driver fixes for reported problems
- typec driver fixes
- xhci fixes
- new device id for ljca usb driver
All of these have been in linux-next this week with no reported
issues"
* tag 'usb-6.11-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb:
xhci: Fix Panther point NULL pointer deref at full-speed re-enumeration
usb: misc: ljca: Add Lunar Lake ljca GPIO HID to ljca_gpio_hids[]
Revert "usb: typec: tcpm: clear pd_event queue in PORT_RESET"
usb: typec: ucsi: Fix the return value of ucsi_run_command()
usb: xhci: fix duplicate stall handling in handle_tx_event()
usb: xhci: Check for xhci->interrupters being allocated in xhci_mem_clearup()
thunderbolt: Mark XDomain as unplugged when router is removed
thunderbolt: Fix memory leaks in {port|retimer}_sb_regs_write()
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull more btrfs fixes from David Sterba:
"A more fixes. We got reports that shrinker added in 6.10 still causes
latency spikes and the fixes don't handle all corner cases. Due to
summer holidays we're taking a shortcut to disable it for release
builds and will fix it in the near future.
- only enable extent map shrinker for DEBUG builds, temporary quick
fix to avoid latency spikes for regular builds
- update target inode's ctime on unlink, mandated by POSIX
- properly take lock to read/update block group's zoned variables
- add counted_by() annotations"
* tag 'for-6.11-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: only enable extent map shrinker for DEBUG builds
btrfs: zoned: properly take lock to read/update block group's zoned variables
btrfs: tree-checker: add dev extent item checks
btrfs: update target inode's ctime on unlink
btrfs: send: annotate struct name_cache_entry with __counted_by()
|
|
fuse_notify_store(), unlike fuse_do_readpage(), does not enable page
zeroing (because it can be used to change partial page contents).
So fuse_notify_store() must be more careful to fully initialize page
contents (including parts of the page that are beyond end-of-file)
before marking the page uptodate.
The current code can leave beyond-EOF page contents uninitialized, which
makes these uninitialized page contents visible to userspace via mmap().
This is an information leak, but only affects systems which do not
enable init-on-alloc (via CONFIG_INIT_ON_ALLOC_DEFAULT_ON=y or the
corresponding kernel command line parameter).
Link: https://bugs.chromium.org/p/project-zero/issues/detail?id=2574
Cc: stable@kernel.org
Fixes: a1d75f258230 ("fuse: add store request")
Signed-off-by: Jann Horn <jannh@google.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull misc fixes from Andrew Morton:
"16 hotfixes. All except one are for MM. 10 of these are cc:stable and
the others pertain to post-6.10 issues.
As usual with these merges, singletons and doubletons all over the
place, no identifiable-by-me theme. Please see the lovingly curated
changelogs to get the skinny"
* tag 'mm-hotfixes-stable-2024-08-17-19-34' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
mm/migrate: fix deadlock in migrate_pages_batch() on large folios
alloc_tag: mark pages reserved during CMA activation as not tagged
alloc_tag: introduce clear_page_tag_ref() helper function
crash: fix riscv64 crash memory reserve dead loop
selftests: memfd_secret: don't build memfd_secret test on unsupported arches
mm: fix endless reclaim on machines with unaccepted memory
selftests/mm: compaction_test: fix off by one in check_compaction()
mm/numa: no task_numa_fault() call if PMD is changed
mm/numa: no task_numa_fault() call if PTE is changed
mm/vmalloc: fix page mapping if vm_area_alloc_pages() with high order fallback to order 0
mm/memory-failure: use raw_spinlock_t in struct memory_failure_cpu
mm: don't account memmap per-node
mm: add system wide stats items category
mm: don't account memmap on failure
mm/hugetlb: fix hugetlb vs. core-mm PT locking
mseal: fix is_madv_discard()
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux
Pull powerpc fixes from Michael Ellerman:
- Fix crashes on 85xx with some configs since the recent hugepd rework.
- Fix boot warning with hugepages and CONFIG_DEBUG_VIRTUAL on some
platforms.
- Don't enable offline cores when changing SMT modes, to match existing
userspace behaviour.
Thanks to Christophe Leroy, Dr. David Alan Gilbert, Guenter Roeck, Nysal
Jan K.A, Shrikanth Hegde, Thomas Gleixner, and Tyrel Datwyler.
* tag 'powerpc-6.11-2' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
powerpc/topology: Check if a core is online
cpu/SMT: Enable SMT only if a core is online
powerpc/mm: Fix boot warning with hugepages and CONFIG_DEBUG_VIRTUAL
powerpc/mm: Fix size of allocated PGDIR
soc: fsl: qbman: remove unused struct 'cgr_comp'
|
|
Pull smb client fixes from Steve French:
- fix for clang warning - additional null check
- fix for cached write with posix locks
- flexible structure fix
* tag 'v6.11-rc3-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6:
smb: smb2pdu.h: Use static_assert() to check struct sizes
smb3: fix lock breakage for cached writes
smb/client: avoid possible NULL dereference in cifs_free_subrequest()
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux
Pull i2c fixes from Wolfram Sang:
"I2C core fix replacing IS_ENABLED() with IS_REACHABLE()
For host drivers, there are two fixes:
- Tegra I2C Controller: Addresses a potential double-locking issue
during probe. ACPI devices are not IRQ-safe when invoking runtime
suspend and resume functions, so the irq_safe flag should not be
set.
- Qualcomm GENI I2C Controller: Fixes an oversight in the exit path
of the runtime_resume() function, which was missed in the previous
release"
* tag 'i2c-for-6.11-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux:
i2c: tegra: Do not mark ACPI devices as irq safe
i2c: Use IS_REACHABLE() for substituting empty ACPI functions
i2c: qcom-geni: Add missing geni_icc_disable in geni_i2c_runtime_resume
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi
Pull SCSI fixes from James Bottomley:
"Two small fixes to the mpi3mr driver. One to avoid oversize
allocations in tracing and the other to fix an uninitialized spinlock
in the user to driver feature request code (used to trigger dumps and
the like)"
* tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi:
scsi: mpi3mr: Avoid MAX_PAGE_ORDER WARNING for buffer allocations
scsi: mpi3mr: Add missing spin_lock_init() for mrioc->trigger_lock
|
|
Pull xfs fixes from Chandan Babu:
- Check for presence of only 'attr' feature before scrubbing an inode's
attribute fork.
- Restore the behaviour of setting AIL thread to TASK_INTERRUPTIBLE for
long (i.e. 50ms) sleep durations to prevent high load averages.
- Do not allow users to change the realtime flag of a file unless the
datadev and rtdev both support fsdax access modes.
* tag 'xfs-6.11-fixes-3' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
xfs: conditionally allow FS_XFLAG_REALTIME changes if S_DAX is set
xfs: revert AIL TASK_KILLABLE threshold
xfs: attr forks require attr, not attr2
|
|
Pull bcachefs fixes from Kent OverstreetL
- New on disk format version, bcachefs_metadata_version_disk_accounting_inum
This adds one more disk accounting counter, which counts disk usage
and number of extents per inode number. This lets us track
fragmentation, for implementing defragmentation later, and it also
counts disk usage per inode in all snapshots, which will be a useful
thing to expose to users.
- One performance issue we've observed is threads spinning when they
should be waiting for dirty keys in the key cache to be flushed by
journal reclaim, so we now have hysteresis for the waiting thread, as
well as improving the tracepoint and a new time_stat, for tracking
time blocked waiting on key cache flushing.
... and various assorted smaller fixes.
* tag 'bcachefs-2024-08-16' of git://evilpiepirate.org/bcachefs:
bcachefs: Fix locking in __bch2_trans_mark_dev_sb()
bcachefs: fix incorrect i_state usage
bcachefs: avoid overflowing LRU_TIME_BITS for cached data lru
bcachefs: Fix forgetting to pass trans to fsck_err()
bcachefs: Increase size of cuckoo hash table on too many rehashes
bcachefs: bcachefs_metadata_version_disk_accounting_inum
bcachefs: Kill __bch2_accounting_mem_mod()
bcachefs: Make bkey_fsck_err() a wrapper around fsck_err()
bcachefs: Fix warning in __bch2_fsck_err() for trans not passed in
bcachefs: Add a time_stat for blocked on key cache flush
bcachefs: Improve trans_blocked_journal_reclaim tracepoint
bcachefs: Add hysteresis to waiting on btree key cache flush
lib/generic-radix-tree.c: Fix rare race in __genradix_ptr_alloc()
bcachefs: Convert for_each_btree_node() to lockrestart_do()
bcachefs: Add missing downgrade table entry
bcachefs: disk accounting: ignore unknown types
bcachefs: bch2_accounting_invalid() fixup
bcachefs: Fix bch2_trigger_alloc when upgrading from old versions
bcachefs: delete faulty fastpath in bch2_btree_path_traverse_cached()
|
|
In the absence of an explicit cgroup slice configureation, make mixed
slice length work with cgroups by propagating the min_slice up the
hierarchy.
This ensures the cgroup entity gets timely service to service its
entities that have this timing constraint set on them.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Valentin Schneider <vschneid@redhat.com>
Link: https://lkml.kernel.org/r/20240727105030.948188417@infradead.org
|
|
Allow applications to directly set a suggested request/slice length using
sched_attr::sched_runtime.
The implementation clamps the value to: 0.1[ms] <= slice <= 100[ms]
which is 1/10 the size of HZ=1000 and 10 times the size of HZ=100.
Applications should strive to use their periodic runtime at a high
confidence interval (95%+) as the target slice. Using a smaller slice
will introduce undue preemptions, while using a larger value will
increase latency.
For all the following examples assume a scheduling quantum of 8, and for
consistency all examples have W=4:
{A,B,C,D}(w=1,r=8):
ABCD...
+---+---+---+---
t=0, V=1.5 t=1, V=3.5
A |------< A |------<
B |------< B |------<
C |------< C |------<
D |------< D |------<
---+*------+-------+--- ---+--*----+-------+---
t=2, V=5.5 t=3, V=7.5
A |------< A |------<
B |------< B |------<
C |------< C |------<
D |------< D |------<
---+----*--+-------+--- ---+------*+-------+---
Note: 4 identical tasks in FIFO order
~~~
{A,B}(w=1,r=16) C(w=2,r=16)
AACCBBCC...
+---+---+---+---
t=0, V=1.25 t=2, V=5.25
A |--------------< A |--------------<
B |--------------< B |--------------<
C |------< C |------<
---+*------+-------+--- ---+----*--+-------+---
t=4, V=8.25 t=6, V=12.25
A |--------------< A |--------------<
B |--------------< B |--------------<
C |------< C |------<
---+-------*-------+--- ---+-------+---*---+---
Note: 1 heavy task -- because q=8, double r such that the deadline of the w=2
task doesn't go below q.
Note: observe the full schedule becomes: W*max(r_i/w_i) = 4*2q = 8q in length.
Note: the period of the heavy task is half the full period at:
W*(r_i/w_i) = 4*(2q/2) = 4q
~~~
{A,C,D}(w=1,r=16) B(w=1,r=8):
BAACCBDD...
+---+---+---+---
t=0, V=1.5 t=1, V=3.5
A |--------------< A |---------------<
B |------< B |------<
C |--------------< C |--------------<
D |--------------< D |--------------<
---+*------+-------+--- ---+--*----+-------+---
t=3, V=7.5 t=5, V=11.5
A |---------------< A |---------------<
B |------< B |------<
C |--------------< C |--------------<
D |--------------< D |--------------<
---+------*+-------+--- ---+-------+--*----+---
t=6, V=13.5
A |---------------<
B |------<
C |--------------<
D |--------------<
---+-------+----*--+---
Note: 1 short task -- again double r so that the deadline of the short task
won't be below q. Made B short because its not the leftmost task, but is
eligible with the 0,1,2,3 spread.
Note: like with the heavy task, the period of the short task observes:
W*(r_i/w_i) = 4*(1q/1) = 4q
~~~
A(w=1,r=16) B(w=1,r=8) C(w=2,r=16)
BCCAABCC...
+---+---+---+---
t=0, V=1.25 t=1, V=3.25
A |--------------< A |--------------<
B |------< B |------<
C |------< C |------<
---+*------+-------+--- ---+--*----+-------+---
t=3, V=7.25 t=5, V=11.25
A |--------------< A |--------------<
B |------< B |------<
C |------< C |------<
---+------*+-------+--- ---+-------+--*----+---
t=6, V=13.25
A |--------------<
B |------<
C |------<
---+-------+----*--+---
Note: 1 heavy and 1 short task -- combine them all.
Note: both the short and heavy task end up with a period of 4q
~~~
A(w=1,r=16) B(w=2,r=16) C(w=1,r=8)
BBCAABBC...
+---+---+---+---
t=0, V=1 t=2, V=5
A |--------------< A |--------------<
B |------< B |------<
C |------< C |------<
---+*------+-------+--- ---+----*--+-------+---
t=3, V=7 t=5, V=11
A |--------------< A |--------------<
B |------< B |------<
C |------< C |------<
---+------*+-------+--- ---+-------+--*----+---
t=7, V=15
A |--------------<
B |------<
C |------<
---+-------+------*+---
Note: as before but permuted
~~~
From all this it can be deduced that, for the steady state:
- the total period (P) of a schedule is: W*max(r_i/w_i)
- the average period of a task is: W*(r_i/w_i)
- each task obtains the fair share: w_i/W of each full period P
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Valentin Schneider <vschneid@redhat.com>
Link: https://lkml.kernel.org/r/20240727105030.842834421@infradead.org
|
|
Part of the reason to have shorter slices is to improve
responsiveness. Allow shorter slices to preempt longer slices on
wakeup.
Task | Runtime ms | Switches | Avg delay ms | Max delay ms | Sum delay ms |
100ms massive_intr 500us cyclictest NO_PREEMPT_SHORT
1 massive_intr:(5) | 846018.956 ms | 779188 | avg: 0.273 ms | max: 58.337 ms | sum:212545.245 ms |
2 massive_intr:(5) | 853450.693 ms | 792269 | avg: 0.275 ms | max: 71.193 ms | sum:218263.588 ms |
3 massive_intr:(5) | 843888.920 ms | 771456 | avg: 0.277 ms | max: 92.405 ms | sum:213353.221 ms |
1 chromium-browse:(8) | 53015.889 ms | 131766 | avg: 0.463 ms | max: 36.341 ms | sum:60959.230 ms |
2 chromium-browse:(8) | 53864.088 ms | 136962 | avg: 0.480 ms | max: 27.091 ms | sum:65687.681 ms |
3 chromium-browse:(9) | 53637.904 ms | 132637 | avg: 0.481 ms | max: 24.756 ms | sum:63781.673 ms |
1 cyclictest:(5) | 12615.604 ms | 639689 | avg: 0.471 ms | max: 32.272 ms | sum:301351.094 ms |
2 cyclictest:(5) | 12511.583 ms | 642578 | avg: 0.448 ms | max: 44.243 ms | sum:287632.830 ms |
3 cyclictest:(5) | 12545.867 ms | 635953 | avg: 0.475 ms | max: 25.530 ms | sum:302374.658 ms |
100ms massive_intr 500us cyclictest PREEMPT_SHORT
1 massive_intr:(5) | 839843.919 ms | 837384 | avg: 0.264 ms | max: 74.366 ms | sum:221476.885 ms |
2 massive_intr:(5) | 852449.913 ms | 845086 | avg: 0.252 ms | max: 68.162 ms | sum:212595.968 ms |
3 massive_intr:(5) | 839180.725 ms | 836883 | avg: 0.266 ms | max: 69.742 ms | sum:222812.038 ms |
1 chromium-browse:(11) | 54591.481 ms | 138388 | avg: 0.458 ms | max: 35.427 ms | sum:63401.508 ms |
2 chromium-browse:(8) | 52034.541 ms | 132276 | avg: 0.436 ms | max: 31.826 ms | sum:57732.958 ms |
3 chromium-browse:(8) | 55231.771 ms | 141892 | avg: 0.469 ms | max: 27.607 ms | sum:66538.697 ms |
1 cyclictest:(5) | 13156.391 ms | 667412 | avg: 0.373 ms | max: 38.247 ms | sum:249174.502 ms |
2 cyclictest:(5) | 12688.939 ms | 665144 | avg: 0.374 ms | max: 33.548 ms | sum:248509.392 ms |
3 cyclictest:(5) | 13475.623 ms | 669110 | avg: 0.370 ms | max: 37.819 ms | sum:247673.390 ms |
As per the numbers the, this makes cyclictest (short slice) it's
max-delay more consistent and consistency drops the sum-delay. The
trade-off is that the massive_intr (long slice) gets more context
switches and a slight increase in sum-delay.
Chunxin contributed did_preempt_short() where a task that lost slice
protection from PREEMPT_SHORT gets rescheduled once it becomes
in-eligible.
[mike: numbers]
Co-Developed-by: Chunxin Zang <zangchunxin@lixiang.com>
Signed-off-by: Chunxin Zang <zangchunxin@lixiang.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Valentin Schneider <vschneid@redhat.com>
Tested-by: Mike Galbraith <umgwanakikbuti@gmail.com>
Link: https://lkml.kernel.org/r/20240727105030.735459544@infradead.org
|
|
During OSPM24 Youssef noted that migrations are re-setting the virtual
deadline. Notably everything that does a dequeue-enqueue, like setting
nice, changing preferred numa-node, and a myriad of other random crap,
will cause this to happen.
This shouldn't be. Preserve the relative virtual deadline across such
dequeue/enqueue cycles.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Valentin Schneider <vschneid@redhat.com>
Tested-by: Valentin Schneider <vschneid@redhat.com>
Link: https://lkml.kernel.org/r/20240727105030.625119246@infradead.org
|
|
Note that tasks that are kept on the runqueue to burn off negative
lag, are not in fact runnable anymore, they'll get dequeued the moment
they get picked.
As such, don't count this time towards runnable.
Thanks to Valentin for spotting I had this backwards initially.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Valentin Schneider <vschneid@redhat.com>
Tested-by: Valentin Schneider <vschneid@redhat.com>
Link: https://lkml.kernel.org/r/20240727105030.514088302@infradead.org
|
|
'Extend' DELAY_DEQUEUE by noting that since we wanted to dequeued them
at the 0-lag point, truncate lag (eg. don't let them earn positive
lag).
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Valentin Schneider <vschneid@redhat.com>
Tested-by: Valentin Schneider <vschneid@redhat.com>
Link: https://lkml.kernel.org/r/20240727105030.403750550@infradead.org
|
|
Extend / fix 86bfbb7ce4f6 ("sched/fair: Add lag based placement") by
noting that lag is fundamentally a temporal measure. It should not be
carried around indefinitely.
OTOH it should also not be instantly discarded, doing so will allow a
task to game the system by purposefully (micro) sleeping at the end of
its time quantum.
Since lag is intimately tied to the virtual time base, a wall-time
based decay is also insufficient, notably competition is required for
any of this to make sense.
Instead, delay the dequeue and keep the 'tasks' on the runqueue,
competing until they are eligible.
Strictly speaking, we only care about keeping them until the 0-lag
point, but that is a difficult proposition, instead carry them around
until they get picked again, and dequeue them at that point.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Valentin Schneider <vschneid@redhat.com>
Tested-by: Valentin Schneider <vschneid@redhat.com>
Link: https://lkml.kernel.org/r/20240727105030.226163742@infradead.org
|
|
Since special task states must not suffer spurious wakeups, and the
proposed delayed dequeue can cause exactly these (under some boundary
conditions), propagate this knowledge into dequeue_task() such that it
can do the right thing.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Valentin Schneider <vschneid@redhat.com>
Tested-by: Valentin Schneider <vschneid@redhat.com>
Link: https://lkml.kernel.org/r/20240727105030.110439521@infradead.org
|
|
The special task states are those that do not suffer spurious wakeups,
TASK_FROZEN is very much one of those, mark it as such.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Valentin Schneider <vschneid@redhat.com>
Tested-by: Valentin Schneider <vschneid@redhat.com>
Link: https://lkml.kernel.org/r/20240727105029.998329901@infradead.org
|
|
Doing a wakeup on a delayed dequeue task is about as simple as it
sounds -- remove the delayed mark and enjoy the fact it was actually
still on the runqueue.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Valentin Schneider <vschneid@redhat.com>
Tested-by: Valentin Schneider <vschneid@redhat.com>
Link: https://lkml.kernel.org/r/20240727105029.888107381@infradead.org
|
|
Delayed dequeue's natural end is when it gets picked again. Ensure
pick_next_task() knows what to do with delayed tasks.
Note, this relies on the earlier patch that made pick_next_task()
state invariant -- it will restart the pick on dequeue, because
obviously the just dequeued task is no longer eligible.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Valentin Schneider <vschneid@redhat.com>
Tested-by: Valentin Schneider <vschneid@redhat.com>
Link: https://lkml.kernel.org/r/20240727105029.747330118@infradead.org
|
|
When dequeue_task() is delayed it becomes possible to exit a task (or
cgroup) that is still enqueued. Ensure things are dequeued before
freeing.
Thanks to Valentin for asking the obvious questions and making
switched_from_fair() less weird.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Valentin Schneider <vschneid@redhat.com>
Tested-by: Valentin Schneider <vschneid@redhat.com>
Link: https://lkml.kernel.org/r/20240727105029.631948434@infradead.org
|
|
Just a little sanity test..
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Valentin Schneider <vschneid@redhat.com>
Tested-by: Valentin Schneider <vschneid@redhat.com>
Link: https://lkml.kernel.org/r/20240727105029.486423066@infradead.org
|
|
Delayed dequeue has tasks sit around on the runqueue that are not
actually runnable -- specifically, they will be dequeued the moment
they get picked.
One side-effect is that such a task can get migrated, which leads to a
'nested' dequeue_task() scenario that messes up uclamp if we don't
take care.
Notably, dequeue_task(DEQUEUE_SLEEP) can 'fail' and keep the task on
the runqueue. This however will have removed the task from uclamp --
per uclamp_rq_dec() in dequeue_task(). So far so good.
However, if at that point the task gets migrated -- or nice adjusted
or any of a myriad of operations that does a dequeue-enqueue cycle --
we'll pass through dequeue_task()/enqueue_task() again. Without
modification this will lead to a double decrement for uclamp, which is
wrong.
Reported-by: Luis Machado <luis.machado@arm.com>
Reported-by: Hongyan Xia <hongyan.xia2@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Valentin Schneider <vschneid@redhat.com>
Tested-by: Valentin Schneider <vschneid@redhat.com>
Link: https://lkml.kernel.org/r/20240727105029.315205425@infradead.org
|
|
While most of the delayed dequeue code can be done inside the
sched_class itself, there is one location where we do not have an
appropriate hook, namely ttwu_runnable().
Add an ENQUEUE_DELAYED call to the on_rq path to deal with waking
delayed dequeue tasks.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Valentin Schneider <vschneid@redhat.com>
Tested-by: Valentin Schneider <vschneid@redhat.com>
Link: https://lkml.kernel.org/r/20240727105029.200000445@infradead.org
|
|
As a preparation for dequeue_task() failing, and a second code-path
needing to take care of the 'success' path, split out the DEQEUE_SLEEP
path from deactivate_task().
Much thanks to Libo for spotting and fixing a TASK_ON_RQ_MIGRATING
ordering fail.
Fixed-by: Libo Chen <libo.chen@oracle.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Valentin Schneider <vschneid@redhat.com>
Tested-by: Valentin Schneider <vschneid@redhat.com>
Link: https://lkml.kernel.org/r/20240727105029.086192709@infradead.org
|
|
Working towards delaying dequeue, notably also inside the hierachy,
rework dequeue_task_fair() such that it can 'resume' an interrupted
hierarchy walk.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Valentin Schneider <vschneid@redhat.com>
Tested-by: Valentin Schneider <vschneid@redhat.com>
Link: https://lkml.kernel.org/r/20240727105028.977256873@infradead.org
|
|
Change the function signature of sched_class::dequeue_task() to return
a boolean, allowing future patches to 'fail' dequeue.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Valentin Schneider <vschneid@redhat.com>
Tested-by: Valentin Schneider <vschneid@redhat.com>
Link: https://lkml.kernel.org/r/20240727105028.864630153@infradead.org
|
|
Implement pick_next_task_fair() in terms of pick_task_fair() to
de-duplicate the pick loop.
More importantly, this makes all the pick loops use the
state-invariant form, which is useful to introduce further re-try
conditions in later patches.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Valentin Schneider <vschneid@redhat.com>
Tested-by: Valentin Schneider <vschneid@redhat.com>
Link: https://lkml.kernel.org/r/20240727105028.725062368@infradead.org
|
|
With 4c456c9ad334 ("sched/fair: Remove unused 'curr' argument from
pick_next_entity()") curr is no longer being used, so no point in
clearing it.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Valentin Schneider <vschneid@redhat.com>
Tested-by: Valentin Schneider <vschneid@redhat.com>
Link: https://lkml.kernel.org/r/20240727105028.614707623@infradead.org
|
|
Per 54d27365cae8 ("sched/fair: Prevent throttling in early
pick_next_task_fair()") the reason check_cfs_rq_runtime() is under the
'if (curr)' check is to ensure the (downward) traversal does not
result in an empty cfs_rq.
But then the pick_task_fair() 'copy' of all this made it restart the
traversal anyway, so that seems to solve the issue too.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Ben Segall <bsegall@google.com>
Reviewed-by: Valentin Schneider <vschneid@redhat.com>
Tested-by: Valentin Schneider <vschneid@redhat.com>
Link: https://lkml.kernel.org/r/20240727105028.501679876@infradead.org
|
|
Since commit e8f331bcc270 ("sched/smp: Use lag to simplify
cross-runqueue placement") the min_vruntime_copy is no longer used.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Valentin Schneider <vschneid@redhat.com>
Tested-by: Valentin Schneider <vschneid@redhat.com>
Link: https://lkml.kernel.org/r/20240727105028.395297941@infradead.org
|
|
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Valentin Schneider <vschneid@redhat.com>
Tested-by: Valentin Schneider <vschneid@redhat.com>
Link: https://lkml.kernel.org/r/20240727105028.287790895@infradead.org
|
|
We run this in full RW mode now, so we have to guard against the
superblock buffer being reallocated.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Pull memcg-v1 fix from Al Viro:
"memcg_write_event_control() oops fix"
* tag 'pull-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
memcg_write_event_control(): fix a user-triggerable oops
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux
Pull arm64 fixes from Catalin Marinas:
- Fix the arm64 __get_mem_asm() to use the _ASM_EXTABLE_##type##ACCESS()
macro instead of the *_ERR() one in order to avoid writing -EFAULT to
the value register in case of a fault
- Initialise all elements of the acpi_early_node_map[] to NUMA_NO_NODE.
Prior to this fix, only the first element was initialised
- Move the KASAN random tag seed initialisation after the per-CPU areas
have been initialised (prng_state is __percpu)
* tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux:
arm64: Fix KASAN random tag seed initialization
arm64: ACPI: NUMA: initialize all values of acpi_early_node_map to NUMA_NO_NODE
arm64: uaccess: correct thinko in __get_mem_asm()
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/clk/linux
Pull clk fix from Stephen Boyd:
"One fix for the new T-Head TH1520 clk driver that marks a bus clk
critical so that it isn't turned off during late init which breaks
emmc-sdio"
* tag 'clk-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/clk/linux:
clk: thead: fix dependency on clk_ignore_unused
|
|
Pull block fixes from Jens Axboe:
- Fix corruption issues with s390/dasd (Eric, Stefan)
- Fix a misuse of non irq locking grab of a lock (Li)
- MD pull request with a single data corruption fix for raid1 (Yu)
* tag 'block-6.11-20240824' of git://git.kernel.dk/linux:
block: Fix lockdep warning in blk_mq_mark_tag_wait
md/raid1: Fix data corruption for degraded array with slow disk
s390/dasd: fix error recovery leading to data corruption on ESE devices
s390/dasd: Remove DMA alignment
|
|
Pull io_uring fixes from Jens Axboe:
- Fix a comment in the uapi header using the wrong member name (Caleb)
- Fix KCSAN warning for a debug check in sqpoll (me)
- Two more NAPI tweaks (Olivier)
* tag 'io_uring-6.11-20240824' of git://git.kernel.dk/linux:
io_uring: fix user_data field name in comment
io_uring/sqpoll: annotate debug task == current with data_race()
io_uring/napi: remove duplicate io_napi_entry timeout assignation
io_uring/napi: check napi_enabled in io_napi_add() before proceeding
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/robh/linux
Pull devicetree fixes from Rob Herring:
- Fix a possible (but unlikely) out-of-bounds read in interrupts
parsing code
- Add AT25 EEPROM "fujitsu,mb85rs256" compatible
- Update Konrad Dybcio's email
* tag 'devicetree-fixes-for-6.11-2' of git://git.kernel.org/pub/scm/linux/kernel/git/robh/linux:
of/irq: Prevent device address out-of-bounds read in interrupt map walk
dt-bindings: eeprom: at25: add fujitsu,mb85rs256 compatible
dt-bindings: Batch-update Konrad Dybcio's email
|
|
Although there are several patches improving the extent map shrinker,
there are still reports of too frequent shrinker behavior, taking too
much CPU for the kswapd process.
So let's only enable extent shrinker for now, until we got more
comprehensive understanding and a better solution.
Link: https://lore.kernel.org/linux-btrfs/3df4acd616a07ef4d2dc6bad668701504b412ffc.camel@intelfx.name/
Link: https://lore.kernel.org/linux-btrfs/c30fd6b3-ca7a-4759-8a53-d42878bf84f7@gmail.com/
Fixes: 956a17d9d050 ("btrfs: add a shrinker for extent maps")
CC: stable@vger.kernel.org # 6.10+
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|