Age | Commit message (Collapse) | Author |
|
As we know we cannot send the datagram (state can be set to LLCP_CLOSED
by nfc_llcp_socket_release()), there is no need to proceed further.
Thus, bail out early from llcp_sock_sendmsg().
Signed-off-by: Siddh Raman Pant <code@siddh.me>
Reviewed-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
Reviewed-by: Suman Ghosh <sumang@marvell.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
llcp_sock_sendmsg() calls nfc_llcp_send_ui_frame() which in turn calls
nfc_alloc_send_skb(), which accesses the nfc_dev from the llcp_sock for
getting the headroom and tailroom needed for skb allocation.
Parallelly the nfc_dev can be freed, as the refcount is decreased via
nfc_free_device(), leading to a UAF reported by Syzkaller, which can
be summarized as follows:
(1) llcp_sock_sendmsg() -> nfc_llcp_send_ui_frame()
-> nfc_alloc_send_skb() -> Dereference *nfc_dev
(2) virtual_ncidev_close() -> nci_free_device() -> nfc_free_device()
-> put_device() -> nfc_release() -> Free *nfc_dev
When a reference to llcp_local is acquired, we do not acquire the same
for the nfc_dev. This leads to freeing even when the llcp_local is in
use, and this is the case with the UAF described above too.
Thus, when we acquire a reference to llcp_local, we should acquire a
reference to nfc_dev, and release the references appropriately later.
References for llcp_local is initialized in nfc_llcp_register_device()
(which is called by nfc_register_device()). Thus, we should acquire a
reference to nfc_dev there.
nfc_unregister_device() calls nfc_llcp_unregister_device() which in
turn calls nfc_llcp_local_put(). Thus, the reference to nfc_dev is
appropriately released later.
Reported-and-tested-by: syzbot+bbe84a4010eeea00982d@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=bbe84a4010eeea00982d
Fixes: c7aa12252f51 ("NFC: Take a reference on the LLCP local pointer when creating a socket")
Reviewed-by: Suman Ghosh <sumang@marvell.com>
Signed-off-by: Siddh Raman Pant <code@siddh.me>
Reviewed-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Commit 2f3ce7a56c6e ("net: sfp: rework the RollBall PHY waiting code")
changed the long wait before accessing RollBall / FS modules into
probing for PHY every 1 second, and trying 25 times.
Wei Lei reports that this does not work correctly on FS modules: when
initializing, they may report values different from 0xffff in PHY ID
registers for some MMDs, causing get_phy_c45_ids() to find some bogus
MMD.
Fix this by adding the module_t_wait member back, and setting it to 4
seconds for FS modules.
Fixes: 2f3ce7a56c6e ("net: sfp: rework the RollBall PHY waiting code")
Reported-by: Wei Lei <quic_leiwei@quicinc.com>
Signed-off-by: Marek Behún <kabel@kernel.org>
Tested-by: Lei Wei <quic_leiwei@quicinc.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Some ioctl commands do not require ioctl permission, but are routed to
other permissions such as FILE_GETATTR or FILE_SETATTR. This routing is
done by comparing the ioctl cmd to a set of 64-bit flags (FS_IOC_*).
However, if a 32-bit process is running on a 64-bit kernel, it emits
32-bit flags (FS_IOC32_*) for certain ioctl operations. These flags are
being checked erroneously, which leads to these ioctl operations being
routed to the ioctl permission, rather than the correct file
permissions.
This was also noted in a RED-PEN finding from a while back -
"/* RED-PEN how should LSM module know it's handling 32bit? */".
This patch introduces a new hook, security_file_ioctl_compat(), that is
called from the compat ioctl syscall. All current LSMs have been changed
to support this hook.
Reviewing the three places where we are currently using
security_file_ioctl(), it appears that only SELinux needs a dedicated
compat change; TOMOYO and SMACK appear to be functional without any
change.
Cc: stable@vger.kernel.org
Fixes: 0b24dcb7f2f7 ("Revert "selinux: simplify ioctl checking"")
Signed-off-by: Alfred Piccioni <alpic@google.com>
Reviewed-by: Stephen Smalley <stephen.smalley.work@gmail.com>
[PM: subject tweak, line length fixes, and alignment corrections]
Signed-off-by: Paul Moore <paul@paul-moore.com>
|
|
Create /proc/net/rxrpc/bundles to display outstanding rxrpc client
connection bundles.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
|
|
Fold the afs_addr_cursor struct into the afs_operation struct and the
afs_vl_cursor struct and fold its operations into their callers also.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
|
|
Use the rxrpc_peer plus the service ID as the call address instead of
passing in a sockaddr_srx down to rxrpc. The peer record is obtained by
using rxrpc_kernel_get_peer(). This avoids the need to repeatedly look up
the peer and allows rxrpc to hold on to resources for it.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
|
|
Rename the ->index and ->untried fields of the afs_vl_cursor and
afs_operation struct to ->server_index and ->untried_servers to avoid
confusion with address iteration fields when those get folded in.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
|
|
Add a tracepoint to track the lifetime of the afs_addr_list struct.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
|
|
Simplify error handling a bit by moving it from the afs_addr_cursor struct
to the afs_operation and afs_vl_cursor structs and using the error
prioritisation function for accumulating errors from multiple sources (AFS
tries to rotate between multiple fileservers, some of which may be
inaccessible or in some state of offlinedness).
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
|
|
Don't put the afs_call struct in afs_wait_for_call_to_complete() but rather
have the caller do it. This will allow the caller to fish stuff out of the
afs_call struct rather than the afs_addr_cursor struct, thereby allowing a
subsequent patch to subsume it.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
|
|
Wrap most op->error accesses with inline funcs which will make it easier
for a subsequent patch to replace op->error with something else. Two
functions are added to this end:
(1) afs_op_error() - Get the error code.
(2) afs_op_set_error() - Set the error code.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
|
|
Set op->nr_iterations to -1 to indicate that we need to begin fileserver
iteration rather than setting error to SHRT_MAX. This makes it easier to
eliminate the address cursor.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
|
|
When processing the result of a call, handle the VIO and UAEIO abort
specifically rather than leaving it to a default case. Rather than
erroring out unconditionally, see if there's another server if the volume
has more than one server available, otherwise return -EREMOTEIO.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
|
|
Rename the failed member of struct addr_list to probe_failed as it's
specifically related to probe failures.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
|
|
In the rotation algorithms for iterating over volume location servers and
file servers, don't skip servers from which we got a valid response to a
probe (either a reply DATA packet or an ABORT) even if we didn't manage to
get an RTT reading.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
|
|
Change rxrpc's API such that:
(1) A new function, rxrpc_kernel_lookup_peer(), is provided to look up an
rxrpc_peer record for a remote address and a corresponding function,
rxrpc_kernel_put_peer(), is provided to dispose of it again.
(2) When setting up a call, the rxrpc_peer object used during a call is
now passed in rather than being set up by rxrpc_connect_call(). For
afs, this meenat passing it to rxrpc_kernel_begin_call() rather than
the full address (the service ID then has to be passed in as a
separate parameter).
(3) A new function, rxrpc_kernel_remote_addr(), is added so that afs can
get a pointer to the transport address for display purposed, and
another, rxrpc_kernel_remote_srx(), to gain a pointer to the full
rxrpc address.
(4) The function to retrieve the RTT from a call, rxrpc_kernel_get_srtt(),
is then altered to take a peer. This now returns the RTT or -1 if
there are insufficient samples.
(5) Rename rxrpc_kernel_get_peer() to rxrpc_kernel_call_get_peer().
(6) Provide a new function, rxrpc_kernel_get_peer(), to get a ref on a
peer the caller already has.
This allows the afs filesystem to pin the rxrpc_peer records that it is
using, allowing faster lookups and pointer comparisons rather than
comparing sockaddr_rxrpc contents. It also makes it easier to get hold of
the RTT. The following changes are made to afs:
(1) The addr_list struct's addrs[] elements now hold a peer struct pointer
and a service ID rather than a sockaddr_rxrpc.
(2) When displaying the transport address, rxrpc_kernel_remote_addr() is
used.
(3) The port arg is removed from afs_alloc_addrlist() since it's always
overridden.
(4) afs_merge_fs_addr4() and afs_merge_fs_addr6() do peer lookup and may
now return an error that must be handled.
(5) afs_find_server() now takes a peer pointer to specify the address.
(6) afs_find_server(), afs_compare_fs_alists() and afs_merge_fs_addr[46]{}
now do peer pointer comparison rather than address comparison.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
|
|
Turn the afs_addr_list address array into an array of structs, thereby
allowing per-address (such as RTT) info to be added.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
|
|
Add some comments on AFS abort code handling in the rotation algorithm and
adjust the errors produced to match.
Reported-by: Jeffrey E Altman <jaltman@auristor.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Jeffrey Altman <jaltman@auristor.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
|
|
rxrpc_find_service_conn_rcu() should make the "seq" counter odd on the
second pass, otherwise read_seqbegin_or_lock() never takes the lock.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
Link: https://lore.kernel.org/r/20231117164846.GA10410@redhat.com/
|
|
David Howells says:
(3) afs_check_validity().
(4) afs_getattr().
These are both pretty short, so your solution is probably good for them.
That said, afs_vnode_commit_status() can spend a long time under the
write lock - and pretty much every file RPC op returns a status update.
Change these functions to use read_seqbegin(). This simplifies the code
and doesn't change the current behaviour, the "seq" counter is always even
so read_seqbegin_or_lock() can never take the lock.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
Link: https://lore.kernel.org/r/20231130115617.GA21584@redhat.com/
|
|
David Howells says:
(5) afs_find_server().
There could be a lot of servers in the list and each server can have
multiple addresses, so I think this would be better with an exclusive
second pass.
The server list isn't likely to change all that often, but when it does
change, there's a good chance several servers are going to be
added/removed one after the other. Further, this is only going to be
used for incoming cache management/callback requests from the server,
which hopefully aren't going to happen too often - but it is remotely
drivable.
(6) afs_find_server_by_uuid().
Similarly to (5), there could be a lot of servers to search through, but
they are in a tree not a flat list, so it should be faster to process.
Again, it's not likely to change that often and, again, when it does
change it's likely to involve multiple changes. This can be driven
remotely by an incoming cache management request but is mostly going to
be driven by setting up or reconfiguring a volume's server list -
something that also isn't likely to happen often.
Make the "seq" counter odd on the 2nd pass, otherwise read_seqbegin_or_lock()
never takes the lock.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
Link: https://lore.kernel.org/r/20231130115614.GA21581@redhat.com/
|
|
David Howells says:
(2) afs_lookup_volume_rcu().
There can be a lot of volumes known by a system. A thousand would
require a 10-step walk and this is drivable by remote operation, so I
think this should probably take a lock on the second pass too.
Make the "seq" counter odd on the 2nd pass, otherwise read_seqbegin_or_lock()
never takes the lock.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
Link: https://lore.kernel.org/r/20231130115606.GA21571@redhat.com/
|
|
Automatically generate trace tag enums from the symbol -> string mapping
tables rather than having the enums as well, thereby reducing duplicated
data.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: Jeff Layton <jlayton@kernel.org>
cc: linux-afs@lists.infradead.org
cc: linux-fsdevel@vger.kernel.org
|
|
checkpatch objects to whitespace before ')', so remove most of it from the
afs trace header.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: Jeff Layton <jlayton@kernel.org>
cc: linux-afs@lists.infradead.org
cc: linux-fsdevel@vger.kernel.org
|
|
If prev_badblocks() returns '-1', it means no valid badblocks record
before the checking range. It doesn't make sense to check whether
the input checking range is overlapped with the non-existed invalid
front range.
This patch checkes whether 'prev >= 0' is true before calling
overlap_front(), to void such invalid operations.
Fixes: 3ea3354cb9f0 ("badblocks: improve badblocks_check() for multiple ranges handling")
Reported-and-tested-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Coly Li <colyli@suse.de>
Link: https://lore.kernel.org/nvdimm/3035e75a-9be0-4bc3-8d4a-6e52c207f277@leemhuis.info/
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Geliang Tang <geliang.tang@suse.com>
Cc: Hannes Reinecke <hare@suse.de>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: NeilBrown <neilb@suse.de>
Cc: Vishal L Verma <vishal.l.verma@intel.com>
Cc: Xiao Ni <xni@redhat.com>
Link: https://lore.kernel.org/r/20231224002820.20234-1-colyli@suse.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 fixes from Ingo Molnar:
- Fix a secondary CPUs enumeration regression caused by creative MADT
APIC table entries on certain systems.
- Fix a race in the NOP-patcher that can spuriously trigger crashes on
bootup.
- Fix a bootup failure regression caused by the parallel bringup code,
caused by firmware inconsistency between the APIC initialization
states of the boot and secondary CPUs, on certain systems.
* tag 'x86-urgent-2023-12-23' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/acpi: Handle bogus MADT APIC tables gracefully
x86/alternatives: Disable interrupts and sync when optimizing NOPs in place
x86/alternatives: Sync core before enabling interrupts
x86/smpboot/64: Handle X2APIC BIOS inconsistency gracefully
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi
Pull SCSI fixes from James Bottomley:
"Four small fixes, three in drivers with the core one adding a batch
indicator (for drivers which use it) to the error handler"
* tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi:
scsi: ufs: core: Let the sq_lock protect sq_tail_slot access
scsi: ufs: qcom: Return ufs_qcom_clk_scale_*() errors in ufs_qcom_clk_scale_notify()
scsi: core: Always send batch on reset or error handling command
scsi: bnx2fc: Fix skb double free in bnx2fc_rcv()
|
|
ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/overlayfs/vfs
Pull backing file updates from Amir Goldstein:
These patches essentially just lift some overlayfs code to common code.
The motivation is to reuse common stacking code for the FUSE passthrough
patches that I am shaping up for upstream. The FUSE passthrough work
will be coming in over the next cycles.
I have been testing those patches with my fuse-backing-fd development
branch for quite some time and I think both you and Miklos gave a
conceptual ACK to some version of this work.
* tag 'ovl-vfs-6.8' of ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/overlayfs/vfs:
fs: factor out backing_file_mmap() helper
fs: factor out backing_file_splice_{read,write}() helpers
fs: factor out backing_file_{read,write}_iter() helpers
fs: prepare for stackable filesystems backing file helpers
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb
Pull USB / Thunderbolt fixes from Greg KH:
"Here are some small bugfixes and new device ids for USB and
Thunderbolt drivers for 6.7-rc7. Included in here are:
- new usb-serial device ids
- thunderbolt driver fixes
- typec driver fix
- usb-storage driver quirk added
- fotg210 driver fix
All of these have been in linux-next with no reported issues"
* tag 'usb-6.7-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb:
USB: serial: option: add Quectel EG912Y module support
USB: serial: ftdi_sio: update Actisense PIDs constant names
usb: fotg210-hcd: delete an incorrect bounds test
usb-storage: Add quirk for incorrect WP on Kingston DT Ultimate 3.0 G3
usb: typec: ucsi: fix gpio-based orientation detection
net: usb: ax88179_178a: avoid failed operations when device is disconnected
USB: serial: option: add Quectel RM500Q R13 firmware support
USB: serial: option: add Foxconn T99W265 with new baseline
thunderbolt: Fix minimum allocated USB 3.x and PCIe bandwidth
thunderbolt: Fix memory leak in margining_port_remove()
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc
Pull char / misc driver fixes from Greg KH:
"Here are a small number of various driver fixes for 6.7-rc7 that
normally come through the char-misc tree, and one debugfs fix as well.
Included in here are:
- iio and hid sensor driver fixes for a number of small things
- interconnect driver fixes
- brcm_nvmem driver fixes
- debugfs fix for previous fix
- guard() definition in device.h so that many subsystems can start
using it for 6.8-rc1 (requested by Dan Williams to make future
merges easier)
All of these have been in linux-next for a while with no reported
issues"
* tag 'char-misc-6.7-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc: (21 commits)
debugfs: initialize cancellations earlier
Revert "iio: hid-sensor-als: Add light color temperature support"
Revert "iio: hid-sensor-als: Add light chromaticity support"
nvmem: brcm_nvram: store a copy of NVRAM content
dt-bindings: nvmem: mxs-ocotp: Document fsl,ocotp
driver core: Add a guard() definition for the device_lock()
interconnect: qcom: icc-rpm: Fix peak rate calculation
iio: adc: MCP3564: fix hardware identification logic
iio: adc: MCP3564: fix calib_bias and calib_scale range checks
iio: adc: meson: add separate config for axg SoC family
iio: adc: imx93: add four channels for imx93 adc
iio: adc: ti_am335x_adc: Fix return value check of tiadc_request_dma()
interconnect: qcom: sm8250: Enable sync_state
iio: triggered-buffer: prevent possible freeing of wrong buffer
iio: imu: inv_mpu6050: fix an error code problem in inv_mpu6050_read_raw
iio: imu: adis16475: use bit numbers in assign_bit()
iio: imu: adis16475: add spi_device_id table
iio: tmag5273: fix temperature offset
interconnect: Treat xlate() returning NULL node as an error
iio: common: ms_sensors: ms_sensors_i2c: fix humidity conversion time table
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input
Pull input updates from Dmitry Torokhov:
- a quirk to AT keyboard driver to skip issuing "GET ID" command when
8042 is in translated mode and the device is a laptop/portable,
because the "GET ID" command makes a bunch of recent laptops unhappy
- a quirk to i8042 to disable multiplexed mode on Acer P459-G2-M which
causes issues on resume
- psmouse will activate native RMI4 protocol support for touchpad on
ThinkPad L14 G1
- addition of Razer Wolverine V2 ID to xpad gamepad driver
- mapping for airplane mode button in soc_button_array driver for
TUXEDO laptops
- improved error handling in ipaq-micro-keys driver
- amimouse being prepared for platform remove callback returning void
* tag 'input-for-v6.7-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input:
Input: soc_button_array - add mapping for airplane mode button
Input: xpad - add Razer Wolverine V2 support
Input: ipaq-micro-keys - add error handling for devm_kmemdup
Input: amimouse - convert to platform remove callback returning void
Input: i8042 - add nomux quirk for Acer P459-G2-M
Input: atkbd - skip ATKBD_CMD_GETID in translated mode
Input: psmouse - enable Synaptics InterTouch for ThinkPad L14 G1
|
|
check_preempt_wakeup_fair()
This variable became unused in:
5e963f2bd465 ("sched/fair: Commit to EEVDF")
Signed-off-by: Wang Jinchao <wangjinchao@xfusion.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lore.kernel.org/r/202312141319+0800-wangjinchao@xfusion.com
|
|
Running N CPU-bound tasks on an N CPUs platform:
- with asymmetric CPU capacity
- not being a DynamIq system (i.e. having a PKG level sched domain
without the SD_SHARE_PKG_RESOURCES flag set)
.. might result in a task placement where two tasks run on a big CPU
and none on a little CPU. This placement could be more optimal by
using all CPUs.
Testing platform:
Juno-r2:
- 2 big CPUs (1-2), maximum capacity of 1024
- 4 little CPUs (0,3-5), maximum capacity of 383
Testing workload ([1]):
Spawn 6 CPU-bound tasks. During the first 100ms (step 1), each tasks
is affine to a CPU, except for:
- one little CPU which is left idle.
- one big CPU which has 2 tasks affine.
After the 100ms (step 2), remove the cpumask affinity.
Behavior before the patch:
During step 2, the load balancer running from the idle CPU tags sched
domains as:
- little CPUs: 'group_has_spare'. Cf. group_has_capacity() and
group_is_overloaded(), 3 CPU-bound tasks run on a 4 CPUs
sched-domain, and the idle CPU provides enough spare capacity
regarding the imbalance_pct
- big CPUs: 'group_overloaded'. Indeed, 3 tasks run on a 2 CPUs
sched-domain, so the following path is used:
group_is_overloaded()
\-if (sgs->sum_nr_running <= sgs->group_weight) return true;
The following path which would change the migration type to
'migrate_task' is not taken:
calculate_imbalance()
\-if (env->idle != CPU_NOT_IDLE && env->imbalance == 0)
as the local group has some spare capacity, so the imbalance
is not 0.
The migration type requested is 'migrate_util' and the busiest
runqueue is the big CPU's runqueue having 2 tasks (each having a
utilization of 512). The idle little CPU cannot pull one of these
task as its capacity is too small for the task. The following path
is used:
detach_tasks()
\-case migrate_util:
\-if (util > env->imbalance) goto next;
After the patch:
As the number of failed balancing attempts grows (with
'nr_balance_failed'), progressively make it easier to migrate
a big task to the idling little CPU. A similar mechanism is
used for the 'migrate_load' migration type.
Improvement:
Running the testing workload [1] with the step 2 representing
a ~10s load for a big CPU:
Before patch: ~19.3s
After patch: ~18s (-6.7%)
Similar issue reported at:
https://lore.kernel.org/lkml/20230716014125.139577-1-qyousef@layalina.io/
Suggested-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Pierre Gondois <pierre.gondois@arm.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Acked-by: Qais Yousef <qyousef@layalina.io>
Link: https://lore.kernel.org/r/20231206090043.634697-1-pierre.gondois@arm.com
|
|
With UTIL_EST_FASTUP now being permanent, we can take advantage of the
fact that the ewma jumps directly to a higher utilization at dequeue to
simplify util_est and remove the enqueued field.
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Tested-by: Lukasz Luba <lukasz.luba@arm.com>
Reviewed-by: Lukasz Luba <lukasz.luba@arm.com>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Reviewed-by: Hongyan Xia <hongyan.xia2@arm.com>
Reviewed-by: Alex Shi <alexs@kernel.org>
Link: https://lore.kernel.org/r/20231201161652.1241695-3-vincent.guittot@linaro.org
|
|
sched_feat(UTIL_EST_FASTUP) has been added to easily disable the feature
in order to check for possibly related regressions. After 3 years, it has
never been used and no regression has been reported. Let's remove it
and make fast increase a permanent behavior.
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Tested-by: Lukasz Luba <lukasz.luba@arm.com>
Reviewed-by: Lukasz Luba <lukasz.luba@arm.com>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Reviewed-by: Hongyan Xia <hongyan.xia2@arm.com>
Reviewed-by: Tang Yizhou <yizhou.tang@shopee.com>
Reviewed-by: Yanteng Si <siyanteng@loongson.cn> [for the Chinese translation]
Reviewed-by: Alex Shi <alexs@kernel.org>
Link: https://lore.kernel.org/r/20231201161652.1241695-2-vincent.guittot@linaro.org
|
|
Use the new capacity_ref_freq() method to set the ratio that is used by AMU for
computing the arch_scale_freq_capacity().
This helps to keep everything aligned using the same reference for
computing CPUs capacity.
The default value of the ratio (stored in per_cpu(arch_max_freq_scale))
ensures that arch_scale_freq_capacity() returns max capacity until it is
set to its correct value with the cpu capacity and capacity_ref_freq().
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Sudeep Holla <sudeep.holla@arm.com>
Acked-by: Will Deacon <will@kernel.org>
Link: https://lore.kernel.org/r/20231211104855.558096-8-vincent.guittot@linaro.org
|
|
Save the frequency associated to the performance that has been used when
initializing the capacity of CPUs.
Also, cppc cpufreq driver can register an artificial energy model. In such
case, it needs the frequency for this compute capacity.
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Tested-by: Pierre Gondois <pierre.gondois@arm.com>
Acked-by: Sudeep Holla <sudeep.holla@arm.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Link: https://lore.kernel.org/r/20231211104855.558096-7-vincent.guittot@linaro.org
|
|
Move and rename cppc_cpufreq_perf_to_khz() and cppc_cpufreq_khz_to_perf() to
use them outside cppc_cpufreq in topology_init_cpu_capacity_cppc().
Modify the interface to use struct cppc_perf_caps *caps instead of
struct cppc_cpudata *cpu_data as we only use the fields of cppc_perf_caps.
cppc_cpufreq was converting the lowest and nominal freq from MHz to kHz
before using them. We move this conversion inside cppc_perf_to_khz and
cppc_khz_to_perf to make them generic and usable outside cppc_cpufreq.
No functional change
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Tested-by: Pierre Gondois <pierre.gondois@arm.com>
Acked-by: Rafael J. Wysocki <rafael@kernel.org>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Link: https://lore.kernel.org/r/20231211104855.558096-6-vincent.guittot@linaro.org
|
|
The last item of a performance domain is not always the performance point
that has been used to compute CPU's capacity. This can lead to different
target frequency compared with other part of the system like schedutil and
would result in wrong energy estimation.
A new arch_scale_freq_ref() is available to return a fixed and coherent
frequency reference that can be used when computing the CPU's frequency
for an level of utilization. Use this function to get this reference
frequency.
Energy model is never used without defining arch_scale_freq_ref() but
can be compiled. Define a default arch_scale_freq_ref() returning 0
in such case.
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Tested-by: Lukasz Luba <lukasz.luba@arm.com>
Reviewed-by: Lukasz Luba <lukasz.luba@arm.com>
Link: https://lore.kernel.org/r/20231211104855.558096-5-vincent.guittot@linaro.org
|
|
cpuinfo.max_freq can change at runtime because of boost as an example. This
implies that the value could be different than the one that has been
used when computing the capacity of a CPU.
The new arch_scale_freq_ref() returns a fixed and coherent reference
frequency that can be used when computing a frequency based on utilization.
Use this arch_scale_freq_ref() when available and fallback to
policy otherwise.
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Tested-by: Lukasz Luba <lukasz.luba@arm.com>
Reviewed-by: Lukasz Luba <lukasz.luba@arm.com>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Acked-by: Rafael J. Wysocki <rafael@kernel.org>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Link: https://lore.kernel.org/r/20231211104855.558096-4-vincent.guittot@linaro.org
|
|
cpuinfo.max_freq can change at runtime because of boost as an example. This
implies that the value could be different from the frequency that has been
used to compute the capacity of a CPU.
The new arch_scale_freq_ref() returns a fixed and coherent frequency
that can be used to compute the capacity for a given frequency.
[ Also fix a arch_set_freq_scale() newline style wart in <linux/cpufreq.h>. ]
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Tested-by: Lukasz Luba <lukasz.luba@arm.com>
Reviewed-by: Lukasz Luba <lukasz.luba@arm.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Acked-by: Rafael J. Wysocki <rafael@kernel.org>
Link: https://lore.kernel.org/r/20231211104855.558096-3-vincent.guittot@linaro.org
|
|
Create a new method to get a unique and fixed max frequency. Currently
cpuinfo.max_freq or the highest (or last) state of performance domain are
used as the max frequency when computing the frequency for a level of
utilization, but:
- cpuinfo_max_freq can change at runtime. boost is one example of
such change.
- cpuinfo.max_freq and last item of the PD can be different leading to
different results between cpufreq and energy model.
We need to save the reference frequency that has been used when computing
the CPUs capacity and use this fixed and coherent value to convert between
frequency and CPU's capacity.
In fact, we already save the frequency that has been used when computing
the capacity of each CPU. We extend the precision to save kHz instead of
MHz currently and we modify the type to be aligned with other variables
used when converting frequency to capacity and the other way.
[ mingo: Minor edits. ]
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Tested-by: Lukasz Luba <lukasz.luba@arm.com>
Reviewed-by: Lukasz Luba <lukasz.luba@arm.com>
Acked-by: Sudeep Holla <sudeep.holla@arm.com>
Link: https://lore.kernel.org/r/20231211104855.558096-2-vincent.guittot@linaro.org
|
|
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
Assert that the file object is allocated in a backing_file container
so that file_user_path() could be used to display the user path and
not the backing file's path in /proc/<pid>/maps.
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
|
|
There is not much in those helpers, but it makes sense to have them
logically next to the backing_file_{read,write}_iter() helpers as they
may grow more common logic in the future.
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
|
|
Overlayfs submits files io to backing files on other filesystems.
Factor out some common helpers to perform io to backing files, into
fs/backing-file.c.
Suggested-by: Miklos Szeredi <miklos@szeredi.hu>
Link: https://lore.kernel.org/r/CAJfpeguhmZbjP3JLqtUy0AdWaHOkAPWeP827BBWwRFEAUgnUcQ@mail.gmail.com
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
|
|
In preparation for factoring out some backing file io helpers from
overlayfs, move backing_file_open() into a new file fs/backing-file.c
and header.
Add a MAINTAINERS entry for stackable filesystems and add a Kconfig
FS_STACK which stackable filesystems need to select.
For now, the backing_file struct, the backing_file alloc/free functions
and the backing_file_real_path() accessor remain internal to file_table.c.
We may change that in the future.
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
|
|
Commit 56769ba4b297 ("kbuild: unify vdso_install rules") accidentally
dropped the '.debug' suffix from the build ID symlinks.
Fixes: 56769ba4b297 ("kbuild: unify vdso_install rules")
Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
|