Age | Commit message (Collapse) | Author |
|
Add a tracepoint to log calls to afs_make_call(), including the destination
server address.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
|
|
The current code assumes that offline and busy volume states apply to all
instances of a volume, not just the one on the server that returned
VOFFLINE or VBUSY and will emit a notice to dmesg suggesting that the
entire volume is unavailable.
Fix that by moving the flags recording this to the afs_server_entry struct
that is used to represent a particular instance of a volume on a specific
server. The notice is altered to include the server UUID also.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
|
|
Fix the fileserver rotation so that it doesn't use RTT as the basis for
deciding which server and address to use as this doesn't necessarily give a
good indication of the best path. Instead, use the configurable preference
list in conjunction with whatever probes have succeeded at the time of
looking.
To this end, make the following changes:
(1) Keep an array of "server states" to track what addresses we've tried
on each server and move the waitqueue entries there that we'll need
for probing.
(2) Each afs_server_state struct is made to pin the corresponding server's
endpoint state rather than the afs_operation struct carrying a pin on
the server we're currently looking at.
(3) Drop the server list preference; we now always rescan the server list.
(4) afs_wait_for_probes() now uses the server state list to guide it in
what it waits for (and to provide the waitqueue entries) and returns
an indication of whether we'd got a response, run out of responsive
addresses or the endpoint state had been superseded and we need to
restart the iteration.
(5) Call afs_get_address_preferences*() occasionally to refresh the
preference values.
(6) When picking a server, scan the addresses of the servers for which we
have as-yet untested communications, looking for the highest priority
one and use that instead of trying all the addresses for a particular
server in ascending-RTT order.
(7) When a Busy or Offline state is seen across all available servers, do
a short sleep.
(8) If we detect that we accessed a future RO volume version whilst it is
undergoing replication, reissue the op against the older version until
at least half of the servers are replicated.
(9) Whilst RO replication is ongoing, increase the frequency of Volume
Location server checks for that volume to every ten minutes instead of
hourly.
Also add a tracepoint to track progress through the rotation algorithm.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
|
|
Overhaul the third party-induced invalidation handling, making use of the
previously added volume-level event counters (cb_scrub and cb_ro_snapshot)
that are now being parsed out of the VolSync record returned by the
fileserver in many of its replies.
This allows better handling of RO (and Backup) volumes. Since these are
snapshot of a RW volume that are updated atomically simultantanously across
all servers that host them, they only require a single callback promise for
the entire volume. The currently upstream code assumes that RO volumes
operate in the same manner as RW volumes, and that each file has its own
individual callback - which means that it does a status fetch for *every*
file in a RO volume, whether or not the volume got "released" (volume
callback breaks can occur for other reasons too, such as the volumeserver
taking ownership of a volume from a fileserver).
To this end, make the following changes:
(1) Change the meaning of the volume's cb_v_break counter so that it is
now a hint that we need to issue a status fetch to work out the state
of a volume. cb_v_break is incremented by volume break callbacks and
by server initialisation callbacks.
(2) Add a second counter, cb_v_check, to the afs_volume struct such that
if this differs from cb_v_break, we need to do a check. When the
check is complete, cb_v_check is advanced to what cb_v_break was at
the start of the status fetch.
(3) Move the list of mmap'd vnodes to the volume and trigger removal of
PTEs that map to files on a volume break rather than on a server
break.
(4) When a server reinitialisation callback comes in, use the
server-to-volume reverse mapping added in a preceding patch to iterate
over all the volumes using that server and clear the volume callback
promises for that server and the general volume promise as a whole to
trigger reanalysis.
(5) Replace the AFS_VNODE_CB_PROMISED flag with an AFS_NO_CB_PROMISE
(TIME64_MIN) value in the cb_expires_at field, reducing the number of
checks we need to make.
(6) Change afs_check_validity() to quickly see if various event counters
have been incremented or if the vnode or volume callback promise is
due to expire/has expired without making any changes to the state.
That is now left to afs_validate() as this may get more complicated in
future as we may have to examine server records too.
(7) Overhaul afs_validate() so that it does a single status fetch if we
need to check the state of either the vnode or the volume - and do so
under appropriate locking. The function does the following steps:
(A) If the vnode/volume is no longer seen as valid, then we take the
vnode validation lock and, if the volume promise has expired, the
volume check lock also. The latter prevents redundant checks being
made to find out if a new version of the volume got released.
(B) If a previous RPC call found that the volsync changed unexpectedly
or that a RO volume was updated, then we unmap all PTEs pointing to
the file to stop mmap being used for access.
(C) If the vnode is still seen to be of uncertain validity, then we
perform an FS.FetchStatus RPC op to jointly update the volume status
and the vnode status. This assessment is done as part of parsing the
reply:
If the RO volume creation timestamp advances, cb_ro_snapshot is
incremented; if either the creation or update timestamps changes in
an unexpected way, the cb_scrub counter is incremented
If the Data Version returned doesn't match the copy we have
locally, then we ask for the pagecache to be zapped. This takes
care of handling RO update.
(D) If cb_scrub differs between volume and vnode, the vnode's
pagecache is zapped and the vnode's cb_scrub is updated unless the
file is marked as having been deleted.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
|
|
A number of fileserver RPC operations return a VolSync record as part of
their reply that gives some information about the state of the volume being
accessed, including:
(1) A volume Creation timestamp. For an RW volume, this is the time at
which the volume was created; if it changes, the RW volume was
presumably restored from a backup and all cached data should be
scrubbed as Data Version numbers could regress on the files in the
volume.
For an RO volume, this is the time it was last snapshotted from the RW
volume. It is expected to advance each time this happens; if it
regresses, cached data should be scrubbed.
(2) A volume Update timestamp (Auristor only). For an RW volume, this is
updated any time any change is made to a volume or its contents. If
it regresses, all cached data must be scrubbed.
For an RO volume, this is a copy of the RW volume's Update timestamp
at the point of snapshotting. It can be used as a version number when
checking to see if a callback on a RO volume was due to a snapshot.
If it regresses, all cached data must be scrubbed.
but this is currently not made use of by the in-kernel afs filesystem.
Make the afs filesystem use this by:
(1) Add an update time field to the afs_volsync struct and use a value of
TIME64_MIN in both that and the creation time to indicate that they
are unset.
(2) Add creation and update time fields to the afs_volume struct and use
this to track the two timestamps.
(3) Add a volsync_lock mutex to the afs_volume struct to control
modification access for when we detect a change in these values.
(3) Add a 'pre-op volsync' struct to the afs_operation struct to record
the state of the volume tracking before the op.
(4) Add a new counter, cb_scrub, to the afs_volume struct to count events
that require all data to be scrubbed. A copy is placed in the
afs_vnode struct (inode) and if they no longer match, a scrub takes
place.
(5) When the result of an operation is being parsed, parse the VolSync
data too, if it is provided. Note that the two timestamps are handled
separately, since they don't work in quite the same way.
- If the afs_volume tracking is unset, just set it and do nothing
else.
- If the result timestamps are the same as the ones in afs_volume, do
nothing.
- If the timestamps regress, increment cb_scrub if not already done
so.
- If the creation timestamp on a RW volume changes, increment cb_scrub
if not already done so.
- If the creation timestamp on a RO volume advances, update the server
list and see if the current server has been excluded, if so reissue
the op. Once over half of the replication sites have been updated,
increment cb_ro_snapshot to indicate updates may be required and
switch over to excluding unupdated replication sites.
- If the creation timestamp on a Backup volume advances, just
increment cb_ro_snapshot to trigger updates.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
|
|
Don't leave servers that are marked VLSF_DONTUSE or VLSF_NEWREPSITE out of
the server list for a volume; rather, mark DONTUSE ones excluded and mark
either NEWREPSITE excluded if the number of updated servers is <50% of the
usable servers or mark !NEWREPSITE excluded otherwise.
Mark the server list as a whole with a 3-state flag to indicate whether we
think the RW volume is being replicated to the RO volume, and, if so,
whether we should switch to using updated replication sites
(VLSF_NEWREPSITE) or stick with the old for now.
This processing is pushed up from the VLDB RPC reply parser to the code
that generates the server list from that information.
Doing this allows the old list to be kept with just the exclusion flags
replaced and to keep the server records pinned and maintained.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
|
|
Fix the comment in afs_do_lookup() that says that slot 0 is used for the
fid being looked up and slot 1 is used for the directory. It's actually
done the other way round.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
|
|
Apply server breaks to mmap'd files that are being used from that server
from the call processor work function rather than punting it off to a
workqueue. The work item, afs_server_init_callback(), then bumps each
individual inode off to its own work item introducing a potentially lengthy
delay. This reduces that delay at the cost of extending the amount of time
we delay replying to the CB.InitCallBack3 notification RPC from the server.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
|
|
Move the code that does validity checking of vnodes and volumes with
respect to third-party changes into its own file.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
|
|
Defer volume record destruction to a workqueue so that afs_put_volume()
isn't going to run the destruction process in the callback workqueue whilst
the server is holding up other clients whilst waiting for us to reply to a
CB.CallBack notification RPC.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
|
|
Make it possible to find the afs_volume structs that are using an
afs_server struct to aid in breaking volume callbacks.
The way this is done is that each afs_volume already has an array of
afs_server_entry records that point to the servers where that volume might
be found. An afs_volume backpointer and a list node is added to each entry
and each entry is then added to an RCU-traversable list on the afs_server
to which it points.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
|
|
Combine the endpoint state bool-type members into a bitmask so that some of
them can be waited upon more easily.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
|
|
Keep a record of the current fileserver endpoint state, including the probe
state, and replace it when a new probe is started rather than just
squelching the old state and overwriting it. Clearance of the old state
can cause a race if there's another thread also currently trying to
communicate with that server.
It appears that this race might be the culprit for some occasions where
kafs complains about invalid data in the RPC reply because the rotation
algorithm fell all the way through without actually issuing an RPC call and
the error return got filled in from the probe state (which has a zero error
recorded). Whatever happens to be in the caller's reply buffer is then
taken as the response.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
|
|
When probing all the addresses for a volume location server, dispatch them
in order of descending priority to try and get back highest priority one
first.
Also add a tracepoint to show the transmission and completion of the
probes.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
|
|
When probing all the addresses for a fileserver, dispatch them in order of
descending priority to try and get back highest priority one first.
Also add a tracepoint to show the transmission and completion of the
probes.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
|
|
Add a field to each address in an address list (afs_addr_list struct) that
records the current priority for that address according to the address
preference table. We don't want to do this every time we use an address
list, so the version number of the address preference table is recorded in
the address list too and we only re-mark the list when we see the version
change.
These numbers are then displayed through /proc/net/afs/servers.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
|
|
AFS servers may have multiple addresses, but the client can't easily judge
between them as to which one is best. For instance, an address that has a
larger RTT might actually have a better bandwidth because it goes through a
switch rather than being directly connected - but we can't work this out
dynamically unless we push through sufficient data that we can measure it.
To allow the administrator to configure this, add a list of preference
weightings for server addresses by IPv4/IPv6 address or subnet and allow
this to be viewed through a procfile and altered by writing text commands
to that same file. Preference rules can be added/updated by:
echo "add <proto> <addr>[/<subnet>] <prior>" >/proc/fs/afs/addr_prefs
echo "add udp 1.2.3.4 1000" >/proc/fs/afs/addr_prefs
echo "add udp 192.168.0.0/16 3000" >/proc/fs/afs/addr_prefs
echo "add udp 1001:2002:0:6::/64 4000" >/proc/fs/afs/addr_prefs
and removed by:
echo "del <proto> <addr>[/<subnet>]" >/proc/fs/afs/addr_prefs
echo "del udp 1.2.3.4" >/proc/fs/afs/addr_prefs
where the priority is a number between 0 and 65535.
The list is split between IPv4 and IPv6 addresses and each sublist is kept
in numerical order, with rules that would otherwise match but have
different subnet masking being ordered with the most specific submatch
first.
A subsequent patch will apply these rules.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
|
|
Remove afs_cmp_addr_list() as it was never implemented.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
|
|
In /proc/net/afs/servers, show the cell name and the last error for each
address in the server's list.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
|
|
I am stepping down as TJA11XX C45 maintainer.
Andrei Botila will take the responsibility to maintain and improve the
support for TJA11XX C45 PHYs.
Signed-off-by: Radu Pirea (NXP OSS) <radu-nicolae.pirea@oss.nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Add support for client processors starting from Kaby Lake.
Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Link: https://lore.kernel.org/r/20231222203957.1348043-1-srinivas.pandruvada@linux.intel.com
Reviewed-by: Hans de Goede <hdegoede@redhat.com>
Signed-off-by: Hans de Goede <hdegoede@redhat.com>
|
|
It seems traffic there is quite low and changes are often not related
to PDx86 anyhow. Besides that I have a lot of other stuff to do, I'm
rearly pay attention on these emails. Doesn't seem Daren to be active
either. With this in mind, remove (stale) section.
Note, it might be make sense to actually move that folder under PDx86
umbrella (in MAINTAINERS) if people find it suitable. That will reduce
burden on arch/x86 maintenance.
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Link: https://lore.kernel.org/r/20231222144453.2888706-1-andriy.shevchenko@linux.intel.com
Reviewed-by: Hans de Goede <hdegoede@redhat.com>
Signed-off-by: Hans de Goede <hdegoede@redhat.com>
|
|
Some r8168 NICs stop working upon system resume:
[ 688.051096] r8169 0000:02:00.1 enp2s0f1: rtl_ep_ocp_read_cond == 0 (loop: 10, delay: 10000).
[ 688.175131] r8169 0000:02:00.1 enp2s0f1: Link is Down
...
[ 691.534611] r8169 0000:02:00.1 enp2s0f1: PCI error (cmd = 0x0407, status_errs = 0x0000)
Not sure if it's related, but those NICs have a BMC device at function
0:
02:00.0 Unassigned class [ff00]: Realtek Semiconductor Co., Ltd. Realtek RealManage BMC [10ec:816e] (rev 1a)
Trial and error shows that increase the loop wait on
rtl_ep_ocp_read_cond to 30 can eliminate the issue, so let
rtl8168ep_driver_start() to wait a bit longer.
Fixes: e6d6ca6e1204 ("r8169: Add support for another RTL8168FP")
Signed-off-by: Kai-Heng Feng <kai.heng.feng@canonical.com>
Reviewed-by: Heiner Kallweit <hkallweit1@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
The freeing and re-allocation of algorithm are protected by cpool_mutex,
so it doesn't fix an actual use-after-free, but avoids a deserved
refcount_warn_saturate() warning.
A trivial fix for the racy behavior.
Fixes: 8c73b26315aa ("net/tcp: Prepare tcp_md5sig_pool for TCP-AO")
Suggested-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Dmitry Safonov <dima@arista.com>
Tested-by: Bagas Sanjaya <bagasdotme@gmail.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
m->data needs to be freed when em_text_destroy is called.
Fixes: d675c989ed2d ("[PKT_SCHED]: Packet classification based on textsearch (ematch)")
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Hangyu Hua <hbh25y@gmail.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
|
|
When parsing emails from .yaml files in particular, stray punctuation
such as a leading '-' can end up in the name. For example, consider a
common YAML section such as:
maintainers:
- devicetree@vger.kernel.org
This would previously be processed by get_maintainer.pl as:
- <devicetree@vger.kernel.org>
Make the logic in clean_file_emails more robust by deleting any
sub-names which consist of common single punctuation marks before
proceeding to the best-effort name extraction logic. The output is then
correct:
devicetree@vger.kernel.org
Some additional comments are added to the function to make things
clearer to future readers.
Link: https://lore.kernel.org/all/0173e76a36b3a9b4e7f324dd3a36fd4a9757f302.camel@perches.com/
Suggested-by: Joe Perches <joe@perches.com>
Signed-off-by: Alvin Šipraga <alsi@bang-olufsen.dk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
While the script correctly extracts UTF-8 encoded names from the
MAINTAINERS file, the regular expressions damage my name when parsing
from .yaml files. Fix this by replacing the Latin-1-compatible regular
expressions with the unicode property matcher \p{L}, which matches on
any letter according to the Unicode General Category of letters.
The proposed solution only works if the script uses proper string
encoding from the outset, so instruct Perl to unconditionally open all
files with UTF-8 encoding. This should be safe, as the entire source
tree is either UTF-8 or ASCII encoded anyway. See [1] for a detailed
analysis.
Furthermore, to prevent the \w expression from matching non-ASCII when
checking for whether a name should be escaped with quotes, add the /a
flag to the regular expression. The escaping logic was duplicated in
two places, so it has been factored out into its own function.
The original issue was also identified on the tools mailing list [2].
This should solve the observed side effects there as well.
Link: https://lore.kernel.org/all/dzn6uco4c45oaa3ia4u37uo5mlt33obecv7gghj2l756fr4hdh@mt3cprft3tmq/ [1]
Link: https://lore.kernel.org/tools/20230726-gush-slouching-a5cd41@meerkat/ [2]
Signed-off-by: Alvin Šipraga <alsi@bang-olufsen.dk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace
Pull tracing fixes from Steven Rostedt:
- Fix readers that are blocked on the ring buffer when buffer_percent
is 100%. They are supposed to wake up when the buffer is full, but
because the sub-buffer that the writer is on is never considered
"dirty" in the calculation, dirty pages will never equal nr_pages.
Add +1 to the dirty count in order to count for the sub-buffer that
the writer is on.
- When a reader is blocked on the "snapshot_raw" file, it is to be
woken up when a snapshot is done and be able to read the snapshot
buffer. But because the snapshot swaps the buffers (the main one with
the snapshot one), and the snapshot reader is waiting on the old
snapshot buffer, it was not woken up (because it is now on the main
buffer after the swap). Worse yet, when it reads the buffer after a
snapshot, it's not reading the snapshot buffer, it's reading the live
active main buffer.
Fix this by forcing a wakeup of all readers on the snapshot buffer
when a new snapshot happens, and then update the buffer that the
reader is reading to be back on the snapshot buffer.
- Fix the modification of the direct_function hash. There was a race
when new functions were added to the direct_function hash as when it
moved function entries from the old hash to the new one, a direct
function trace could be hit and not see its entry.
This is fixed by allocating the new hash, copy all the old entries
onto it as well as the new entries, and then use rcu_assign_pointer()
to update the new direct_function hash with it.
This also fixes a memory leak in that code.
- Fix eventfs ownership
* tag 'trace-v6.7-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
ftrace: Fix modification of direct_function hash while in use
tracing: Fix blocked reader of snapshot buffer
ring-buffer: Fix wake ups when buffer_percent is set to 100
eventfs: Fix file and directory uid and gid ownership
|
|
Directly return NULL or 'next' instead of breaking out of the loop.
Signed-off-by: David Laight <david.laight@aculab.com>
[ Split original patch into two independent parts - Linus ]
Link: https://lore.kernel.org/lkml/7c8828aec72e42eeb841ca0ee3397e9a@AcuMS.aculab.com/
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
osq_wait_next() is passed 'prev' from osq_lock() and NULL from
osq_unlock() but only needs the 'cpu' value to write to lock->tail.
Just pass prev->cpu or OSQ_UNLOCKED_VAL instead.
Should have no effect on the generated code since gcc manages to assume
that 'prev != NULL' due to an earlier dereference.
Signed-off-by: David Laight <david.laight@aculab.com>
[ Changed 'old' to 'old_cpu' by request from Waiman Long - Linus ]
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
struct optimistic_spin_node is private to the implementation.
Move it into the C file to ensure nothing is accessing it.
Signed-off-by: David Laight <david.laight@aculab.com>
Acked-by: Waiman Long <longman@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Masami Hiramatsu reported a memory leak in register_ftrace_direct() where
if the number of new entries are added is large enough to cause two
allocations in the loop:
for (i = 0; i < size; i++) {
hlist_for_each_entry(entry, &hash->buckets[i], hlist) {
new = ftrace_add_rec_direct(entry->ip, addr, &free_hash);
if (!new)
goto out_remove;
entry->direct = addr;
}
}
Where ftrace_add_rec_direct() has:
if (ftrace_hash_empty(direct_functions) ||
direct_functions->count > 2 * (1 << direct_functions->size_bits)) {
struct ftrace_hash *new_hash;
int size = ftrace_hash_empty(direct_functions) ? 0 :
direct_functions->count + 1;
if (size < 32)
size = 32;
new_hash = dup_hash(direct_functions, size);
if (!new_hash)
return NULL;
*free_hash = direct_functions;
direct_functions = new_hash;
}
The "*free_hash = direct_functions;" can happen twice, losing the previous
allocation of direct_functions.
But this also exposed a more serious bug.
The modification of direct_functions above is not safe. As
direct_functions can be referenced at any time to find what direct caller
it should call, the time between:
new_hash = dup_hash(direct_functions, size);
and
direct_functions = new_hash;
can have a race with another CPU (or even this one if it gets interrupted),
and the entries being moved to the new hash are not referenced.
That's because the "dup_hash()" is really misnamed and is really a
"move_hash()". It moves the entries from the old hash to the new one.
Now even if that was changed, this code is not proper as direct_functions
should not be updated until the end. That is the best way to handle
function reference changes, and is the way other parts of ftrace handles
this.
The following is done:
1. Change add_hash_entry() to return the entry it created and inserted
into the hash, and not just return success or not.
2. Replace ftrace_add_rec_direct() with add_hash_entry(), and remove
the former.
3. Allocate a "new_hash" at the start that is made for holding both the
new hash entries as well as the existing entries in direct_functions.
4. Copy (not move) the direct_function entries over to the new_hash.
5. Copy the entries of the added hash to the new_hash.
6. If everything succeeds, then use rcu_pointer_assign() to update the
direct_functions with the new_hash.
This simplifies the code and fixes both the memory leak as well as the
race condition mentioned above.
Link: https://lore.kernel.org/all/170368070504.42064.8960569647118388081.stgit@devnote2/
Link: https://lore.kernel.org/linux-trace-kernel/20231229115134.08dd5174@gandalf.local.home
Cc: stable@vger.kernel.org
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Fixes: 763e34e74bb7d ("ftrace: Add register_ftrace_direct()")
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
|
|
In
https://lore.kernel.org/r/20231206110636.GBZXBVvCWj2IDjVk4c@fat_crate.local
I wanted to adjust the alternative patching debug output to the new
changes introduced by
da0fe6e68e10 ("x86/alternative: Add indirect call patching")
but removed the '*' which denotes the ->x86_capability word. The correct
output should be, for example:
[ 0.230071] SMP alternatives: feat: 11*32+15, old: (entry_SYSCALL_64_after_hwframe+0x5a/0x77 (ffffffff81c000c2) len: 16), repl: (ffffffff89ae896a, len: 5) flags: 0x0
while the incorrect one says "... 1132+15" currently.
Add back the '*'.
Fixes: da0fe6e68e10 ("x86/alternative: Add indirect call patching")
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20231206110636.GBZXBVvCWj2IDjVk4c@fat_crate.local
|
|
The HP Pavilion 14 ec1xxx series uses the HP mainboard 8A0F with the
ALC287 codec.
The mute led can be enabled using the already existing
ALC287_FIXUP_HP_GPIO_LED quirk.
Tested on an HP Pavilion ec1003AU
Signed-off-by: Aabish Malik <aabishmalik3337@gmail.com>
Cc: <stable@vger.kernel.org>
Link: https://lore.kernel.org/r/20231229170352.742261-3-aabishmalik3337@gmail.com
Signed-off-by: Takashi Iwai <tiwai@suse.de>
|
|
During our experiment with zswap, we sometimes observe swap IOs due to
occasional zswap store failures and writebacks-to-swap. These swapping
IOs prevent many users who cannot tolerate swapping from adopting zswap to
save memory and improve performance where possible.
This patch adds the option to disable this behavior entirely: do not
writeback to backing swapping device when a zswap store attempt fail, and
do not write pages in the zswap pool back to the backing swap device (both
when the pool is full, and when the new zswap shrinker is called).
This new behavior can be opted-in/out on a per-cgroup basis via a new
cgroup file. By default, writebacks to swap device is enabled, which is
the previous behavior. Initially, writeback is enabled for the root
cgroup, and a newly created cgroup will inherit the current setting of its
parent.
Note that this is subtly different from setting memory.swap.max to 0, as
it still allows for pages to be stored in the zswap pool (which itself
consumes swap space in its current form).
This patch should be applied on top of the zswap shrinker series:
https://lore.kernel.org/linux-mm/20231130194023.4102148-1-nphamcs@gmail.com/
as it also disables the zswap shrinker, a major source of zswap
writebacks.
For the most part, this feature is motivated by internal parties who
have already established their opinions regarding swapping - the
workloads that are highly sensitive to IO, and especially those who are
using servers with really slow disk performance (for instance, massive
but slow HDDs). For these folks, it's impossible to convince them to
even entertain zswap if swapping also comes as a packaged deal.
Writeback disabling is quite a useful feature in these situations - on
a mixed workloads deployment, they can disable writeback for the more
IO-sensitive workloads, and enable writeback for other background
workloads.
For instance, on a server with HDD, I allocate memories and populate
them with random values (so that zswap store will always fail), and
specify memory.high low enough to trigger reclaim. The time it takes
to allocate the memories and just read through it a couple of times
(doing silly things like computing the values' average etc.):
zswap.writeback disabled:
real 0m30.537s
user 0m23.687s
sys 0m6.637s
0 pages swapped in
0 pages swapped out
zswap.writeback enabled:
real 0m45.061s
user 0m24.310s
sys 0m8.892s
712686 pages swapped in
461093 pages swapped out
(the last two lines are from vmstat -s).
[nphamcs@gmail.com: add a comment about recurring zswap store failures leading to reclaim inefficiency]
Link: https://lkml.kernel.org/r/20231221005725.3446672-1-nphamcs@gmail.com
Link: https://lkml.kernel.org/r/20231207192406.3809579-1-nphamcs@gmail.com
Signed-off-by: Nhat Pham <nphamcs@gmail.com>
Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Yosry Ahmed <yosryahmed@google.com>
Acked-by: Chris Li <chrisl@kernel.org>
Cc: Dan Streetman <ddstreet@ieee.org>
Cc: David Heidelberg <david@ixit.cz>
Cc: Domenico Cerasuolo <cerasuolodomenico@gmail.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Mike Rapoport (IBM) <rppt@kernel.org>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Seth Jennings <sjenning@redhat.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vitaly Wool <vitaly.wool@konsulko.com>
Cc: Zefan Li <lizefan.x@bytedance.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Under heavy traffic, the BlueField Gigabit interface can
become unresponsive. This is due to a possible race condition
in the mlxbf_gige_rx_packet function, where the function exits
with producer and consumer indices equal but there are remaining
packet(s) to be processed. In order to prevent this situation,
read receive consumer index *before* the HW replenish so that
the mlxbf_gige_rx_packet function returns an accurate return
value even if a packet is received into just-replenished buffer
prior to exiting this routine. If the just-replenished buffer
is received and occupies the last RX ring entry, the interface
would not recover and instead would encounter RX packet drops
related to internal buffer shortages since the driver RX logic
is not being triggered to drain the RX ring. This patch will
address and prevent this "ring full" condition.
Fixes: f92e1869d74e ("Add Mellanox BlueField Gigabit Ethernet driver")
Reviewed-by: Asmaa Mnebhi <asmaa@nvidia.com>
Signed-off-by: David Thompson <davthompson@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Remove duplicate definitions, no functional changes.
Link: https://lkml.kernel.org/r/MW4PR84MB3145459ADC7EB38BBB36955B8198A@MW4PR84MB3145.NAMPRD84.PROD.OUTLOOK.COM
Signed-off-by: Youling Tang <tangyouling@kylinos.cn>
Reported-by: Huacai Chen <chenhuacai@loongson.cn>
Acked-by: Baoquan He <bhe@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
When the kernel log is acquired over a serial cable it is not uncommon for
the log to contain carriage return characters, in addition to the expected
line feeds.
When this output is feed into decode_stacktrace.sh, handle_line() fails to
strip the trailing ']' off the module name, which results in find_module()
not being able to find the referred to kernel module. This is reported to
the user as:
WARNING! Modules path isn't set, but is needed to parse this symbol
The solution is to reconfigure the serial port, or to strip the carriage
returns from the log, but this isn't obvious from the error reported by
the script.
Instead, make decode_stacktrace.sh more user friendly by stripping the
trailing carriage return.
Link: https://lkml.kernel.org/r/20231225-decode-stacktrace-cr-v1-1-9f306f38cdde@quicinc.com
Signed-off-by: Bjorn Andersson <quic_bjorande@quicinc.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
If, as part of handling a hardlockup or softlockup, we've already dumped
all CPUs and we're just about to panic, don't reenable dumping and give
some other CPU a chance to hop in there and add some confusing logs right
as the panic is happening.
Link: https://lkml.kernel.org/r/20231220131534.4.Id3a9c7ec2d7d83e4080da6f8662ba2226b40543f@changeid
Signed-off-by: Douglas Anderson <dianders@chromium.org>
Cc: John Ogness <john.ogness@linutronix.de>
Cc: Lecopzer Chen <lecopzer.chen@mediatek.com>
Cc: Li Zhe <lizhe.67@bytedance.com>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Pingfan Liu <kernelfans@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
If two CPUs end up reporting a hardlockup at the same time then their logs
could get interleaved which is hard to read.
The interleaving problem was especially bad with the "perf" hardlockup
detector where the locked up CPU is always the same as the running CPU and
we end up in show_regs(). show_regs() has no inherent serialization so we
could mix together two crawls if two hardlockups happened at the same time
(and if we didn't have `sysctl_hardlockup_all_cpu_backtrace` set). With
this change we'll fully serialize hardlockups when using the "perf"
hardlockup detector.
The interleaving problem was less bad with the "buddy" hardlockup
detector. With "buddy" we always end up calling
`trigger_single_cpu_backtrace(cpu)` on some CPU other than the running
one. trigger_single_cpu_backtrace() always at least serializes the
individual stack crawls because it eventually uses
printk_cpu_sync_get_irqsave(). Unfortunately the fact that
trigger_single_cpu_backtrace() eventually calls
printk_cpu_sync_get_irqsave() (on a different CPU) means that we have to
drop the "lock" before calling it and we can't fully serialize all
printouts associated with a given hardlockup. However, we still do get
the advantage of serializing the output of print_modules() and
print_irqtrace_events().
Aside from serializing hardlockups from each other, this change also has
the advantage of serializing hardlockups and softlockups from each other
if they happen to happen at the same time since they are both using the
same "lock".
Even though nobody is expected to hang while holding the lock associated
with printk_cpu_sync_get_irqsave(), out of an abundance of caution, we
don't call printk_cpu_sync_get_irqsave() until after we print out about
the hardlockup. This makes extra sure that, even if
printk_cpu_sync_get_irqsave() somehow never runs we at least print that we
saw the hardlockup. This is different than the choice made for softlockup
because hardlockup is really our last resort.
Link: https://lkml.kernel.org/r/20231220131534.3.I6ff691b3b40f0379bc860f80c6e729a0485b5247@changeid
Signed-off-by: Douglas Anderson <dianders@chromium.org>
Reviewed-by: John Ogness <john.ogness@linutronix.de>
Cc: Lecopzer Chen <lecopzer.chen@mediatek.com>
Cc: Li Zhe <lizhe.67@bytedance.com>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Pingfan Liu <kernelfans@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Instead of introducing a spinlock, use printk_cpu_sync_get_irqsave() and
printk_cpu_sync_put_irqrestore() to serialize softlockup reporting. Alone
this doesn't have any real advantage over the spinlock, but this will
allow us to use the same function in a future change to also serialize
hardlockup crawls.
NOTE: for the most part this serialization is important because we often
end up in the show_regs() path and that has no built-in serialization if
there are multiple callers at once. However, even in the case where we
end up in the dump_stack() path this still has some advantages because the
stack will be guaranteed to be together in the logs with the lockup
message with no interleaving.
NOTE: the fact that printk_cpu_sync_get_irqsave() is allowed to be called
multiple times on the same CPU is important here. Specifically we hold
the "lock" while calling dump_stack() which also gets the same "lock".
This is explicitly documented to be OK and means we don't need to
introduce a variant of dump_stack() that doesn't grab the lock.
Link: https://lkml.kernel.org/r/20231220131534.2.Ia5906525d440d8e8383cde31b7c61c2aadc8f907@changeid
Signed-off-by: Douglas Anderson <dianders@chromium.org>
Reviewed-by: John Ogness <john.ogness@linutronix.de>
Reviewed-by: Li Zhe <lizhe.67@bytedance.com>
Cc: Lecopzer Chen <lecopzer.chen@mediatek.com>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Pingfan Liu <kernelfans@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Patch series "watchdog: Better handling of concurrent lockups".
When we get multiple lockups at roughly the same time, the output in the
kernel logs can be very confusing since the reports about the lockups end
up interleaved in the logs. There is some code in the kernel to try to
handle this but it wasn't that complete.
Li Zhe recently made this a bit better for softlockups (specifically for
the case where `kernel.softlockup_all_cpu_backtrace` is not set) in commit
9d02330abd3e ("softlockup: serialized softlockup's log"), but that only
handled softlockup reports. Hardlockup reports still had similar issues.
This series also has a small fix to avoid dumping all stacks a second time
in the case of a panic. This is a bit unrelated to the interleaving fixes
but it does also improve the clarity of lockup reports.
This patch (of 4):
The hardlockup detector and softlockup detector both have the ability to
dump the stack of all CPUs (`kernel.hardlockup_all_cpu_backtrace` and
`kernel.softlockup_all_cpu_backtrace`). Both detectors also have some
logic to attempt to avoid interleaving printouts if two CPUs were trying
to do dumps of all CPUs at the same time. However:
- The hardlockup detector's logic still allowed interleaving some
information. Specifically another CPU could print modules and dump
the stack of the locked CPU at the same time we were dumping all
CPUs.
- In the case where `kernel.hardlockup_panic` was set in addition to
`kernel.hardlockup_all_cpu_backtrace`, when two CPUs both detected
hardlockups at the same time the second CPU could call panic() while
the first was still dumping stacks. This was especially bad if the
locked up CPU wasn't responding to the request for a backtrace since
the function nmi_trigger_cpumask_backtrace() can wait up to 10
seconds.
Let's resolve this by adopting the softlockup logic in the hardlockup
handler.
NOTES:
- As part of this, one might think that we should make a helper
function that both the hard and softlockup detectors call. This
turns out not to be super trivial since it would have to be
parameterized quite a bit since there are separate global variables
controlling each lockup detector and they print log messages that
are just different enough that it would be a pain. We probably don't
want to change the messages that are printed without good reason to
avoid throwing log parsers for a loop.
- One might also think that it would be a good idea to have the
hardlockup and softlockup detector use the same global variable to
prevent interleaving. This would make sure that softlockups and
hardlockups can't interleave each other. That _almost_ works but has
a dangerous flaw if `kernel.hardlockup_panic` is not the same as
`kernel.softlockup_panic` because we might skip a call to panic() if
one type of lockup was detected at the same time as another.
Link: https://lkml.kernel.org/r/20231220211640.2023645-1-dianders@chromium.org
Link: https://lkml.kernel.org/r/20231220131534.1.I4f35a69fbb124b5f0c71f75c631e11fabbe188ff@changeid
Signed-off-by: Douglas Anderson <dianders@chromium.org>
Cc: John Ogness <john.ogness@linutronix.de>
Cc: Lecopzer Chen <lecopzer.chen@mediatek.com>
Cc: Li Zhe <lizhe.67@bytedance.com>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Pingfan Liu <kernelfans@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
image->control_page represents the starting address for allocating the
next control page, while hole_end represents the address of the last valid
byte of the currently allocated control page.
This bug actually does not affect the correctness of allocating control
pages, because image->control_page is currently only used in
kimage_alloc_crash_control_pages(), and this function, when allocating
control pages, will first align image->control_page up to the nearest
`(1 << order) << PAGE_SHIFT` boundary, then use this value as the
starting address of the next control page. This ensures that the newly
allocated control page will use the correct starting address and not
overlap with previously allocated control pages.
Although it does not affect the correctness of the final result, it is
better for us to set image->control_page to the correct value, in case
it might be used elsewhere in the future, potentially causing errors.
Therefore, after successfully allocating a control page,
image->control_page should be updated to `hole_end + 1`, rather than
hole_end.
Link: https://lkml.kernel.org/r/20231221042308.11076-1-ytcoode@gmail.com
Signed-off-by: Yuntao Wang <ytcoode@gmail.com>
Cc: Baoquan He <bhe@redhat.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
kernel_ident_mapping_init() takes an exclusive memory range [pstart, pend)
where pend is not included in the range, while res represents an inclusive
memory range [start, end] where end is considered part of the range.
Passing [start, end] rather than [start, end+1) to
kernel_ident_mapping_init() may result in the identity mapping for the
end address not being set up.
For example, when res->start is equal to res->end,
kernel_ident_mapping_init() will not establish any identity mapping.
Similarly, when the value of res->end is a multiple of 2M and the page
table maps 2M pages, kernel_ident_mapping_init() will also not set up
identity mapping for res->end.
Therefore, passing res->end directly to kernel_ident_mapping_init() is
incorrect, the correct end address should be `res->end + 1`.
Link: https://lkml.kernel.org/r/20231221101702.20956-1-ytcoode@gmail.com
Signed-off-by: Yuntao Wang <ytcoode@gmail.com>
Cc: Baoquan He <bhe@redhat.com>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Borislav Petkov (AMD) <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Simon Horman <horms@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
asm-generic/io.h can be replaced with linux/io.h and the file will still
build correctly. It is an asm-generic file which should be avoided if
possible.
Link: https://lkml.kernel.org/r/20231221-tracereadwrite-v1-1-a434f25180c7@google.com
Signed-off-by: Tanzir Hasan <tanzirh@google.com>
Suggested-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Correct the function parameter names for nilfs_cpfile_get_info():
cpfile.c:564: warning: Function parameter or member 'cnop' not described in 'nilfs_cpfile_get_cpinfo'
cpfile.c:564: warning: Function parameter or member 'mode' not described in 'nilfs_cpfile_get_cpinfo'
cpfile.c:564: warning: Function parameter or member 'buf' not described in 'nilfs_cpfile_get_cpinfo'
cpfile.c:564: warning: Function parameter or member 'cisz' not described in 'nilfs_cpfile_get_cpinfo'
cpfile.c:564: warning: Excess function parameter 'cno' description in 'nilfs_cpfile_get_cpinfo'
cpfile.c:564: warning: Excess function parameter 'ci' description in 'nilfs_cpfile_get_cpinfo'
Also add missing descriptions of the function's specification.
[ konishi.ryusuke@gmail.com: filled in missing descriptions ]
Link: https://lkml.kernel.org/r/20231220065931.2372-1-rdunlap@infradead.org
Link: https://lkml.kernel.org/r/20231220221342.11505-1-konishi.ryusuke@gmail.com
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Change @task to @tsk to prevent kernel-doc warnings:
kernel/stacktrace.c:138: warning: Excess function parameter 'task' description in 'stack_trace_save_tsk'
kernel/stacktrace.c:138: warning: Function parameter or member 'tsk' not described in 'stack_trace_save_tsk'
Link: https://lkml.kernel.org/r/20231220054945.17663-1-rdunlap@infradead.org
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
When I use older version aarch64 objdump (2.24) to disassemble aarch64
vmlinux, I get the result like below. There is no space between sp and
offset.
ffff800008010000 <dw_apb_ictl_handle_irq>:
ffff800008010000: d503233f hint #0x19
ffff800008010004: a9bc7bfd stp x29, x30, [sp,#-64]!
ffff800008010008: 90011e60 adrp x0, ffff80000a3dc000 <num_ictlrs>
ffff80000801000c: 910003fd mov x29, sp
ffff800008010010: a9025bf5 stp x21, x22, [sp,#32]
When I use newer version aarch64 objdump (2.35), I get
the result like below.
There is a space between sp and offset.
ffff800008010000 <dw_apb_ictl_handle_irq>:
ffff800008010000: d503233f paciasp
ffff800008010004: a9bc7bfd stp x29, x30, [sp, #-64]!
ffff800008010008: 90011e60 adrp x0, ffff80000a3dc000 <num_ictlrs>
ffff80000801000c: 910003fd mov x29, sp
ffff800008010010: a9025bf5 stp x21, x22, [sp, #32]
Add no space support of regular expression for old version objdump.
Link: https://lkml.kernel.org/r/20231220073629.2658-1-Kuan-Ying.Lee@mediatek.com
Signed-off-by: Kuan-Ying Lee <Kuan-Ying.Lee@mediatek.com>
Cc: Casper Li <casper.li@mediatek.com>
Cc: AngeloGioacchino Del Regno <angelogioacchino.delregno@collabora.com>
Cc: Chinwen Chang <chinwen.chang@mediatek.com>
Cc: Matthias Brugger <matthias.bgg@gmail.com>
Cc: Qun-Wei Lin <qun-wei.lin@mediatek.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
kexec_dprintk() expects the last argument to be kbuf.memsz, but the actual
argument being passed is kbuf.bufsz.
Although these two values are currently equal, it is better to pass the
correct one, in case these two values become different in the future.
Link: https://lkml.kernel.org/r/20231220154105.215610-1-ytcoode@gmail.com
Signed-off-by: Yuntao Wang <ytcoode@gmail.com>
Cc: Baoquan He <bhe@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|