Age | Commit message (Collapse) | Author |
|
Add support for getting the boot hart ID from the Linux EFI stub using
RISCV_EFI_BOOT_PROTOCOL. This method is preferred over the existing DT
based approach since it works irrespective of DT or ACPI.
The specification of the protocol is hosted at:
https://github.com/riscv-non-isa/riscv-uefi
Signed-off-by: Sunil V L <sunilvl@ventanamicro.com>
Acked-by: Palmer Dabbelt <palmer@rivosinc.com>
Reviewed-by: Heinrich Schuchardt <heinrich.schuchardt@canonical.com>
Link: https://lore.kernel.org/r/20220519051512.136724-2-sunilvl@ventanamicro.com
[ardb: minor tweaks for coding style and whitespace]
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
|
|
https://gitlab.freedesktop.org/agd5f/linux into drm-next
amd-drm-next-5.19-2022-05-18:
amdgpu:
- Misc code cleanups
- Additional SMU 13.x enablement
- Smartshift fixes
- GFX11 fixes
- Support for SMU 13.0.4
- SMU mutex fix
- Suspend/resume fix
amdkfd:
- static checker fix
- Doorbell/MMIO resource handling fix
Signed-off-by: Dave Airlie <airlied@redhat.com>
From: Alex Deucher <alexander.deucher@amd.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20220518205621.5741-1-alexander.deucher@amd.com
|
|
request_reinit() is not only ugly as the comment rightfully suggests,
but also unsafe. Even though it is called with osdc->lock held for
write in all cases, resetting the OSD request refcount can still race
with handle_reply() and result in use-after-free. Taking linger ping
as an example:
handle_timeout thread handle_reply thread
down_read(&osdc->lock)
req = lookup_request(...)
...
finish_request(req) # unregisters
up_read(&osdc->lock)
__complete_request(req)
linger_ping_cb(req)
# req->r_kref == 2 because handle_reply still holds its ref
down_write(&osdc->lock)
send_linger_ping(lreq)
req = lreq->ping_req # same req
# cancel_linger_request is NOT
# called - handle_reply already
# unregistered
request_reinit(req)
WARN_ON(req->r_kref != 1) # fires
request_init(req)
kref_init(req->r_kref)
# req->r_kref == 1 after kref_init
ceph_osdc_put_request(req)
kref_put(req->r_kref)
# req->r_kref == 0 after kref_put, req is freed
<further req initialization/use> !!!
This happens because send_linger_ping() always (re)uses the same OSD
request for watch ping requests, relying on cancel_linger_request() to
unregister it from the OSD client and rip its messages out from the
messenger. send_linger() does the same for watch/notify registration
and watch reconnect requests. Unfortunately cancel_request() doesn't
guarantee that after it returns the OSD client would be completely done
with the OSD request -- a ref could still be held and the callback (if
specified) could still be invoked too.
The original motivation for request_reinit() was inability to deal with
allocation failures in send_linger() and send_linger_ping(). Switching
to using osdc->req_mempool (currently only used by CephFS) respects that
and allows us to get rid of request_reinit().
Cc: stable@vger.kernel.org
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Acked-by: Jeff Layton <jlayton@kernel.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/chanwoo/linux
Pull devfreq changes for 5.19-rc1 from Chanwoo Choi:
"1. Update devfreq core
- Add cpu based scaling support to passive governor. Some device like
cache might require the dynamic frequency scaling. But, it has very
tightly to cpu frequency. So that use passive governor to scale
the frequency according to current cpu frequency.
To decide the frequency of the device, the governor does one of the following:
: Derives the optimal devfreq device opp from required-opps property of
the parent cpu opp_table.
: Scales the device frequency in proportion to the CPU frequency. So, if
the CPUs are running at their max frequency, the device runs at its
max frequency. If the CPUs are running at their min frequency, the
device runs at its min frequency. It is interpolated for frequencies
in between.
2. Update devfreq drivers
- Update rk3399_dmc.c as following:
: Convert dt-binding document to YAML and deprecate unused properties.
: Use Hz units for the device-tree properties of rk3399_dmc.
: rk3399_dmc is able to set the idle time before changing the dmc clock.
Specify idle time parameters by using nano-second unit on dt bidning.
: Add new disable-freq properties to optimize the power-saving feature
of rk3399_dmc.
: Disable devfreq-event device on remove() to fix unbalanced
enable-disable count.
: Use devm_pm_opp_of_add_table()
: Block PMU (Power-Management Unit) transitions when scaling frequency
by ARM Trust Firmware in order to fix the conflict between PMU and DMC
(Dynamic Memory Controller)."
* tag 'devfreq-next-for-5.19' of git://git.kernel.org/pub/scm/linux/kernel/git/chanwoo/linux:
PM / devfreq: passive: Keep cpufreq_policy for possible cpus
PM / devfreq: passive: Reduce duplicate code when passive_devfreq case
PM / devfreq: Add cpu based scaling support to passive governor
PM / devfreq: Export devfreq_get_freq_range symbol within devfreq
PM / devfreq: rk3399_dmc: Block PMU during transitions
soc: rockchip: power-domain: Manage resource conflicts with firmware
PM / devfreq: rk3399_dmc: Avoid static (reused) profile
PM / devfreq: rk3399_dmc: Use devm_pm_opp_of_add_table()
PM / devfreq: rk3399_dmc: Disable edev on remove()
PM / devfreq: rk3399_dmc: Support new *-ns properties
PM / devfreq: rk3399_dmc: Support new disable-freq properties
PM / devfreq: rk3399_dmc: Use bitfield macro definitions for ODT_PD
PM / devfreq: rk3399_dmc: Drop excess timing properties
PM / devfreq: rk3399_dmc: Drop undocumented ondemand DT props
dt-bindings: devfreq: rk3399_dmc: Add more disable-freq properties
dt-bindings: devfreq: rk3399_dmc: Specify idle params in nanoseconds
dt-bindings: devfreq: rk3399_dmc: Fix Hz units
dt-bindings: devfreq: rk3399_dmc: Deprecate unused/redundant properties
dt-bindings: devfreq: rk3399_dmc: Convert to YAML
|
|
Add support for using longer timeouts during controller initialization
and letting the controller come up with namespaces that are not ready
for I/O yet. We skip these not ready namespaces during scanning and
only bring them online once anoter scan is kicked off by the AEN that
is set when the NRDY bit gets set in the I/O Command Set Independent
Identify Namespace Data Structure. This asynchronous probing avoids
blocking the kernel boot when controllers take a very long time to
recover after unclean shutdowns (up to minutes).
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
|
|
Currently, start_kernel() adds latent entropy and the command line to
the entropy bool *after* the RNG has been initialized, deferring when
it's actually used by things like stack canaries until the next time
the pool is seeded. This surely is not intended.
Rather than splitting up which entropy gets added where and when between
start_kernel() and random_init(), just do everything in random_init(),
which should eliminate these kinds of bugs in the future.
While we're at it, rename the awkwardly titled "rand_initialize()" to
the more standard "random_init()" nomenclature.
Reviewed-by: Dominik Brodowski <linux@dominikbrodowski.net>
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
|
|
random32.c has two random number generators in it: one that is meant to
be used deterministically, with some predefined seed, and one that does
the same exact thing as random.c, except does it poorly. The first one
has some use cases. The second one no longer does and can be replaced
with calls to random.c's proper random number generator.
The relatively recent siphash-based bad random32.c code was added in
response to concerns that the prior random32.c was too deterministic.
Out of fears that random.c was (at the time) too slow, this code was
anonymously contributed. Then out of that emerged a kind of shadow
entropy gathering system, with its own tentacles throughout various net
code, added willy nilly.
Stop👏making👏bespoke👏random👏number👏generators👏.
Fortunately, recent advances in random.c mean that we can stop playing
with this sketchiness, and just use get_random_u32(), which is now fast
enough. In micro benchmarks using RDPMC, I'm seeing the same median
cycle count between the two functions, with the mean being _slightly_
higher due to batches refilling (which we can optimize further need be).
However, when doing *real* benchmarks of the net functions that actually
use these random numbers, the mean cycles actually *decreased* slightly
(with the median still staying the same), likely because the additional
prandom code means icache misses and complexity, whereas random.c is
generally already being used by something else nearby.
The biggest benefit of this is that there are many users of prandom who
probably should be using cryptographically secure random numbers. This
makes all of those accidental cases become secure by just flipping a
switch. Later on, we can do a tree-wide cleanup to remove the static
inline wrapper functions that this commit adds.
There are also some low-ish hanging fruits for making this even faster
in the future: a get_random_u16() function for use in the networking
stack will give a 2x performance boost there, using SIMD for ChaCha20
will let us compute 4 or 8 or 16 blocks of output in parallel, instead
of just one, giving us large buffers for cheap, and introducing a
get_random_*_bh() function that assumes irqs are already disabled will
shave off a few cycles for ordinary calls. These are things we can chip
away at down the road.
Acked-by: Jakub Kicinski <kuba@kernel.org>
Acked-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
|
|
The SipHash family of permutations is currently used in three places:
- siphash.c itself, used in the ordinary way it was intended.
- random32.c, in a construction from an anonymous contributor.
- random.c, as part of its fast_mix function.
Each one of these places reinvents the wheel with the same C code, same
rotation constants, and same symmetry-breaking constants.
This commit tidies things up a bit by placing macros for the
permutations and constants into siphash.h, where each of the three .c
users can access them. It also leaves a note dissuading more users of
them from emerging.
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
|
|
fsnotify_foreach_iter_mark_type() is used to reduce boilerplate code
of iterating all marks of a specific group interested in an event
by consulting the iterator report_mask.
Use an open coded version of that iterator in fsnotify_iter_next()
that collects all marks of the current iteration group without
consulting the iterator report_mask.
At the moment, the two iterator variants are the same, but this
decoupling will allow us to exclude some of the group's marks from
reporting the event, for example for event on child and inode marks
on parent did not request to watch events on children.
Fixes: 2f02fd3fa13e ("fanotify: fix ignore mask logic for events on child and on dir")
Reported-by: Jan Kara <jack@suse.com>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20220511190213.831646-2-amir73il@gmail.com
|
|
The boardfiles for IXP4xx have been deleted. Delete all the
quirks and code dealing with that boot path and rely solely on
device tree boot.
Signed-off-by: Linus Walleij <linus.walleij@linaro.org>
Link: https://lore.kernel.org/r/20220406205505.2332821-1-linus.walleij@linaro.org
Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org>
|
|
Multiport eswitch mode is a LAG mode that allows to add rules that
forward traffic to a specific physical port without being affected by LAG
affinity configuration.
This mode of operation is mutual exclusive with the other LAG modes used
by multipath and bonding.
To make the transition between the modes, we maintain a counter on the
number of rules specifying one of the uplink representors as the target
of mirred egress redirect action.
An example of such rule would be:
$ tc filter add dev enp8s0f0_0 prot all root flower dst_mac \
00:11:22:33:44:55 action mirred egress redirect dev enp8s0f0
If the reference count just grows to one and LAG is not in use, we
create the LAG in multiport eswitch mode. Other mode changes are not
allowed while in this mode. When the reference count reaches zero, we
destroy the LAG and let other modes be used if needed.
logic also changed such that if forwarding to some uplink destination
cannot be guaranteed, we fail the operation so the rule will eventually
be in software and not in hardware.
Signed-off-by: Eli Cohen <elic@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
|
|
Take the wrapper version which picks default node into a header file.
This reduces the number of exported functions.
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Moshe Shemesh <moshe@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
|
|
Add syndrome of last command failure per command type to debugfs to ease
debugging of such failure.
last_failed_syndrome - last command failed syndrome returned by FW.
Signed-off-by: Moshe Shemesh <moshe@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
|
|
Use TOD_READ_SECONDARY for extts to keep TOD_READ_PRIMARY
for gettime and settime exclusively. Before this change,
TOD_READ_PRIMARY was used for both extts and gettime/settime,
which would result in changing TOD read/write triggers between
operations. Using TOD_READ_SECONDARY would make extts
independent of gettime/settime operation
Signed-off-by: Min Li <min.li.xe@renesas.com>
Acked-by: Richard Cochran <richardcochran@gmail.com>
Link: https://lore.kernel.org/r/1652712427-14703-1-git-send-email-min.li.xe@renesas.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Add generic support for try_cmpxchg64{,_acquire,_release,_relaxed}
and their falbacks involving cmpxchg64.
Signed-off-by: Uros Bizjak <ubizjak@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20220515184205.103089-2-ubizjak@gmail.com
|
|
Not calling the function for dummy contexts will cause the context to
not be reset. During the next syscall, this will cause an error in
__audit_syscall_entry:
WARN_ON(context->context != AUDIT_CTX_UNUSED);
WARN_ON(context->name_count);
if (context->context != AUDIT_CTX_UNUSED || context->name_count) {
audit_panic("unrecoverable error in audit_syscall_entry()");
return;
}
These problematic dummy contexts are created via the following call
chain:
exit_to_user_mode_prepare
-> arch_do_signal_or_restart
-> get_signal
-> task_work_run
-> tctx_task_work
-> io_req_task_submit
-> io_issue_sqe
-> audit_uring_entry
Cc: stable@vger.kernel.org
Fixes: 5bd2182d58e9 ("audit,io_uring,io-wq: add some basic audit support to io_uring")
Signed-off-by: Julian Orth <ju.orth@gmail.com>
[PM: subject line tweaks]
Signed-off-by: Paul Moore <paul@paul-moore.com>
|
|
Add the ability to set or retrieve the acl using the NFSv4.1 'dacl' and
'sacl' attributes to the NFSv4 xdr encoders/decoders.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
|
|
When caching a NFSv4 ACL, we want to specify whether we are caching an
NFSv4.0 type acl, the NFSv4.1 dacl or the NFSv4.1 sacl.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
|
|
Implement the data plane of on-demand read mode.
The early implementation [1] place the entry to
cachefiles_ondemand_read() in fscache_read(). However, fscache_read()
can only detect if the requested file range is fully cache miss, whilst
we need to notify the user daemon as long as there's a hole inside the
requested file range.
Thus the entry is now placed in cachefiles_prepare_read(). When working
in on-demand read mode, once a hole detected, the read routine will send
a READ request to the user daemon. The user daemon needs to fetch the
data and write it to the cache file. After sending the READ request, the
read routine will hang there, until the READ request is handled by the
user daemon. Then it will retry to read from the same file range. If no
progress encountered, the read routine will fail then.
A new NETFS_SREQ_ONDEMAND flag is introduced to indicate that on-demand
read should be done when a cache miss encountered.
[1] https://lore.kernel.org/all/20220406075612.60298-6-jefflexu@linux.alibaba.com/ #v8
Signed-off-by: Jeffle Xu <jefflexu@linux.alibaba.com>
Acked-by: David Howells <dhowells@redhat.com>
Link: https://lore.kernel.org/r/20220425122143.56815-6-jefflexu@linux.alibaba.com
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
|
|
Fscache/CacheFiles used to serve as a local cache for a remote
networking fs. A new on-demand read mode will be introduced for
CacheFiles, which can boost the scenario where on-demand read semantics
are needed, e.g. container image distribution.
The essential difference between these two modes is seen when a cache
miss occurs: In the original mode, the netfs will fetch the data from
the remote server and then write it to the cache file; in on-demand
read mode, fetching the data and writing it into the cache is delegated
to a user daemon.
As the first step, notify the user daemon when looking up cookie. In
this case, an anonymous fd is sent to the user daemon, through which the
user daemon can write the fetched data to the cache file. Since the user
daemon may move the anonymous fd around, e.g. through dup(), an object
ID uniquely identifying the cache file is also attached.
Also add one advisory flag (FSCACHE_ADV_WANT_CACHE_SIZE) suggesting that
the cache file size shall be retrieved at runtime. This helps the
scenario where one cache file contains multiple netfs files, e.g. for
the purpose of deduplication. In this case, netfs itself has no idea the
size of the cache file, whilst the user daemon should give the hint on
it.
Signed-off-by: Jeffle Xu <jefflexu@linux.alibaba.com>
Link: https://lore.kernel.org/r/20220509074028.74954-3-jefflexu@linux.alibaba.com
Acked-by: David Howells <dhowells@redhat.com>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
|
|
More and more drivers will check for bad characters in the hwmon name
and all are using the same code snippet. Consolidate that code by adding
a new hwmon_sanitize_name() function.
Signed-off-by: Michael Walle <michael@walle.cc>
Reviewed-by: Tom Rix <trix@redhat.com>
Link: https://lore.kernel.org/r/20220405092452.4033674-2-michael@walle.cc
Signed-off-by: Guenter Roeck <linux@roeck-us.net>
|
|
The passive governor requires the cpu data to get the next target frequency
of devfreq device if depending on cpu. In order to reduce the unnecessary
memory data, keep cpufreq_policy data for possible cpus instead of NR_CPU.
Tested-by: Chen-Yu Tsai <wenst@chromium.org>
Tested-by: Johnson Wang <johnson.wang@mediatek.com>
Signed-off-by: Chanwoo Choi <cw00.choi@samsung.com>
|
|
Many CPU architectures have caches that can scale independent of the
CPUs. Frequency scaling of the caches is necessary to make sure that the
cache is not a performance bottleneck that leads to poor performance and
power. The same idea applies for RAM/DDR.
To achieve this, this patch adds support for cpu based scaling to the
passive governor. This is accomplished by taking the current frequency
of each CPU frequency domain and then adjust the frequency of the cache
(or any devfreq device) based on the frequency of the CPUs. It listens
to CPU frequency transition notifiers to keep itself up to date on the
current CPU frequency.
To decide the frequency of the device, the governor does one of the
following:
* Derives the optimal devfreq device opp from required-opps property of
the parent cpu opp_table.
* Scales the device frequency in proportion to the CPU frequency. So, if
the CPUs are running at their max frequency, the device runs at its
max frequency. If the CPUs are running at their min frequency, the
device runs at its min frequency. It is interpolated for frequencies
in between.
Tested-by: Chen-Yu Tsai <wenst@chromium.org>
Tested-by: Johnson Wang <johnson.wang@mediatek.com>
Signed-off-by: Saravana Kannan <skannan@codeaurora.org>
[Sibi: Integrated cpu-freqmap governor into passive_governor]
Signed-off-by: Sibi Sankar <sibis@codeaurora.org>
[Chanwoo: Fix conflict with latest code and cleanup code]
Signed-off-by: Chanwoo Choi <cw00.choi@samsung.com>
|
|
Instead of having one big enum add one for each register or field.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
|
|
skb_data_area_size() is not needed. As Jakub pointed out [1]:
For Rx, drivers can use the size passed during skb allocation or
use skb_tailroom().
For Tx, drivers should use skb_headlen().
[1] https://lore.kernel.org/netdev/CAHNKnsTmH-rGgWi3jtyC=ktM1DW2W1VJkYoTMJV2Z_Bt498bsg@mail.gmail.com/
Signed-off-by: Ricardo Martinez <ricardo.martinez@linux.intel.com>
Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Reviewed-by: Sergey Ryazanov <ryazanov.s.a@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Introduce dax_recovery_write() operation. The function is used to
recover a dax range that contains poison. Typical use case is when
a user process receives a SIGBUS with si_code BUS_MCEERR_AR
indicating poison(s) in a dax range, in response, the user process
issues a pwrite() to the page-aligned dax range, thus clears the
poison and puts valid data in the range.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jane Chu <jane.chu@oracle.com>
Link: https://lore.kernel.org/r/20220422224508.440670-6-jane.chu@oracle.com
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
|
|
Up till now, dax_direct_access() is used implicitly for normal
access, but for the purpose of recovery write, dax range with
poison is requested. To make the interface clear, introduce
enum dax_access_mode {
DAX_ACCESS,
DAX_RECOVERY_WRITE,
}
where DAX_ACCESS is used for normal dax access, and
DAX_RECOVERY_WRITE is used for dax recovery write.
Suggested-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Jane Chu <jane.chu@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Mike Snitzer <snitzer@redhat.com>
Reviewed-by: Vivek Goyal <vgoyal@redhat.com>
Link: https://lore.kernel.org/r/165247982851.52965.11024212198889762949.stgit@dwillia2-desk3.amr.corp.intel.com
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
|
|
The set_memory_uc() approach doesn't work well in all cases.
As Dan pointed out when "The VMM unmapped the bad page from
guest physical space and passed the machine check to the guest."
"The guest gets virtual #MC on an access to that page. When
the guest tries to do set_memory_uc() and instructs cpa_flush()
to do clean caches that results in taking another fault / exception
perhaps because the VMM unmapped the page from the guest."
Since the driver has special knowledge to handle NP or UC,
mark the poisoned page with NP and let driver handle it when
it comes down to repair.
Please refer to discussions here for more details.
https://lore.kernel.org/all/CAPcyv4hrXPb1tASBZUg-GgdVs0OOFKXMXLiHmktg_kFi7YBMyQ@mail.gmail.com/
Now since poisoned page is marked as not-present, in order to
avoid writing to a not-present page and trigger kernel Oops,
also fix pmem_do_write().
Fixes: 284ce4011ba6 ("x86/memory_failure: Introduce {set, clear}_mce_nospec()")
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Jane Chu <jane.chu@oracle.com>
Acked-by: Tony Luck <tony.luck@intel.com>
Link: https://lore.kernel.org/r/165272615484.103830.2563950688772226611.stgit@dwillia2-desk3.amr.corp.intel.com
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
|
|
Relocate the twin mce functions to arch/x86/mm/pat/set_memory.c
file where they belong.
While at it, fixup a function name in a comment.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Jane Chu <jane.chu@oracle.com>
Acked-by: Borislav Petkov <bp@suse.de>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
[sfr: gate {set,clear}_mce_nospec() by CONFIG_X86_64]
Link: https://lore.kernel.org/r/165272527328.90175.8336008202048685278.stgit@dwillia2-desk3.amr.corp.intel.com
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
|
|
* kvm-arm64/vgic-invlpir:
: .
: Implement MMIO-based LPI invalidation for vGICv3.
: .
KVM: arm64: vgic-v3: Advertise GICR_CTLR.{IR, CES} as a new GICD_IIDR revision
KVM: arm64: vgic-v3: Implement MMIO-based LPI invalidation
KVM: arm64: vgic-v3: Expose GICR_CTLR.RWP when disabling LPIs
irqchip/gic-v3: Exposes bit values for GICR_CTLR.{IR, CES}
Signed-off-by: Marc Zyngier <maz@kernel.org>
|
|
Allow the file system to keep state for all iterations. For now only
wire it up for direct I/O as there is an immediate need for it there.
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Allow the file system to provide a specific bio_set for allocating
direct I/O bios. This will allow file systems that use the
->submit_io hook to stash away additional information for file system
use.
To make use of this additional space for information in the completion
path, the file system needs to override the ->bi_end_io callback and
then call back into iomap, so export iomap_dio_bio_end_io for that.
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Add a function sb_write_started() to allow callers to verify if
sb_start_write() is properly called. It will be used for assertion in
btrfs.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
We need the tty fixes in here as well, as we need to revert one of them :(
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
Some of mediatek processors contain
the Tensilica HiFix DSP for audio processing.
The communication between Host CPU and DSP firmware is
taking place using a shared memory area for message passing.
ADSP IPC protocol offers (send/recv) interfaces using
mediatek-mailbox APIs.
We use two mbox channels to implement a request-reply protocol.
Signed-off-by: Allen-KH Cheng <allen-kh.cheng@mediatek.com>
Signed-off-by: TingHan Shen <tinghan.shen@mediatek.com>
Reviewed-by: Pierre-Louis Bossart <pierre-louis.bossart@linux.intel.com>
Reviewed-by: Curtis Malainey <cujomalainey@chromium.org>
Reviewed-by: Tzung-Bi Shih <tzungbi@google.com>
Reviewed-by: YC Hung <yc.hung@mediatek.com>
Reviewed-by: AngeloGioacchino Del Regno <angelogioacchino.delregno@collabora.com>
Link: https://lore.kernel.org/r/20220512082215.3018-2-tinghan.shen@mediatek.com
Signed-off-by: Mark Brown <broonie@kernel.org>
|
|
When calling dev_fill_forward_path on a pppoe device, the provided destination
address is invalid. In order for the bridge fdb lookup to succeed, the pppoe
code needs to update ctx->daddr to the correct value.
Fix this by storing the address inside struct net_device_path_ctx
Fixes: f6efc675c9dd ("net: ppp: resolve forwarding path for bridge pppoe devices")
Signed-off-by: Felix Fietkau <nbd@nbd.name>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
|
A cpu can observe sd->defer_count reaching 128,
and call smp_call_function_single_async()
Problem is that the remote CPU can clear sd->defer_count
before the IPI is run/acknowledged.
Other cpus can queue more packets and also decide
to call smp_call_function_single_async() while the pending
IPI was not yet delivered.
This is a common issue with smp_call_function_single_async().
Callers must ensure correct synchronization and serialization.
I triggered this issue while experimenting smaller threshold.
Performing the call to smp_call_function_single_async()
under sd->defer_lock protection did not solve the problem.
Commit 5a18ceca6350 ("smp: Allow smp_call_function_single_async()
to insert locked csd") replaced an informative WARN_ON_ONCE()
with a return of -EBUSY, which is often ignored.
Test of CSD_FLAG_LOCK presence is racy anyway.
Fixes: 68822bdf76f1 ("net: generalize skb freeing deferral to per-cpu lists")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
The SKB_DR_OR() is used to set the drop reason to a value when it is
not set or specified yet. SKB_NOT_DROPPED_YET should also be considered
as not set.
Reviewed-by: Jiang Biao <benbjiang@tencent.com>
Reviewed-by: Hao Peng <flyingpeng@tencent.com>
Signed-off-by: Menglong Dong <imagedong@tencent.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Instead of simply forcing a 0 payload_len in IPv6 header,
implement RFC 2675 and insert a custom extension header.
Note that only TCP stack is currently potentially generating
jumbograms, and that this extension header is purely local,
it wont be sent on a physical link.
This is needed so that packet capture (tcpdump and friends)
can properly dissect these large packets.
Signed-off-by: Coco Li <lixiaoyan@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Alexander Duyck <alexanderduyck@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Allow the gro_max_size to exceed a value larger than 65536.
There weren't really any external limitations that prevented this other
than the fact that IPv4 only supports a 16 bit length field. Since we have
the option of adding a hop-by-hop header for IPv6 we can allow IPv6 to
exceed this value and for IPv4 and non-TCP flows we can cap things at 65536
via a constant rather than relying on gro_max_size.
[edumazet] limit GRO_MAX_SIZE to (8 * 65535) to avoid overflows.
Signed-off-by: Alexander Duyck <alexanderduyck@fb.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Make sure we will not overflow shinfo->gso_segs
Minimal TCP MSS size is 8 bytes, and shinfo->gso_segs
is a 16bit field.
TCP_MIN_GSO_SIZE is currently defined in include/net/tcp.h,
it seems cleaner to not bring tcp details into include/linux/netdevice.h
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Alexander Duyck <alexanderduyck@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
The code for gso_max_size was added originally to allow for debugging and
workaround of buggy devices that couldn't support TSO with blocks 64K in
size. The original reason for limiting it to 64K was because that was the
existing limits of IPv4 and non-jumbogram IPv6 length fields.
With the addition of Big TCP we can remove this limit and allow the value
to potentially go up to UINT_MAX and instead be limited by the tso_max_size
value.
So in order to support this we need to go through and clean up the
remaining users of the gso_max_size value so that the values will cap at
64K for non-TCPv6 flows. In addition we can clean up the GSO_MAX_SIZE value
so that 64K becomes GSO_LEGACY_MAX_SIZE and UINT_MAX will now be the upper
limit for GSO_MAX_SIZE.
v6: (edumazet) fixed a compile error if CONFIG_IPV6=n,
in a new sk_trim_gso_size() helper.
netif_set_tso_max_size() caps the requested TSO size
with GSO_MAX_SIZE.
Signed-off-by: Alexander Duyck <alexanderduyck@fb.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Log a few more path related status codes.
Signed-off-by: Max Gurtovoy <mgurtovoy@nvidia.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Christoph Hellwig <hch@lst.de>
|
|
This was only used by the ide-cd driver, which went away in
commit b7fb14d3ac63 ("ide: remove the legacy ide driver")
so we might as well take advantage of that and get rid of
this hook as well.
Cc: Christoph Hellwig <hch@lst.de>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Phillip Potter <phil@philpotter.co.uk>
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Link: https://lore.kernel.org/all/20220427132436.12795-2-paul.gortmaker@windriver.com
Signed-off-by: Phillip Potter <phil@philpotter.co.uk>
Link: https://lore.kernel.org/r/20220515205833.944139-3-phil@philpotter.co.uk
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Currently we have 3 primitives for removing an opened file from descriptor
table - pick_file(), __close_fd_get_file() and close_fd_get_file(). Their
calling conventions are rather odd and there's a code duplication for no
good reason. They can be unified -
1) have __range_close() cap max_fd in the very beginning; that way
we don't need separate way for pick_file() to report being past the end
of descriptor table.
2) make {__,}close_fd_get_file() return file (or NULL) directly, rather
than returning it via struct file ** argument. Don't bother with
(bogus) return value - nobody wants that -ENOENT.
3) make pick_file() return NULL on unopened descriptor - the only caller
that used to care about the distinction between descriptor past the end
of descriptor table and finding NULL in descriptor table doesn't give
a damn after (1).
4) lift ->files_lock out of pick_file()
That actually simplifies the callers, as well as the primitives themselves.
Code duplication is also gone...
Reviewed-by: Christian Brauner (Microsoft) <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|
These two interface were added in 091141a42 commit,
but now there is no place to call them.
The only user of fput/fget_many() was removed in commit
62906e89e63b ("io_uring: remove file batch-get optimisation").
A user of get_file_rcu_many() were removed in commit
f073531070d2 ("init: add an init_dup helper").
And replace atomic_long_sub/add to atomic_long_dec/inc
can improve performance.
Here are the test results of unixbench:
Cmd: ./Run -c 64 context1
Without patch:
System Benchmarks Partial Index BASELINE RESULT INDEX
Pipe-based Context Switching 4000.0 2798407.0 6996.0
========
System Benchmarks Index Score (Partial Only) 6996.0
With patch:
System Benchmarks Partial Index BASELINE RESULT INDEX
Pipe-based Context Switching 4000.0 3486268.8 8715.7
========
System Benchmarks Index Score (Partial Only) 8715.7
Signed-off-by: Gou Hao <gouhao@uniontech.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|
Fix the warning - "Enum value 'NR_DAMON_OPS' not described in enum
'damon_ops_id'" generated by the command "make pdfdocs"
Link: https://lkml.kernel.org/r/20220508073316.141401-1-gautammenghani201@gmail.com
Signed-off-by: Gautam Menghani <gautammenghani201@gmail.com>
Reviewed-by: SeongJae Park <sj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
We must add hugetlb_free_vmemmap=on (or "off") to the boot cmdline and
reboot the server to enable or disable the feature of optimizing vmemmap
pages associated with HugeTLB pages. However, rebooting usually takes a
long time. So add a sysctl to enable or disable the feature at runtime
without rebooting. Why we need this? There are 3 use cases.
1) The feature of minimizing overhead of struct page associated with
each HugeTLB is disabled by default without passing
"hugetlb_free_vmemmap=on" to the boot cmdline. When we (ByteDance)
deliver the servers to the users who want to enable this feature, they
have to configure the grub (change boot cmdline) and reboot the
servers, whereas rebooting usually takes a long time (we have thousands
of servers). It's a very bad experience for the users. So we need a
approach to enable this feature after rebooting. This is a use case in
our practical environment.
2) Some use cases are that HugeTLB pages are allocated 'on the fly'
instead of being pulled from the HugeTLB pool, those workloads would be
affected with this feature enabled. Those workloads could be
identified by the characteristics of they never explicitly allocating
huge pages with 'nr_hugepages' but only set 'nr_overcommit_hugepages'
and then let the pages be allocated from the buddy allocator at fault
time. We can confirm it is a real use case from the commit
099730d67417. For those workloads, the page fault time could be ~2x
slower than before. We suspect those users want to disable this
feature if the system has enabled this before and they don't think the
memory savings benefit is enough to make up for the performance drop.
3) If the workload which wants vmemmap pages to be optimized and the
workload which wants to set 'nr_overcommit_hugepages' and does not want
the extera overhead at fault time when the overcommitted pages be
allocated from the buddy allocator are deployed in the same server.
The user could enable this feature and set 'nr_hugepages' and
'nr_overcommit_hugepages', then disable the feature. In this case, the
overcommited HugeTLB pages will not encounter the extra overhead at
fault time.
Link: https://lkml.kernel.org/r/20220512041142.39501-5-songmuchun@bytedance.com
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Luis Chamberlain <mcgrof@kernel.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: Iurii Zaikin <yzaikin@google.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: David Hildenbrand <david@redhat.com>
Cc: Masahiro Yamada <masahiroy@kernel.org>
Cc: Xiongchun Duan <duanxiongchun@bytedance.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
On some architectures (like ARM64), it can support CONT-PTE/PMD size
hugetlb, which means it can support not only PMD/PUD size hugetlb: 2M and
1G, but also CONT-PTE/PMD size: 64K and 32M if a 4K page size specified.
When migrating a hugetlb page, we will get the relevant page table entry
by huge_pte_offset() only once to nuke it and remap it with a migration
pte entry. This is correct for PMD or PUD size hugetlb, since they always
contain only one pmd entry or pud entry in the page table.
However this is incorrect for CONT-PTE and CONT-PMD size hugetlb, since
they can contain several continuous pte or pmd entry with same page table
attributes. So we will nuke or remap only one pte or pmd entry for this
CONT-PTE/PMD size hugetlb page, which is not expected for hugetlb
migration. The problem is we can still continue to modify the subpages'
data of a hugetlb page during migrating a hugetlb page, which can cause a
serious data consistent issue, since we did not nuke the page table entry
and set a migration pte for the subpages of a hugetlb page.
To fix this issue, we should change to use huge_ptep_clear_flush() to nuke
a hugetlb page table, and remap it with set_huge_pte_at() and
set_huge_swap_pte_at() when migrating a hugetlb page, which already
considered the CONT-PTE or CONT-PMD size hugetlb.
[akpm@linux-foundation.org: fix nommu build]
[baolin.wang@linux.alibaba.com: fix build errors for !CONFIG_MMU]
Link: https://lkml.kernel.org/r/a4baca670aca637e7198d9ae4543b8873cb224dc.1652270205.git.baolin.wang@linux.alibaba.com
Link: https://lkml.kernel.org/r/ea5abf529f0997b5430961012bfda6166c1efc8c.1652147571.git.baolin.wang@linux.alibaba.com
Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Reviewed-by: Muchun Song <songmuchun@bytedance.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Acked-by: David Hildenbrand <david@redhat.com>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Helge Deller <deller@gmx.de>
Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Rich Felker <dalias@libc.org>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Will Deacon <will@kernel.org>
Cc: Yoshinori Sato <ysato@users.osdn.me>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
The use of kmap_atomic() is new code is being deprecated in favor of
kmap_local_page(). For this reason the "Using kmap_atomic" section in
highmem.rst is obsolete and unnecessary, but it can still help developers
if it were moved to kdocs in highmem.h.
Therefore, move the relevant parts of this section from highmem.rst and
merge them with the kdocs in highmem.h.
Link: https://lkml.kernel.org/r/20220428212455.892-4-fmdefrancesco@gmail.com
Signed-off-by: Fabio M. De Francesco <fmdefrancesco@gmail.com>
Suggested-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: Peter Collingbourne <pcc@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Will Deacon <will@kernel.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|