Age | Commit message (Collapse) | Author |
|
Up to now, guest userspace does logging directly to host using essentially
the same rather complex port assembly stuff as the kernel.
We'd rather use the same mechanism than duplicate it (it may also change in
the future), hence add a new ioctl for relaying guest/host messaging
(logging is just one application of it).
Signed-off-by: Roland Scheidegger <sroland@vmware.com>
Reviewed-by: Thomas Hellstrom <thellstrom@vmware.com>
Signed-off-by: Thomas Hellstrom <thellstrom@vmware.com>
|
|
Some gpio's parent irqdomain may not use the struct irq_fwspec as
argument, such as msi irqdomain. So rename the callback
populate_parent_fwspec() to populate_parent_alloc_arg() and make it
allocate and populate the specific struct which is needed by the
parent irqdomain.
Signed-off-by: Kevin Hao <haokexin@gmail.com>
Link: https://lore.kernel.org/r/20200114082821.14015-3-haokexin@gmail.com
Signed-off-by: Linus Walleij <linus.walleij@linaro.org>
|
|
The last user of the phy generic platform data was
deleted in commit 1e041b6f313aaa966612a7e415cfc09c90d6b829
("usb: dwc3: exynos: Remove dead code"). So get rid of
the platform data, which rids us of another consumer of
the legacy GPIO API at the same time. Make sure we
only inlcude <linux/gpio/consumer.h> which is all we use.
Alter the usb_phy_gen_create_phy() function prototype to
not pass any platform data as this is just hardcoded to
NULL at all locations calling it in the kernel.
Move the devm_gpiod_get* calls out of the if (of_node)
parenthesis, as these calls are generic and do not depend
on device tree, they are used by any hardware description.
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Felipe Balbi <balbi@kernel.org>
Signed-off-by: Linus Walleij <linus.walleij@linaro.org>
Signed-off-by: Felipe Balbi <balbi@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
In case a radar event of CAC_FINISHED or RADAR_DETECTED
happens during another phy is during CAC we might need
to cancel that CAC.
If we got a radar in a channel that another phy is now
doing CAC on then the CAC should be canceled there.
If, for example, 2 phys doing CAC on the same channels,
or on comptable channels, once on of them will finish his
CAC the other might need to cancel his CAC, since it is no
longer relevant.
To fix that the commit adds an callback and implement it in
mac80211 to end CAC.
This commit also adds a call to said callback if after a radar
event we see the CAC is no longer relevant
Signed-off-by: Orr Mazor <Orr.Mazor@tandemg.com>
Reviewed-by: Sergey Matyukevich <sergey.matyukevich.os@quantenna.com>
Link: https://lore.kernel.org/r/20191222145449.15792-1-Orr.Mazor@tandemg.com
[slightly reformat/reword commit message]
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
|
|
git://anongit.freedesktop.org/drm/drm-intel into drm-next
Final drm/i915 features for v5.6:
- DP MST fixes (José)
- Fix intel_bw_state memory leak (Pankaj Bharadiya)
- Switch context id allocation to xarray (Tvrtko)
- ICL/EHL/TGL workarounds (Matt Roper, Tvrtko)
- Debugfs for LMEM details (Lukasz Fiedorowicz)
- Prefer platform acronyms over codenames in symbols (Lucas)
- Tiled and port sync mode fixes for fbdev and DP (Manasi)
- DSI panel and backlight enable GPIO fixes (Hans de Goede)
- Relax audio min CDCLK requirements on non-GLK (Kai Vehmanen)
- Plane alignment and dimension check fixes (Imre)
- Fix state checks for PSR (José)
- Remove ICL+ clock gating programming (José)
- Static checker fixes around bool usage (Ma Feng)
- Bring back tests for self-contained headers in i915 (Masahiro Yamada)
- Fix DP MST disable sequence (Ville)
- Start converting i915 to the new drm device based logging macros (Wambui Karuga)
- Add DSI VBT I2C sequence execution (Vivek Kasireddy)
- Start using function pointers and ops structs in uc code (Michal)
- Fix PMU names to not use colons or dashes (Tvrtko)
- TGL media decompression support (DK, Imre)
- Split i915_gem_gtt.[ch] to more manageable chunks (Matthew Auld)
- Create dumb buffers in LMEM where available (Ram)
- Extend mmap support for LMEM (Abdiel)
- Selftest updates (Chris)
- Hack bump up CDCLK on TGL to avoid underruns (Stan)
- Use intel_encoder and intel_connector more instead of drm counterparts (Ville)
- Build error fixes (Zhang Xiaoxu)
- Fixes related to GPU and engine initialization/resume (Chris)
- Support for prefaulting discontiguous objects (Abdiel)
- Support discontiguous LMEM object maps (Chris)
- Various GEM and GT improvements and fixes (Chris)
- Merge pinctrl dependencies branch for the DSI GPIO updates (Jani)
- Backmerge drm-next for new logging macros (Jani)
Signed-off-by: Dave Airlie <airlied@redhat.com>
From: Jani Nikula <jani.nikula@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/87sgkil0v9.fsf@intel.com
|
|
https://github.com/ckhu-mediatek/linux.git-tags into drm-next
Mediatek DRM Next for Linux 5.6
This fix non-smooth cursor problem, add cmdq support, add ctm property
support and some refinement.
Signed-off-by: Dave Airlie <airlied@redhat.com>
From: CK Hu <ck.hu@mediatek.com>
Link: https://patchwork.freedesktop.org/patch/msgid/1578972526.14594.8.camel@mtksdaap41
|
|
... and get rid of a bunch of bugs in it. Background:
the reason for path_mountpoint() is that umount() really doesn't
want attempts to revalidate the root of what it's trying to umount.
The thing we want to avoid actually happen from complete_walk();
solution was to do something parallel to normal path_lookupat()
and it both went overboard and got the boilerplate subtly
(and not so subtly) wrong.
A better solution is to do pretty much what the normal path_lookupat()
does, but instead of complete_walk() do unlazy_walk(). All it takes
to avoid that ->d_weak_revalidate() call... mountpoint_last() goes
away, along with everything it got wrong, and so does the magic around
LOOKUP_NO_REVAL.
Another source of bugs is that when we traverse mounts at the final
location (and we need to do that - umount . expects to get whatever's
overmounting ., if any, out of the lookup) we really ought to take
care of ->d_manage() - as it is, manual umount of autofs automount
in progress can lead to unpleasant surprises for the daemon. Easily
solved by using handle_lookup_down() instead of follow_mount().
Tested-by: Ian Kent <raven@themaw.net>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|
git://anongit.freedesktop.org/tegra/linux into drm-next
drm/tegra: Changes for v5.6-rc1
This contains a small set of mostly fixes and some minor improvements.
Signed-off-by: Dave Airlie <airlied@redhat.com>
From: Thierry Reding <thierry.reding@gmail.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20200111004835.2412858-1-thierry.reding@gmail.com
|
|
In a similar fashion to previous patch, add "offload" and "trap"
indication to IPv6 routes.
This is done by using two unused bits in 'struct fib6_info' to hold
these indications. Capable drivers are expected to set these when
processing the various in-kernel route notifications.
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Reviewed-by: David Ahern <dsahern@gmail.com>
Acked-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
When performing L3 offload, routes and nexthops are usually programmed
into two different tables in the underlying device. Therefore, the fact
that a nexthop resides in hardware does not necessarily mean that all
the associated routes also reside in hardware and vice-versa.
While the kernel can signal to user space the presence of a nexthop in
hardware (via 'RTNH_F_OFFLOAD'), it does not have a corresponding flag
for routes. In addition, the fact that a route resides in hardware does
not necessarily mean that the traffic is offloaded. For example,
unreachable routes (i.e., 'RTN_UNREACHABLE') are programmed to trap
packets to the CPU so that the kernel will be able to generate the
appropriate ICMP error packet.
This patch adds an "offload" and "trap" indications to IPv4 routes, so
that users will have better visibility into the offload process.
'struct fib_alias' is extended with two new fields that indicate if the
route resides in hardware or not and if it is offloading traffic from
the kernel or trapping packets to it. Note that the new fields are added
in the 6 bytes hole and therefore the struct still fits in a single
cache line [1].
Capable drivers are expected to invoke fib_alias_hw_flags_set() with the
route's key in order to set the flags.
The indications are dumped to user space via a new flags (i.e.,
'RTM_F_OFFLOAD' and 'RTM_F_TRAP') in the 'rtm_flags' field in the
ancillary header.
v2:
* Make use of 'struct fib_rt_info' in fib_alias_hw_flags_set()
[1]
struct fib_alias {
struct hlist_node fa_list; /* 0 16 */
struct fib_info * fa_info; /* 16 8 */
u8 fa_tos; /* 24 1 */
u8 fa_type; /* 25 1 */
u8 fa_state; /* 26 1 */
u8 fa_slen; /* 27 1 */
u32 tb_id; /* 28 4 */
s16 fa_default; /* 32 2 */
u8 offload:1; /* 34: 0 1 */
u8 trap:1; /* 34: 1 1 */
u8 unused:6; /* 34: 2 1 */
/* XXX 5 bytes hole, try to pack */
struct callback_head rcu __attribute__((__aligned__(8))); /* 40 16 */
/* size: 56, cachelines: 1, members: 12 */
/* sum members: 50, holes: 1, sum holes: 5 */
/* sum bitfield members: 8 bits (1 bytes) */
/* forced alignments: 1, forced holes: 1, sum forced holes: 5 */
/* last cacheline: 56 bytes */
} __attribute__((__aligned__(8)));
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Reviewed-by: David Ahern <dsahern@gmail.com>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
fib_dump_info() is used to prepare RTM_{NEW,DEL}ROUTE netlink messages
using the passed arguments. Currently, the function takes 11 arguments,
6 of which are attributes of the route being dumped (e.g., prefix, TOS).
The next patch will need the function to also dump to user space an
indication if the route is present in hardware or not. Instead of
passing yet another argument, change the function to take a struct
containing the different route attributes.
v2:
* Name last argument of fib_dump_info()
* Move 'struct fib_rt_info' to include/net/ip_fib.h so that it could
later be passed to fib_alias_hw_flags_set()
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Reviewed-by: David Ahern <dsahern@gmail.com>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
When fs-verity verifies data pages, currently it reads each Merkle tree
page synchronously using read_mapping_page().
Therefore, when the Merkle tree pages aren't already cached, fs-verity
causes an extra 4 KiB I/O request for every 512 KiB of data (assuming
that the Merkle tree uses SHA-256 and 4 KiB blocks). This results in
more I/O requests and performance loss than is strictly necessary.
Therefore, implement readahead of the Merkle tree pages.
For simplicity, we take advantage of the fact that the kernel already
does readahead of the file's *data*, just like it does for any other
file. Due to this, we don't really need a separate readahead state
(struct file_ra_state) just for the Merkle tree, but rather we just need
to piggy-back on the existing data readahead requests.
We also only really need to bother with the first level of the Merkle
tree, since the usual fan-out factor is 128, so normally over 99% of
Merkle tree I/O requests are for the first level.
Therefore, make fsverity_verify_bio() enable readahead of the first
Merkle tree level, for up to 1/4 the number of pages in the bio, when it
sees that the REQ_RAHEAD flag is set on the bio. The readahead size is
then passed down to ->read_merkle_tree_page() for the filesystem to
(optionally) implement if it sees that the requested page is uncached.
While we're at it, also make build_merkle_tree_level() set the Merkle
tree readahead size, since it's easy to do there.
However, for now don't set the readahead size in fsverity_verify_page(),
since currently it's only used to verify holes on ext4 and f2fs, and it
would need parameters added to know how much to read ahead.
This patch significantly improves fs-verity sequential read performance.
Some quick benchmarks with 'cat'-ing a 250MB file after dropping caches:
On an ARM64 phone (using sha256-ce):
Before: 217 MB/s
After: 263 MB/s
(compare to sha256sum of non-verity file: 357 MB/s)
In an x86_64 VM (using sha256-avx2):
Before: 173 MB/s
After: 215 MB/s
(compare to sha256sum of non-verity file: 223 MB/s)
Link: https://lore.kernel.org/r/20200106205533.137005-1-ebiggers@kernel.org
Reviewed-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Eric Biggers <ebiggers@google.com>
|
|
Just as commit 0566e40ce7 ("tracing: initcall: Ordered comparison of
function pointers"), this patch fixes another remaining one in xen.h
found by clang-9.
In file included from arch/x86/xen/trace.c:21:
In file included from ./include/trace/events/xen.h:475:
In file included from ./include/trace/define_trace.h:102:
In file included from ./include/trace/trace_events.h:473:
./include/trace/events/xen.h:69:7: warning: ordered comparison of function \
pointers ('xen_mc_callback_fn_t' (aka 'void (*)(void *)') and 'xen_mc_callback_fn_t') [-Wordered-compare-function-pointers]
__field(xen_mc_callback_fn_t, fn)
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
./include/trace/trace_events.h:421:29: note: expanded from macro '__field'
^
./include/trace/trace_events.h:407:6: note: expanded from macro '__field_ext'
is_signed_type(type), filter_type); \
^
./include/linux/trace_events.h:554:44: note: expanded from macro 'is_signed_type'
^
Fixes: c796f213a6934 ("xen/trace: add multicall tracing")
Signed-off-by: Changbin Du <changbin.du@gmail.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
|
|
This worked before, because we made all callers name their next pointer
"next". But in trying to be more "drop-in" ready, the silliness here is
revealed. This commit fixes the problem by making the macro argument and
the member use different names.
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Allow to call macsec_pn_wrapped from hardware drivers to notify when a
PN rolls over. Some drivers might used an interrupt to implement this.
Signed-off-by: Antoine Tenart <antoine.tenart@bootlin.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
MACsec offloading to underlying hardware devices is disabled by default
(the software implementation is used). This patch adds support for
changing this setting through the MACsec netlink interface. Many checks
are done when enabling offloading on a given MACsec interface as there
are limitations (it must be supported by the hardware, only a single
interface can be offloaded on a given physical device at a time, rules
can't be moved for now).
Signed-off-by: Antoine Tenart <antoine.tenart@bootlin.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
This patch adds a reference to MACsec ops in the phy_device, to allow
PHYs to support offloading MACsec operations. The phydev lock will be
held while calling those helpers.
Signed-off-by: Antoine Tenart <antoine.tenart@bootlin.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
This patch introduces MACsec ops for drivers to support offloading
MACsec operations.
Signed-off-by: Antoine Tenart <antoine.tenart@bootlin.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
This patch introduces the macsec_context structure. It will be used
in the kernel to exchange information between the common MACsec
implementation (macsec.c) and the MACsec hardware offloading
implementations. This structure contains pointers to MACsec specific
structures which contain the actual MACsec configuration, and to the
underlying device (phydev for now).
Signed-off-by: Antoine Tenart <antoine.tenart@bootlin.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
This patch moves some structure, type and identifier definitions into a
MACsec specific header. This patch does not modify how the MACsec code
is running and only move things around. This is a preparation for the
future MACsec hardware offloading support, which will re-use those
definitions outside macsec.c.
Signed-off-by: Antoine Tenart <antoine.tenart@bootlin.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
The information about the PHY attached to the PHYLINK instance is useful
but is missing the IRQ prints that phy_attached_info() adds.
phy_attached_info() is a bit long and it would not be possible to use
phylink_info() anyway.
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/arnd/playground
Pull asm-generic fixes from Arnd Bergmann:
"Here are two bugfixes from Mike Rapoport, both fixing compile-time
errors for the nds32 architecture that were recently introduced"
* tag 'asm-generic-5.5' of git://git.kernel.org/pub/scm/linux/kernel/git/arnd/playground:
nds32: fix build failure caused by page table folding updates
asm-generic/nds32: don't redefine cacheflush primitives
|
|
Merge misc fixes from David Howells.
Two afs fixes and a key refcounting fix.
* dhowells:
afs: Fix afs_lookup() to not clobber the version on a new dentry
afs: Fix use-after-loss-of-ref
keys: Fix request_key() cache
|
|
afs_lookup() has a tracepoint to indicate the outcome of
d_splice_alias(), passing it the inode to retrieve the fid from.
However, the function gave up its ref on that inode when it called
d_splice_alias(), which may have failed and dropped the inode.
Fix this by caching the fid.
Fixes: 80548b03991f ("afs: Add more tracepoints")
Reported-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
The ops aren't used in any SPI NOR controller. Therefore, remove them
altogether.
Signed-off-by: Michael Walle <michael@walle.cc>
Signed-off-by: Tudor Ambarus <tudor.ambarus@microchip.com>
|
|
The 'interval_sub' is placed on the 'notifier_subscriptions' interval
tree.
This eliminates the poor name 'mni' for this variable.
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
|
|
The 'subscription' is placed on the 'notifier_subscriptions' list.
This eliminates the poor name 'mn' for this variable.
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
|
|
The name mmu_notifier_mm implies that the thing is a mm_struct pointer,
and is difficult to abbreviate. The struct is actually holding the
interval tree and hlist containing the notifiers subscribed to a mm.
Use 'subscriptions' as the variable name for this struct instead of the
really terrible and misleading 'mmn_mm'.
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
|
|
https://git.kernel.org/pub/scm/linux/kernel/git/broonie/regulator into asoc-5.6
regulator: add regulator_equal()
|
|
Add regulator_is_equal() helper to compare whether two regulators are
the same. This is useful for checking whether two separate regulators
in a driver are actually the same supply.
Signed-off-by: Marek Vasut <marex@denx.de>
Cc: Fabio Estevam <festevam@gmail.com>
Cc: Igor Opaniuk <igor.opaniuk@toradex.com>
Cc: Liam Girdwood <lgirdwood@gmail.com>
Cc: Marcel Ziswiler <marcel.ziswiler@toradex.com>
Cc: Mark Brown <broonie@kernel.org>
Cc: Oleksandr Suvorov <oleksandr.suvorov@toradex.com>
Link: https://lore.kernel.org/r/20191220164450.1395038-1-marex@denx.de
Signed-off-by: Mark Brown <broonie@kernel.org>
|
|
I have added the AU6625 PCI_ID to the list of supported IDs:
alcor_pci.c
// Added au6625s ID to the array of supported devices
alcor_pci.h
// Added entry to define the PCI ID
Made it fit in with the already submitted code:
alcor_pci.c
// Added config entry to that matches the one for au6601
>From general usage there seems to be no problems.
Signed-off-by: Rhys Perry <rhysperry111@gmail.com>
Link: https://lore.kernel.org/r/20191229171824.10308-1-rhysperry111@gmail.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
Some users prefer kdump tools to generate guest kernel dumpfile,
at the same time, need a out-of-band kernel panic event.
Currently if booting guest kernel with 'crash_kexec_post_notifiers',
QEMU will receive PVPANIC_PANICKED event and stop VM. If booting
guest kernel without 'crash_kexec_post_notifiers', guest will not
call notifier chain.
Add PVPANIC_CRASH_LOADED bit for pvpanic event, it means that guest
kernel actually hit a kernel panic, but the guest kernel wants to
handle by itself.
Signed-off-by: zhenwei pi <pizhenwei@bytedance.com>
Link: https://lore.kernel.org/r/20200102023513.318836-3-pizhenwei@bytedance.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
Some processes outside of the kernel(Ex, QEMU) should know what the
value really is for, so move the bit definition to a uapi file.
Suggested-by: Greg KH <gregkh@linuxfoundation.org>
Signed-off-by: zhenwei pi <pizhenwei@bytedance.com>
Link: https://lore.kernel.org/r/20200102023513.318836-2-pizhenwei@bytedance.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
API to set time namespace offsets for children processes, i.e.:
echo "$clockid $offset_sec $offset_nsec" > /proc/self/timens_offsets
Co-developed-by: Dmitry Safonov <dima@arista.com>
Signed-off-by: Andrei Vagin <avagin@gmail.com>
Signed-off-by: Dmitry Safonov <dima@arista.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20191112012724.250792-28-dima@arista.com
|
|
The VVAR page layout depends on whether a task belongs to the root or
non-root time namespace. Whenever a task changes its namespace, the VVAR
page tables are cleared and then they will be re-faulted with a
corresponding layout.
Co-developed-by: Andrei Vagin <avagin@gmail.com>
Signed-off-by: Andrei Vagin <avagin@gmail.com>
Signed-off-by: Dmitry Safonov <dima@arista.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20191112012724.250792-27-dima@arista.com
|
|
VDSO support for Time namespace needs to set up a page with the same
layout as VVAR. That timens page will be placed on position of VVAR page
inside namespace. That page contains time namespace clock offsets and it
has vdso_data->seq set to 1 to enforce the slow path and
vdso_data->clock_mode set to VCLOCK_TIMENS to enforce the time namespace
handling path.
Allocate the timens page during namespace creation. Setup the offsets
when the first task enters the ns and freeze them to guarantee the pace
of monotonic/boottime clocks and to avoid breakage of applications.
The design decision is to have a global offset_lock which is used during
namespace offsets setup and to freeze offsets when the first task joins the
new time namespace. That is better in terms of memory usage compared to
having a per namespace mutex that's used only during the setup period.
Suggested-by: Andy Lutomirski <luto@kernel.org>
Based-on-work-by: Thomas Gleixner <tglx@linutronix.de>
Co-developed-by: Andrei Vagin <avagin@gmail.com>
Signed-off-by: Andrei Vagin <avagin@gmail.com>
Signed-off-by: Dmitry Safonov <dima@arista.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20191112012724.250792-24-dima@arista.com
|
|
VDSO support for time namespaces needs to set up a page with the same
layout as VVAR. That timens page will be placed on position of VVAR page
inside namespace. That page has vdso_data->seq set to 1 to enforce
the slow path and vdso_data->clock_mode set to VCLOCK_TIMENS to enforce
the time namespace handling path.
To prepare the time namespace page the kernel needs to know the vdso_data
offset. Provide arch_get_vdso_data() helper for locating vdso_data on VVAR
page.
Co-developed-by: Andrei Vagin <avagin@openvz.org>
Signed-off-by: Andrei Vagin <avagin@openvz.org>
Signed-off-by: Dmitry Safonov <dima@arista.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20191112012724.250792-22-dima@arista.com
|
|
To support time namespaces in the vdso with a minimal impact on regular non
time namespace affected tasks, the namespace handling needs to be hidden in
a slow path.
The most obvious place is vdso_seq_begin(). If a task belongs to a time
namespace then the VVAR page which contains the system wide vdso data is
replaced with a namespace specific page which has the same layout as the
VVAR page. That page has vdso_data->seq set to 1 to enforce the slow path
and vdso_data->clock_mode set to VCLOCK_TIMENS to enforce the time
namespace handling path.
The extra check in the case that vdso_data->seq is odd, e.g. a concurrent
update of the vdso data is in progress, is not really affecting regular
tasks which are not part of a time namespace as the task is spin waiting
for the update to finish and vdso_data->seq to become even again.
If a time namespace task hits that code path, it invokes the corresponding
time getter function which retrieves the real VVAR page, reads host time
and then adds the offset for the requested clock which is stored in the
special VVAR page.
If VDSO time namespace support is disabled the whole magic is compiled out.
Initial testing shows that the disabled case is almost identical to the
host case which does not take the slow timens path. With the special timens
page installed the performance hit is constant time and in the range of
5-7%.
For the vdso functions which are not using the sequence count an
unconditional check for vdso_data->clock_mode is added which switches to
the real vdso when the clock_mode is VCLOCK_TIMENS.
[avagin: Make do_hres_timens() work with raw clocks too: choose vdso_data
pointer by CS_RAW offset.]
Suggested-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrei Vagin <avagin@gmail.com>
Signed-off-by: Dmitry Safonov <dima@arista.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20191112012724.250792-21-dima@arista.com
|
|
clock_nanosleep() accepts absolute values of expiration time when
TIMER_ABSTIME flag is set. This absolute value is inside the task's
time namespace, and has to be converted to the host's time.
There is timens_ktime_to_host() helper for converting time, but
it accepts ktime argument.
As a preparation, make hrtimer_nanosleep() accept a clock value in ktime
instead of timespec64.
Co-developed-by: Dmitry Safonov <dima@arista.com>
Signed-off-by: Andrei Vagin <avagin@openvz.org>
Signed-off-by: Dmitry Safonov <dima@arista.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20191112012724.250792-17-dima@arista.com
|
|
The helper subtracts namespace's clock offset from the given time
and ensures that the result is within [0, KTIME_MAX].
Co-developed-by: Dmitry Safonov <dima@arista.com>
Signed-off-by: Andrei Vagin <avagin@gmail.com>
Signed-off-by: Dmitry Safonov <dima@arista.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20191112012724.250792-13-dima@arista.com
|
|
Introduce offsets for time namespace. They will contain an adjustment
needed to convert clocks to/from host's.
A new namespace is created with the same offsets as the time namespace
of the current process.
Co-developed-by: Dmitry Safonov <dima@arista.com>
Signed-off-by: Andrei Vagin <avagin@openvz.org>
Signed-off-by: Dmitry Safonov <dima@arista.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20191112012724.250792-5-dima@arista.com
|
|
Time Namespace isolates clock values.
The kernel provides access to several clocks CLOCK_REALTIME,
CLOCK_MONOTONIC, CLOCK_BOOTTIME, etc.
CLOCK_REALTIME
System-wide clock that measures real (i.e., wall-clock) time.
CLOCK_MONOTONIC
Clock that cannot be set and represents monotonic time since
some unspecified starting point.
CLOCK_BOOTTIME
Identical to CLOCK_MONOTONIC, except it also includes any time
that the system is suspended.
For many users, the time namespace means the ability to changes date and
time in a container (CLOCK_REALTIME). Providing per namespace notions of
CLOCK_REALTIME would be complex with a massive overhead, but has a dubious
value.
But in the context of checkpoint/restore functionality, monotonic and
boottime clocks become interesting. Both clocks are monotonic with
unspecified starting points. These clocks are widely used to measure time
slices and set timers. After restoring or migrating processes, it has to be
guaranteed that they never go backward. In an ideal case, the behavior of
these clocks should be the same as for a case when a whole system is
suspended. All this means that it is required to set CLOCK_MONOTONIC and
CLOCK_BOOTTIME clocks, which can be achieved by adding per-namespace
offsets for clocks.
A time namespace is similar to a pid namespace in the way how it is
created: unshare(CLONE_NEWTIME) system call creates a new time namespace,
but doesn't set it to the current process. Then all children of the process
will be born in the new time namespace, or a process can use the setns()
system call to join a namespace.
This scheme allows setting clock offsets for a namespace, before any
processes appear in it.
All available clone flags have been used, so CLONE_NEWTIME uses the highest
bit of CSIGNAL. It means that it can be used only with the unshare() and
the clone3() system calls.
[ tglx: Adjusted paragraph about clone3() to reality and massaged the
changelog a bit. ]
Co-developed-by: Dmitry Safonov <dima@arista.com>
Signed-off-by: Andrei Vagin <avagin@gmail.com>
Signed-off-by: Dmitry Safonov <dima@arista.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://criu.org/Time_namespace
Link: https://lists.openvz.org/pipermail/criu/2018-June/041504.html
Link: https://lore.kernel.org/r/20191112012724.250792-4-dima@arista.com
|
|
Place the branch with no concurrent write before the contended case.
Performance numbers for Intel(R) Core(TM) i5-6300U CPU @ 2.40GHz
(more clock_gettime() cycles - the better):
| before | after
-----------------------------------
| 150252214 | 153242367
| 150301112 | 153324800
| 150392773 | 153125401
| 150373957 | 153399355
| 150303157 | 153489417
| 150365237 | 153494270
-----------------------------------
avg | 150331408 | 153345935
diff % | 2 | 0
-----------------------------------
stdev % | 0.3 | 0.1
Co-developed-by: Dmitry Safonov <dima@arista.com>
Signed-off-by: Andrei Vagin <avagin@gmail.com>
Signed-off-by: Dmitry Safonov <dima@arista.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Vincenzo Frascino <vincenzo.frascino@arm.com>
Reviewed-by: Vincenzo Frascino <vincenzo.frascino@arm.com>
Link: https://lore.kernel.org/r/20191112012724.250792-2-dima@arista.com
|
|
If the programming of the dev_number fails due to an IO error, a new
device_number will be assigned, resulting in a leak.
Make sure we only assign a device_number once per Slave device.
Signed-off-by: Pierre-Louis Bossart <pierre-louis.bossart@linux.intel.com>
Link: https://lore.kernel.org/r/20200113225637.17313-1-pierre-louis.bossart@linux.intel.com
Signed-off-by: Vinod Koul <vkoul@kernel.org>
|
|
Allow DisplayPort PHYs to be configured through the generic
functions through a custom structure added to the generic union.
The configuration structure is used for reconfiguration of
DisplayPort PHYs during link training operation.
The parameters added here are the ones defined in the DisplayPort
spec v1.4 which include link rate, number of lanes, voltage swing
and pre-emphasis.
Add the DisplayPort phy mode to the generic phy_mode enum.
Signed-off-by: Yuti Amonkar <yamonkar@cadence.com>
Reviewed-by: Maxime Ripard <mripard@kernel.org>
Reviewed-by: Jyri Sarha <jsarha@ti.com>
Signed-off-by: Kishon Vijay Abraham I <kishon@ti.com>
|
|
Adds the initial hooks for TBS support. This needs a 32 byte descriptor
in order for it to work with current HW. Adds all the logic for Enhanced
Descriptors in main core but no HW related logic for now.
Changes from v2:
- Use bitfield for TBS status / support (Jakub)
- Remove unneeded cache alignment (Jakub)
- Fix checkpatch issues
Signed-off-by: Jose Abreu <Jose.Abreu@synopsys.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Commit 99cb0dbd47a1 ("mm,thp: add read-only THP support for (non-shmem)
FS") introduced a new khugepaged scan result: SCAN_PAGE_HAS_PRIVATE, but
the corresponding description for trace events were not added.
Link: http://lkml.kernel.org/r/1574793844-2914-1-git-send-email-yang.shi@linux.alibaba.com
Fixes: 99cb0dbd47a1 ("mm,thp: add read-only THP support for (non-shmem) FS")
Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com>
Cc: Song Liu <songliubraving@fb.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Commit 96a2b03f281d ("mm, debug_pagelloc: use static keys to enable
debugging") has introduced a static key to reduce overhead when
debug_pagealloc is compiled in but not enabled. It relied on the
assumption that jump_label_init() is called before parse_early_param()
as in start_kernel(), so when the "debug_pagealloc=on" option is parsed,
it is safe to enable the static key.
However, it turns out multiple architectures call parse_early_param()
earlier from their setup_arch(). x86 also calls jump_label_init() even
earlier, so no issue was found while testing the commit, but same is not
true for e.g. ppc64 and s390 where the kernel would not boot with
debug_pagealloc=on as found by our QA.
To fix this without tricky changes to init code of multiple
architectures, this patch partially reverts the static key conversion
from 96a2b03f281d. Init-time and non-fastpath calls (such as in arch
code) of debug_pagealloc_enabled() will again test a simple bool
variable. Fastpath mm code is converted to a new
debug_pagealloc_enabled_static() variant that relies on the static key,
which is enabled in a well-defined point in mm_init() where it's
guaranteed that jump_label_init() has been called, regardless of
architecture.
[sfr@canb.auug.org.au: export _debug_pagealloc_enabled_early]
Link: http://lkml.kernel.org/r/20200106164944.063ac07b@canb.auug.org.au
Link: http://lkml.kernel.org/r/20191219130612.23171-1-vbabka@suse.cz
Fixes: 96a2b03f281d ("mm, debug_pagelloc: use static keys to enable debugging")
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Qian Cai <cai@lca.pw>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Currently slab percpu vmstats are flushed twice: during the memcg
offlining and just before freeing the memcg structure. Each time percpu
counters are summed, added to the atomic counterparts and propagated up
by the cgroup tree.
The second flushing is required due to how recursive vmstats are
implemented: counters are batched in percpu variables on a local level,
and once a percpu value is crossing some predefined threshold, it spills
over to atomic values on the local and each ascendant levels. It means
that without flushing some numbers cached in percpu variables will be
dropped on floor each time a cgroup is destroyed. And with uptime the
error on upper levels might become noticeable.
The first flushing aims to make counters on ancestor levels more
precise. Dying cgroups may resume in the dying state for a long time.
After kmem_cache reparenting which is performed during the offlining
slab counters of the dying cgroup don't have any chances to be updated,
because any slab operations will be performed on the parent level. It
means that the inaccuracy caused by percpu batching will not decrease up
to the final destruction of the cgroup. By the original idea flushing
slab counters during the offlining should minimize the visible
inaccuracy of slab counters on the parent level.
The problem is that percpu counters are not zeroed after the first
flushing. So every cached percpu value is summed twice. It creates a
small error (up to 32 pages per cpu, but usually less) which accumulates
on parent cgroup level. After creating and destroying of thousands of
child cgroups, slab counter on parent level can be way off the real
value.
For now, let's just stop flushing slab counters on memcg offlining. It
can't be done correctly without scheduling a work on each cpu: reading
and zeroing it during css offlining can race with an asynchronous
update, which doesn't expect values to be changed underneath.
With this change, slab counters on parent level will become eventually
consistent. Once all dying children are gone, values are correct. And
if not, the error is capped by 32 * NR_CPUS pages per dying cgroup.
It's not perfect, as slab are reparented, so any updates after the
reparenting will happen on the parent level. It means that if a slab
page was allocated, a counter on child level was bumped, then the page
was reparented and freed, the annihilation of positive and negative
counter values will not happen until the child cgroup is released. It
makes slab counters different from others, and it might want us to
implement flushing in a correct form again. But it's also a question of
performance: scheduling a work on each cpu isn't free, and it's an open
question if the benefit of having more accurate counters is worth it.
We might also consider flushing all counters on offlining, not only slab
counters.
So let's fix the main problem now: make the slab counters eventually
consistent, so at least the error won't grow with uptime (or more
precisely the number of created and destroyed cgroups). And think about
the accuracy of counters separately.
Link: http://lkml.kernel.org/r/20191220042728.1045881-1-guro@fb.com
Fixes: bee07b33db78 ("mm: memcontrol: flush percpu slab vmstats on kmem offlining")
Signed-off-by: Roman Gushchin <guro@fb.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Commit 0bf7800f1799 ("ptr_ring: try vmalloc() when kmalloc() fails")
started to use kvmalloc_array and kvfree, which are defined in mm.h,
the previous functions kcalloc and kfree, which are defined in slab.h.
Add the missing include of linux/mm.h. This went unnoticed as other
include files happened to include mm.h.
Fixes: 0bf7800f1799 ("ptr_ring: try vmalloc() when kmalloc() fails")
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|