Age | Commit message (Collapse) | Author |
|
SMCCC callers are currently amassing a collection of enums for the SMCCC
conduit, and are having to dig into the PSCI driver's internals in order
to figure out what to do.
Let's clean this up, with common SMCCC_CONDUIT_* definitions, and an
arm_smccc_1_1_get_conduit() helper that abstracts the PSCI driver's
internal state.
We can kill off the PSCI_CONDUIT_* definitions once we've migrated users
over to the new interface.
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Acked-by: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com>
Acked-by: Will Deacon <will.deacon@arm.com>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
|
|
Illegal memory will be touch if SDMA_SCRIPT_ADDRS_ARRAY_SIZE_V3
(41) exceed the size of structure sdma_script_start_addrs(40),
thus cause memory corrupt such as slob block header so that kernel
trap into while() loop forever in slob_free(). Please refer to below
code piece in imx-sdma.c:
for (i = 0; i < sdma->script_number; i++)
if (addr_arr[i] > 0)
saddr_arr[i] = addr_arr[i]; /* memory corrupt here */
That issue was brought by commit a572460be9cf ("dmaengine: imx-sdma: Add
support for version 3 firmware") because SDMA_SCRIPT_ADDRS_ARRAY_SIZE_V3
(38->41 3 scripts added) not align with script number added in
sdma_script_start_addrs(2 scripts).
Fixes: a572460be9cf ("dmaengine: imx-sdma: Add support for version 3 firmware")
Cc: stable@vger.kernel
Link: https://www.spinics.net/lists/arm-kernel/msg754895.html
Signed-off-by: Robin Gong <yibin.gong@nxp.com>
Reported-by: Jurgen Lambrecht <J.Lambrecht@TELEVIC.com>
Link: https://lore.kernel.org/r/1569347584-3478-1-git-send-email-yibin.gong@nxp.com
[vkoul: update the patch title]
Signed-off-by: Vinod Koul <vkoul@kernel.org>
|
|
We want the staging driver fixes in here as well to build on and test
with.
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
Both tcp_v4_err() and tcp_v6_err() do the following operations
while they do not own the socket lock :
fastopen = tp->fastopen_rsk;
snd_una = fastopen ? tcp_rsk(fastopen)->snt_isn : tp->snd_una;
The problem is that without appropriate barrier, the compiler
might reload tp->fastopen_rsk and trigger a NULL deref.
request sockets are protected by RCU, we can simply add
the missing annotations and barriers to solve the issue.
Fixes: 168a8f58059a ("tcp: TCP Fast Open Server - main code path")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/groeck/linux-staging
Pull hwmon fixes from Guenter Roeck:
- Update/fix inspur-ipsps1 and k10temp Documentation
- Fix nct7904 driver
- Fix HWMON_P_MIN_ALARM mask in hwmon core
* tag 'hwmon-for-v5.4-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/groeck/linux-staging:
hwmon: docs: Extend inspur-ipsps1 title underline
hwmon: (nct7904) Add array fan_alarm and vsen_alarm to store the alarms in nct7904_data struct.
docs: hwmon: Include 'inspur-ipsps1.rst' into docs
hwmon: Fix HWMON_P_MIN_ALARM mask
hwmon: (k10temp) Update documentation and add temp2_input info
hwmon: (nct7904) Fix the incorrect value of vsen_mask in nct7904_data struct
|
|
https://git.kernel.org/pub/scm/linux/kernel/git/jic23/iio into staging-next
Jonathan writes:
First set of IIO new device support, cleanups and features for the 5.5 cycle
Third version with the adis rework set dropped as better to just have
a fresh version of that at some future date.
The usual mixed backs of new device support being added to drivers,
long term reworks continuing and little per driver cleanups and
features.
Also a few trivial counter subsystem tidy ups on behalf of William.
Core new feature
* Device label support. A long requested feature no one got around to
implementing before. Allows DT based provision of a 'label' that
identifies a device uniquely within a system. This differs from existing
'name' which is meant to be the part number.
New device support
* ingenic-adc
- Support for the JZ4770 SoC ADC including bindings.
* inv_mpu6050
- Add support for magnetometer in MPU925x parts.
Fiddly to do as this is actually a separate device sitting inside the
package, but with the master device being able to schedule reads etc.
Will only run if the auxiliary bus is not in use for any other devices.
Features
* ad7192
- Userspace calibration controls to do zero and full scale.
* st_lsm6dsx
- Enable latched interrupts by default for sensors events with related clear.
- Motion events and related wakeup source. This needed quite a bit of
refactoring as well.
Cleanups and minor features
* ad7192
- sysfs ABI docs
* ad7949
- Remove code to readback configuration word as driver never actually enabled
it.
- Fix incorrect xfer length. Not actually known to cause problems other
than wasted bus usage.
* adis16080
- Replace core mlock usage with local lock with more appropriate scope.
* adis16130
- Remove pointless mlock usage.
* adis16240
- Remove include of gpio.h as no gpio usage.
* atlas-ph-sensor
- Improve logical ordering of buffer predisable / postenable functions.
This is part of a longer term rework Alexandru is driving towards.
* bh1750
- Fix up a static compiler warning and make the code more readable.
- yaml conversion of binding + MAINTAINERS entry.
* bmp280
- Drop a stray newline.
* cm36651
- Drop a redundant assignment
* itg3200
- Alignment cleanup.
* max31856
- Add missing of_node and parent references, useful to identify the device.
* sc27xx_adc
- Use devm_hwspin_lock_request_specific rather than local rolled version.
* stm32-lptimer counter
- kernel-doc warning.
* stm32-timer counter
- kernel-doc warning.
- Alignment cleanup.
* sx9500
- Improve logical ordering of buffer predisable / postenable functions.
This is part of a longer term rework Alexandru is driving towards.
* tcs3414
- Improve logical ordering of buffer predisable / postenable functions.
This is part of a longer term rework Alexandru is driving towards.
* tag 'iio-for-5.5a-take3' of https://git.kernel.org/pub/scm/linux/kernel/git/jic23/iio: (41 commits)
iio: pressure: bmp280: remove stray newline
iio: adc: sc27xx: Use devm_hwspin_lock_request_specific() to simplify code
iio: chemical: atlas-ph-sensor: fix iio_triggered_buffer_predisable() position
iio: gyro: clean up indentation issue
counter: stm32: clean up indentation issue
iio: proximity: sx9500: fix iio_triggered_buffer_{predisable,postenable} positions
iio: core: Add optional symbolic label to device attributes
dt-binding: iio: Add optional label property
iio: gyro: adis16080: replace mlock with own lock
counter: stm32-lptimer-cnt: fix a kernel-doc warning
counter: stm32-timer-cnt: fix a kernel-doc warning
iio: gyro: adis16130: remove mlock usage
MAINTAINERS: add entry for ROHM BH1750 driver
dt-bindings: iio: light: bh1750: convert bindings to yaml
iio: imu: st_lsm6dsx: add motion report function and call from interrupt
iio: imu: st_lsm6dsx: always enter interrupt thread
iio: imu: st_lsm6dsx: add wakeup-source option
iio: imu: st_lsm6dsx: add motion events
iio: imu: st_lsm6dsx: move interrupt thread to core
iio: imu: inv_mpu6050: add fifo support for magnetometer data
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb
Pull USB fixes from Greg KH:
"Here are a lot of small USB driver fixes for 5.4-rc3.
syzbot has stepped up its testing of the USB driver stack, now able to
trigger fun race conditions between disconnect and probe functions.
Because of that we have a lot of fixes in here from Johan and others
fixing these reported issues that have been around since almost all
time.
We also are just deleting the rio500 driver, making all of the syzbot
bugs found in it moot as it turns out no one has been using it for
years as there is a userspace version that is being used instead.
There are also a number of other small fixes in here, all resolving
reported issues or regressions.
All have been in linux-next without any reported issues"
* tag 'usb-5.4-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb: (65 commits)
USB: yurex: fix NULL-derefs on disconnect
USB: iowarrior: use pr_err()
USB: iowarrior: drop redundant iowarrior mutex
USB: iowarrior: drop redundant disconnect mutex
USB: iowarrior: fix use-after-free after driver unbind
USB: iowarrior: fix use-after-free on release
USB: iowarrior: fix use-after-free on disconnect
USB: chaoskey: fix use-after-free on release
USB: adutux: fix use-after-free on release
USB: ldusb: fix NULL-derefs on driver unbind
USB: legousbtower: fix use-after-free on release
usb: cdns3: Fix for incorrect DMA mask.
usb: cdns3: fix cdns3_core_init_role()
usb: cdns3: gadget: Fix full-speed mode
USB: usb-skeleton: drop redundant in-urb check
USB: usb-skeleton: fix use-after-free after driver unbind
USB: usb-skeleton: fix NULL-deref on disconnect
usb:cdns3: Fix for CV CH9 running with g_zero driver.
usb: dwc3: Remove dev_err() on platform_get_irq() failure
usb: dwc3: Switch to platform_get_irq_byname_optional()
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull EFI fixes from Ingo Molnar:
"Misc EFI fixes all across the map: CPER error report fixes, fixes to
TPM event log parsing, fix for a kexec hang, a Sparse fix and other
fixes"
* 'efi-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
efi/tpm: Fix sanity check of unsigned tbl_size being less than zero
efi/x86: Do not clean dummy variable in kexec path
efi: Make unexported efi_rci2_sysfs_init() static
efi/tpm: Only set 'efi_tpm_final_log_size' after successful event log parsing
efi/tpm: Don't traverse an event log with no events
efi/tpm: Don't access event->count when it isn't mapped
efivar/ssdt: Don't iterate over EFI vars if no SSDT override was specified
efi/cper: Fix endianness of PCIe class code
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 fixes from Ingo Molnar:
"A handful of fixes: a kexec linking fix, an AMD MWAITX fix, a vmware
guest support fix when built under Clang, and new CPU model number
definitions"
* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/cpu: Add Comet Lake to the Intel CPU models header
lib/string: Make memzero_explicit() inline instead of external
x86/cpu/vmware: Use the full form of INL in VMWARE_PORT
x86/asm: Fix MWAITX C-state hint value
|
|
Pull NFS client bugfixes from Anna Schumaker:
"Stable bugfixes:
- Fix O_DIRECT accounting of number of bytes read/written # v4.1+
Other fixes:
- Fix nfsi->nrequests count error on nfs_inode_remove_request()
- Remove redundant mirror tracking in O_DIRECT
- Fix leak of clp->cl_acceptor string
- Fix race to sk_err after xs_error_report"
* tag 'nfs-for-5.4-2' of git://git.linux-nfs.org/projects/anna/linux-nfs:
SUNRPC: fix race to sk_err after xs_error_report
NFSv4: Fix leak of clp->cl_acceptor string
NFS: Remove redundant mirror tracking in O_DIRECT
NFS: Fix O_DIRECT accounting of number of bytes read/written
nfs: Fix nfsi->nrequests count error on nfs_inode_remove_request
|
|
Do not risk spanning these small structures on two cache lines.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20191011181140.2898-1-edumazet@google.com
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/jeyu/linux
Pull module fixes from Jessica Yu:
"Code cleanups and kbuild/namespace related fixups from Masahiro.
Most importantly, it fixes a namespace-related modpost issue for
external module builds
- Fix broken external module builds due to a modpost bug in
read_dump(), where the namespace was not being strdup'd and
sym->namespace would be set to bogus data.
- Various namespace-related kbuild fixes and cleanups thanks to
Masahiro Yamada"
* tag 'modules-for-v5.4-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/jeyu/linux:
doc: move namespaces.rst from kbuild/ to core-api/
nsdeps: make generated patches independent of locale
nsdeps: fix hashbang of scripts/nsdeps
kbuild: fix build error of 'make nsdeps' in clean tree
module: rename __kstrtab_ns_* to __kstrtabns_* to avoid symbol conflict
modpost: fix broken sym->namespace for external module builds
module: swap the order of symbol.namespace
scripts: add_namespace: Fix coccicheck failed
|
|
Reserve the pseudo keyword 'fallthrough' for the ability to convert the
various case block /* fallthrough */ style comments to appear to be an
actual reserved word with the same gcc case block missing fallthrough
warning capability.
All switch/case blocks now should end in one of:
break;
fallthrough;
goto <label>;
return [expression];
continue;
In C mode, GCC supports the __fallthrough__ attribute since 7.1,
the same time the warning and the comment parsing were introduced.
fallthrough devolves to an empty "do {} while (0)" if the compiler
version (any version less than gcc 7) does not support the attribute.
Signed-off-by: Joe Perches <joe@perches.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Suggested-by: Dan Carpenter <dan.carpenter@oracle.com>
Cc: Miguel Ojeda <miguel.ojeda.sandonis@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Adding description for the device_is_available member which
was missing, and fixing the description of the member
property_read_int_array.
Signed-off-by: Heikki Krogerus <heikki.krogerus@linux.intel.com>
Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
|
|
The prefix is used for printing purpose before a node, and it also works
as a separator between two nodes.
Signed-off-by: Sakari Ailus <sakari.ailus@linux.intel.com>
Acked-by: Rob Herring <robh@kernel.org> (for OF)
Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
|
|
The fwnode framework did not have means to obtain the name of a node. Add
that now, in form of the fwnode_get_name() function and a corresponding
get_name fwnode op. OF and ACPI support is included.
Signed-off-by: Sakari Ailus <sakari.ailus@linux.intel.com>
Acked-by: Rob Herring <robh@kernel.org> (for OF)
Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
|
|
Add two convenience functions for accessing node's parents:
fwnode_count_parents() returns the number of parent nodes a given node
has. fwnode_get_nth_parent() returns node's parent at a given distance
from the node itself.
Signed-off-by: Sakari Ailus <sakari.ailus@linux.intel.com>
Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
|
|
to_software_node() does not need to modify the fwnode_handle it operates
on; therefore make it const. This allows passing a const fwnode_handle to
to_software_node().
Signed-off-by: Sakari Ailus <sakari.ailus@linux.intel.com>
Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Reviewed-by: Heikki Krogerus <heikki.krogerus@linux.intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
|
|
git://anongit.freedesktop.org/drm/drm-misc into drm-next
drm-misc-next for 5.5:
UAPI Changes:
-Colorspace: Expose different prop values for DP vs. HDMI (Gwan-gyeong Mun)
-fourcc: Add DRM_FORMAT_MOD_ARM_16X16_BLOCK_U_INTERLEAVED (Raymond)
-not_actually: s/ENOTSUPP/EOPNOTSUPP/ in drm_edid and drm_mipi_dbi. This should
not reach userspace, but adding here to specifically call that out (Daniel)
-i810: Prevent underflow in dispatch ioctls (Dan)
-komeda: Add ACLK sysfs attribute (Mihail)
-v3d: Allow userspace to clean up after render jobs (Iago)
Cross-subsystem Changes:
-MAINTAINERS:
-Add Alyssa & Steven as panfrost reviewers (Rob)
-Add Jernej as DE2 reviewer (Maxime)
-Add Chen-Yu as Allwinner maintainer (Maxime)
-staging: Make some stack arrays static const (Colin)
Core Changes:
-ttm: Allow drivers to specify their vma manager (to use gem mgr) (Gerd)
-docs: Various fixes in connector/encoder/bridge docs (Daniel, Lyude, Laurent)
-connector: Allow more than 3 possible encoders for a connector (José)
-dp_cec: Allow a connector to be associated with a cec device (Dariusz)
-various: Fix some compile/sparse warnings (Ville)
-mm: Ensure mm node removals are properly serialised (Chris)
-panel: Specify the type of panel for drm_panels for later use (Laurent)
-panel: Use drm_panel_init to init device and funcs (Laurent)
-mst: Refactors and cleanups in anticipation of suspend/resume support (Lyude)
-vram:
-Add lazy unmapping for gem bo's (Thomas)
-Unify and rationalize vram mm and gem vram (Thomas)
-Expose vmap and vunmap for gem vram objects (Thomas)
-Allow objects to be pinned at the top of vram to avoid fragmentation (Thomas)
Driver Changes:
-various: Include drm_bridge.h instead of relying on drm_crtc.h (Boris)
-ast/mgag200: Refactor show_cursor(), move cursor to top of video mem (Thomas)
-komeda:
-Add error event printing (behind CONFIG) and reg dump support (Lowry)
-Add suspend/resume support (Lowry)
-Workaround D71 shadow registers not flushing on disable (Lowry)
-meson: Add suspend/resume support (Neil)
-omap: Miscellaneous refactors and improvements (Tomi/Jyri)
-panfrost/shmem: Silence lockdep by using mutex_trylock (Rob)
-panfrost: Miscellaneous small fixes (Rob/Steven)
-sti: Fix warnings (Benjamin/Linus)
-sun4i:
-Add vcc-dsi regulator to sun6i_mipi_dsi (Jagan)
-A few patches to figure out the DRQ/start delay calc on dsi (Jagan/Icenowy)
-virtio:
-Add module param to switch resource reuse workaround on/off (Gerd)
-Avoid calling vmexit while holding spinlock (Gerd)
-Use gem shmem helpers instead of ttm (Gerd)
-Accommodate command buffer allocations too big for cma (David)
Cc: Rob Herring <robh@kernel.org>
Cc: Maxime Ripard <mripard@kernel.org>
Cc: Gwan-gyeong Mun <gwan-gyeong.mun@intel.com>
Cc: Gerd Hoffmann <kraxel@redhat.com>
Cc: Laurent Pinchart <laurent.pinchart@ideasonboard.com>
Cc: Lyude Paul <lyude@redhat.com>
Cc: José Roberto de Souza <jose.souza@intel.com>
Cc: Dariusz Marcinkiewicz <darekm@google.com>
Cc: Ville Syrjälä <ville.syrjala@linux.intel.com>
Cc: Raymond Smith <raymond.smith@arm.com>
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Colin Ian King <colin.king@canonical.com>
Cc: Thomas Zimmermann <tzimmermann@suse.de>
Cc: Dan Carpenter <dan.carpenter@oracle.com>
Cc: Mihail Atanassov <Mihail.Atanassov@arm.com>
Cc: Lowry Li <Lowry.Li@arm.com>
Cc: Neil Armstrong <narmstrong@baylibre.com>
Cc: Jyri Sarha <jsarha@ti.com>
Cc: Tomi Valkeinen <tomi.valkeinen@ti.com>
Cc: Alyssa Rosenzweig <alyssa.rosenzweig@collabora.com>
Cc: Steven Price <steven.price@arm.com>
Cc: Benjamin Gaignard <benjamin.gaignard@st.com>
Cc: Linus Walleij <linus.walleij@linaro.org>
Cc: Jagan Teki <jagan@amarulasolutions.com>
Cc: Icenowy Zheng <icenowy@aosc.io>
Cc: Iago Toral Quiroga <itoral@igalia.com>
Cc: David Riley <davidriley@chromium.org>
Signed-off-by: Dave Airlie <airlied@redhat.com>
# gpg: Signature made Thu 10 Oct 2019 01:00:47 AM AEST
# gpg: using RSA key 732C002572DCAF79
# gpg: Can't check signature: public key not found
# Conflicts:
# drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c
# drivers/gpu/drm/i915/i915_drv.c
# drivers/gpu/drm/i915/i915_gem.c
# drivers/gpu/drm/i915/i915_gem_gtt.c
# drivers/gpu/drm/i915/i915_vma.c
From: Sean Paul <sean@poorly.run>
Link: https://patchwork.freedesktop.org/patch/msgid/20191009150825.GA227673@art_vandelay
|
|
Afaict, the struct seccomp_data argument to secure_computing() is unused
by all current callers. So let's remove it.
The argument was added in [1]. It was added because having the arch
supply the syscall arguments used to be faster than having it done by
secure_computing() (cf. Andy's comment in [2]). This is not true anymore
though.
/* References */
[1]: 2f275de5d1ed ("seccomp: Add a seccomp_data parameter secure_computing()")
[2]: https://lore.kernel.org/r/CALCETrU_fs_At-hTpr231kpaAd0z7xJN4ku-DvzhRU6cvcJA_w@mail.gmail.com
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Drewry <wad@chromium.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: linux-arm-kernel@lists.infradead.org
Cc: linux-parisc@vger.kernel.org
Cc: linux-s390@vger.kernel.org
Cc: linux-um@lists.infradead.org
Cc: x86@kernel.org
Acked-by: Borislav Petkov <bp@suse.de>
Acked-by: Andy Lutomirski <luto@kernel.org>
Link: https://lore.kernel.org/r/20190924064420.6353-1-christian.brauner@ubuntu.com
Signed-off-by: Kees Cook <keescook@chromium.org>
|
|
Since commit 4f8943f80883 ("SUNRPC: Replace direct task wakeups from
softirq context") there has been a race to the value of the sk_err if both
XPRT_SOCK_WAKE_ERROR and XPRT_SOCK_WAKE_DISCONNECT are set. In that case,
we may end up losing the sk_err value that existed when xs_error_report was
called.
Fix this by reverting to the previous behavior: instead of using SO_ERROR
to retrieve the value at a later time (which might also return sk_err_soft),
copy the sk_err value onto struct sock_xprt, and use that value to wake
pending tasks.
Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
Fixes: 4f8943f80883 ("SUNRPC: Replace direct task wakeups from softirq context")
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
|
|
When both PCI and OF are disabled, no drivers are registered, and
we get some unused-function warnings:
drivers/crypto/inside-secure/safexcel.c:1221:13: error: unused function 'safexcel_unregister_algorithms' [-Werror,-Wunused-function]
static void safexcel_unregister_algorithms(struct safexcel_crypto_priv *priv)
drivers/crypto/inside-secure/safexcel.c:1307:12: error: unused function 'safexcel_probe_generic' [-Werror,-Wunused-function]
static int safexcel_probe_generic(void *pdev,
drivers/crypto/inside-secure/safexcel.c:1531:13: error: unused function 'safexcel_hw_reset_rings' [-Werror,-Wunused-function]
static void safexcel_hw_reset_rings(struct safexcel_crypto_priv *priv)
It's better to make the compiler see what is going on and remove
such ifdef checks completely. In case of PCI, this is trivial since
pci_register_driver() is defined to an empty function that makes the
compiler subsequently drop all unused code silently.
The global pcireg_rc/ofreg_rc variables are not actually needed here
since the driver registration does not fail in ways that would make
it helpful.
For CONFIG_OF, an IS_ENABLED() check is still required, since platform
drivers can exist both with and without it.
A little change to linux/pci.h is needed to ensure that
pcim_enable_device() is visible to the driver. Moving the declaration
outside of ifdef would be sufficient here, but for consistency with the
rest of the file, adding an inline helper is probably best.
Fixes: 212ef6f29e5b ("crypto: inside-secure - Fix unused variable warning when CONFIG_PCI=n")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Bjorn Helgaas <bhelgaas@google.com> # pci.h
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
|
|
Soc framework exposed sysfs entries are not sufficient for some
of the h/w platforms. Currently there is no interface where soc
drivers can expose further information about their SoCs via soc
framework. This change address this limitation where clients can
pass their custom entries as attribute group and soc framework
would expose them as sysfs properties.
Signed-off-by: Murali Nalajala <mnalajal@codeaurora.org>
Reviewed-by: Bjorn Andersson <bjorn.andersson@linaro.org>
Reviewed-by: Stephen Boyd <swboyd@chromium.org>
Link: https://lore.kernel.org/r/1570480662-25252-1-git-send-email-mnalajal@codeaurora.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
Lots of fixes to kernel-doc in structs, enums, and functions.
Also add header files that are being used but not yet #included.
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Cc: Yamin Friedman <yaminf@mellanox.com>
Cc: Tal Gilboa <talgi@mellanox.com>
Cc: Saeed Mahameed <saeedm@mellanox.com>
Cc: Doug Ledford <dledford@redhat.com>
Cc: Jason Gunthorpe <jgg@mellanox.com>
Cc: linux-rdma@vger.kernel.org
Cc: netdev@vger.kernel.org
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
|
|
https://git.kernel.org/pub/scm/linux/kernel/git/broonie/spi
Pull in a dependency for Vladimir's work on more precise
packet time stamping.
Mark Brown says:
====================
spi: Add a PTP API
For detailed timestamping of operations.
====================
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
|
|
All the defined symbols from linux/platform_data/pixcir_i2c_ts.h
are only used by the pixcir_i2c_ts driver, so move all the definitions
locally and get rid of the pixcir_i2c_ts.h file.
Signed-off-by: Fabio Estevam <festevam@gmail.com>
Reviewed-by: Roger Quadros <rogerq@ti.com>
Tested-by: Michal Vokáč <michal.vokac@ysoft.com>
Signed-off-by: Dmitry Torokhov <dmitry.torokhov@gmail.com>
|
|
When the source server reboots after a server-to-server copy was
issued, we need to retry the copy from COPY_NOTIFY. We need to
detect that the source server rebooted and there is a copy waiting
on a destination server and wake it up.
Signed-off-by: Olga Kornievskaia <kolga@netapp.com>
|
|
Support only one source server address: the same address that
the client and source server use.
Signed-off-by: Andy Adamson <andros@netapp.com>
Signed-off-by: Olga Kornievskaia <kolga@netapp.com>
|
|
Try using the delegation stateid, then the open stateid.
Only NL4_NETATTR, No support for NL4_NAME and NL4_URL.
Allow only one source server address to be returned for now.
To distinguish between same server copy offload ("intra") and
a copy between different server ("inter"), do a check of server
owner identity and also make sure server is capable of doing
a copy offload.
Signed-off-by: Andy Adamson <andros@netapp.com>
Signed-off-by: Olga Kornievskaia <kolga@netapp.com>
|
|
These structures are needed by COPY_NOTIFY on the client and needed
by the nfsd as well
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Olga Kornievskaia <kolga@netapp.com>
|
|
TI SoCs hardware reset signals require the parent clockdomain to be
in force wakeup mode while de-asserting the reset, otherwise it may
never complete. To support this, add pdata hooks to control the
clockdomain directly.
Signed-off-by: Tero Kristo <t-kristo@ti.com>
Reviewed-by: Tony Lindgren <tony@atomide.com>
Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
|
|
Since the following commit:
b4adfe8e05f1 ("locking/lockdep: Remove unused argument in __lock_release")
@nested is no longer used in lock_release(), so remove it from all
lock_release() calls and friends.
Signed-off-by: Qian Cai <cai@lca.pw>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Will Deacon <will@kernel.org>
Acked-by: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: airlied@linux.ie
Cc: akpm@linux-foundation.org
Cc: alexander.levin@microsoft.com
Cc: daniel@iogearbox.net
Cc: davem@davemloft.net
Cc: dri-devel@lists.freedesktop.org
Cc: duyuyang@gmail.com
Cc: gregkh@linuxfoundation.org
Cc: hannes@cmpxchg.org
Cc: intel-gfx@lists.freedesktop.org
Cc: jack@suse.com
Cc: jlbec@evilplan.or
Cc: joonas.lahtinen@linux.intel.com
Cc: joseph.qi@linux.alibaba.com
Cc: jslaby@suse.com
Cc: juri.lelli@redhat.com
Cc: maarten.lankhorst@linux.intel.com
Cc: mark@fasheh.com
Cc: mhocko@kernel.org
Cc: mripard@kernel.org
Cc: ocfs2-devel@oss.oracle.com
Cc: rodrigo.vivi@intel.com
Cc: sean@poorly.run
Cc: st@kernel.org
Cc: tj@kernel.org
Cc: tytso@mit.edu
Cc: vdavydov.dev@gmail.com
Cc: vincent.guittot@linaro.org
Cc: viro@zeniv.linux.org.uk
Link: https://lkml.kernel.org/r/1568909380-32199-1-git-send-email-cai@lca.pw
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
On context switch we are locking the vtime seqcount of the scheduling-out
task twice:
* On vtime_task_switch_common(), when we flush the pending vtime through
vtime_account_system()
* On arch_vtime_task_switch() to reset the vtime state.
This is pointless as these actions can be performed without the need
to unlock/lock in the middle. The reason these steps are separated is to
consolidate a very small amount of common code between
CONFIG_VIRT_CPU_ACCOUNTING_GEN and CONFIG_VIRT_CPU_ACCOUNTING_NATIVE.
Performance in this fast path is definitely a priority over artificial
code factorization so split the task switch code between GEN and
NATIVE and mutualize the parts than can run under a single seqcount
locked block.
As a side effect, vtime_account_idle() becomes included in the seqcount
protection. This happens to be a welcome preparation in order to
properly support kcpustat under vtime in the future and fetch
CPUTIME_IDLE without race.
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Wanpeng Li <wanpengli@tencent.com>
Cc: Yauheni Kaliuta <yauheni.kaliuta@redhat.com>
Link: https://lkml.kernel.org/r/20191003161745.28464-3-frederic@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
vtime_account_system() decides if we need to account the time to the
system (__vtime_account_system()) or to the guest (vtime_account_guest()).
So this function is a misnomer as we are on a higher level than
"system". All we know when we call that function is that we are
accounting kernel cputime. Whether it belongs to guest or system time
is a lower level detail.
Rename this function to vtime_account_kernel(). This will clarify things
and avoid too many underscored vtime_account_system() versions.
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Wanpeng Li <wanpengli@tencent.com>
Cc: Yauheni Kaliuta <yauheni.kaliuta@redhat.com>
Link: https://lkml.kernel.org/r/20191003161745.28464-2-frederic@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
This reverts commit 0ad646c81b2182f7fa67ec0c8c825e0ee165696d.
As noticed by Jakub, this is no longer needed after
commit 11fc7d5a0a2d ("tun: fix memory leak in error path")
This no longer exports dev_get_valid_name() for the exclusive
use of tun driver.
Suggested-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
|
|
commit 99356b03b431 ("soc: qcom: Make llcc-qcom a generic driver") move
these out of llcc-qcom.h, make the building fails:
drivers/edac/qcom_edac.c:86:40: error: array type has incomplete element type struct llcc_edac_reg_data
static const struct llcc_edac_reg_data edac_reg_data[] = {
^~~~~~~~~~~~~
drivers/edac/qcom_edac.c:87:3: error: array index in non-array initializer
[LLCC_DRAM_CE] = {
^~~~~~~~~~~~
drivers/edac/qcom_edac.c:87:3: note: (near initialization for edac_reg_data)
drivers/edac/qcom_edac.c:88:3: error: field name not in record or union initializer
.name = "DRAM Single-bit",
...
drivers/edac/qcom_edac.c:169:51: warning: struct llcc_drv_data declared inside parameter
list will not be visible outside of this definition or declaration
qcom_llcc_clear_error_status(int err_type, struct llcc_drv_data *drv)
^~~~~~~~~~~~~
This patch move the needed definitions back to include.
Reported-by: Hulk Robot <hulkci@huawei.com>
Fixes: 99356b03b431 ("soc: qcom: Make llcc-qcom a generic driver")
Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: Bjorn Andersson <bjorn.andersson@linaro.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/j.anaszewski/linux-leds
Pull LED fixes from Jacek Anaszewski:
- fix a leftover from earlier stage of development in the documentation
of recently added led_compose_name() and fix old mistake in the
documentation of led_set_brightness_sync() parameter name.
- MAINTAINERS: add pointer to Pavel Machek's linux-leds.git tree.
Pavel is going to take over LED tree maintainership from myself.
* tag 'led-fixes-for-5.4-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/j.anaszewski/linux-leds:
Add my linux-leds branch to MAINTAINERS
leds: core: Fix leds.h structure documentation
|
|
Update the leds.h structure documentation to define the
correct arguments.
Signed-off-by: Dan Murphy <dmurphy@ti.com>
Signed-off-by: Jacek Anaszewski <jacek.anaszewski@gmail.com>
|
|
of_pci_range_parser_one() has a bug when parsing dma-ranges. When it
translates the parent address (aka cpu address in the code), 'ranges' is
always being used. This happens to work because most users are just 1:1
translation.
Cc: Robin Murphy <robin.murphy@arm.com>
Tested-by: Nicolas Saenz Julienne <nsaenzjulienne@suse.de>
Reviewed-by: Nicolas Saenz Julienne <nsaenzjulienne@suse.de>
Signed-off-by: Rob Herring <robh@kernel.org>
|
|
of_dma_get_range() is only used within the DT core code, so remove the
export and move the header declaration to the private header.
Cc: Robin Murphy <robin.murphy@arm.com>
Reviewed-by: Geert Uytterhoeven <geert+renesas@glider.be>
Tested-by: Nicolas Saenz Julienne <nsaenzjulienne@suse.de>
Reviewed-by: Nicolas Saenz Julienne <nsaenzjulienne@suse.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Rob Herring <robh@kernel.org>
|
|
of_find_matching_node_by_address() is unused, so remove it.
Cc: Robin Murphy <robin.murphy@arm.com>
Reviewed-by: Geert Uytterhoeven <geert+renesas@glider.be>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Tested-by: Nicolas Saenz Julienne <nsaenzjulienne@suse.de>
Reviewed-by: Nicolas Saenz Julienne <nsaenzjulienne@suse.de>
Signed-off-by: Rob Herring <robh@kernel.org>
|
|
SPI is one of the interfaces used to access devices which have a POSIX
clock driver (real time clocks, 1588 timers etc). The fact that the SPI
bus is slow is not what the main problem is, but rather the fact that
drivers don't take a constant amount of time in transferring data over
SPI. When there is a high delay in the readout of time, there will be
uncertainty in the value that has been read out of the peripheral.
When that delay is constant, the uncertainty can at least be
approximated with a certain accuracy which is fine more often than not.
Timing jitter occurs all over in the kernel code, and is mainly caused
by having to let go of the CPU for various reasons such as preemption,
servicing interrupts, going to sleep, etc. Another major reason is CPU
dynamic frequency scaling.
It turns out that the problem of retrieving time from a SPI peripheral
with high accuracy can be solved by the use of "PTP system
timestamping" - a mechanism to correlate the time when the device has
snapshotted its internal time counter with the Linux system time at that
same moment. This is sufficient for having a precise time measurement -
it is not necessary for the whole SPI transfer to be transmitted "as
fast as possible", or "as low-jitter as possible". The system has to be
low-jitter for a very short amount of time to be effective.
This patch introduces a PTP system timestamping mechanism in struct
spi_transfer. This is to be used by SPI device drivers when they need to
know the exact time at which the underlying device's time was
snapshotted. More often than not, SPI peripherals have a very exact
timing for when their SPI-to-interconnect bridge issues a transaction
for snapshotting and reading the time register, and that will be
dependent on when the SPI-to-interconnect bridge figures out that this
is what it should do, aka as soon as it sees byte N of the SPI transfer.
Since spi_device drivers are the ones who'd know best how the peripheral
behaves in this regard, expose a mechanism in spi_transfer which allows
them to specify which word (or word range) from the transfer should be
timestamped.
Add a default implementation of the PTP system timestamping in the SPI
core. This is not going to be satisfactory performance-wise, but should
at least increase the likelihood that SPI device drivers will use PTP
system timestamping in the future.
There are 3 entry points from the core towards the SPI controller
drivers:
- transfer_one: The driver is passed individual spi_transfers to
execute. This is the easiest to timestamp.
- transfer_one_message: The core passes the driver an entire spi_message
(a potential batch of spi_transfers). The core puts the same pre and
post timestamp to all transfers within a message. This is not ideal,
but nothing better can be done by default anyway, since the core has
no insight into how the driver batches the transfers.
- transfer: Like transfer_one_message, but for unqueued drivers (i.e.
the driver implements its own queue scheduling).
Signed-off-by: Vladimir Oltean <olteanv@gmail.com>
Link: https://lore.kernel.org/r/20190905010114.26718-3-olteanv@gmail.com
Signed-off-by: Mark Brown <broonie@kernel.org>
|
|
With the use of the barrier implied by barrier_data(), there is no need
for memzero_explicit() to be extern. Making it inline saves the overhead
of a function call, and allows the code to be reused in arch/*/purgatory
without having to duplicate the implementation.
Tested-by: Hans de Goede <hdegoede@redhat.com>
Signed-off-by: Arvind Sankar <nivedita@alum.mit.edu>
Reviewed-by: Hans de Goede <hdegoede@redhat.com>
Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: H . Peter Anvin <hpa@zytor.com>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephan Mueller <smueller@chronox.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-crypto@vger.kernel.org
Cc: linux-s390@vger.kernel.org
Fixes: 906a4bb97f5d ("crypto: sha256 - Use get/put_unaligned_be32 to get input, memzero_explicit")
Link: https://lkml.kernel.org/r/20191007220000.GA408752@rani.riverdale.lan
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
Expose maximum scatter entries per RDMA READ for optimal performance.
Signed-off-by: Yamin Friedman <yaminf@mellanox.com>
Reviewed-by: Or Gerlitz <ogerlitz@mellanox.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
|
|
Merge misc fixes from Andrew Morton:
"The usual shower of hotfixes.
Chris's memcg patches aren't actually fixes - they're mature but a few
niggling review issues were late to arrive.
The ocfs2 fixes are quite old - those took some time to get reviewer
attention.
Subsystems affected by this patch series: ocfs2, hotfixes, mm/memcg,
mm/slab-generic"
* emailed patches from Andrew Morton <akpm@linux-foundation.org>:
mm, sl[aou]b: guarantee natural alignment for kmalloc(power-of-two)
mm, sl[ou]b: improve memory accounting
mm, memcg: make scan aggression always exclude protection
mm, memcg: make memory.emin the baseline for utilisation determination
mm, memcg: proportional memory.{low,min} reclaim
mm/vmpressure.c: fix a signedness bug in vmpressure_register_event()
mm/page_alloc.c: fix a crash in free_pages_prepare()
mm/z3fold.c: claim page in the beginning of free
kernel/sysctl.c: do not override max_threads provided by userspace
memcg: only record foreign writebacks with dirty pages when memcg is not disabled
mm: fix -Wmissing-prototypes warnings
writeback: fix use-after-free in finish_writeback_work()
mm/memremap: drop unused SECTION_SIZE and SECTION_MASK
panic: ensure preemption is disabled during panic()
fs: ocfs2: fix a possible null-pointer dereference in ocfs2_info_scan_inode_alloc()
fs: ocfs2: fix a possible null-pointer dereference in ocfs2_write_end_nolock()
fs: ocfs2: fix possible null-pointer dereferences in ocfs2_xa_prepare_entry()
ocfs2: clear zero in unaligned direct IO
|
|
In most configurations, kmalloc() happens to return naturally aligned
(i.e. aligned to the block size itself) blocks for power of two sizes.
That means some kmalloc() users might unknowingly rely on that
alignment, until stuff breaks when the kernel is built with e.g.
CONFIG_SLUB_DEBUG or CONFIG_SLOB, and blocks stop being aligned. Then
developers have to devise workaround such as own kmem caches with
specified alignment [1], which is not always practical, as recently
evidenced in [2].
The topic has been discussed at LSF/MM 2019 [3]. Adding a
'kmalloc_aligned()' variant would not help with code unknowingly relying
on the implicit alignment. For slab implementations it would either
require creating more kmalloc caches, or allocate a larger size and only
give back part of it. That would be wasteful, especially with a generic
alignment parameter (in contrast with a fixed alignment to size).
Ideally we should provide to mm users what they need without difficult
workarounds or own reimplementations, so let's make the kmalloc()
alignment to size explicitly guaranteed for power-of-two sizes under all
configurations. What this means for the three available allocators?
* SLAB object layout happens to be mostly unchanged by the patch. The
implicitly provided alignment could be compromised with
CONFIG_DEBUG_SLAB due to redzoning, however SLAB disables redzoning for
caches with alignment larger than unsigned long long. Practically on at
least x86 this includes kmalloc caches as they use cache line alignment,
which is larger than that. Still, this patch ensures alignment on all
arches and cache sizes.
* SLUB layout is also unchanged unless redzoning is enabled through
CONFIG_SLUB_DEBUG and boot parameter for the particular kmalloc cache.
With this patch, explicit alignment is guaranteed with redzoning as
well. This will result in more memory being wasted, but that should be
acceptable in a debugging scenario.
* SLOB has no implicit alignment so this patch adds it explicitly for
kmalloc(). The potential downside is increased fragmentation. While
pathological allocation scenarios are certainly possible, in my testing,
after booting a x86_64 kernel+userspace with virtme, around 16MB memory
was consumed by slab pages both before and after the patch, with
difference in the noise.
[1] https://lore.kernel.org/linux-btrfs/c3157c8e8e0e7588312b40c853f65c02fe6c957a.1566399731.git.christophe.leroy@c-s.fr/
[2] https://lore.kernel.org/linux-fsdevel/20190225040904.5557-1-ming.lei@redhat.com/
[3] https://lwn.net/Articles/787740/
[akpm@linux-foundation.org: documentation fixlet, per Matthew]
Link: http://lkml.kernel.org/r/20190826111627.7505-3-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Cc: David Sterba <dsterba@suse.cz>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Ming Lei <ming.lei@redhat.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: "Darrick J . Wong" <darrick.wong@oracle.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
This patch is an incremental improvement on the existing
memory.{low,min} relative reclaim work to base its scan pressure
calculations on how much protection is available compared to the current
usage, rather than how much the current usage is over some protection
threshold.
This change doesn't change the experience for the user in the normal
case too much. One benefit is that it replaces the (somewhat arbitrary)
100% cutoff with an indefinite slope, which makes it easier to ballpark
a memory.low value.
As well as this, the old methodology doesn't quite apply generically to
machines with varying amounts of physical memory. Let's say we have a
top level cgroup, workload.slice, and another top level cgroup,
system-management.slice. We want to roughly give 12G to
system-management.slice, so on a 32GB machine we set memory.low to 20GB
in workload.slice, and on a 64GB machine we set memory.low to 52GB.
However, because these are relative amounts to the total machine size,
while the amount of memory we want to generally be willing to yield to
system.slice is absolute (12G), we end up putting more pressure on
system.slice just because we have a larger machine and a larger workload
to fill it, which seems fairly unintuitive. With this new behaviour, we
don't end up with this unintended side effect.
Previously the way that memory.low protection works is that if you are
50% over a certain baseline, you get 50% of your normal scan pressure.
This is certainly better than the previous cliff-edge behaviour, but it
can be improved even further by always considering memory under the
currently enforced protection threshold to be out of bounds. This means
that we can set relatively low memory.low thresholds for variable or
bursty workloads while still getting a reasonable level of protection,
whereas with the previous version we may still trivially hit the 100%
clamp. The previous 100% clamp is also somewhat arbitrary, whereas this
one is more concretely based on the currently enforced protection
threshold, which is likely easier to reason about.
There is also a subtle issue with the way that proportional reclaim
worked previously -- it promotes having no memory.low, since it makes
pressure higher during low reclaim. This happens because we base our
scan pressure modulation on how far memory.current is between memory.min
and memory.low, but if memory.low is unset, we only use the overage
method. In most cromulent configurations, this then means that we end
up with *more* pressure than with no memory.low at all when we're in low
reclaim, which is not really very usable or expected.
With this patch, memory.low and memory.min affect reclaim pressure in a
more understandable and composable way. For example, from a user
standpoint, "protected" memory now remains untouchable from a reclaim
aggression standpoint, and users can also have more confidence that
bursty workloads will still receive some amount of guaranteed
protection.
Link: http://lkml.kernel.org/r/20190322160307.GA3316@chrisdown.name
Signed-off-by: Chris Down <chris@chrisdown.name>
Reviewed-by: Roman Gushchin <guro@fb.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@kernel.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Dennis Zhou <dennis@kernel.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Roman points out that when when we do the low reclaim pass, we scale the
reclaim pressure relative to position between 0 and the maximum
protection threshold.
However, if the maximum protection is based on memory.elow, and
memory.emin is above zero, this means we still may get binary behaviour
on second-pass low reclaim. This is because we scale starting at 0, not
starting at memory.emin, and since we don't scan at all below emin, we
end up with cliff behaviour.
This should be a fairly uncommon case since usually we don't go into the
second pass, but it makes sense to scale our low reclaim pressure
starting at emin.
You can test this by catting two large sparse files, one in a cgroup
with emin set to some moderate size compared to physical RAM, and
another cgroup without any emin. In both cgroups, set an elow larger
than 50% of physical RAM. The one with emin will have less page
scanning, as reclaim pressure is lower.
Rebase on top of and apply the same idea as what was applied to handle
cgroup_memory=disable properly for the original proportional patch
http://lkml.kernel.org/r/20190201045711.GA18302@chrisdown.name ("mm,
memcg: Handle cgroup_disable=memory when getting memcg protection").
Link: http://lkml.kernel.org/r/20190201051810.GA18895@chrisdown.name
Signed-off-by: Chris Down <chris@chrisdown.name>
Suggested-by: Roman Gushchin <guro@fb.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Dennis Zhou <dennis@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
cgroup v2 introduces two memory protection thresholds: memory.low
(best-effort) and memory.min (hard protection). While they generally do
what they say on the tin, there is a limitation in their implementation
that makes them difficult to use effectively: that cliff behaviour often
manifests when they become eligible for reclaim. This patch implements
more intuitive and usable behaviour, where we gradually mount more
reclaim pressure as cgroups further and further exceed their protection
thresholds.
This cliff edge behaviour happens because we only choose whether or not
to reclaim based on whether the memcg is within its protection limits
(see the use of mem_cgroup_protected in shrink_node), but we don't vary
our reclaim behaviour based on this information. Imagine the following
timeline, with the numbers the lruvec size in this zone:
1. memory.low=1000000, memory.current=999999. 0 pages may be scanned.
2. memory.low=1000000, memory.current=1000000. 0 pages may be scanned.
3. memory.low=1000000, memory.current=1000001. 1000001* pages may be
scanned. (?!)
* Of course, we won't usually scan all available pages in the zone even
without this patch because of scan control priority, over-reclaim
protection, etc. However, as shown by the tests at the end, these
techniques don't sufficiently throttle such an extreme change in input,
so cliff-like behaviour isn't really averted by their existence alone.
Here's an example of how this plays out in practice. At Facebook, we are
trying to protect various workloads from "system" software, like
configuration management tools, metric collectors, etc (see this[0] case
study). In order to find a suitable memory.low value, we start by
determining the expected memory range within which the workload will be
comfortable operating. This isn't an exact science -- memory usage deemed
"comfortable" will vary over time due to user behaviour, differences in
composition of work, etc, etc. As such we need to ballpark memory.low,
but doing this is currently problematic:
1. If we end up setting it too low for the workload, it won't have
*any* effect (see discussion above). The group will receive the full
weight of reclaim and won't have any priority while competing with the
less important system software, as if we had no memory.low configured
at all.
2. Because of this behaviour, we end up erring on the side of setting
it too high, such that the comfort range is reliably covered. However,
protected memory is completely unavailable to the rest of the system,
so we might cause undue memory and IO pressure there when we *know* we
have some elasticity in the workload.
3. Even if we get the value totally right, smack in the middle of the
comfort zone, we get extreme jumps between no pressure and full
pressure that cause unpredictable pressure spikes in the workload due
to the current binary reclaim behaviour.
With this patch, we can set it to our ballpark estimation without too much
worry. Any undesirable behaviour, such as too much or too little reclaim
pressure on the workload or system will be proportional to how far our
estimation is off. This means we can set memory.low much more
conservatively and thus waste less resources *without* the risk of the
workload falling off a cliff if we overshoot.
As a more abstract technical description, this unintuitive behaviour
results in having to give high-priority workloads a large protection
buffer on top of their expected usage to function reliably, as otherwise
we have abrupt periods of dramatically increased memory pressure which
hamper performance. Having to set these thresholds so high wastes
resources and generally works against the principle of work conservation.
In addition, having proportional memory reclaim behaviour has other
benefits. Most notably, before this patch it's basically mandatory to set
memory.low to a higher than desirable value because otherwise as soon as
you exceed memory.low, all protection is lost, and all pages are eligible
to scan again. By contrast, having a gradual ramp in reclaim pressure
means that you now still get some protection when thresholds are exceeded,
which means that one can now be more comfortable setting memory.low to
lower values without worrying that all protection will be lost. This is
important because workingset size is really hard to know exactly,
especially with variable workloads, so at least getting *some* protection
if your workingset size grows larger than you expect increases user
confidence in setting memory.low without a huge buffer on top being
needed.
Thanks a lot to Johannes Weiner and Tejun Heo for their advice and
assistance in thinking about how to make this work better.
In testing these changes, I intended to verify that:
1. Changes in page scanning become gradual and proportional instead of
binary.
To test this, I experimented stepping further and further down
memory.low protection on a workload that floats around 19G workingset
when under memory.low protection, watching page scan rates for the
workload cgroup:
+------------+-----------------+--------------------+--------------+
| memory.low | test (pgscan/s) | control (pgscan/s) | % of control |
+------------+-----------------+--------------------+--------------+
| 21G | 0 | 0 | N/A |
| 17G | 867 | 3799 | 23% |
| 12G | 1203 | 3543 | 34% |
| 8G | 2534 | 3979 | 64% |
| 4G | 3980 | 4147 | 96% |
| 0 | 3799 | 3980 | 95% |
+------------+-----------------+--------------------+--------------+
As you can see, the test kernel (with a kernel containing this
patch) ramps up page scanning significantly more gradually than the
control kernel (without this patch).
2. More gradual ramp up in reclaim aggression doesn't result in
premature OOMs.
To test this, I wrote a script that slowly increments the number of
pages held by stress(1)'s --vm-keep mode until a production system
entered severe overall memory contention. This script runs in a highly
protected slice taking up the majority of available system memory.
Watching vmstat revealed that page scanning continued essentially
nominally between test and control, without causing forward reclaim
progress to become arrested.
[0]: https://facebookmicrosites.github.io/cgroup2/docs/overview.html#case-study-the-fbtax2-project
[akpm@linux-foundation.org: reflow block comments to fit in 80 cols]
[chris@chrisdown.name: handle cgroup_disable=memory when getting memcg protection]
Link: http://lkml.kernel.org/r/20190201045711.GA18302@chrisdown.name
Link: http://lkml.kernel.org/r/20190124014455.GA6396@chrisdown.name
Signed-off-by: Chris Down <chris@chrisdown.name>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Roman Gushchin <guro@fb.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Dennis Zhou <dennis@kernel.org>
Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
disabled
In kdump kernel, memcg usually is disabled with 'cgroup_disable=memory'
for saving memory. Now kdump kernel will always panic when dump vmcore
to local disk:
BUG: kernel NULL pointer dereference, address: 0000000000000ab8
Oops: 0000 [#1] SMP NOPTI
CPU: 0 PID: 598 Comm: makedumpfile Not tainted 5.3.0+ #26
Hardware name: HPE ProLiant DL385 Gen10/ProLiant DL385 Gen10, BIOS A40 10/02/2018
RIP: 0010:mem_cgroup_track_foreign_dirty_slowpath+0x38/0x140
Call Trace:
__set_page_dirty+0x52/0xc0
iomap_set_page_dirty+0x50/0x90
iomap_write_end+0x6e/0x270
iomap_write_actor+0xce/0x170
iomap_apply+0xba/0x11e
iomap_file_buffered_write+0x62/0x90
xfs_file_buffered_aio_write+0xca/0x320 [xfs]
new_sync_write+0x12d/0x1d0
vfs_write+0xa5/0x1a0
ksys_write+0x59/0xd0
do_syscall_64+0x59/0x1e0
entry_SYSCALL_64_after_hwframe+0x44/0xa9
And this will corrupt the 1st kernel too with 'cgroup_disable=memory'.
Via the trace and with debugging, it is pointing to commit 97b27821b485
("writeback, memcg: Implement foreign dirty flushing") which introduced
this regression. Disabling memcg causes the null pointer dereference at
uninitialized data in function mem_cgroup_track_foreign_dirty_slowpath().
Fix it by returning directly if memcg is disabled, but not trying to
record the foreign writebacks with dirty pages.
Link: http://lkml.kernel.org/r/20190924141928.GD31919@MiWiFi-R3L-srv
Fixes: 97b27821b485 ("writeback, memcg: Implement foreign dirty flushing")
Signed-off-by: Baoquan He <bhe@redhat.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Tejun Heo <tj@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|