Age | Commit message (Collapse) | Author |
|
Merge more updates from Andrew Morton:
"190 patches.
Subsystems affected by this patch series: mm (hugetlb, userfaultfd,
vmscan, kconfig, proc, z3fold, zbud, ras, mempolicy, memblock,
migration, thp, nommu, kconfig, madvise, memory-hotplug, zswap,
zsmalloc, zram, cleanups, kfence, and hmm), procfs, sysctl, misc,
core-kernel, lib, lz4, checkpatch, init, kprobes, nilfs2, hfs,
signals, exec, kcov, selftests, compress/decompress, and ipc"
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (190 commits)
ipc/util.c: use binary search for max_idx
ipc/sem.c: use READ_ONCE()/WRITE_ONCE() for use_global_lock
ipc: use kmalloc for msg_queue and shmid_kernel
ipc sem: use kvmalloc for sem_undo allocation
lib/decompressors: remove set but not used variabled 'level'
selftests/vm/pkeys: exercise x86 XSAVE init state
selftests/vm/pkeys: refill shadow register after implicit kernel write
selftests/vm/pkeys: handle negative sys_pkey_alloc() return code
selftests/vm/pkeys: fix alloc_random_pkey() to make it really, really random
kcov: add __no_sanitize_coverage to fix noinstr for all architectures
exec: remove checks in __register_bimfmt()
x86: signal: don't do sas_ss_reset() until we are certain that sigframe won't be abandoned
hfsplus: report create_date to kstat.btime
hfsplus: remove unnecessary oom message
nilfs2: remove redundant continue statement in a while-loop
kprobes: remove duplicated strong free_insn_page in x86 and s390
init: print out unknown kernel parameters
checkpatch: do not complain about positive return values starting with EPOLL
checkpatch: improve the indented label test
checkpatch: scripts/spdxcheck.py now requires python3
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup
Pull cgroup updates from Tejun Heo:
- cgroup.kill is added which implements atomic killing of the whole
subtree.
Down the line, this should be able to replace the multiple userland
implementations of "keep killing till empty".
- PSI can now be turned off at boot time to avoid overhead for
configurations which don't care about PSI.
* 'for-5.14' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
cgroup: make per-cgroup pressure stall tracking configurable
cgroup: Fix kernel-doc
cgroup: inline cgroup_task_freeze()
tests/cgroup: test cgroup.kill
tests/cgroup: move cg_wait_for(), cg_prepare_for_wait()
tests/cgroup: use cgroup.kill in cg_killall()
docs/cgroup: add entry for cgroup.kill
cgroup: introduce cgroup.kill
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/dennis/percpu
Pull percpu updates from Dennis Zhou:
- percpu chunk depopulation - depopulate backing pages for chunks with
empty pages when we exceed a global threshold without those pages.
This lets us reclaim a portion of memory that would previously be
lost until the full chunk would be freed (possibly never).
- memcg accounting cleanup - previously separate chunks were managed
for normal allocations and __GFP_ACCOUNT allocations. These are now
consolidated which cleans up the code quite a bit.
- a few misc clean ups for clang warnings
* 'for-5.14' of git://git.kernel.org/pub/scm/linux/kernel/git/dennis/percpu:
percpu: optimize locking in pcpu_balance_workfn()
percpu: initialize best_upa variable
percpu: rework memcg accounting
mm, memcg: introduce mem_cgroup_kmem_disabled()
mm, memcg: mark cgroup_memory_nosocket, nokmem and noswap as __ro_after_init
percpu: make symbol 'pcpu_free_slot' static
percpu: implement partial chunk depopulation
percpu: use pcpu_free_slot instead of pcpu_nr_slots - 1
percpu: factor out pcpu_check_block_hint()
percpu: split __pcpu_balance_workfn()
percpu: fix a comment about the chunks ordering
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-pinctrl
Pull pin control updates from Linus Walleij:
"This is the bulk of pin control changes for the v5.14 kernel. Not so
much going on. No core changes, just drivers.
The most interesting would be that MIPS Ralink is migrating to pin
control and we have some bindings but not yet code for the Apple M1
pin controller.
New drivers:
- Last merge window we created a driver for the Ralink RT2880. We are
now moving the Ralink SoC pin control drivers out of the MIPS
architecture code and into the pin control subsystem. This concerns
RT288X, MT7620, RT305X, RT3883 and MT7621.
- Qualcomm SM6125 SoC pin control driver.
- Qualcomm spmi-gpio support for PM7325.
- Qualcomm spmi-mpp also handles PMI8994 (just a compatible string)
- Mediatek MT8365 SoC pin controller.
- New device HID for the AMD GPIO controller.
Improvements:
- Pin bias config support for a slew of Renesas pin controllers.
- Incremental improvements and non-urgent bug fixes to the Renesas
SoC drivers.
- Implement irq_set_wake on the AMD pin controller so we can wake up
from external pin events.
Misc:
- Devicetree bindings for the Apple M1 pin controller, we will
probably see a proper driver for this soon as well"
* tag 'pinctrl-v5.14-1' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-pinctrl: (54 commits)
pinctrl: ralink: rt305x: add missing include
pinctrl: stm32: check for IRQ MUX validity during alloc()
pinctrl: zynqmp: some code cleanups
drivers: qcom: pinctrl: Add pinctrl driver for sm6125
dt-bindings: pinctrl: qcom: sm6125: Document SM6125 pinctrl driver
dt-bindings: pinctrl: mcp23s08: add documentation for reset-gpios
pinctrl: mcp23s08: Add optional reset GPIO
pinctrl: mediatek: fix mode encoding
pinctrl: mcp23s08: Fix missing unlock on error in mcp23s08_irq()
pinctrl: bcm: Constify static pinmux_ops
pinctrl: bcm: Constify static pinctrl_ops
pinctrl: ralink: move RT288X SoC pinmux config into a new 'pinctrl-rt288x.c' file
pinctrl: ralink: move MT7620 SoC pinmux config into a new 'pinctrl-mt7620.c' file
pinctrl: ralink: move RT305X SoC pinmux config into a new 'pinctrl-rt305x.c' file
pinctrl: ralink: move RT3883 SoC pinmux config into a new 'pinctrl-rt3883.c' file
pinctrl: ralink: move MT7621 SoC pinmux config into a new 'pinctrl-mt7621.c' file
pinctrl: ralink: move ralink architecture pinmux header into the driver
pinctrl: single: config: enable the pin's input
pinctrl: mtk: Fix mt8365 Kconfig dependency
pinctrl: mcp23s08: fix race condition in irq handler
...
|
|
Pull rdma updates from Jason Gunthorpe:
"This contains a replacement driver for Intel iWarp hardware. This new
driver supports the old ethernet hardware and also newer chips that
can do ROCE.
Other than that, this contains the typical mix of patches:
- Driver updates and cleanups for bnxt_re, cxgb4, mlx4, and mlx5
- Many static checker driven code clean ups, including a wide
refcount_t conversion
- Several series for the hns driver, more HIP09 HW capabilities,
migration to new HW register manipulators, and code cleanups
- Minor fixes and improvements in srp, rts, and cm
- Improvements throughout for sysfs related code to use
DEVICE_ATTR_*, make the ib_port sysfs first-class, and overall use
sysfs APIs properly
- Intel's new irdma driver replacing i40iw
- rxe general clean ups and Memory Window support"
* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: (211 commits)
RDMA/core: Always release restrack object
RDMA/mlx5: Don't access NULL-cleared mpi pointer
RDMA/irdma: Fix potential overflow expression in irdma_prm_get_pbles
RDMA/irdma: Check contents of user-space irdma_mem_reg_req object
RDMA/rxe: Missing unlock on error in get_srq_wqe()
RDMA/cma: Fix rdma_resolve_route() memory leak
RDMA/core/sa_query: Remove unused argument
RDMA/cma: Fix incorrect Packet Lifetime calculation
RDMA/cma: Protect RMW with qp_mutex
RDMA/cma: Remove unnecessary INIT->INIT transition
RDMA/hns: Add window selection field of congestion control
RDMA/hfi1: Remove use of kmap()
RDMA/irdma: Remove use of kmap()
RDMA/bnxt_re: Fix uninitialized struct bit field rsvd1
IB/isert: Align target max I/O size to initiator size
RDMA/hns: Fix incorrect vlan enable bit in QPC
MAINTAINERS: Update Broadcom RDMA maintainers
RDMA/irdma: Use the queried port attributes
RDMA/rxe: Fix redundant skb_put_zero
RDMA/rxe: Fix extra copy in prepare_ack_packet
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/clk/linux
Pull clk updates from Stephen Boyd:
"This round has a diffstat dominated by Qualcomm clk drivers. Honestly
though that's just a bunch of data so the diffstat reflects that.
Looking beyond that there's just a bunch of updates all around in
various clk drivers. Renesas and NXP (for i.MX) are two SoC vendors
that have a lot of patches in here.
Overall the driver changes look to be mostly enabling more clks and
non-critical fixes that we could hold until the next merge window.
I'm especially excited about the series from Arnd that graduates
clkdev to be the only implementation of clk_get() and clk_put().
That's a good step in the right direction to migreate eveerything over
to the common clk framework. Now we don't have to worry about clkdev
specific details, they're just part of the clk API now.
Core:
- clkdev is now the only option, i.e. clk_get()/clk_put() is
implemented in only one place in the kernel instead of in
drivers/clk/clkdev.c and in architectures that want their own
implementation
New Drivers:
- Texas Instruments' LMK04832 Ultra Low-Noise JESD204B Compliant
Clock Jitter Cleaner With Dual Loop PLLs
- Qualcomm MDM9607 GCC
- Qualcomm SC8180X display clks
- Qualcomm SM6125 GCC
- Qualcomm SM8250 CAMCC (camera)
- Renesas RZ/G2L SoC
- Hisilicon hi3559A SoC
Updates:
- Stop using clock-output-names in ST clk drivers (yay!)
- Support secure mode of STM32MP1 SoCs
- Improve clock support for Actions S500 SoC
- duty cycle setting support on qcom clks
- Add TI am33xx spread spectrum clock support
- Use determine_rate() for the Amlogic pll ops instead of
round_rate()
- Restrict Amlogic gp0/1 and audio plls range on g12a/sm1
- Improve Amlogic axg-audio controller error on deferral
- Add NNA clocks on Amlogic g12a
- Reduce memory footprint of Rockchip PLL rate tables
- A fix for the newly added Rockchip rk3568 clk driver
- Exported clock for the newly added Rockchip video decoder
- Remove audio ipg clock from i.MX8MP
- Remove deprecated legacy clock binding for i.MX SCU clock driver
- Use common clk-imx8qxp for both i.MX8QXP and i.MX8QM
- Add multiple clocks to clk-imx8qxp driver (enet, hdmi, lcdif,
audio, parallel interface)
- Add dedicated clock ops for i.MX paralel interface
- Different fixes for clocks controlled by ATF on i.MX SoCs
- Add A53/A72 frequency scaling support i.MX clk-scu driver
- Add special case for DCSS clock on suspend for i.MX clk-scu driver
- Add parent save/restore on suspend/resume to i.MX clk-scu driver
- Skip runtime PM enablement for CPU clocks in i.MX clk-scu driver
- Remove the sys1_pll/sys2_pll clock gates for i.MX8MQ and their
bindings
- Tegra clk driver no longer deasserts resets on clk_enable as it
gets in the way of certain power-up sequences
- Fix compile testing for Tegra clk driver
- One patch to fix a divider on the Allwinner v3s Audio PLL
- Add support for CPU core clock boost modes on Renesas R-Car Gen3
- Add ISPCS (Image Signal Processor) clocks on Renesas R-Car V3U
- Switch SH/R-Mobile and R-Car "DIV6" clocks to .determine_rate() and
improve support for multiple parents
- Switch Renesas RZ/N1 divider clocks to .determine_rate()
- Add ZA2 (Audio Clock Generator) clock on Renesas R-Car D3
- Convert ar7 to common clk framework
- Convert ralink to common clk framework"
* tag 'clk-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/clk/linux: (161 commits)
clk: zynqmp: Handle divider specific read only flag
clk: zynqmp: Use firmware specific mux clock flags
clk: zynqmp: Use firmware specific divider clock flags
clk: zynqmp: Use firmware specific common clock flags
clk: lmk04832: Use of match table
clk: lmk04832: Depend on SPI
clk: stm32mp1: new compatible for secure RCC support
dt-bindings: clock: stm32mp1 new compatible for secure rcc
dt-bindings: reset: add MCU HOLD BOOT ID for SCMI reset domains on stm32mp15
dt-bindings: reset: add IDs for SCMI reset domains on stm32mp15
dt-bindings: clock: add IDs for SCMI clocks on stm32mp15
reset: stm32mp1: remove stm32mp1 reset
clk: hisilicon: Add clock driver for hi3559A SoC
dt-bindings: Document the hi3559a clock bindings
clk: si5341: Add sysfs properties to allow checking/resetting device faults
clk: si5341: Add silabs,iovdd-33 property
clk: si5341: Add silabs,xaxb-ext-clk property
clk: si5341: Allow different output VDD_SEL values
clk: si5341: Update initialization magic
clk: si5341: Check for input clock presence and PLL lock on startup
...
|
|
Pull drm updates from Dave Airlie:
"Highlights:
- AMD enables two more GPUs, with resulting header files
- i915 has started to move to TTM for discrete GPU and enable DG1
discrete GPU support (not by default yet)
- new HyperV drm driver
- vmwgfx adds arm64 support
- TTM refactoring ongoing
- 16bpc display support for AMD hw
Otherwise it's just the usual insane amounts of work all over the
place in lots of drivers and the core, as mostly summarised below:
Core:
- mark AGP ioctls as legacy
- disable force probing for non-master clients
- HDR metadata property helpers
- HDMI infoframe signal colorimetry support
- remove drm_device.pdev pointer
- remove DRM_KMS_FB_HELPER config option
- remove drm_pci_alloc/free
- drm_err_*/drm_dbg_* helpers
- use drm driver names for fbdev
- leaked DMA handle fix
- 16bpc fixed point format fourcc
- add prefetching memcpy for WC
- Documentation fixes
aperture:
- add aperture ownership helpers
dp:
- aux fixes
- downstream 0 port handling
- use extended base receiver capability DPCD
- Rename DP_PSR_SELECTIVE_UPDATE to better mach eDP spec
- mst: use khz as link rate during init
- VCPI fixes for StarTech hub
ttm:
- provide tt_shrink file via debugfs
- warn about freeing pinned BOs
- fix swapping error handling
- move page alignment into BO
- cleanup ttm_agp_backend
- add ttm_sys_manager
- don't override vm_ops
- ttm_bo_mmap removed
- make ttm_resource base of all managers
- remove VM_MIXEDMAP usage
panel:
- sysfs_emit support
- simple: runtime PM support
- simple: power up panel when reading EDID + caching
bridge:
- MHDP8546: HDCP support + DT bindings
- MHDP8546: Register DP AUX channel with userspace
- TI SN65DSI83 + SN65DSI84: add driver
- Sil8620: Fix module dependencies
- dw-hdmi: make CEC driver loading optional
- Ti-sn65dsi86: refclk fixes, subdrivers, runtime pm
- It66121: Add driver + DT bindings
- Adv7511: Support I2S IEC958 encoding
- Anx7625: fix power-on delay
- Nwi-dsi: Modesetting fixes; Cleanups
- lt6911: add missing MODULE_DEVICE_TABLE
- cdns: fix PM reference leak
hyperv:
- add new DRM driver for HyperV graphics
efifb:
- non-PCI device handling fixes
i915:
- refactor IP/device versioning
- XeLPD Display IP preperation work
- ADL-P enablement patches
- DG1 uAPI behind BROKEN
- disable mmap ioctl for discerte GPUs
- start enabling HuC loading for Gen12+
- major GuC backend rework for new platforms
- initial TTM support for Discrete GPUs
- locking rework for TTM prep
- use correct max source link rate for eDP
- %p4cc format printing
- GLK display fixes
- VLV DSI panel power fixes
- PSR2 disabled for RKL and ADL-S
- ACPI _DSM invalid access fixed
- DMC FW path abstraction
- ADL-S PCI ID update
- uAPI headers converted to kerneldoc
- initial LMEM support for DG1
- x86/gpu: add Jasperlake to gen11 early quirks
amdgpu:
- Aldebaran updates + initial SR-IOV
- new GPU: Beige Goby and Yellow Carp support
- more LTTPR display work
- Vangogh updates
- SDMA 5.x GCR fixes
- PCIe ASPM support
- Renoir TMZ enablement
- initial multiple eDP panel support
- use fdinfo to track devices/process info
- pin/unpin TTM fixes
- free resource on fence usage query
- fix fence calculation
- fix hotunplug/suspend issues
- GC/MM register access macro cleanup for SR-IOV
- W=1 fixes
- ACPI ATCS/ATIF handling rework
- 16bpc fixed point format support
- Initial smartshift support
- RV/PCO power tuning fixes
- new INFO query for additional vbios info
amdkfd:
- SR-IOV aldebaran support
- HMM SVM support
radeon:
- SMU regression fixes
- Oland flickering fix
vmwgfx:
- enable console with fbdev emulation
- fix cpu updates of coherent multisample surfaces
- remove reservation semaphore
- add initial SVGA3 support
- support arm64
msm:
- devcoredump support for display errors
- dpu/dsi: yaml bindings conversion
- mdp5: alpha/blend_mode/zpos support
- a6xx: cached coherent buffer support
- gpu iova fault improvement
- a660 support
rockchip:
- RK3036 win1 scaling support
- RK3066/3188 missing register support
- RK3036/3066/3126/3188 alpha support
mediatek:
- MT8167 HDMI support
- MT8183 DPI dual edge support
tegra:
- fixed YUV support/scaling on Tegra186+
ast:
- use pcim_iomap
- fix DP501 EDID
bochs:
- screen blanking support
etnaviv:
- export more GPU ID values to userspace
- add HWDB entry for GPU on i.MX8MP
- rework linear window calcs
exynos:
- pm runtime changes
imx:
- Annotate dma_fence critical section
- fix PRG modifiers after drmm conversion
- Add 8 pixel alignment fix for 1366x768
- fix YUV advertising
- add color properties
ingenic:
- IPU planes fix
panfrost:
- Mediatek MT8183 support + DT bindings
- export AFBC_FEATURES register to userspace
simpledrm:
- %pr for printing resources
nouveau:
- pin/unpin TTM fixes
qxl:
- unpin shadow BO
virtio:
- create dumb BOs as guest blob
vkms:
- drmm_universal_plane_alloc
- add XRGB plane composition
- overlay support"
* tag 'drm-next-2021-07-01' of git://anongit.freedesktop.org/drm/drm: (1570 commits)
drm/i915: Reinstate the mmap ioctl for some platforms
drm/i915/dsc: abstract helpers to get bigjoiner primary/secondary crtc
Revert "drm/msm/mdp5: provide dynamic bandwidth management"
drm/msm/mdp5: provide dynamic bandwidth management
drm/msm/mdp5: add perf blocks for holding fudge factors
drm/msm/mdp5: switch to standard zpos property
drm/msm/mdp5: add support for alpha/blend_mode properties
drm/msm/mdp5: use drm_plane_state for pixel blend mode
drm/msm/mdp5: use drm_plane_state for storing alpha value
drm/msm/mdp5: use drm atomic helpers to handle base drm plane state
drm/msm/dsi: do not enable PHYs when called for the slave DSI interface
drm/msm: Add debugfs to trigger shrinker
drm/msm/dpu: Avoid ABBA deadlock between IRQ modules
drm/msm: devcoredump iommu fault support
iommu/arm-smmu-qcom: Add stall support
drm/msm: Improve the a6xx page fault handler
iommu/arm-smmu-qcom: Add an adreno-smmu-priv callback to get pagefault info
iommu/arm-smmu: Add support for driver IOMMU fault handlers
drm/msm: export hangcheck_period in debugfs
drm/msm/a6xx: add support for Adreno 660 GPU
...
|
|
Pull io_uring updates from Jens Axboe:
- Multi-queue iopoll improvement (Fam)
- Allow configurable io-wq CPU masks (me)
- renameat/linkat tightening (me)
- poll re-arm improvement (Olivier)
- SQPOLL race fix (Olivier)
- Cancelation unification (Pavel)
- SQPOLL cleanups (Pavel)
- Enable file backed buffers for shmem/memfd (Pavel)
- A ton of cleanups and performance improvements (Pavel)
- Followup and misc fixes (Colin, Fam, Hao, Olivier)
* tag 'for-5.14/io_uring-2021-06-30' of git://git.kernel.dk/linux-block: (83 commits)
io_uring: code clean for kiocb_done()
io_uring: spin in iopoll() only when reqs are in a single queue
io_uring: pre-initialise some of req fields
io_uring: refactor io_submit_flush_completions
io_uring: optimise hot path restricted checks
io_uring: remove not needed PF_EXITING check
io_uring: mainstream sqpoll task_work running
io_uring: refactor io_arm_poll_handler()
io_uring: reduce latency by reissueing the operation
io_uring: add IOPOLL and reserved field checks to IORING_OP_UNLINKAT
io_uring: add IOPOLL and reserved field checks to IORING_OP_RENAMEAT
io_uring: refactor io_openat2()
io_uring: simplify struct io_uring_sqe layout
io_uring: update sqe layout build checks
io_uring: fix code style problems
io_uring: refactor io_sq_thread()
io_uring: don't change sqpoll creds if not needed
io_uring: Create define to modify a SQPOLL parameter
io_uring: Fix race condition when sqp thread goes to sleep
io_uring: improve in tctx_task_work() resubmission
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs
Pull misc fs updates from Jan Kara:
"The new quotactl_fd() syscall (remake of quotactl_path() syscall that
got introduced & disabled in 5.13 cycle), and couple of udf, reiserfs,
isofs, and writeback fixes and cleanups"
* tag 'fs_for_v5.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
writeback: fix obtain a reference to a freeing memcg css
quota: remove unnecessary oom message
isofs: remove redundant continue statement
quota: Wire up quotactl_fd syscall
quota: Change quotactl_path() systcall to an fd-based one
reiserfs: Remove unneed check in reiserfs_write_full_page()
udf: Fix NULL pointer dereference in udf_symlink function
reiserfs: add check for invalid 1st journal block
|
|
Until now no compiler supported an attribute to disable coverage
instrumentation as used by KCOV.
To work around this limitation on x86, noinstr functions have their
coverage instrumentation turned into nops by objtool. However, this
solution doesn't scale automatically to other architectures, such as
arm64, which are migrating to use the generic entry code.
Clang [1] and GCC [2] have added support for the attribute recently.
[1] https://github.com/llvm/llvm-project/commit/280333021e9550d80f5c1152a34e33e81df1e178
[2] https://gcc.gnu.org/git/?p=gcc.git;a=commit;h=cec4d4a6782c9bd8d071839c50a239c49caca689
The changes will appear in Clang 13 and GCC 12.
Add __no_sanitize_coverage for both compilers, and add it to noinstr.
Note: In the Clang case, __has_feature(coverage_sanitizer) is only true if
the feature is enabled, and therefore we do not require an additional
defined(CONFIG_KCOV) (like in the GCC case where __has_attribute(..) is
always true) to avoid adding redundant attributes to functions if KCOV is
off. That being said, compilers that support the attribute will not
generate errors/warnings if the attribute is redundantly used; however,
where possible let's avoid it as it reduces preprocessed code size and
associated compile-time overheads.
[elver@google.com: Implement __has_feature(coverage_sanitizer) in Clang]
Link: https://lkml.kernel.org/r/20210527162655.3246381-1-elver@google.com
[elver@google.com: add comment explaining __has_feature() in Clang]
Link: https://lkml.kernel.org/r/20210527194448.3470080-1-elver@google.com
Link: https://lkml.kernel.org/r/20210525175819.699786-1-elver@google.com
Signed-off-by: Marco Elver <elver@google.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Miguel Ojeda <ojeda@kernel.org>
Reviewed-by: Nathan Chancellor <nathan@kernel.org>
Cc: Nick Desaulniers <ndesaulniers@google.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Will Deacon <will@kernel.org>
Cc: Ard Biesheuvel <ardb@kernel.org>
Cc: Luc Van Oostenryck <luc.vanoostenryck@gmail.com>
Cc: Arvind Sankar <nivedita@alum.mit.edu>
Cc: Masahiro Yamada <masahiroy@kernel.org>
Cc: Sami Tolvanen <samitolvanen@google.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
won't be abandoned
Currently we handle SS_AUTODISARM as soon as we have stored the altstack
settings into sigframe - that's the point when we have set the things up
for eventual sigreturn to restore the old settings. And if we manage to
set the sigframe up (we are not done with that yet), everything's fine.
However, in case of failure we end up with sigframe-to-be abandoned and
SIGSEGV force-delivered. And in that case we end up with inconsistent
rules - late failures have altstack reset, early ones do not.
It's trivial to get consistent behaviour - just handle SS_AUTODISARM once
we have set the sigframe up and are committed to entering the handler,
i.e. in signal_delivered().
Link: https://lore.kernel.org/lkml/20200404170604.GN23230@ZenIV.linux.org.uk/
Link: https://github.com/ClangBuiltLinux/linux/issues/876
Link: https://lkml.kernel.org/r/20210422230846.1756380-1-ndesaulniers@google.com
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Nick Desaulniers <ndesaulniers@google.com>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Tested-by: Nathan Chancellor <natechancellor@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
free_insn_page() in x86 and s390 is same with the common weak function in
kernel/kprobes.c. Plus, the comment "Recover page to RW mode before
releasing it" in x86 seems insensible to be there since resetting mapping
is done by common code in vfree() of module_memfree(). So drop these two
duplicated strong functions and related comment, then mark the common one
in kernel/kprobes.c strong.
Link: https://lkml.kernel.org/r/20210608065736.32656-1-song.bao.hua@hisilicon.com
Signed-off-by: Barry Song <song.bao.hua@hisilicon.com>
Acked-by: Masami Hiramatsu <mhiramat@kernel.org>
Acked-by: Heiko Carstens <hca@linux.ibm.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: "Naveen N. Rao" <naveen.n.rao@linux.ibm.com>
Cc: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Qi Liu <liuqi115@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
kernel.h is being used as a dump for all kinds of stuff for a long time.
Here is the attempt to start cleaning it up by splitting out kstrtox() and
simple_strtox() helpers.
At the same time convert users in header and lib folders to use new
header. Though for time being include new header back to kernel.h to
avoid twisted indirected includes for existing users.
[andy.shevchenko@gmail.com: fix documentation references]
Link: https://lkml.kernel.org/r/20210615220003.377901-1-andy.shevchenko@gmail.com
Link: https://lkml.kernel.org/r/20210611185815.44103-1-andriy.shevchenko@linux.intel.com
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Acked-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Cc: Francis Laniel <laniel_francis@privacyrequired.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Kars Mulder <kerneldev@karsmulder.nl>
Cc: Trond Myklebust <trond.myklebust@hammerspace.com>
Cc: Anna Schumaker <anna.schumaker@netapp.com>
Cc: "J. Bruce Fields" <bfields@fieldses.org>
Cc: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Fix some spelling mistakes in comments:
flaged ==> flagged
bufer ==> buffer
multipler ==> multiplier
MULTIPLER ==> MULTIPLIER
leaset ==> least
chnage ==> change
Link: https://lkml.kernel.org/r/20210604074401.12198-1-thunder.leizhen@huawei.com
Signed-off-by: Zhen Lei <thunder.leizhen@huawei.com>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
There are no more users of the seq_escape_mem_ascii() followed by
string_escape_mem_ascii().
Remove them for good.
Link: https://lkml.kernel.org/r/20210504180819.73127-16-andriy.shevchenko@linux.intel.com
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Chuck Lever <chuck.lever@oracle.com>
Cc: "J. Bruce Fields" <bfields@fieldses.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
In some cases we want to escape characters from NULL-terminated strings.
Add seq_escape_str() as replica of string_escape_str() for that.
Link: https://lkml.kernel.org/r/20210504180819.73127-13-andriy.shevchenko@linux.intel.com
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Chuck Lever <chuck.lever@oracle.com>
Cc: "J. Bruce Fields" <bfields@fieldses.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Introduce seq_escape_mem() to allow users to pass additional parameters to
string_escape_mem().
Link: https://lkml.kernel.org/r/20210504180819.73127-12-andriy.shevchenko@linux.intel.com
Suggested-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: Chuck Lever <chuck.lever@oracle.com>
Cc: "J. Bruce Fields" <bfields@fieldses.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
We have got new flags and hence new features of string_escape_mem().
Add test cases for that.
Link: https://lkml.kernel.org/r/20210504180819.73127-10-andriy.shevchenko@linux.intel.com
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Chuck Lever <chuck.lever@oracle.com>
Cc: "J. Bruce Fields" <bfields@fieldses.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Introduce a new flag to append additional characters, passed in 'only'
parameter, to be escaped if they fall in the corresponding class.
Link: https://lkml.kernel.org/r/20210504180819.73127-7-andriy.shevchenko@linux.intel.com
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Chuck Lever <chuck.lever@oracle.com>
Cc: "J. Bruce Fields" <bfields@fieldses.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Some users may want to have an ASCII based filter for printable only
characters, provided by conjunction of isascii() and isprint() functions.
Here is the addition of a such.
Link: https://lkml.kernel.org/r/20210504180819.73127-6-andriy.shevchenko@linux.intel.com
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Chuck Lever <chuck.lever@oracle.com>
Cc: "J. Bruce Fields" <bfields@fieldses.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Some users may want to have an ASCII based filter, provided by isascii()
function. Here is the addition of a such.
Link: https://lkml.kernel.org/r/20210504180819.73127-5-andriy.shevchenko@linux.intel.com
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Chuck Lever <chuck.lever@oracle.com>
Cc: "J. Bruce Fields" <bfields@fieldses.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Patch series "lib/string_helpers: get rid of ugly *_escape_mem_ascii()", v3.
Get rid of ugly *_escape_mem_ascii() API since it's not flexible and has
the only single user. Provide better approach based on usage of the
string_escape_mem() with appropriate flags.
Test cases has been expanded accordingly to cover new functionality.
This patch (of 15):
Switch to use BIT() macro for flag definitions. No changes implied.
Link: https://lkml.kernel.org/r/20210504180819.73127-1-andriy.shevchenko@linux.intel.com
Link: https://lkml.kernel.org/r/20210504180819.73127-2-andriy.shevchenko@linux.intel.com
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: "J. Bruce Fields" <bfields@fieldses.org>
Cc: Chuck Lever <chuck.lever@oracle.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
kernel.h is being used as a dump for all kinds of stuff for a long time.
Here is the attempt to start cleaning it up by splitting out panic and
oops helpers.
There are several purposes of doing this:
- dropping dependency in bug.h
- dropping a loop by moving out panic_notifier.h
- unload kernel.h from something which has its own domain
At the same time convert users tree-wide to use new headers, although for
the time being include new header back to kernel.h to avoid twisted
indirected includes for existing users.
[akpm@linux-foundation.org: thread_info.h needs limits.h]
[andriy.shevchenko@linux.intel.com: ia64 fix]
Link: https://lkml.kernel.org/r/20210520130557.55277-1-andriy.shevchenko@linux.intel.com
Link: https://lkml.kernel.org/r/20210511074137.33666-1-andriy.shevchenko@linux.intel.com
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Reviewed-by: Bjorn Andersson <bjorn.andersson@linaro.org>
Co-developed-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Mike Rapoport <rppt@linux.ibm.com>
Acked-by: Corey Minyard <cminyard@mvista.com>
Acked-by: Christian Brauner <christian.brauner@ubuntu.com>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Kees Cook <keescook@chromium.org>
Acked-by: Wei Liu <wei.liu@kernel.org>
Acked-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Sebastian Reichel <sre@kernel.org>
Acked-by: Luis Chamberlain <mcgrof@kernel.org>
Acked-by: Stephen Boyd <sboyd@kernel.org>
Acked-by: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Acked-by: Helge Deller <deller@gmx.de> # parisc
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
The ascii85.h is user of exactly two headers, i.e. math.h and types.h.
There is no need to carry on entire kernel.h.
Link: https://lkml.kernel.org/r/20210611185915.44181-1-andriy.shevchenko@linux.intel.com
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Reviewed-by: Jani Nikula <jani.nikula@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Some devices require exclusive write access to shared virtual memory (SVM)
ranges to perform atomic operations on that memory. This requires CPU
page tables to be updated to deny access whilst atomic operations are
occurring.
In order to do this introduce a new swap entry type
(SWP_DEVICE_EXCLUSIVE). When a SVM range needs to be marked for exclusive
access by a device all page table mappings for the particular range are
replaced with device exclusive swap entries. This causes any CPU access
to the page to result in a fault.
Faults are resovled by replacing the faulting entry with the original
mapping. This results in MMU notifiers being called which a driver uses
to update access permissions such as revoking atomic access. After
notifiers have been called the device will no longer have exclusive access
to the region.
Walking of the page tables to find the target pages is handled by
get_user_pages() rather than a direct page table walk. A direct page
table walk similar to what migrate_vma_collect()/unmap() does could also
have been utilised. However this resulted in more code similar in
functionality to what get_user_pages() provides as page faulting is
required to make the PTEs present and to break COW.
[dan.carpenter@oracle.com: fix signedness bug in make_device_exclusive_range()]
Link: https://lkml.kernel.org/r/YNIz5NVnZ5GiZ3u1@mwanda
Link: https://lkml.kernel.org/r/20210616105937.23201-8-apopple@nvidia.com
Signed-off-by: Alistair Popple <apopple@nvidia.com>
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Ben Skeggs <bskeggs@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Cc: Peter Xu <peterx@redhat.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Shakeel Butt <shakeelb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
MMU notifier ranges have a migrate_pgmap_owner field which is used by
drivers to store a pointer. This is subsequently used by the driver
callback to filter MMU_NOTIFY_MIGRATE events. Other notifier event types
can also benefit from this filtering, so rename the 'migrate_pgmap_owner'
field to 'owner' and create a new notifier initialisation function to
initialise this field.
Link: https://lkml.kernel.org/r/20210616105937.23201-6-apopple@nvidia.com
Signed-off-by: Alistair Popple <apopple@nvidia.com>
Suggested-by: Peter Xu <peterx@redhat.com>
Reviewed-by: Peter Xu <peterx@redhat.com>
Cc: Ben Skeggs <bskeggs@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Shakeel Butt <shakeelb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Migration is currently implemented as a mode of operation for
try_to_unmap_one() generally specified by passing the TTU_MIGRATION flag
or in the case of splitting a huge anonymous page TTU_SPLIT_FREEZE.
However it does not have much in common with the rest of the unmap
functionality of try_to_unmap_one() and thus splitting it into a separate
function reduces the complexity of try_to_unmap_one() making it more
readable.
Several simplifications can also be made in try_to_migrate_one() based on
the following observations:
- All users of TTU_MIGRATION also set TTU_IGNORE_MLOCK.
- No users of TTU_MIGRATION ever set TTU_IGNORE_HWPOISON.
- No users of TTU_MIGRATION ever set TTU_BATCH_FLUSH.
TTU_SPLIT_FREEZE is a special case of migration used when splitting an
anonymous page. This is most easily dealt with by calling the correct
function from unmap_page() in mm/huge_memory.c - either try_to_migrate()
for PageAnon or try_to_unmap().
Link: https://lkml.kernel.org/r/20210616105937.23201-5-apopple@nvidia.com
Signed-off-by: Alistair Popple <apopple@nvidia.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ralph Campbell <rcampbell@nvidia.com>
Cc: Ben Skeggs <bskeggs@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Cc: Peter Xu <peterx@redhat.com>
Cc: Shakeel Butt <shakeelb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
The behaviour of try_to_unmap_one() is difficult to follow because it
performs different operations based on a fairly large set of flags used in
different combinations.
TTU_MUNLOCK is one such flag. However it is exclusively used by
try_to_munlock() which specifies no other flags. Therefore rather than
overload try_to_unmap_one() with unrelated behaviour split this out into
it's own function and remove the flag.
Link: https://lkml.kernel.org/r/20210616105937.23201-4-apopple@nvidia.com
Signed-off-by: Alistair Popple <apopple@nvidia.com>
Reviewed-by: Ralph Campbell <rcampbell@nvidia.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Ben Skeggs <bskeggs@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Cc: Peter Xu <peterx@redhat.com>
Cc: Shakeel Butt <shakeelb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Both migration and device private pages use special swap entries that are
manipluated by a range of inline functions. The arguments to these are
somewhat inconsistent so rework them to remove flag type arguments and to
make the arguments similar for both read and write entry creation.
Link: https://lkml.kernel.org/r/20210616105937.23201-3-apopple@nvidia.com
Signed-off-by: Alistair Popple <apopple@nvidia.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Ralph Campbell <rcampbell@nvidia.com>
Cc: Ben Skeggs <bskeggs@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Cc: Peter Xu <peterx@redhat.com>
Cc: Shakeel Butt <shakeelb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Patch series "Add support for SVM atomics in Nouveau", v11.
Introduction
============
Some devices have features such as atomic PTE bits that can be used to
implement atomic access to system memory. To support atomic operations to
a shared virtual memory page such a device needs access to that page which
is exclusive of the CPU. This series introduces a mechanism to
temporarily unmap pages granting exclusive access to a device.
These changes are required to support OpenCL atomic operations in Nouveau
to shared virtual memory (SVM) regions allocated with the
CL_MEM_SVM_ATOMICS clSVMAlloc flag. A more complete description of the
OpenCL SVM feature is available at
https://www.khronos.org/registry/OpenCL/specs/3.0-unified/html/
OpenCL_API.html#_shared_virtual_memory .
Implementation
==============
Exclusive device access is implemented by adding a new swap entry type
(SWAP_DEVICE_EXCLUSIVE) which is similar to a migration entry. The main
difference is that on fault the original entry is immediately restored by
the fault handler instead of waiting.
Restoring the entry triggers calls to MMU notifers which allows a device
driver to revoke the atomic access permission from the GPU prior to the
CPU finalising the entry.
Patches
=======
Patches 1 & 2 refactor existing migration and device private entry
functions.
Patches 3 & 4 rework try_to_unmap_one() by splitting out unrelated
functionality into separate functions - try_to_migrate_one() and
try_to_munlock_one().
Patch 5 renames some existing code but does not introduce functionality.
Patch 6 is a small clean-up to swap entry handling in copy_pte_range().
Patch 7 contains the bulk of the implementation for device exclusive
memory.
Patch 8 contains some additions to the HMM selftests to ensure everything
works as expected.
Patch 9 is a cleanup for the Nouveau SVM implementation.
Patch 10 contains the implementation of atomic access for the Nouveau
driver.
Testing
=======
This has been tested with upstream Mesa 21.1.0 and a simple OpenCL program
which checks that GPU atomic accesses to system memory are atomic.
Without this series the test fails as there is no way of write-protecting
the page mapping which results in the device clobbering CPU writes. For
reference the test is available at
https://ozlabs.org/~apopple/opencl_svm_atomics/
Further testing has been performed by adding support for testing exclusive
access to the hmm-tests kselftests.
This patch (of 10):
Remove multiple similar inline functions for dealing with different types
of special swap entries.
Both migration and device private swap entries use the swap offset to
store a pfn. Instead of multiple inline functions to obtain a struct page
for each swap entry type use a common function pfn_swap_entry_to_page().
Also open-code the various entry_to_pfn() functions as this results is
shorter code that is easier to understand.
Link: https://lkml.kernel.org/r/20210616105937.23201-1-apopple@nvidia.com
Link: https://lkml.kernel.org/r/20210616105937.23201-2-apopple@nvidia.com
Signed-off-by: Alistair Popple <apopple@nvidia.com>
Reviewed-by: Ralph Campbell <rcampbell@nvidia.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Ben Skeggs <bskeggs@redhat.com>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Currently most platforms define pmd_pgtable() as pmd_page() duplicating
the same code all over. Instead just define a default value i.e
pmd_page() for pmd_pgtable() and let platforms override when required via
<asm/pgtable.h>. All the existing platform that override pmd_pgtable()
have been moved into their respective <asm/pgtable.h> header in order to
precede before the new generic definition. This makes it much cleaner
with reduced code.
Link: https://lkml.kernel.org/r/1623646133-20306-1-git-send-email-anshuman.khandual@arm.com
Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com>
Acked-by: Geert Uytterhoeven <geert@linux-m68k.org>
Acked-by: Mike Rapoport <rppt@linux.ibm.com>
Cc: Nick Hu <nickhu@andestech.com>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Vineet Gupta <vgupta@synopsys.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will@kernel.org>
Cc: Guo Ren <guoren@kernel.org>
Cc: Brian Cain <bcain@codeaurora.org>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Michal Simek <monstr@monstr.eu>
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Cc: Ley Foon Tan <ley.foon.tan@intel.com>
Cc: Jonas Bonn <jonas@southpole.se>
Cc: Stefan Kristiansson <stefan.kristiansson@saunalahti.fi>
Cc: Stafford Horne <shorne@gmail.com>
Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Jeff Dike <jdike@addtoit.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Chris Zankel <chris@zankel.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
make W=1 generates the following warning in mm/workingset.c for allnoconfig
mm/workingset.c: In function `unpack_shadow':
mm/workingset.c:201:15: warning: variable `nid' set but not used [-Wunused-but-set-variable]
int memcgid, nid;
^~~
On FLATMEM, NODE_DATA returns a global pglist_data without dereferencing
nid. Make the helper an inline function to suppress the warning, add type
checking and to apply any side-effects in the parameter list.
Link: https://lkml.kernel.org/r/20210520084809.8576-15-mgorman@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Reviewed-by: Yang Shi <shy828301@gmail.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Dan Streetman <ddstreet@ieee.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: Michal Hocko <mhocko@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
make W=1 generates the following warning in page_mapping() for allnoconfig
mm/util.c:700:15: warning: variable `entry' set but not used [-Wunused-but-set-variable]
swp_entry_t entry;
^~~~~
swap_address is a #define on !CONFIG_SWAP configurations. Make the helper
an inline function to suppress the warning, add type checking and to apply
any side-effects in the parameter list.
Link: https://lkml.kernel.org/r/20210520084809.8576-12-mgorman@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Reviewed-by: Yang Shi <shy828301@gmail.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Dan Streetman <ddstreet@ieee.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: Michal Hocko <mhocko@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Fix some spelling mistakes in comments:
each having differents usage ==> each has a different usage
statments ==> statements
adresses ==> addresses
aggresive ==> aggressive
datas ==> data
posion ==> poison
higer ==> higher
precisly ==> precisely
wont ==> won't
We moves tha ==> We move the
endianess ==> endianness
Link: https://lkml.kernel.org/r/20210519065853.7723-2-thunder.leizhen@huawei.com
Signed-off-by: Zhen Lei <thunder.leizhen@huawei.com>
Reviewed-by: Souptick Joarder <jrdr.linux@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Currently most platforms define FIRST_USER_ADDRESS as 0UL duplication the
same code all over. Instead just define a generic default value (i.e 0UL)
for FIRST_USER_ADDRESS and let the platforms override when required. This
makes it much cleaner with reduced code.
The default FIRST_USER_ADDRESS here would be skipped in <linux/pgtable.h>
when the given platform overrides its value via <asm/pgtable.h>.
Link: https://lkml.kernel.org/r/1620615725-24623-1-git-send-email-anshuman.khandual@arm.com
Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com>
Acked-by: Geert Uytterhoeven <geert@linux-m68k.org> [m68k]
Acked-by: Guo Ren <guoren@kernel.org> [csky]
Acked-by: Stafford Horne <shorne@gmail.com> [openrisc]
Acked-by: Catalin Marinas <catalin.marinas@arm.com> [arm64]
Acked-by: Mike Rapoport <rppt@linux.ibm.com>
Acked-by: Palmer Dabbelt <palmerdabbelt@google.com> [RISC-V]
Cc: Richard Henderson <rth@twiddle.net>
Cc: Vineet Gupta <vgupta@synopsys.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will@kernel.org>
Cc: Guo Ren <guoren@kernel.org>
Cc: Brian Cain <bcain@codeaurora.org>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Michal Simek <monstr@monstr.eu>
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Cc: Ley Foon Tan <ley.foon.tan@intel.com>
Cc: Jonas Bonn <jonas@southpole.se>
Cc: Stefan Kristiansson <stefan.kristiansson@saunalahti.fi>
Cc: Stafford Horne <shorne@gmail.com>
Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Jeff Dike <jdike@addtoit.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Chris Zankel <chris@zankel.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
We moves tha -> We move that in mm/swap.c
statments -> statements in include/linux/mm.h
Link: https://lkml.kernel.org/r/20210509063444.GA24745@hyeyoo
Signed-off-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
I. Background: Sparse Memory Mappings
When we manage sparse memory mappings dynamically in user space - also
sometimes involving MAP_NORESERVE - we want to dynamically populate/
discard memory inside such a sparse memory region. Example users are
hypervisors (especially implementing memory ballooning or similar
technologies like virtio-mem) and memory allocators. In addition, we want
to fail in a nice way (instead of generating SIGBUS) if populating does
not succeed because we are out of backend memory (which can happen easily
with file-based mappings, especially tmpfs and hugetlbfs).
While MADV_DONTNEED, MADV_REMOVE and FALLOC_FL_PUNCH_HOLE allow for
reliably discarding memory for most mapping types, there is no generic
approach to populate page tables and preallocate memory.
Although mmap() supports MAP_POPULATE, it is not applicable to the concept
of sparse memory mappings, where we want to populate/discard dynamically
and avoid expensive/problematic remappings. In addition, we never
actually report errors during the final populate phase - it is best-effort
only.
fallocate() can be used to preallocate file-based memory and fail in a
safe way. However, it cannot really be used for any private mappings on
anonymous files via memfd due to COW semantics. In addition, fallocate()
does not actually populate page tables, so we still always get pagefaults
on first access - which is sometimes undesired (i.e., real-time workloads)
and requires real prefaulting of page tables, not just a preallocation of
backend storage. There might be interesting use cases for sparse memory
regions along with mlockall(MCL_ONFAULT) which fallocate() cannot satisfy
as it does not prefault page tables.
II. On preallcoation/prefaulting from user space
Because we don't have a proper interface, what applications (like QEMU and
databases) end up doing is touching (i.e., reading+writing one byte to not
overwrite existing data) all individual pages.
However, that approach
1) Can result in wear on storage backing, because we end up reading/writing
each page; this is especially a problem for dax/pmem.
2) Can result in mmap_sem contention when prefaulting via multiple
threads.
3) Requires expensive signal handling, especially to catch SIGBUS in case
of hugetlbfs/shmem/file-backed memory. For example, this is
problematic in hypervisors like QEMU where SIGBUS handlers might already
be used by other subsystems concurrently to e.g, handle hardware errors.
"Simply" doing preallocation concurrently from other thread is not that
easy.
III. On MADV_WILLNEED
Extending MADV_WILLNEED is not an option because
1. It would change the semantics: "Expect access in the near future." and
"might be a good idea to read some pages" vs. "Definitely populate/
preallocate all memory and definitely fail on errors.".
2. Existing users (like virtio-balloon in QEMU when deflating the balloon)
don't want populate/prealloc semantics. They treat this rather as a hint
to give a little performance boost without too much overhead - and don't
expect that a lot of memory might get consumed or a lot of time
might be spent.
IV. MADV_POPULATE_READ and MADV_POPULATE_WRITE
Let's introduce MADV_POPULATE_READ and MADV_POPULATE_WRITE, inspired by
MAP_POPULATE, with the following semantics:
1. MADV_POPULATE_READ can be used to prefault page tables just like
manually reading each individual page. This will not break any COW
mappings. The shared zero page might get mapped and no backend storage
might get preallocated -- allocation might be deferred to
write-fault time. Especially shared file mappings require an explicit
fallocate() upfront to actually preallocate backend memory (blocks in
the file system) in case the file might have holes.
2. If MADV_POPULATE_READ succeeds, all page tables have been populated
(prefaulted) readable once.
3. MADV_POPULATE_WRITE can be used to preallocate backend memory and
prefault page tables just like manually writing (or
reading+writing) each individual page. This will break any COW
mappings -- e.g., the shared zeropage is never populated.
4. If MADV_POPULATE_WRITE succeeds, all page tables have been populated
(prefaulted) writable once.
5. MADV_POPULATE_READ and MADV_POPULATE_WRITE cannot be applied to special
mappings marked with VM_PFNMAP and VM_IO. Also, proper access
permissions (e.g., PROT_READ, PROT_WRITE) are required. If any such
mapping is encountered, madvise() fails with -EINVAL.
6. If MADV_POPULATE_READ or MADV_POPULATE_WRITE fails, some page tables
might have been populated.
7. MADV_POPULATE_READ and MADV_POPULATE_WRITE will return -EHWPOISON
when encountering a HW poisoned page in the range.
8. Similar to MAP_POPULATE, MADV_POPULATE_READ and MADV_POPULATE_WRITE
cannot protect from the OOM (Out Of Memory) handler killing the
process.
While the use case for MADV_POPULATE_WRITE is fairly obvious (i.e.,
preallocate memory and prefault page tables for VMs), one issue is that
whenever we prefault pages writable, the pages have to be marked dirty,
because the CPU could dirty them any time. while not a real problem for
hugetlbfs or dax/pmem, it can be a problem for shared file mappings: each
page will be marked dirty and has to be written back later when evicting.
MADV_POPULATE_READ allows for optimizing this scenario: Pre-read a whole
mapping from backend storage without marking it dirty, such that eviction
won't have to write it back. As discussed above, shared file mappings
might require an explciit fallocate() upfront to achieve
preallcoation+prepopulation.
Although sparse memory mappings are the primary use case, this will also
be useful for other preallocate/prefault use cases where MAP_POPULATE is
not desired or the semantics of MAP_POPULATE are not sufficient: as one
example, QEMU users can trigger preallocation/prefaulting of guest RAM
after the mapping was created -- and don't want errors to be silently
suppressed.
Looking at the history, MADV_POPULATE was already proposed in 2013 [1],
however, the main motivation back than was performance improvements --
which should also still be the case.
V. Single-threaded performance comparison
I did a short experiment, prefaulting page tables on completely *empty
mappings/files* and repeated the experiment 10 times. The results
correspond to the shortest execution time. In general, the performance
benefit for huge pages is negligible with small mappings.
V.1: Private mappings
POPULATE_READ and POPULATE_WRITE is fastest. Note that
Reading/POPULATE_READ will populate the shared zeropage where applicable
-- which result in short population times.
The fastest way to allocate backend storage (here: swap or huge pages) and
prefault page tables is POPULATE_WRITE.
V.2: Shared mappings
fallocate() is fastest, however, doesn't prefault page tables.
POPULATE_WRITE is faster than simple writes and read/writes.
POPULATE_READ is faster than simple reads.
Without a fd, the fastest way to allocate backend storage and prefault
page tables is POPULATE_WRITE. With an fd, the fastest way is usually
FALLOCATE+POPULATE_READ or FALLOCATE+POPULATE_WRITE respectively; one
exception are actual files: FALLOCATE+Read is slightly faster than
FALLOCATE+POPULATE_READ.
The fastest way to allocate backend storage prefault page tables is
FALLOCATE+POPULATE_WRITE -- except when dealing with actual files; then,
FALLOCATE+POPULATE_READ is fastest and won't directly mark all pages as
dirty.
v.3: Detailed results
==================================================
2 MiB MAP_PRIVATE:
**************************************************
Anon 4 KiB : Read : 0.119 ms
Anon 4 KiB : Write : 0.222 ms
Anon 4 KiB : Read/Write : 0.380 ms
Anon 4 KiB : POPULATE_READ : 0.060 ms
Anon 4 KiB : POPULATE_WRITE : 0.158 ms
Memfd 4 KiB : Read : 0.034 ms
Memfd 4 KiB : Write : 0.310 ms
Memfd 4 KiB : Read/Write : 0.362 ms
Memfd 4 KiB : POPULATE_READ : 0.039 ms
Memfd 4 KiB : POPULATE_WRITE : 0.229 ms
Memfd 2 MiB : Read : 0.030 ms
Memfd 2 MiB : Write : 0.030 ms
Memfd 2 MiB : Read/Write : 0.030 ms
Memfd 2 MiB : POPULATE_READ : 0.030 ms
Memfd 2 MiB : POPULATE_WRITE : 0.030 ms
tmpfs : Read : 0.033 ms
tmpfs : Write : 0.313 ms
tmpfs : Read/Write : 0.406 ms
tmpfs : POPULATE_READ : 0.039 ms
tmpfs : POPULATE_WRITE : 0.285 ms
file : Read : 0.033 ms
file : Write : 0.351 ms
file : Read/Write : 0.408 ms
file : POPULATE_READ : 0.039 ms
file : POPULATE_WRITE : 0.290 ms
hugetlbfs : Read : 0.030 ms
hugetlbfs : Write : 0.030 ms
hugetlbfs : Read/Write : 0.030 ms
hugetlbfs : POPULATE_READ : 0.030 ms
hugetlbfs : POPULATE_WRITE : 0.030 ms
**************************************************
4096 MiB MAP_PRIVATE:
**************************************************
Anon 4 KiB : Read : 237.940 ms
Anon 4 KiB : Write : 708.409 ms
Anon 4 KiB : Read/Write : 1054.041 ms
Anon 4 KiB : POPULATE_READ : 124.310 ms
Anon 4 KiB : POPULATE_WRITE : 572.582 ms
Memfd 4 KiB : Read : 136.928 ms
Memfd 4 KiB : Write : 963.898 ms
Memfd 4 KiB : Read/Write : 1106.561 ms
Memfd 4 KiB : POPULATE_READ : 78.450 ms
Memfd 4 KiB : POPULATE_WRITE : 805.881 ms
Memfd 2 MiB : Read : 357.116 ms
Memfd 2 MiB : Write : 357.210 ms
Memfd 2 MiB : Read/Write : 357.606 ms
Memfd 2 MiB : POPULATE_READ : 356.094 ms
Memfd 2 MiB : POPULATE_WRITE : 356.937 ms
tmpfs : Read : 137.536 ms
tmpfs : Write : 954.362 ms
tmpfs : Read/Write : 1105.954 ms
tmpfs : POPULATE_READ : 80.289 ms
tmpfs : POPULATE_WRITE : 822.826 ms
file : Read : 137.874 ms
file : Write : 987.025 ms
file : Read/Write : 1107.439 ms
file : POPULATE_READ : 80.413 ms
file : POPULATE_WRITE : 857.622 ms
hugetlbfs : Read : 355.607 ms
hugetlbfs : Write : 355.729 ms
hugetlbfs : Read/Write : 356.127 ms
hugetlbfs : POPULATE_READ : 354.585 ms
hugetlbfs : POPULATE_WRITE : 355.138 ms
**************************************************
2 MiB MAP_SHARED:
**************************************************
Anon 4 KiB : Read : 0.394 ms
Anon 4 KiB : Write : 0.348 ms
Anon 4 KiB : Read/Write : 0.400 ms
Anon 4 KiB : POPULATE_READ : 0.326 ms
Anon 4 KiB : POPULATE_WRITE : 0.273 ms
Anon 2 MiB : Read : 0.030 ms
Anon 2 MiB : Write : 0.030 ms
Anon 2 MiB : Read/Write : 0.030 ms
Anon 2 MiB : POPULATE_READ : 0.030 ms
Anon 2 MiB : POPULATE_WRITE : 0.030 ms
Memfd 4 KiB : Read : 0.412 ms
Memfd 4 KiB : Write : 0.372 ms
Memfd 4 KiB : Read/Write : 0.419 ms
Memfd 4 KiB : POPULATE_READ : 0.343 ms
Memfd 4 KiB : POPULATE_WRITE : 0.288 ms
Memfd 4 KiB : FALLOCATE : 0.137 ms
Memfd 4 KiB : FALLOCATE+Read : 0.446 ms
Memfd 4 KiB : FALLOCATE+Write : 0.330 ms
Memfd 4 KiB : FALLOCATE+Read/Write : 0.454 ms
Memfd 4 KiB : FALLOCATE+POPULATE_READ : 0.379 ms
Memfd 4 KiB : FALLOCATE+POPULATE_WRITE : 0.268 ms
Memfd 2 MiB : Read : 0.030 ms
Memfd 2 MiB : Write : 0.030 ms
Memfd 2 MiB : Read/Write : 0.030 ms
Memfd 2 MiB : POPULATE_READ : 0.030 ms
Memfd 2 MiB : POPULATE_WRITE : 0.030 ms
Memfd 2 MiB : FALLOCATE : 0.030 ms
Memfd 2 MiB : FALLOCATE+Read : 0.031 ms
Memfd 2 MiB : FALLOCATE+Write : 0.031 ms
Memfd 2 MiB : FALLOCATE+Read/Write : 0.031 ms
Memfd 2 MiB : FALLOCATE+POPULATE_READ : 0.030 ms
Memfd 2 MiB : FALLOCATE+POPULATE_WRITE : 0.030 ms
tmpfs : Read : 0.416 ms
tmpfs : Write : 0.369 ms
tmpfs : Read/Write : 0.425 ms
tmpfs : POPULATE_READ : 0.346 ms
tmpfs : POPULATE_WRITE : 0.295 ms
tmpfs : FALLOCATE : 0.139 ms
tmpfs : FALLOCATE+Read : 0.447 ms
tmpfs : FALLOCATE+Write : 0.333 ms
tmpfs : FALLOCATE+Read/Write : 0.454 ms
tmpfs : FALLOCATE+POPULATE_READ : 0.380 ms
tmpfs : FALLOCATE+POPULATE_WRITE : 0.272 ms
file : Read : 0.191 ms
file : Write : 0.511 ms
file : Read/Write : 0.524 ms
file : POPULATE_READ : 0.196 ms
file : POPULATE_WRITE : 0.434 ms
file : FALLOCATE : 0.004 ms
file : FALLOCATE+Read : 0.197 ms
file : FALLOCATE+Write : 0.554 ms
file : FALLOCATE+Read/Write : 0.480 ms
file : FALLOCATE+POPULATE_READ : 0.201 ms
file : FALLOCATE+POPULATE_WRITE : 0.381 ms
hugetlbfs : Read : 0.030 ms
hugetlbfs : Write : 0.030 ms
hugetlbfs : Read/Write : 0.030 ms
hugetlbfs : POPULATE_READ : 0.030 ms
hugetlbfs : POPULATE_WRITE : 0.030 ms
hugetlbfs : FALLOCATE : 0.030 ms
hugetlbfs : FALLOCATE+Read : 0.031 ms
hugetlbfs : FALLOCATE+Write : 0.031 ms
hugetlbfs : FALLOCATE+Read/Write : 0.030 ms
hugetlbfs : FALLOCATE+POPULATE_READ : 0.030 ms
hugetlbfs : FALLOCATE+POPULATE_WRITE : 0.030 ms
**************************************************
4096 MiB MAP_SHARED:
**************************************************
Anon 4 KiB : Read : 1053.090 ms
Anon 4 KiB : Write : 913.642 ms
Anon 4 KiB : Read/Write : 1060.350 ms
Anon 4 KiB : POPULATE_READ : 893.691 ms
Anon 4 KiB : POPULATE_WRITE : 782.885 ms
Anon 2 MiB : Read : 358.553 ms
Anon 2 MiB : Write : 358.419 ms
Anon 2 MiB : Read/Write : 357.992 ms
Anon 2 MiB : POPULATE_READ : 357.533 ms
Anon 2 MiB : POPULATE_WRITE : 357.808 ms
Memfd 4 KiB : Read : 1078.144 ms
Memfd 4 KiB : Write : 942.036 ms
Memfd 4 KiB : Read/Write : 1100.391 ms
Memfd 4 KiB : POPULATE_READ : 925.829 ms
Memfd 4 KiB : POPULATE_WRITE : 804.394 ms
Memfd 4 KiB : FALLOCATE : 304.632 ms
Memfd 4 KiB : FALLOCATE+Read : 1163.359 ms
Memfd 4 KiB : FALLOCATE+Write : 933.186 ms
Memfd 4 KiB : FALLOCATE+Read/Write : 1187.304 ms
Memfd 4 KiB : FALLOCATE+POPULATE_READ : 1013.660 ms
Memfd 4 KiB : FALLOCATE+POPULATE_WRITE : 794.560 ms
Memfd 2 MiB : Read : 358.131 ms
Memfd 2 MiB : Write : 358.099 ms
Memfd 2 MiB : Read/Write : 358.250 ms
Memfd 2 MiB : POPULATE_READ : 357.563 ms
Memfd 2 MiB : POPULATE_WRITE : 357.334 ms
Memfd 2 MiB : FALLOCATE : 356.735 ms
Memfd 2 MiB : FALLOCATE+Read : 358.152 ms
Memfd 2 MiB : FALLOCATE+Write : 358.331 ms
Memfd 2 MiB : FALLOCATE+Read/Write : 358.018 ms
Memfd 2 MiB : FALLOCATE+POPULATE_READ : 357.286 ms
Memfd 2 MiB : FALLOCATE+POPULATE_WRITE : 357.523 ms
tmpfs : Read : 1087.265 ms
tmpfs : Write : 950.840 ms
tmpfs : Read/Write : 1107.567 ms
tmpfs : POPULATE_READ : 922.605 ms
tmpfs : POPULATE_WRITE : 810.094 ms
tmpfs : FALLOCATE : 306.320 ms
tmpfs : FALLOCATE+Read : 1169.796 ms
tmpfs : FALLOCATE+Write : 933.730 ms
tmpfs : FALLOCATE+Read/Write : 1191.610 ms
tmpfs : FALLOCATE+POPULATE_READ : 1020.474 ms
tmpfs : FALLOCATE+POPULATE_WRITE : 798.945 ms
file : Read : 654.101 ms
file : Write : 1259.142 ms
file : Read/Write : 1289.509 ms
file : POPULATE_READ : 661.642 ms
file : POPULATE_WRITE : 1106.816 ms
file : FALLOCATE : 1.864 ms
file : FALLOCATE+Read : 656.328 ms
file : FALLOCATE+Write : 1153.300 ms
file : FALLOCATE+Read/Write : 1180.613 ms
file : FALLOCATE+POPULATE_READ : 668.347 ms
file : FALLOCATE+POPULATE_WRITE : 996.143 ms
hugetlbfs : Read : 357.245 ms
hugetlbfs : Write : 357.413 ms
hugetlbfs : Read/Write : 357.120 ms
hugetlbfs : POPULATE_READ : 356.321 ms
hugetlbfs : POPULATE_WRITE : 356.693 ms
hugetlbfs : FALLOCATE : 355.927 ms
hugetlbfs : FALLOCATE+Read : 357.074 ms
hugetlbfs : FALLOCATE+Write : 357.120 ms
hugetlbfs : FALLOCATE+Read/Write : 356.983 ms
hugetlbfs : FALLOCATE+POPULATE_READ : 356.413 ms
hugetlbfs : FALLOCATE+POPULATE_WRITE : 356.266 ms
**************************************************
[1] https://lkml.org/lkml/2013/6/27/698
[akpm@linux-foundation.org: coding style fixes]
Link: https://lkml.kernel.org/r/20210419135443.12822-3-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Jann Horn <jannh@google.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Rik van Riel <riel@surriel.com>
Cc: Michael S. Tsirkin <mst@redhat.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com>
Cc: Helge Deller <deller@gmx.de>
Cc: Chris Zankel <chris@zankel.net>
Cc: Max Filippov <jcmvbkbc@gmail.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Rolf Eike Beer <eike-kernel@sf-tec.de>
Cc: Ram Pai <linuxram@us.ibm.com>
Cc: Shuah Khan <shuah@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Currently try_to_unmap() return bool value by checking page_mapcount(),
however this may return false positive since page_mapcount() doesn't check
all subpages of compound page. The total_mapcount() could be used
instead, but its cost is higher since it traverses all subpages.
Actually the most callers of try_to_unmap() don't care about the return
value at all. So just need check if page is still mapped by page_mapped()
when necessary. And page_mapped() does bail out early when it finds
mapped subpage.
Link: https://lkml.kernel.org/r/bb27e3fe-6036-b637-5086-272befbfe3da@google.com
Suggested-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Yang Shi <shy828301@gmail.com>
Acked-by: Minchan Kim <minchan@kernel.org>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jue Wang <juew@google.com>
Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Peter Xu <peterx@redhat.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Wang Yugui <wangyugui@e16-tech.com>
Cc: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
When the THP NUMA fault support was added THP migration was not supported
yet. So the ad hoc THP migration was implemented in NUMA fault handling.
Since v4.14 THP migration has been supported so it doesn't make too much
sense to still keep another THP migration implementation rather than using
the generic migration code.
This patch reworks the NUMA fault handling to use generic migration
implementation to migrate misplaced page. There is no functional change.
After the refactor the flow of NUMA fault handling looks just like its
PTE counterpart:
Acquire ptl
Prepare for migration (elevate page refcount)
Release ptl
Isolate page from lru and elevate page refcount
Migrate the misplaced THP
If migration fails just restore the old normal PMD.
In the old code anon_vma lock was needed to serialize THP migration
against THP split, but since then the THP code has been reworked a lot, it
seems anon_vma lock is not required anymore to avoid the race.
The page refcount elevation when holding ptl should prevent from THP
split.
Use migrate_misplaced_page() for both base page and THP NUMA hinting fault
and remove all the dead and duplicate code.
[dan.carpenter@oracle.com: fix a double unlock bug]
Link: https://lkml.kernel.org/r/YLX8uYN01JmfLnlK@mwanda
Link: https://lkml.kernel.org/r/20210518200801.7413-4-shy828301@gmail.com
Signed-off-by: Yang Shi <shy828301@gmail.com>
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Huang Ying <ying.huang@intel.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Pach series "mm: thp: use generic THP migration for NUMA hinting fault", v3.
When the THP NUMA fault support was added THP migration was not supported
yet. So the ad hoc THP migration was implemented in NUMA fault handling.
Since v4.14 THP migration has been supported so it doesn't make too much
sense to still keep another THP migration implementation rather than using
the generic migration code. It is definitely a maintenance burden to keep
two THP migration implementation for different code paths and it is more
error prone. Using the generic THP migration implementation allows us
remove the duplicate code and some hacks needed by the old ad hoc
implementation.
A quick grep shows x86_64, PowerPC (book3s), ARM64 ans S390 support both
THP and NUMA balancing. The most of them support THP migration except for
S390. Zi Yan tried to add THP migration support for S390 before but it
was not accepted due to the design of S390 PMD. For the discussion,
please see: https://lkml.org/lkml/2018/4/27/953.
Per the discussion with Gerald Schaefer in v1 it is acceptible to skip
huge PMD for S390 for now.
I saw there were some hacks about gup from git history, but I didn't
figure out if they have been removed or not since I just found FOLL_NUMA
code in the current gup implementation and they seems useful.
Patch #1 ~ #2 are preparation patches.
Patch #3 is the real meat.
Patch #4 ~ #6 keep consistent counters and behaviors with before.
Patch #7 skips change huge PMD to prot_none if thp migration is not supported.
Test
----
Did some tests to measure the latency of do_huge_pmd_numa_page. The test
VM has 80 vcpus and 64G memory. The test would create 2 processes to
consume 128G memory together which would incur memory pressure to cause
THP splits. And it also creates 80 processes to hog cpu, and the memory
consumer processes are bound to different nodes periodically in order to
increase NUMA faults.
The below test script is used:
echo 3 > /proc/sys/vm/drop_caches
# Run stress-ng for 24 hours
./stress-ng/stress-ng --vm 2 --vm-bytes 64G --timeout 24h &
PID=$!
./stress-ng/stress-ng --cpu $NR_CPUS --timeout 24h &
# Wait for vm stressors forked
sleep 5
PID_1=`pgrep -P $PID | awk 'NR == 1'`
PID_2=`pgrep -P $PID | awk 'NR == 2'`
JOB1=`pgrep -P $PID_1`
JOB2=`pgrep -P $PID_2`
# Bind load jobs to different nodes periodically to force generate
# cross node memory access
while [ -d "/proc/$PID" ]
do
taskset -apc 8 $JOB1
taskset -apc 8 $JOB2
sleep 300
taskset -apc 58 $JOB1
taskset -apc 58 $JOB2
sleep 300
done
With the above test the histogram of latency of do_huge_pmd_numa_page is
as shown below. Since the number of do_huge_pmd_numa_page varies
drastically for each run (should be due to scheduler), so I converted the
raw number to percentage.
patched base
@us[stress-ng]:
[0] 3.57% 0.16%
[1] 55.68% 18.36%
[2, 4) 10.46% 40.44%
[4, 8) 7.26% 17.82%
[8, 16) 21.12% 13.41%
[16, 32) 1.06% 4.27%
[32, 64) 0.56% 4.07%
[64, 128) 0.16% 0.35%
[128, 256) < 0.1% < 0.1%
[256, 512) < 0.1% < 0.1%
[512, 1K) < 0.1% < 0.1%
[1K, 2K) < 0.1% < 0.1%
[2K, 4K) < 0.1% < 0.1%
[4K, 8K) < 0.1% < 0.1%
[8K, 16K) < 0.1% < 0.1%
[16K, 32K) < 0.1% < 0.1%
[32K, 64K) < 0.1% < 0.1%
Per the result, patched kernel is even slightly better than the base
kernel. I think this is because the lock contention against THP split is
less than base kernel due to the refactor.
To exclude the affect from THP split, I also did test w/o memory pressure.
No obvious regression is spotted. The below is the test result *w/o*
memory pressure.
patched base
@us[stress-ng]:
[0] 7.97% 18.4%
[1] 69.63% 58.24%
[2, 4) 4.18% 2.63%
[4, 8) 0.22% 0.17%
[8, 16) 1.03% 0.92%
[16, 32) 0.14% < 0.1%
[32, 64) < 0.1% < 0.1%
[64, 128) < 0.1% < 0.1%
[128, 256) < 0.1% < 0.1%
[256, 512) 0.45% 1.19%
[512, 1K) 15.45% 17.27%
[1K, 2K) < 0.1% < 0.1%
[2K, 4K) < 0.1% < 0.1%
[4K, 8K) < 0.1% < 0.1%
[8K, 16K) 0.86% 0.88%
[16K, 32K) < 0.1% 0.15%
[32K, 64K) < 0.1% < 0.1%
[64K, 128K) < 0.1% < 0.1%
[128K, 256K) < 0.1% < 0.1%
The series also survived a series of tests that exercise NUMA balancing
migrations by Mel.
This patch (of 7):
Add orig_pmd to struct vm_fault so the "orig_pmd" parameter used by huge
page fault could be removed, just like its PTE counterpart does.
Link: https://lkml.kernel.org/r/20210518200801.7413-1-shy828301@gmail.com
Link: https://lkml.kernel.org/r/20210518200801.7413-2-shy828301@gmail.com
Signed-off-by: Yang Shi <shy828301@gmail.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: Huang Ying <ying.huang@intel.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Since commit d6995da31122 ("hugetlb: use page.private for hugetlb specific
page flags") converts page.private for hugetlb specific page flags. We
should use hugetlb_page_subpool() to get the subpool pointer instead of
page_private().
This 'could' prevent the migration of hugetlb pages. page_private(hpage)
is now used for hugetlb page specific flags. At migration time, the only
flag which could be set is HPageVmemmapOptimized. This flag will only be
set if the new vmemmap reduction feature is enabled. In addition,
!page_mapping() implies an anonymous mapping. So, this will prevent
migration of hugetb pages in anonymous mappings if the vmemmap reduction
feature is enabled.
In addition, that if statement checked for the rare race condition of a
page being migrated while in the process of being freed. Since that check
is now wrong, we could leak hugetlb subpool usage counts.
The commit forgot to update it in the page migration routine. So fix it.
[songmuchun@bytedance.com: fix compiler error when !CONFIG_HUGETLB_PAGE reported by Randy]
Link: https://lkml.kernel.org/r/20210521022747.35736-1-songmuchun@bytedance.com
Link: https://lkml.kernel.org/r/20210520025949.1866-1-songmuchun@bytedance.com
Fixes: d6995da31122 ("hugetlb: use page.private for hugetlb specific page flags")
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Reported-by: Anshuman Khandual <anshuman.khandual@arm.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Tested-by: Anshuman Khandual <anshuman.khandual@arm.com> [arm64]
Cc: Oscar Salvador <osalvador@suse.de>
Cc: David Hildenbrand <david@redhat.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Xiongchun Duan <duanxiongchun@bytedance.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
CONFIG_SPARSEMEM_VMEMMAP is now the only available memory model on arm64
platforms and free_unused_memmap() would just return without creating any
holes in the memmap mapping. There is no need for any special handling in
pfn_valid() and HAVE_ARCH_PFN_VALID can just be dropped. This also moves
the pfn upper bits sanity check into generic pfn_valid().
Link: https://lkml.kernel.org/r/1621947349-25421-1-git-send-email-anshuman.khandual@arm.com
Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com>
Acked-by: David Hildenbrand <david@redhat.com>
Acked-by: Mike Rapoport <rppt@linux.ibm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will@kernel.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: Mike Rapoport <rppt@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
The struct pages representing a reserved memory region are initialized
using reserve_bootmem_range() function. This function is called for each
reserved region just before the memory is freed from memblock to the buddy
page allocator.
The struct pages for MEMBLOCK_NOMAP regions are kept with the default
values set by the memory map initialization which makes it necessary to
have a special treatment for such pages in pfn_valid() and
pfn_valid_within().
Split out initialization of the reserved pages to a function with a
meaningful name and treat the MEMBLOCK_NOMAP regions the same way as the
reserved regions and mark struct pages for the NOMAP regions as
PageReserved.
Link: https://lkml.kernel.org/r/20210511100550.28178-3-rppt@kernel.org
Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>
Acked-by: Ard Biesheuvel <ardb@kernel.org>
Reviewed-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Marc Zyngier <maz@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Patch series "arm64: drop pfn_valid_within() and simplify pfn_valid()", v4.
These patches aim to remove CONFIG_HOLES_IN_ZONE and essentially hardwire
pfn_valid_within() to 1.
The idea is to mark NOMAP pages as reserved in the memory map and restore
the intended semantics of pfn_valid() to designate availability of struct
page for a pfn.
With this the core mm will be able to cope with the fact that it cannot
use NOMAP pages and the holes created by NOMAP ranges within MAX_ORDER
blocks will be treated correctly even without the need for
pfn_valid_within.
This patch (of 4):
Add comment describing the semantics of pfn_valid() that clarifies that
pfn_valid() only checks for availability of a memory map entry (i.e.
struct page) for a PFN rather than availability of usable memory backing
that PFN.
The most "generic" version of pfn_valid() used by the configurations with
SPARSEMEM enabled resides in include/linux/mmzone.h so this is the most
suitable place for documentation about semantics of pfn_valid().
Link: https://lkml.kernel.org/r/20210511100550.28178-1-rppt@kernel.org
Link: https://lkml.kernel.org/r/20210511100550.28178-2-rppt@kernel.org
Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Suggested-by: Anshuman Khandual <anshuman.khandual@arm.com>
Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>
Acked-by: Ard Biesheuvel <ardb@kernel.org>
Reviewed-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Marc Zyngier <maz@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Current structure 'mempolicy' uses a union to store the node info for
bind/interleave/perfer policies.
union {
short preferred_node; /* preferred */
nodemask_t nodes; /* interleave/bind */
/* undefined for default */
} v;
Since preferred node can also be represented by a nodemask_t with only ont
bit set, unify these policies with using one nodemask_t 'nodes', which can
remove a union, simplify the code and make it easier to support future's
new policy's node info.
Link: https://lore.kernel.org/r/20200630212517.308045-7-ben.widawsky@intel.com
Link: https://lkml.kernel.org/r/1623399825-75651-1-git-send-email-feng.tang@intel.com
Co-developed-by: Feng Tang <feng.tang@intel.com>
Signed-off-by: Ben Widawsky <ben.widawsky@intel.com>
Signed-off-by: Feng Tang <feng.tang@intel.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
MPOL_LOCAL policy has been setup as a real policy, but it is still handled
like a faked POL_PREFERRED policy with one internal MPOL_F_LOCAL flag bit
set, and there are many places having to judge the real 'prefer' or the
'local' policy, which are quite confusing.
In current code, there are 4 cases that MPOL_LOCAL are used:
1. user specifies 'local' policy
2. user specifies 'prefer' policy, but with empty nodemask
3. system 'default' policy is used
4. 'prefer' policy + valid 'preferred' node with MPOL_F_STATIC_NODES
flag set, and when it is 'rebind' to a nodemask which doesn't contains
the 'preferred' node, it will perform as 'local' policy
So make 'local' a real policy instead of a fake 'prefer' one, and kill
MPOL_F_LOCAL bit, which can greatly reduce the confusion for code reading.
For case 4, the logic of mpol_rebind_preferred() is confusing, as Michal
Hocko pointed out:
: I do believe that rebinding preferred policy is just bogus and it should
: be dropped altogether on the ground that a preference is a mere hint from
: userspace where to start the allocation. Unless I am missing something
: cpusets will be always authoritative for the final placement. The
: preferred node just acts as a starting point and it should be really
: preserved when cpusets changes. Otherwise we have a very subtle behavior
: corner cases.
So dump all the tricky transformation between 'prefer' and 'local', and
just record the new nodemask of rebinding.
[feng.tang@intel.com: fix a problem in mpol_set_nodemask(), per Michal Hocko]
Link: https://lkml.kernel.org/r/1622560492-1294-3-git-send-email-feng.tang@intel.com
[feng.tang@intel.com: refine code and comments of mpol_set_nodemask(), per Michal]
Link: https://lkml.kernel.org/r/20210603081807.GE56979@shbuild999.sh.intel.com
Link: https://lkml.kernel.org/r/1622469956-82897-3-git-send-email-feng.tang@intel.com
Signed-off-by: Feng Tang <feng.tang@intel.com>
Suggested-by: Michal Hocko <mhocko@suse.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Ben Widawsky <ben.widawsky@intel.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Huang Ying <ying.huang@intel.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Patch series "mm/mempolicy: some fix and semantics cleanup", v4.
Current memory policy code has some confusing and ambiguous part about
MPOL_LOCAL policy, as it is handled as a faked MPOL_PREFERRED one, and
there are many places having to distinguish them. Also the nodemask
intersection check needs cleanup to be more explicit for OOM use, and
handle MPOL_INTERLEAVE correctly. This patchset cleans up these and
unifies the parameter sanity check for mbind() and set_mempolicy().
This patch (of 3):
mempolicy_nodemask_intersects seem to be a general purpose mempolicy
function. In fact it is partially tailored for the OOM purpose
instead. The oom proper is the only existing user so rename the
function to make that purpose explicit.
While at it drop the MPOL_INTERLEAVE as those allocations never has a
nodemask defined (see alloc_page_interleave) so this is a dead code and
a confusing one because MPOL_INTERLEAVE is a hint rather than a hard
requirement so it shouldn't be considered during the OOM.
The final code can be reduced to a check for MPOL_BIND which is the
only memory policy that is a hard requirement and thus relevant to a
constrained OOM logic.
[mhocko@suse.com: changelog edits]
Link: https://lkml.kernel.org/r/1622560492-1294-1-git-send-email-feng.tang@intel.com
Link: https://lkml.kernel.org/r/1622560492-1294-2-git-send-email-feng.tang@intel.com
Link: https://lkml.kernel.org/r/1622469956-82897-1-git-send-email-feng.tang@intel.com
Link: https://lkml.kernel.org/r/1622469956-82897-2-git-send-email-feng.tang@intel.com
Signed-off-by: Feng Tang <feng.tang@intel.com>
Suggested-by: Michal Hocko <mhocko@suse.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Ben Widawsky <ben.widawsky@intel.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Huang Ying <ying.huang@intel.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
The zbud doesn't need to export any API and it is meant to be used via
zpool API since the commit 12d79d64bfd3 ("mm/zpool: update zswap to use
zpool"). So we can remove the unneeded zbud.h and move down zpool API to
avoid any forward declaration.
[linmiaohe@huawei.com: fix unused function warnings when CONFIG_ZPOOL is disabled]
Link: https://lkml.kernel.org/r/20210619025508.1239386-1-linmiaohe@huawei.com
Link: https://lkml.kernel.org/r/20210608114515.206992-3-linmiaohe@huawei.com
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Cc: Dan Streetman <ddstreet@ieee.org>
Cc: Seth Jennings <sjenning@redhat.com>
Cc: Nathan Chancellor <nathan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
PageOffline()
A driver might set a page logically offline -- PageOffline() -- and turn
the page inaccessible in the hypervisor; after that, access to page
content can be fatal. One example is virtio-mem; while unplugged memory
-- marked as PageOffline() can currently be read in the hypervisor, this
will no longer be the case in the future; for example, when having a
virtio-mem device backed by huge pages in the hypervisor.
Some special PFN walkers -- i.e., /proc/kcore -- read content of random
pages after checking PageOffline(); however, these PFN walkers can race
with drivers that set PageOffline().
Let's introduce page_offline_(begin|end|freeze|thaw) for synchronizing.
page_offline_freeze()/page_offline_thaw() allows for a subsystem to
synchronize with such drivers, achieving that a page cannot be set
PageOffline() while frozen.
page_offline_begin()/page_offline_end() is used by drivers that care about
such races when setting a page PageOffline().
For simplicity, use a rwsem for now; neither drivers nor users are
performance sensitive.
Link: https://lkml.kernel.org/r/20210526093041.8800-5-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Mike Rapoport <rppt@linux.ibm.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Cc: Aili Yao <yaoaili@kingsoft.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Alex Shi <alex.shi@linux.alibaba.com>
Cc: Haiyang Zhang <haiyangz@microsoft.com>
Cc: Jason Wang <jasowang@redhat.com>
Cc: Jiri Bohac <jbohac@suse.cz>
Cc: "K. Y. Srinivasan" <kys@microsoft.com>
Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Stephen Hemminger <sthemmin@microsoft.com>
Cc: Steven Price <steven.price@arm.com>
Cc: Wei Liu <wei.liu@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
hwpoisoned pages
Let's avoid reading:
1) Offline memory sections: the content of offline memory sections is
stale as the memory is effectively unused by the kernel. On s390x with
standby memory, offline memory sections (belonging to offline storage
increments) are not accessible. With virtio-mem and the hyper-v
balloon, we can have unavailable memory chunks that should not be
accessed inside offline memory sections. Last but not least, offline
memory sections might contain hwpoisoned pages which we can no longer
identify because the memmap is stale.
2) PG_offline pages: logically offline pages that are documented as
"The content of these pages is effectively stale. Such pages should
not be touched (read/write/dump/save) except by their owner.".
Examples include pages inflated in a balloon or unavailble memory
ranges inside hotplugged memory sections with virtio-mem or the hyper-v
balloon.
3) PG_hwpoison pages: Reading pages marked as hwpoisoned can be fatal.
As documented: "Accessing is not safe since it may cause another
machine check. Don't touch!"
Introduce is_page_hwpoison(), adding a comment that it is inherently racy
but best we can really do.
Reading /proc/kcore now performs similar checks as when reading
/proc/vmcore for kdump via makedumpfile: problematic pages are exclude.
It's also similar to hibernation code, however, we don't skip hwpoisoned
pages when processing pages in kernel/power/snapshot.c:saveable_page()
yet.
Note 1: we can race against memory offlining code, especially memory going
offline and getting unplugged: however, we will properly tear down the
identity mapping and handle faults gracefully when accessing this memory
from kcore code.
Note 2: we can race against drivers setting PageOffline() and turning
memory inaccessible in the hypervisor. We'll handle this in a follow-up
patch.
Link: https://lkml.kernel.org/r/20210526093041.8800-4-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Mike Rapoport <rppt@linux.ibm.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Cc: Aili Yao <yaoaili@kingsoft.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Alex Shi <alex.shi@linux.alibaba.com>
Cc: Haiyang Zhang <haiyangz@microsoft.com>
Cc: Jason Wang <jasowang@redhat.com>
Cc: Jiri Bohac <jbohac@suse.cz>
Cc: "K. Y. Srinivasan" <kys@microsoft.com>
Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Stephen Hemminger <sthemmin@microsoft.com>
Cc: Steven Price <steven.price@arm.com>
Cc: Wei Liu <wei.liu@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|