Age | Commit message (Collapse) | Author |
|
When ext4 file systems were created intentionally with 128 byte inodes,
the rate-limited warning of eventual possible timestamp overflow are
still emitted rather frequently. Remove the warning for now.
Discussion for whether any warning is needed,
and where it should be emitted, can be found at
https://lore.kernel.org/lkml/1567523922.5576.57.camel@lca.pw/.
I can post a separate follow-up patch after the conclusion.
Reported-by: Qian Cai <cai@lca.pw>
Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
|
|
The series is an update and a more complete version of the
previously posted series at
https://lore.kernel.org/linux-fsdevel/20180122020426.2988-1-deepa.kernel@gmail.com/
Thanks to Arnd Bergmann for doing a few preliminary reviews.
They helped me fix a few issues I had overlooked.
The limits (sometimes granularity also) for the filesystems updated here are according to the
following table:
File system Time type Start year Expiration year Granularity
cramfs fixed 0 0
romfs fixed 0 0
pstore ascii seconds (27 digit ascii) S64_MIN S64_MAX 1
coda INT64 S64_MIN S64_MAX 1
omfs 64-bit milliseconds 0 U64_MAX/ 1000 NSEC_PER_MSEC
befs unsigned 48-bit seconds 0 0xffffffffffff alloc_super
bfs unsigned 32-bit seconds 0 U32_MAX alloc_super
efs unsigned 32-bit seconds 0 U32_MAX alloc_super
ext2 signed 32-bit seconds S32_MIN S32_MAX alloc_super
ext3 signed 32-bit seconds S32_MIN S32_MAX alloc_super
ext4 (old) signed 32-bit seconds S32_MIN S32_MAX alloc_super
ext4 (extra) 34-bit seconds, 30-bit ns S32_MIN 0x37fffffff 1
freevxfs u32 secs/usecs 0 U32_MAX alloc_super
jffs2 unsigned 32-bit seconds 0 U32_MAX alloc_super
jfs unsigned 32-bit seconds/ns 0 U32_MAX 1
minix unsigned 32-bit seconds 0 U32_MAX alloc_super
qnx4 unsigned 32-bit seconds 0 U32_MAX alloc_super
qnx6 unsigned 32-bit seconds 0 U32_MAX alloc_super
reiserfs unsigned 32-bit seconds 0 U32_MAX alloc_super
squashfs unsigned 32-bit seconds 0 U32_MAX alloc_super
ufs1 signed 32-bit seconds S32_MIN S32_MAX NSEC_PER_SEC
ufs2 signed 64-bit seconds/u32 ns S64_MIN S64_MAX 1
xfs signed 32-bit seconds/ns S32_MIN S32_MAX 1
ceph unsigned 32-bit second/ns 0 U32_MAX 1000
sysv unsigned 32-bit seconds 0 U32_MAX alloc_super
affs u32 day, min, ticks 1978 u32_max days NSEC_PER_SEC
nfsv2 unsigned 32-bit seconds/ns 0 U32_MAX 1
nfsv3 unsigned 32-bit seconds/ns 0 U32_MAX 1000
nfsv4 u64 seconds/u32 ns S64_MIN S64_MAX 1000
isofs u8 year since 1900 (fixable) 1900 2155 alloc_super
hpfs unsigned 32-bit seconds 1970 2106 alloc_super
fat 7-bit years, 2s resolution 1980 2107
cifs (smb) 7-bit years 1980 2107
cifs (modern) 64-bit 100ns since 1601 1601 30828
9p (9P2000) unsigned 32-bit seconds 1970 2106
9p (9P2000.L) signed 64-bit seconds, ns 1970 S64_MAX
Granularity column filled in by the alloc_super() in the above table indicates that
the granularity is NSEC_PER_SEC.
Note that anything not mentioned above still has the default limits
S64_MIN..S64_MAX.
The patches in the series are as structured below:
1. Add vfs support to maintain the limits per filesystem.
2. Add a new timestamp_truncate() api for clamping timestamps
according to the filesystem limits.
3. Add a warning for mount syscall to indicate the impending
expiry of timestamps.
4. Modify utimes to clamp the timestamps.
5. Fill in limits for filesystems.
A test for checking file system timestamp limits has been posted
at https://www.spinics.net/lists/fstests/msg12262.html
Changes since v8:
* Dropped orangefs patch because of ongoing discussion
* Dropped adfs patch because a conflict with Russell's ADFS series
Changes since v7:
* Dropped fat modifications from timespec_truncate patch
* Leverage timestamp_truncate function in utimes
* Added a fix for pstore ramoops timestamps
* Added ext4 warning for inodes without room for extended timestamps.
* Made mount warning more human readable
Changes since v6:
* No change in mount behavior because of expiry of timestamps.
* Included limits for more filesystems.
Changes since v5:
* Dropped y2038-specific changes
Changes since v4:
* Added documentation for boot param
Changes since v3:
* Remove redundant initializations in libfs.c
* Change early_param to __setup similar to other root mount options.
* Fix documentation warning
Changes since v2:
* Introduce early boot param override for checks.
* Drop afs patch for timestamp limits.
Changes since v1:
* return EROFS on mount errors
* fix mtime copy/paste error in utimes
* 'limits' of https://github.com/deepa-hub/vfs:
isofs: Initialize filesystem timestamp ranges
pstore: fs superblock limits
fs: omfs: Initialize filesystem timestamp ranges
fs: hpfs: Initialize filesystem timestamp ranges
fs: ceph: Initialize filesystem timestamp ranges
fs: sysv: Initialize filesystem timestamp ranges
fs: affs: Initialize filesystem timestamp ranges
fs: fat: Initialize filesystem timestamp ranges
fs: cifs: Initialize filesystem timestamp ranges
fs: nfs: Initialize filesystem timestamp ranges
ext4: Initialize timestamps limits
9p: Fill min and max timestamps in sb
fs: Fill in max and min timestamps in superblock
utimes: Clamp the timestamps before update
mount: Add mount warning for impending timestamp expiry
timestamp_truncate: Replace users of timespec64_trunc
vfs: Add timestamp_truncate() api
vfs: Add file timestamp range support
Link: https://lore.kernel.org/lkml/20190730014924.2193-1-deepa.kernel@gmail.com/
Link: https://lore.kernel.org/lkml/20190830154744.4868-1-deepa.kernel@gmail.com/
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
|
|
Fill in the appropriate limits to avoid inconsistencies
in the vfs cached inode times when timestamps are
outside the permitted range.
Reference: http://www.ecma-international.org/publications/standards/Ecma-119.htm
Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com>
Acked-by: Jeff Layton <jlayton@kernel.org>
|
|
Leaving granularity at 1ns because it is dependent on the specific
attached backing pstore module. ramoops has microsecond resolution.
Fix the readback of ramoops fractional timestamp microseconds,
which has incorrectly been reporting the value as nanoseconds.
Fixes: 3f8f80f0cfeb ("pstore/ram: Read and write to the 'compressed' flag of pstore").
Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com>
Acked-by: Kees Cook <keescook@chromium.org>
Acked-by: Jeff Layton <jlayton@kernel.org>
Cc: anton@enomsg.org
Cc: ccross@android.com
Cc: keescook@chromium.org
Cc: tony.luck@intel.com
|
|
Fill in the appropriate limits to avoid inconsistencies
in the vfs cached inode times when timestamps are
outside the permitted range.
Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com>
Acked-by: Bob Copeland <me@bobcopeland.com>
Acked-by: Jeff Layton <jlayton@kernel.org>
Cc: me@bobcopeland.com
Cc: linux-karma-devel@lists.sourceforge.net
|
|
Fill in the appropriate limits to avoid inconsistencies
in the vfs cached inode times when timestamps are
outside the permitted range.
Also change the local_to_gmt() to use time64_t instead
of time32_t.
Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com>
Acked-by: Jeff Layton <jlayton@kernel.org>
Cc: mikulas@artax.karlin.mff.cuni.cz
|
|
Fill in the appropriate limits to avoid inconsistencies
in the vfs cached inode times when timestamps are
outside the permitted range.
According to the disscussion in
https://patchwork.kernel.org/patch/8308691/ we agreed to use
unsigned 32 bit timestamps on ceph.
Update the limits accordingly.
Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com>
Acked-by: Jeff Layton <jlayton@kernel.org>
Cc: zyan@redhat.com
Cc: sage@redhat.com
Cc: idryomov@gmail.com
Cc: ceph-devel@vger.kernel.org
|
|
Fill in the appropriate limits to avoid inconsistencies
in the vfs cached inode times when timestamps are
outside the permitted range.
Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com>
Acked-by: Jeff Layton <jlayton@kernel.org>
Cc: hch@infradead.org
|
|
Fill in the appropriate limits to avoid inconsistencies
in the vfs cached inode times when timestamps are
outside the permitted range.
Also fix timestamp calculation to avoid overflow
while converting from days to seconds.
Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com>
Acked-by: David Sterba <dsterba@suse.com>
Acked-by: Jeff Layton <jlayton@kernel.org>
Cc: dsterba@suse.com
|
|
Fill in the appropriate limits to avoid inconsistencies
in the vfs cached inode times when timestamps are
outside the permitted range.
Some FAT variants indicate that the years after 2099 are not supported.
Since commit 7decd1cb0305 ("fat: Fix and cleanup timestamp conversion")
we support the full range of years that can be represented, up to 2107.
Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com>
Acked-by: Jeff Layton <jlayton@kernel.org>
Cc: hirofumi@mail.parknet.co.jp
|
|
Fill in the appropriate limits to avoid inconsistencies
in the vfs cached inode times when timestamps are
outside the permitted range.
Also fixed cnvrtDosUnixTm calculations to avoid int overflow
while computing maximum date.
References:
http://cifs.com/
https://docs.microsoft.com/en-us/openspecs/windows_protocols/ms-cifs/d416ff7c-c536-406e-a951-4f04b2fd1d2b
Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com>
Acked-by: Jeff Layton <jlayton@kernel.org>
Cc: sfrench@samba.org
Cc: linux-cifs@vger.kernel.org
|
|
Fill in the appropriate limits to avoid inconsistencies
in the vfs cached inode times when timestamps are
outside the permitted range.
The time formats for various verious is detailed in the
RFCs as below:
https://tools.ietf.org/html/rfc7862(time metadata)
https://tools.ietf.org/html/rfc7530:
nfstime4
struct nfstime4 {
int64_t seconds;
uint32_t nseconds;
};
https://tools.ietf.org/html/rfc1094
struct timeval {
unsigned int seconds;
unsigned int useconds;
};
https://tools.ietf.org/html/rfc1813
struct nfstime3 {
uint32 seconds;
uint32 nseconds;
};
Use the limits as per the RFC.
Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com>
Acked-by: Jeff Layton <jlayton@kernel.org>
Cc: trond.myklebust@hammerspace.com
Cc: anna.schumaker@netapp.com
Cc: linux-nfs@vger.kernel.org
|
|
ext4 has different overflow limits for max filesystem
timestamps based on the extra bytes available.
The timestamp limits are calculated according to the
encoding table in
a4dad1ae24f85i(ext4: Fix handling of extended tv_sec):
* extra msb of adjust for signed
* epoch 32-bit 32-bit tv_sec to
* bits time decoded 64-bit tv_sec 64-bit tv_sec valid time range
* 0 0 1 -0x80000000..-0x00000001 0x000000000 1901-12-13..1969-12-31
* 0 0 0 0x000000000..0x07fffffff 0x000000000 1970-01-01..2038-01-19
* 0 1 1 0x080000000..0x0ffffffff 0x100000000 2038-01-19..2106-02-07
* 0 1 0 0x100000000..0x17fffffff 0x100000000 2106-02-07..2174-02-25
* 1 0 1 0x180000000..0x1ffffffff 0x200000000 2174-02-25..2242-03-16
* 1 0 0 0x200000000..0x27fffffff 0x200000000 2242-03-16..2310-04-04
* 1 1 1 0x280000000..0x2ffffffff 0x300000000 2310-04-04..2378-04-22
* 1 1 0 0x300000000..0x37fffffff 0x300000000 2378-04-22..2446-05-10
Note that the time limits are not correct for deletion times.
Added a warn when an inode cannot be extended to incorporate an
extended timestamp.
Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Acked-by: Jeff Layton <jlayton@kernel.org>
Cc: tytso@mit.edu
Cc: adilger.kernel@dilger.ca
Cc: linux-ext4@vger.kernel.org
|
|
struct p9_wstat and struct p9_stat_dotl indicate that the
wire transport uses u32 and u64 fields for timestamps.
Fill in the appropriate limits to avoid inconsistencies in
the vfs cached inode times when timestamps are outside the
permitted range.
Note that the upper bound for V9FS_PROTO_2000L is retained as S64_MAX.
This is because that is the upper bound supported by vfs.
Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com>
Acked-by: Jeff Layton <jlayton@kernel.org>
Cc: ericvh@gmail.com
Cc: lucho@ionkov.net
Cc: asmadeus@codewreck.org
Cc: v9fs-developer@lists.sourceforge.net
|
|
Fill in the appropriate limits to avoid inconsistencies
in the vfs cached inode times when timestamps are
outside the permitted range.
Even though some filesystems are read-only, fill in the
timestamps to reflect the on-disk representation.
Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Acked-By: Tigran Aivazian <aivazian.tigran@gmail.com>
Acked-by: Jeff Layton <jlayton@kernel.org>
Cc: aivazian.tigran@gmail.com
Cc: al@alarsen.net
Cc: coda@cs.cmu.edu
Cc: darrick.wong@oracle.com
Cc: dushistov@mail.ru
Cc: dwmw2@infradead.org
Cc: hch@infradead.org
Cc: jack@suse.com
Cc: jaharkes@cs.cmu.edu
Cc: luisbg@kernel.org
Cc: nico@fluxnic.net
Cc: phillip@squashfs.org.uk
Cc: richard@nod.at
Cc: salah.triki@gmail.com
Cc: shaggy@kernel.org
Cc: linux-xfs@vger.kernel.org
Cc: codalist@coda.cs.cmu.edu
Cc: linux-ext4@vger.kernel.org
Cc: linux-mtd@lists.infradead.org
Cc: jfs-discussion@lists.sourceforge.net
Cc: reiserfs-devel@vger.kernel.org
|
|
POSIX is ambiguous on the behavior of timestamps for
futimens, utimensat and utimes. Whether to return an
error or silently clamp a timestamp beyond the range
supported by the underlying filesystems is not clear.
POSIX.1 section for futimens, utimensat and utimes says:
(http://pubs.opengroup.org/onlinepubs/9699919799/functions/futimens.html)
The file's relevant timestamp shall be set to the greatest
value supported by the file system that is not greater
than the specified time.
If the tv_nsec field of a timespec structure has the special
value UTIME_NOW, the file's relevant timestamp shall be set
to the greatest value supported by the file system that is
not greater than the current time.
[EINVAL]
A new file timestamp would be a value whose tv_sec
component is not a value supported by the file system.
The patch chooses to clamp the timestamps according to the
filesystem timestamp ranges and does not return an error.
This is in line with the behavior of utime syscall also
since the POSIX page(http://pubs.opengroup.org/onlinepubs/009695399/functions/utime.html)
for utime does not mention returning an error or clamping like above.
Same for utimes http://pubs.opengroup.org/onlinepubs/009695399/functions/utimes.html
Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com>
Acked-by: Jeff Layton <jlayton@kernel.org>
|
|
The warning reuses the uptime max of 30 years used by
settimeofday().
Note that the warning is only emitted for writable filesystem mounts
through the mount syscall. Automounts do not have the same warning.
Print out the warning in human readable format using the struct tm.
After discussion with Arnd Bergmann, we chose to print only the year number.
The raw s_time_max is also displayed, and the user can easily decode
it e.g. "date -u -d @$((0x7fffffff))". We did not want to consolidate
struct rtc_tm and struct tm just to print the date using a format specifier
as part of this series.
Given that the rtc_tm is not compiled on all architectures, this is not a
trivial patch. This can be added in the future.
Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com>
Acked-by: Jeff Layton <jlayton@kernel.org>
|
|
Update the inode timestamp updates to use timestamp_truncate()
instead of timespec64_trunc().
The change was mostly generated by the following coccinelle
script.
virtual context
virtual patch
@r1 depends on patch forall@
struct inode *inode;
identifier i_xtime =~ "^i_[acm]time$";
expression e;
@@
inode->i_xtime =
- timespec64_trunc(
+ timestamp_truncate(
...,
- e);
+ inode);
Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com>
Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Acked-by: Jeff Layton <jlayton@kernel.org>
Cc: adrian.hunter@intel.com
Cc: dedekind1@gmail.com
Cc: gregkh@linuxfoundation.org
Cc: hch@lst.de
Cc: jaegeuk@kernel.org
Cc: jlbec@evilplan.org
Cc: richard@nod.at
Cc: tj@kernel.org
Cc: yuchao0@huawei.com
Cc: linux-f2fs-devel@lists.sourceforge.net
Cc: linux-ntfs-dev@lists.sourceforge.net
Cc: linux-mtd@lists.infradead.org
|
|
timespec_trunc() function is used to truncate a
filesystem timestamp to the right granularity.
But, the function does not clamp tv_sec part of the
timestamps according to the filesystem timestamp limits.
The replacement api: timestamp_truncate() also alters the
signature of the function to accommodate filesystem
timestamp clamping according to flesystem limits.
Note that the tv_nsec part is set to 0 if tv_sec is not within
the range supported for the filesystem.
Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com>
Acked-by: Jeff Layton <jlayton@kernel.org>
|
|
Add fields to the superblock to track the min and max
timestamps supported by filesystems.
Initially, when a superblock is allocated, initialize
it to the max and min values the fields can hold.
Individual filesystems override these to match their
actual limits.
Pseudo filesystems are assumed to always support the
min and max allowable values for the fields.
Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com>
Acked-by: Jeff Layton <jlayton@kernel.org>
|
|
|
|
Pull auxdisplay cleanup from Miguel Ojeda:
"Make ht16k33_fb_fix and ht16k33_fb_var constant (Nishka Dasgupta)"
* tag 'auxdisplay-for-linus-v5.3-rc7' of git://github.com/ojeda/linux:
auxdisplay: ht16k33: Make ht16k33_fb_fix and ht16k33_fb_var constant
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/rw/uml
Pull UML fix from Richard Weinberger:
"Fix time travel mode"
* tag 'for-linus-5.3-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/rw/uml:
um: fix time travel mode
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/rw/ubifs
Pull UBIFS and JFFS2 fixes from Richard Weinberger:
"UBIFS:
- Don't block too long in writeback_inodes_sb()
- Fix for a possible overrun of the log head
- Fix double unlock in orphan_delete()
JFFS2:
- Remove C++ style from UAPI header and unbreak picky toolchains"
* tag 'for-linus-5.3-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/rw/ubifs:
ubifs: Limit the number of pages in shrink_liability
ubifs: Correctly initialize c->min_log_bytes
ubifs: Fix double unlock around orphan_delete()
jffs2: Remove C++ style comments from uapi header
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 fixes from Thomas Gleixner:
"A few fixes for x86:
- Fix a boot regression caused by the recent bootparam sanitizing
change, which escaped the attention of all people who reviewed that
code.
- Address a boot problem on machines with broken E820 tables caused
by an underflow which ended up placing the trampoline start at
physical address 0.
- Handle machines which do not advertise a legacy timer of any form,
but need calibration of the local APIC timer gracefully by making
the calibration routine independent from the tick interrupt. Marked
for stable as well as there seems to be quite some new laptops
rolled out which expose this.
- Clear the RDRAND CPUID bit on AMD family 15h and 16h CPUs which are
affected by broken firmware which does not initialize RDRAND
correctly after resume. Add a command line parameter to override
this for machine which either do not use suspend/resume or have a
fixed BIOS. Unfortunately there is no way to detect this on boot,
so the only safe decision is to turn it off by default.
- Prevent RFLAGS from being clobbers in CALL_NOSPEC on 32bit which
caused fast KVM instruction emulation to break.
- Explain the Intel CPU model naming convention so that the repeating
discussions come to an end"
* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/retpoline: Don't clobber RFLAGS during CALL_NOSPEC on i386
x86/boot: Fix boot regression caused by bootparam sanitizing
x86/CPU/AMD: Clear RDRAND CPUID bit on AMD family 15h/16h
x86/boot/compressed/64: Fix boot on machines with broken E820 table
x86/apic: Handle missing global clockevent gracefully
x86/cpu: Explain Intel model naming convention
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timekeeping fix from Thomas Gleixner:
"A single fix for a regression caused by the generic VDSO
implementation where a math overflow causes CLOCK_BOOTTIME to become a
random number generator"
* 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
timekeeping/vsyscall: Prevent math overflow in BOOTTIME update
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler fix from Thomas Gleixner:
"Handle the worker management in situations where a task is scheduled
out on a PI lock contention correctly and schedule a new worker if
possible"
* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched/core: Schedule new worker even if PI-blocked
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf fixes from Thomas Gleixner:
"Two small fixes for kprobes and perf:
- Prevent a deadlock in kprobe_optimizer() causes by reverse lock
ordering
- Fix a comment typo"
* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
kprobes: Fix potential deadlock in kprobe_optimizer()
perf/x86: Fix typo in comment
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull irq fix from Thomas Gleixner:
"A single fix for a imbalanced kobject operation in the irq decriptor
code which was unearthed by the new warnings in the kobject code"
* 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
genirq: Properly pair kobject_del() with kobject_add()
|
|
Mergr misc fixes from Andrew Morton:
"11 fixes"
Mostly VM fixes, one psi polling fix, and one parisc build fix.
* emailed patches from Andrew Morton <akpm@linux-foundation.org>:
mm/kasan: fix false positive invalid-free reports with CONFIG_KASAN_SW_TAGS=y
mm/zsmalloc.c: fix race condition in zs_destroy_pool
mm/zsmalloc.c: migration can leave pages in ZS_EMPTY indefinitely
mm, page_owner: handle THP splits correctly
userfaultfd_release: always remove uffd flags and clear vm_userfaultfd_ctx
psi: get poll_work to run when calling poll syscall next time
mm: memcontrol: flush percpu vmevents before releasing memcg
mm: memcontrol: flush percpu vmstats before releasing memcg
parisc: fix compilation errrors
mm, page_alloc: move_freepages should not examine struct page of reserved memory
mm/z3fold.c: fix race between migration and destruction
|
|
Pull dma-mapping fixes from Christoph Hellwig:
"Two fixes for regressions in this merge window:
- select the Kconfig symbols for the noncoherent dma arch helpers on
arm if swiotlb is selected, not just for LPAE to not break then Xen
build, that uses swiotlb indirectly through swiotlb-xen
- fix the page allocator fallback in dma_alloc_contiguous if the CMA
allocation fails"
* tag 'dma-mapping-5.3-5' of git://git.infradead.org/users/hch/dma-mapping:
dma-direct: fix zone selection after an unaddressable CMA allocation
arm: select the dma-noncoherent symbols for all swiotlb builds
|
|
The code like this:
ptr = kmalloc(size, GFP_KERNEL);
page = virt_to_page(ptr);
offset = offset_in_page(ptr);
kfree(page_address(page) + offset);
may produce false-positive invalid-free reports on the kernel with
CONFIG_KASAN_SW_TAGS=y.
In the example above we lose the original tag assigned to 'ptr', so
kfree() gets the pointer with 0xFF tag. In kfree() we check that 0xFF
tag is different from the tag in shadow hence print false report.
Instead of just comparing tags, do the following:
1) Check that shadow doesn't contain KASAN_TAG_INVALID. Otherwise it's
double-free and it doesn't matter what tag the pointer have.
2) If pointer tag is different from 0xFF, make sure that tag in the
shadow is the same as in the pointer.
Link: http://lkml.kernel.org/r/20190819172540.19581-1-aryabinin@virtuozzo.com
Fixes: 7f94ffbc4c6a ("kasan: add hooks implementation for tag-based mode")
Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Reported-by: Walter Wu <walter-zh.wu@mediatek.com>
Reported-by: Mark Rutland <mark.rutland@arm.com>
Reviewed-by: Andrey Konovalov <andreyknvl@google.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
In zs_destroy_pool() we call flush_work(&pool->free_work). However, we
have no guarantee that migration isn't happening in the background at
that time.
Since migration can't directly free pages, it relies on free_work being
scheduled to free the pages. But there's nothing preventing an
in-progress migrate from queuing the work *after*
zs_unregister_migration() has called flush_work(). Which would mean
pages still pointing at the inode when we free it.
Since we know at destroy time all objects should be free, no new
migrations can come in (since zs_page_isolate() fails for fully-free
zspages). This means it is sufficient to track a "# isolated zspages"
count by class, and have the destroy logic ensure all such pages have
drained before proceeding. Keeping that state under the class spinlock
keeps the logic straightforward.
In this case a memory leak could lead to an eventual crash if compaction
hits the leaked page. This crash would only occur if people are
changing their zswap backend at runtime (which eventually starts
destruction).
Link: http://lkml.kernel.org/r/20190809181751.219326-2-henryburns@google.com
Fixes: 48b4800a1c6a ("zsmalloc: page migration support")
Signed-off-by: Henry Burns <henryburns@google.com>
Reviewed-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Cc: Henry Burns <henrywolfeburns@gmail.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Jonathan Adams <jwadams@google.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
In zs_page_migrate() we call putback_zspage() after we have finished
migrating all pages in this zspage. However, the return value is
ignored. If a zs_free() races in between zs_page_isolate() and
zs_page_migrate(), freeing the last object in the zspage,
putback_zspage() will leave the page in ZS_EMPTY for potentially an
unbounded amount of time.
To fix this, we need to do the same thing as zs_page_putback() does:
schedule free_work to occur.
To avoid duplicated code, move the sequence to a new
putback_zspage_deferred() function which both zs_page_migrate() and
zs_page_putback() call.
Link: http://lkml.kernel.org/r/20190809181751.219326-1-henryburns@google.com
Fixes: 48b4800a1c6a ("zsmalloc: page migration support")
Signed-off-by: Henry Burns <henryburns@google.com>
Reviewed-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Cc: Henry Burns <henrywolfeburns@gmail.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Jonathan Adams <jwadams@google.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
THP splitting path is missing the split_page_owner() call that
split_page() has.
As a result, split THP pages are wrongly reported in the page_owner file
as order-9 pages. Furthermore when the former head page is freed, the
remaining former tail pages are not listed in the page_owner file at
all. This patch fixes that by adding the split_page_owner() call into
__split_huge_page().
Link: http://lkml.kernel.org/r/20190820131828.22684-2-vbabka@suse.cz
Fixes: a9627bc5e34e ("mm/page_owner: introduce split_page_owner and replace manual handling")
Reported-by: Kirill A. Shutemov <kirill@shutemov.name>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
userfaultfd_release() should clear vm_flags/vm_userfaultfd_ctx even if
mm->core_state != NULL.
Otherwise a page fault can see userfaultfd_missing() == T and use an
already freed userfaultfd_ctx.
Link: http://lkml.kernel.org/r/20190820160237.GB4983@redhat.com
Fixes: 04f5866e41fb ("coredump: fix race condition between mmget_not_zero()/get_task_mm() and core dumping")
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Reported-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Reviewed-by: Andrea Arcangeli <aarcange@redhat.com>
Tested-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: Jann Horn <jannh@google.com>
Cc: Jason Gunthorpe <jgg@mellanox.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Only when calling the poll syscall the first time can user receive
POLLPRI correctly. After that, user always fails to acquire the event
signal.
Reproduce case:
1. Get the monitor code in Documentation/accounting/psi.txt
2. Run it, and wait for the event triggered.
3. Kill and restart the process.
The question is why we can end up with poll_scheduled = 1 but the work
not running (which would reset it to 0). And the answer is because the
scheduling side sees group->poll_kworker under RCU protection and then
schedules it, but here we cancel the work and destroy the worker. The
cancel needs to pair with resetting the poll_scheduled flag.
Link: http://lkml.kernel.org/r/1566357985-97781-1-git-send-email-joseph.qi@linux.alibaba.com
Signed-off-by: Jason Xing <kerneljasonxing@linux.alibaba.com>
Signed-off-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Reviewed-by: Caspar Zhang <caspar@linux.alibaba.com>
Reviewed-by: Suren Baghdasaryan <surenb@google.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Similar to vmstats, percpu caching of local vmevents leads to an
accumulation of errors on non-leaf levels. This happens because some
leftovers may remain in percpu caches, so that they are never propagated
up by the cgroup tree and just disappear into nonexistence with on
releasing of the memory cgroup.
To fix this issue let's accumulate and propagate percpu vmevents values
before releasing the memory cgroup similar to what we're doing with
vmstats.
Since on cpu hotplug we do flush percpu vmstats anyway, we can iterate
only over online cpus.
Link: http://lkml.kernel.org/r/20190819202338.363363-4-guro@fb.com
Fixes: 42a300353577 ("mm: memcontrol: fix recursive statistics correctness & scalabilty")
Signed-off-by: Roman Gushchin <guro@fb.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Percpu caching of local vmstats with the conditional propagation by the
cgroup tree leads to an accumulation of errors on non-leaf levels.
Let's imagine two nested memory cgroups A and A/B. Say, a process
belonging to A/B allocates 100 pagecache pages on the CPU 0. The percpu
cache will spill 3 times, so that 32*3=96 pages will be accounted to A/B
and A atomic vmstat counters, 4 pages will remain in the percpu cache.
Imagine A/B is nearby memory.max, so that every following allocation
triggers a direct reclaim on the local CPU. Say, each such attempt will
free 16 pages on a new cpu. That means every percpu cache will have -16
pages, except the first one, which will have 4 - 16 = -12. A/B and A
atomic counters will not be touched at all.
Now a user removes A/B. All percpu caches are freed and corresponding
vmstat numbers are forgotten. A has 96 pages more than expected.
As memory cgroups are created and destroyed, errors do accumulate. Even
1-2 pages differences can accumulate into large numbers.
To fix this issue let's accumulate and propagate percpu vmstat values
before releasing the memory cgroup. At this point these numbers are
stable and cannot be changed.
Since on cpu hotplug we do flush percpu vmstats anyway, we can iterate
only over online cpus.
Link: http://lkml.kernel.org/r/20190819202338.363363-2-guro@fb.com
Fixes: 42a300353577 ("mm: memcontrol: fix recursive statistics correctness & scalabilty")
Signed-off-by: Roman Gushchin <guro@fb.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Commit 0cfaee2af3a0 ("include/asm-generic/5level-fixup.h: fix variable
'p4d' set but not used") converted a few functions from macros to static
inline, which causes parisc to complain,
In file included from include/asm-generic/4level-fixup.h:38:0,
from arch/parisc/include/asm/pgtable.h:5,
from arch/parisc/include/asm/io.h:6,
from include/linux/io.h:13,
from sound/core/memory.c:9:
include/asm-generic/5level-fixup.h:14:18: error: unknown type name 'pgd_t'; did you mean 'pid_t'?
#define p4d_t pgd_t
^
include/asm-generic/5level-fixup.h:24:28: note: in expansion of macro 'p4d_t'
static inline int p4d_none(p4d_t p4d)
^~~~~
It is because "4level-fixup.h" is included before "asm/page.h" where
"pgd_t" is defined.
Link: http://lkml.kernel.org/r/20190815205305.1382-1-cai@lca.pw
Fixes: 0cfaee2af3a0 ("include/asm-generic/5level-fixup.h: fix variable 'p4d' set but not used")
Signed-off-by: Qian Cai <cai@lca.pw>
Reported-by: Guenter Roeck <linux@roeck-us.net>
Tested-by: Guenter Roeck <linux@roeck-us.net>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
After commit 907ec5fca3dc ("mm: zero remaining unavailable struct
pages"), struct page of reserved memory is zeroed. This causes
page->flags to be 0 and fixes issues related to reading
/proc/kpageflags, for example, of reserved memory.
The VM_BUG_ON() in move_freepages_block(), however, assumes that
page_zone() is meaningful even for reserved memory. That assumption is
no longer true after the aforementioned commit.
There's no reason why move_freepages_block() should be testing the
legitimacy of page_zone() for reserved memory; its scope is limited only
to pages on the zone's freelist.
Note that pfn_valid() can be true for reserved memory: there is a
backing struct page. The check for page_to_nid(page) is also buggy but
reserved memory normally only appears on node 0 so the zeroing doesn't
affect this.
Move the debug checks to after verifying PageBuddy is true. This
isolates the scope of the checks to only be for buddy pages which are on
the zone's freelist which move_freepages_block() is operating on. In
this case, an incorrect node or zone is a bug worthy of being warned
about (and the examination of struct page is acceptable bcause this
memory is not reserved).
Why does move_freepages_block() gets called on reserved memory? It's
simply math after finding a valid free page from the per-zone free area
to use as fallback. We find the beginning and end of the pageblock of
the valid page and that can bring us into memory that was reserved per
the e820. pfn_valid() is still true (it's backed by a struct page), but
since it's zero'd we shouldn't make any inferences here about comparing
its node or zone. The current node check just happens to succeed most
of the time by luck because reserved memory typically appears on node 0.
The fix here is to validate that we actually have buddy pages before
testing if there's any type of zone or node strangeness going on.
We noticed it almost immediately after bringing 907ec5fca3dc in on
CONFIG_DEBUG_VM builds. It depends on finding specific free pages in
the per-zone free area where the math in move_freepages() will bring the
start or end pfn into reserved memory and wanting to claim that entire
pageblock as a new migratetype. So the path will be rare, require
CONFIG_DEBUG_VM, and require fallback to a different migratetype.
Some struct pages were already zeroed from reserve pages before
907ec5fca3c so it theoretically could trigger before this commit. I
think it's rare enough under a config option that most people don't run
that others may not have noticed. I wouldn't argue against a stable tag
and the backport should be easy enough, but probably wouldn't single out
a commit that this is fixing.
Mel said:
: The overhead of the debugging check is higher with this patch although
: it'll only affect debug builds and the path is not particularly hot.
: If this was a concern, I think it would be reasonable to simply remove
: the debugging check as the zone boundaries are checked in
: move_freepages_block and we never expect a zone/node to be smaller than
: a pageblock and stuck in the middle of another zone.
Link: http://lkml.kernel.org/r/alpine.DEB.2.21.1908122036560.10779@chino.kir.corp.google.com
Signed-off-by: David Rientjes <rientjes@google.com>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Masayoshi Mizuma <m.mizuma@jp.fujitsu.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Pavel Tatashin <pavel.tatashin@microsoft.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
In z3fold_destroy_pool() we call destroy_workqueue(&pool->compact_wq).
However, we have no guarantee that migration isn't happening in the
background at that time.
Migration directly calls queue_work_on(pool->compact_wq), if destruction
wins that race we are using a destroyed workqueue.
Link: http://lkml.kernel.org/r/20190809213828.202833-1-henryburns@google.com
Signed-off-by: Henry Burns <henryburns@google.com>
Cc: Vitaly Wool <vitalywool@gmail.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Jonathan Adams <jwadams@google.com>
Cc: Henry Burns <henrywolfeburns@gmail.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-gpio
Pull GPIO fixes from Linus Walleij:
"Here is a (hopefully last) set of GPIO fixes for the v5.3 kernel
cycle. Two are pretty core:
- Fix not reporting open drain/source lines to userspace as "input"
- Fix a minor build error found in randconfigs
- Fix a chip select quirk on the Freescale SPI
- Fix the irqchip initialization semantic order to reflect what it
was using the old API"
* tag 'gpio-v5.3-4' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-gpio:
gpio: Fix irqchip initialization order
gpio: of: fix Freescale SPI CS quirk handling
gpio: Fix build error of function redefinition
gpiolib: never report open-drain/source lines as 'input' to user-space
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux
Pull Hyper-V fixes from Sasha Levin:
- Fix for panics and network failures on PAE guests by Dexuan Cui.
- Fix of a memory leak (and related cleanups) in the hyper-v keyboard
driver by Dexuan Cui.
- Code cleanups for hyper-v clocksource driver during the merge window
by Dexuan Cui.
- Fix for a false positive warning in the userspace hyper-v KVP store
by Vitaly Kuznetsov.
* tag 'hyperv-fixes-signed' of git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux:
Drivers: hv: vmbus: Fix virt_to_hvpfn() for X86_PAE
Tools: hv: kvp: eliminate 'may be used uninitialized' warning
Input: hyperv-keyboard: Use in-place iterator API in the channel callback
Drivers: hv: vmbus: Remove the unused "tsc_page" from struct hv_context
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux
Pull arm64 fixes from Will Deacon:
"Two KVM/arm fixes for MMIO emulation and UBSAN.
Unusually, we're routing them via the arm64 tree as per Paolo's
request on the list:
https://lore.kernel.org/kvm/21ae69a2-2546-29d0-bff6-2ea825e3d968@redhat.com/
We don't actually have any other arm64 fixes pending at the moment
(touch wood), so I've pulled from Marc, written a merge commit, tagged
the result and run it through my build/boot/bisect scripts"
* tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux:
KVM: arm/arm64: VGIC: Properly initialise private IRQ affinity
KVM: arm/arm64: Only skip MMIO insn once
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi
Pull SCSI fixes from James Bottomley:
"Four fixes, three for edge conditions which don't occur very often.
The lpfc fix mitigates memory exhaustion for some high CPU systems"
* tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi:
scsi: lpfc: Mitigate high memory pre-allocation by SCSI-MQ
scsi: ufs: Fix NULL pointer dereference in ufshcd_config_vreg_hpm()
scsi: target: tcmu: avoid use-after-free after command timeout
scsi: qla2xxx: Fix gnl.l memory leak on adapter init failure
|
|
Pull xfs fix from Darrick Wong:
"A single patch that fixes a xfs lockup problem when a chown/chgrp
operation fails due to running out of quota. It has survived the usual
xfstests runs and merges cleanly with this morning's master:
- Fix a forgotten inode unlock when chown/chgrp fail due to quota"
* tag 'xfs-5.3-fixes-6' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
xfs: fix missing ILOCK unlock when xfs_setattr_nonsize fails due to EDQUOT
|
|
Pull more drm fixes from Dave Airlie:
"Although the tree built for me fine on arm here, it appears either
header cleanups in next or some kconfig combo it breaks, so this
contains a fix to mediatek to include dma-mapping.h explicitly.
There was also one nouveau fix that came in late that I was going to
leave until next week, but since I was sending this I thought it may
as well be in here:
mediatek:
- fix build in some cases
nouveau:
- fix hang with i2c and mst docks"
* tag 'drm-fixes-2019-08-24' of git://anongit.freedesktop.org/drm/drm:
drm/mediatek: include dma-mapping header
drm/nouveau: Don't retry infinitely when receiving no data on i2c over AUX
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into kvm/fixes
Pull KVM/arm fixes from Marc Zyngier as per Paulo's request at:
https://lkml.kernel.org/r/21ae69a2-2546-29d0-bff6-2ea825e3d968@redhat.com
"One (hopefully last) set of fixes for KVM/arm for 5.3: an embarassing
MMIO emulation regression, and a UBSAN splat. Oh well...
- Don't overskip instructions on MMIO emulation
- Fix UBSAN splat when initializing PPI priorities"
* tag 'kvmarm-fixes-for-5.3-3' of git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm:
KVM: arm/arm64: VGIC: Properly initialise private IRQ affinity
KVM: arm/arm64: Only skip MMIO insn once
|
|
Although it builds fine here in my arm cross compile, it seems
either via some other patches in -next or some Kconfig combination,
this fails to build for everyone.
Include linux/dma-mapping.h should fix it.
Signed-off-by: Dave Airlie <airlied@redhat.com>
|