Age | Commit message (Collapse) | Author |
|
Michael Ellerman reported that emulate_loadstore() was trying to
access element 32 of regs->gpr[], which doesn't exist, when
emulating a string store instruction. This is because the string
load and store instructions (lswi, lswx, stswi and stswx) are
defined to wrap around from register 31 to register 0 if the number
of bytes being loaded or stored is sufficiently large. This wrapping
was not implemented in the emulation code. To fix it, we mask the
register number after incrementing it.
Reported-by: Michael Ellerman <mpe@ellerman.id.au>
Fixes: c9f6f4ed95d4 ("powerpc: Implement emulation of string loads and stores")
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
|
|
This adds emulation for the lfiwax, lfiwzx and stfiwx instructions.
This necessitated adding a new flag to indicate whether a floating
point or an integer conversion was needed for LOAD_FP and STORE_FP,
so this moves the size field in op->type up 4 bits.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
|
|
This replaces almost all of the instruction emulation code in
fix_alignment() with calls to analyse_instr(), emulate_loadstore()
and emulate_dcbz(). The only emulation code left is the SPE
emulation code; analyse_instr() etc. do not handle SPE instructions
at present.
One result of this is that we can now handle alignment faults on
all the new VSX load and store instructions that were added in POWER9.
VSX loads/stores will take alignment faults for unaligned accesses
to cache-inhibited memory.
Another effect is that we no longer rely on the DAR and DSISR values
set by the processor.
With this, we now need to include the instruction emulation code
unconditionally.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
|
|
This moves the parts of emulate_step() that deal with emulating
load and store instructions into a new function called
emulate_loadstore(). This is to make it possible to reuse this
code in the alignment handler.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
|
|
This adds code to the load and store emulation code to byte-swap
the data appropriately when the process being emulated is set to
the opposite endianness to that of the kernel.
This also enables the emulation for the multiple-register loads
and stores (lmw, stmw, lswi, stswi, lswx, stswx) to work for
little-endian. In little-endian mode, the partial word at the
end of a transfer for lsw*/stsw* (when the byte count is not a
multiple of 4) is loaded/stored at the least-significant end of
the register. Additionally, this fixes a bug in the previous
code in that it could call read_mem/write_mem with a byte count
that was not 1, 2, 4 or 8.
Note that this only works correctly on processors with "true"
little-endian mode, such as IBM POWER processors from POWER6 on, not
the so-called "PowerPC" little-endian mode that uses address swizzling
as implemented on the old 32-bit 603, 604, 740/750, 74xx CPUs.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
|
|
This adds code to the instruction emulation code to set regs->dar
to the address of any memory access that fails. This address is
not necessarily the same as the effective address of the instruction,
because if the memory access is unaligned, it might cross a page
boundary and fault on the second page.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
|
|
This adds code to analyse_instr() and emulate_step() to understand the
dcbz (data cache block zero) instruction. The emulate_dcbz() function
is made public so it can be used by the alignment handler in future.
(The apparently unnecessary cropping of the address to 32 bits is
there because it will be needed in that situation.)
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
|
|
This adds lfdp[x] and stfdp[x] to the set of instructions that
analyse_instr() and emulate_step() understand.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
|
|
This adds code to analyse_instr() and emulate_step() to handle the
vector element loads and stores:
lvebx, lvehx, lvewx, stvebx, stvehx, stvewx.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
|
|
At present, the analyse_instr/emulate_step code checks for the
relevant MSR_FP/VEC/VSX bit being set when a FP/VMX/VSX load
or store is decoded, but doesn't recheck the bit before reading or
writing the relevant FP/VMX/VSX register in emulate_step().
Since we don't have preemption disabled, it is possible that we get
preempted between checking the MSR bit and doing the register access.
If that happened, then the registers would have been saved to the
thread_struct for the current process. Accesses to the CPU registers
would then potentially read stale values, or write values that would
never be seen by the user process.
Another way that the registers can become non-live is if a page
fault occurs when accessing user memory, and the page fault code
calls a copy routine that wants to use the VMX or VSX registers.
To fix this, the code for all the FP/VMX/VSX loads gets restructured
so that it forms an image in a local variable of the desired register
contents, then disables preemption, checks the MSR bit and either
sets the CPU register or writes the value to the thread struct.
Similarly, the code for stores checks the MSR bit, copies either the
CPU register or the thread struct to a local variable, then reenables
preemption and then copies the register image to memory.
If the instruction being emulated is in the kernel, then we must not
use the register values in the thread_struct. In this case, if the
relevant MSR enable bit is not set, then emulate_step refuses to
emulate the instruction.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
|
|
At the moment, emulation of loads and stores of up to 8 bytes to
unaligned addresses on a little-endian system uses a sequence of
single-byte loads or stores to memory. This is rather inefficient,
and the code is hard to follow because it has many ifdefs.
In addition, the Power ISA has requirements on how unaligned accesses
are performed, which are not met by doing all accesses as
sequences of single-byte accesses.
Emulation of VSX loads and stores uses __copy_{to,from}_user,
which means the emulation code has no control on the size of
accesses.
To simplify this, we add new copy_mem_in() and copy_mem_out()
functions for accessing memory. These use a sequence of the largest
possible aligned accesses, up to 8 bytes (or 4 on 32-bit systems),
to copy memory between a local buffer and user memory. We then
rewrite {read,write}_mem_unaligned and the VSX load/store
emulation using these new functions.
These new functions also simplify the code in do_fp_load() and
do_fp_store() for the unaligned cases.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
|
|
The addpcis instruction puts the sum of the next instruction address
plus a constant into a register. Since the result depends on the
address of the instruction, it will give an incorrect result if it
is single-stepped out of line, which is what the *probes subsystem
will currently do if a probe is placed on an addpcis instruction.
This fixes the problem by adding emulation of it to analyse_instr().
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
|
|
The architecture shows the least-significant bit of the instruction
word as reserved for the popcnt[bwd], prty[wd] and bpermd
instructions, that is, these instructions never update CR0.
Therefore this changes the emulation of these instructions to
skip the CR0 update.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
|
|
The case added for the isel instruction was added inside a switch
statement which uses the 10-bit minor opcode field in the 0x7fe
bits of the instruction word. However, for the isel instruction,
the minor opcode field is only the 0x3e bits, and the 0x7c0 bits
are used for the "BC" field, which indicates which CR bit to use
to select the result.
Therefore, for the isel emulation to work correctly when BC != 0,
we need to match on ((instr >> 1) & 0x1f) == 15). To do this, we
pull the isel case out of the switch statement and put it in an
if statement of its own.
Fixes: e27f71e5ff3c ("powerpc/lib/sstep: Add isel instruction emulation")
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
|
|
When a 64-bit processor is executing in 32-bit mode, the update forms
of load and store instructions are required by the architecture to
write the full 64-bit effective address into the RA register, though
only the bottom 32 bits are used to address memory. Currently,
the instruction emulation code writes the truncated address to the
RA register. This fixes it by keeping the full 64-bit EA in the
instruction_op structure, truncating the address in emulate_step()
where it is used to address memory, rather than in the address
computations in analyse_instr().
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
|
|
This extends the instruction emulation infrastructure in sstep.c to
handle all the load and store instructions defined in the Power ISA
v3.0, except for the atomic memory operations, ldmx (which was never
implemented), lfdp/stfdp, and the vector element load/stores.
The instructions added are:
Integer loads and stores: lbarx, lharx, lqarx, stbcx., sthcx., stqcx.,
lq, stq.
VSX loads and stores: lxsiwzx, lxsiwax, stxsiwx, lxvx, lxvl, lxvll,
lxvdsx, lxvwsx, stxvx, stxvl, stxvll, lxsspx, lxsdx, stxsspx, stxsdx,
lxvw4x, lxsibzx, lxvh8x, lxsihzx, lxvb16x, stxvw4x, stxsibx, stxvh8x,
stxsihx, stxvb16x, lxsd, lxssp, lxv, stxsd, stxssp, stxv.
These instructions are handled both in the analyse_instr phase and in
the emulate_step phase.
The code for lxvd2ux and stxvd2ux has been taken out, as those
instructions were never implemented in any processor and have been
taken out of the architecture, and their opcodes have been reused for
other instructions in POWER9 (lxvb16x and stxvb16x).
The emulation for the VSX loads and stores uses helper functions
which don't access registers or memory directly, which can hopefully
be reused by KVM later.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
|
|
This removes the checks for the FP/VMX/VSX enable bits in the MSR
from analyse_instr() and adds them to emulate_step() instead.
The reason for this is that we may want to use analyse_instr() in
a situation where the FP/VMX/VSX register values are stored in the
current thread_struct and the FP/VMX/VSX enable bits in the MSR
image in the pt_regs are zero. Since analyse_instr() doesn't make
any changes to register state, it is reasonable for it to indicate
what the effect of an instruction would be even though the relevant
enable bit is off.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
|
|
The analyse_instr function currently doesn't just work out what an
instruction does, it also executes those instructions whose effect
is only to update CPU registers that are stored in struct pt_regs.
This is undesirable because optprobes uses analyse_instr to work out
if an instruction could be successfully emulated in future.
This changes analyse_instr so it doesn't modify *regs; instead it
stores information in the instruction_op structure to indicate what
registers (GPRs, CR, XER, LR) would be set and what value they would
be set to. A companion function called emulate_update_regs() can
then use that information to update a pt_regs struct appropriately.
As a minor cleanup, this replaces inline asm using the cntlzw and
cntlzd instructions with calls to __builtin_clz() and __builtin_clzl().
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
|
|
bitmap_resize() does not work for file-backed bitmaps.
The buffer_heads are allocated and initialized when
the bitmap is read from the file, but resize doesn't
read from the file, it loads from the internal bitmap.
When it comes time to write the new bitmap, the bh is
non-existent and we crash.
The common case when growing an array involves making the array larger,
and that normally means making the bitmap larger. Doing
that inside the kernel is possible, but would need more code.
It is probably easier to require people who use file-backed
bitmaps to remove them and re-add after a reshape.
So this patch disables the resizing of arrays which have
file-backed bitmaps. This is better than crashing.
Reported-by: Zhilong Liu <zlliu@suse.com>
Fixes: d60b479d177a ("md/bitmap: add bitmap_resize function to allow bitmap resizing.")
Cc: stable@vger.kernel.org (v3.5+).
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
|
|
When mounting to older servers, such as Windows XP (or even Windows 7),
the limited error messages that can be passed back to user space can
get confusing since the default dialect has changed from SMB1 (CIFS) to
more secure SMB3 dialect. Log additional information when the user chooses
to use the default dialects and when the server does not support the
dialect requested.
Signed-off-by: Steve French <smfrench@gmail.com>
Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com>
Acked-by: Pavel Shilovsky <pshilov@microsoft.com>
|
|
David Ahern says:
====================
bpf: Add option to set mark and priority in cgroup sock programs
Add option to set mark and priority in addition to bound device for newly
created sockets. Also, allow the bpf programs to use the get_current_uid_gid
helper meaning socket marks, priority and device can be set based on the
uid/gid of the running process.
Sample programs are updated to demonstrate the new options.
v3
- no changes to Patches 1 and 2 which Alexei acked in previous versions
- dropped change related to recursive programs in a cgroup
- updated tests per dropped patch
v2
- added flag to control recursive behavior as requested by Alexei
- added comment to sock_filter_func_proto regarding use of
get_current_uid_gid helper
- updated test programs for recursive option
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Signed-off-by: David Ahern <dsahern@gmail.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Update cgrp2 bpf sock tests to check that device, mark and priority
can all be set on a socket via bpf programs attached to a cgroup.
Signed-off-by: David Ahern <dsahern@gmail.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Add option to dump socket settings. Will be used in the next patch
to verify bpf programs are correctly setting mark, priority and
device based on the cgroup attachment for the program run.
Signed-off-by: David Ahern <dsahern@gmail.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Add option to detach programs from a cgroup.
Signed-off-by: David Ahern <dsahern@gmail.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Update sock test to set mark and priority on socket create.
Signed-off-by: David Ahern <dsahern@gmail.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Allow BPF programs run on sock create to use the get_current_uid_gid
helper. IPv4 and IPv6 sockets are created in a process context so
there is always a valid uid/gid
Signed-off-by: David Ahern <dsahern@gmail.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Add socket mark and priority to fields that can be set by
ebpf program when a socket is created.
Signed-off-by: David Ahern <dsahern@gmail.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
git://git.samba.org/sfrench/cifs-2.6
Pull cifs fixes from Steve French:
"Two cifs bug fixes for stable"
* tag 'cifs-fixes-for-4.13-rc7-and-stable' of git://git.samba.org/sfrench/cifs-2.6:
CIFS: remove endian related sparse warning
CIFS: Fix maximum SMB2 header size
|
|
Pull block fixes from Jens Axboe:
"Unfortunately a few issues that warrant sending another pull request,
even if I had hoped to avoid it. This contains:
- A fix for multiqueue xen-blkback, on tear down / disconnect.
- A few fixups for NVMe, including a wrong bit definition, fix for
host memory buffers, and an nvme rdma page size fix"
* 'for-linus' of git://git.kernel.dk/linux-block:
nvme: fix the definition of the doorbell buffer config support bit
nvme-pci: use dma memory for the host memory buffer descriptors
nvme-rdma: default MR page size to 4k
xen-blkback: stop blkback thread of every queue in xen_blkif_disconnect
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm
Pull device mapper fixes from Mike Snitzer:
- A couple fixes for bugs introduced as part of the blk_status_t block
layer changes during the 4.13 merge window
- A printk throttling fix to use discrete rate limiting state for each
DM log level
- A stable@ fix for DM multipath that delays request requeueing to
avoid CPU lockup if/when the request queue is "dying"
* tag 'for-4.13/dm-fixes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm:
dm mpath: do not lock up a CPU with requeuing activity
dm: fix printk() rate limiting code
dm mpath: retry BLK_STS_RESOURCE errors
dm: fix the second dec_pending() argument in __split_and_process_bio()
|
|
Merge more fixes from Andrew Morton:
"6 fixes"
* emailed patches from Andrew Morton <akpm@linux-foundation.org>:
scripts/dtc: fix '%zx' warning
include/linux/compiler.h: don't perform compiletime_assert with -O0
mm, madvise: ensure poisoned pages are removed from per-cpu lists
mm, uprobes: fix multiple free of ->uprobes_state.xol_area
kernel/kthread.c: kthread_worker: don't hog the cpu
mm,page_alloc: don't call __node_reclaim() with oom_lock held.
|
|
Merge mmu_notifier fixes from Jérôme Glisse:
"The invalidate_page callback suffered from 2 pitfalls. First it used
to happen after page table lock was release and thus a new page might
have been setup for the virtual address before the call to
invalidate_page().
This is in a weird way fixed by commit c7ab0d2fdc84 ("mm: convert
try_to_unmap_one() to use page_vma_mapped_walk()") which moved the
callback under the page table lock. Which also broke several existing
user of the mmu_notifier API that assumed they could sleep inside this
callback.
The second pitfall was invalidate_page being the only callback not
taking a range of address in respect to invalidation but was giving an
address and a page. Lot of the callback implementer assumed this could
never be THP and thus failed to invalidate the appropriate range for
THP pages.
By killing this callback we unify the mmu_notifier callback API to
always take a virtual address range as input.
There is now two clear API (I am not mentioning the youngess API which
is seldomly used):
- invalidate_range_start()/end() callback (which allow you to sleep)
- invalidate_range() where you can not sleep but happen right after
page table update under page table lock
Note that a lot of existing user feels broken in respect to
range_start/ range_end. Many user only have range_start() callback but
there is nothing preventing them to undo what was invalidated in their
range_start() callback after it returns but before any CPU page table
update take place.
The code pattern use in kvm or umem odp is an example on how to
properly avoid such race. In a nutshell use some kind of sequence
number and active range invalidation counter to block anything that
might undo what the range_start() callback did.
If you do not care about keeping fully in sync with CPU page table (ie
you can live with CPU page table pointing to new different page for a
given virtual address) then you can take a reference on the pages
inside the range_start callback and drop it in range_end or when your
driver is done with those pages.
Last alternative is to use invalidate_range() if you can do
invalidation without sleeping as invalidate_range() callback happens
under the CPU page table spinlock right after the page table is
updated.
The first two patches convert existing mmu_notifier_invalidate_page()
calls to mmu_notifier_invalidate_range() and bracket those call with
call to mmu_notifier_invalidate_range_start()/end().
The next ten patches remove existing invalidate_page() callback as it
can no longer happen.
Finally the last page remove the invalidate_page() callback completely
so it can RIP.
Changes since v1:
- remove more dead code in kvm (no testing impact)
- more accurate end address computation (patch 2) in page_mkclean_one
and try_to_unmap_one
- added tested-by/reviewed-by gotten so far"
* emailed patches from Jérôme Glisse <jglisse@redhat.com>:
mm/mmu_notifier: kill invalidate_page
KVM: update to new mmu_notifier semantic v2
xen/gntdev: update to new mmu_notifier semantic
sgi-gru: update to new mmu_notifier semantic
misc/mic/scif: update to new mmu_notifier semantic
iommu/intel: update to new mmu_notifier semantic
iommu/amd: update to new mmu_notifier semantic
IB/hfi1: update to new mmu_notifier semantic
IB/umem: update to new mmu_notifier semantic
drm/amdgpu: update to new mmu_notifier semantic
powerpc/powernv: update to new mmu_notifier semantic
mm/rmap: update to new mmu_notifier semantic v2
dax: update to new mmu_notifier semantic
|
|
jfs had previously avoided the use of MAX_LFS_FILESIZE because it hadn't
accounted for the whole 32-bit index range on 32-bit systems. That has
been fixed by commit 0cc3b0ec23ce ("Clarify (and fix) MAX_LFS_FILESIZE
macros"), so we can simplify the code now.
Suggested by Andreas Dilger.
Signed-off-by: Dave Kleikamp <dave.kleikamp@oracle.com>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Cc: jfs-discussion@lists.sourceforge.net
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
dtc uses an incorrect format specifier for printing a uint64_t value.
uint64_t may be either 'unsigned long' or 'unsigned long long' depending
on the host architecture.
Fix this by using %llx and casting to unsigned long long, which ensures
that we always have a wide enough variable to print 64 bits of hex.
HOSTCC scripts/dtc/checks.o
scripts/dtc/checks.c: In function 'check_simple_bus_reg':
scripts/dtc/checks.c:876:2: warning: format '%zx' expects argument of type 'size_t', but argument 4 has type 'uint64_t' [-Wformat=]
snprintf(unit_addr, sizeof(unit_addr), "%zx", reg);
^
scripts/dtc/checks.c:876:2: warning: format '%zx' expects argument of type 'size_t', but argument 4 has type 'uint64_t' [-Wformat=]
Link: http://lkml.kernel.org/r/20170829222034.GJ20805@n2100.armlinux.org.uk
Fixes: 828d4cdd012c ("dtc: check.c fix compile error")
Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk>
Cc: Rob Herring <robh+dt@kernel.org>
Cc: Frank Rowand <frowand.list@gmail.com>
Cc: Shuah Khan <shuahkh@osg.samsung.com>
Cc: David Gibson <david@gibson.dropbear.id.au>
Cc: Michal Marek <mmarek@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Commit c7acec713d14 ("kernel.h: handle pointers to arrays better in
container_of()") made use of __compiletime_assert() from container_of()
thus increasing the usage of this macro, allowing developers to notice
type conflicts in usage of container_of() at compile time.
However, the implementation of __compiletime_assert relies on compiler
optimizations to report an error. This means that if a developer uses
"-O0" with any code that performs container_of(), the compiler will always
report an error regardless of whether there is an actual problem in the
code.
This patch disables compile_time_assert when optimizations are disabled to
allow such code to compile with CFLAGS="-O0".
Example compilation failure:
./include/linux/compiler.h:547:38: error: call to `__compiletime_assert_94' declared with attribute error: pointer type mismatch in container_of()
_compiletime_assert(condition, msg, __compiletime_assert_, __LINE__)
^
./include/linux/compiler.h:530:4: note: in definition of macro `__compiletime_assert'
prefix ## suffix(); \
^~~~~~
./include/linux/compiler.h:547:2: note: in expansion of macro `_compiletime_assert'
_compiletime_assert(condition, msg, __compiletime_assert_, __LINE__)
^~~~~~~~~~~~~~~~~~~
./include/linux/build_bug.h:46:37: note: in expansion of macro `compiletime_assert'
#define BUILD_BUG_ON_MSG(cond, msg) compiletime_assert(!(cond), msg)
^~~~~~~~~~~~~~~~~~
./include/linux/kernel.h:860:2: note: in expansion of macro `BUILD_BUG_ON_MSG'
BUILD_BUG_ON_MSG(!__same_type(*(ptr), ((type *)0)->member) && \
^~~~~~~~~~~~~~~~
[akpm@linux-foundation.org: use do{}while(0), per Michal]
Link: http://lkml.kernel.org/r/20170829230114.11662-1-joe@ovn.org
Fixes: c7acec713d14c6c ("kernel.h: handle pointers to arrays better in container_of()")
Signed-off-by: Joe Stringer <joe@ovn.org>
Cc: Ian Abbott <abbotti@mev.co.uk>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Michal Nazarewicz <mina86@mina86.com>
Cc: Kees Cook <keescook@chromium.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Wendy Wang reported off-list that a RAS HWPOISON-SOFT test case failed
and bisected it to the commit 479f854a207c ("mm, page_alloc: defer
debugging checks of pages allocated from the PCP").
The problem is that a page that was poisoned with madvise() is reused.
The commit removed a check that would trigger if DEBUG_VM was enabled
but re-enabling the check only fixes the problem as a side-effect by
printing a bad_page warning and recovering.
The root of the problem is that an madvise() can leave a poisoned page
on the per-cpu list. This patch drains all per-cpu lists after pages
are poisoned so that they will not be reused. Wendy reports that the
test case in question passes with this patch applied. While this could
be done in a targeted fashion, it is over-complicated for such a rare
operation.
Link: http://lkml.kernel.org/r/20170828133414.7qro57jbepdcyz5x@techsingularity.net
Fixes: 479f854a207c ("mm, page_alloc: defer debugging checks of pages allocated from the PCP")
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Reported-by: Wang, Wendy <wendy.wang@intel.com>
Tested-by: Wang, Wendy <wendy.wang@intel.com>
Acked-by: David Rientjes <rientjes@google.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: "Hansen, Dave" <dave.hansen@intel.com>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: Naoya Horiguchi <nao.horiguchi@gmail.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Commit 7c051267931a ("mm, fork: make dup_mmap wait for mmap_sem for
write killable") made it possible to kill a forking task while it is
waiting to acquire its ->mmap_sem for write, in dup_mmap().
However, it was overlooked that this introduced an new error path before
the new mm_struct's ->uprobes_state.xol_area has been set to NULL after
being copied from the old mm_struct by the memcpy in dup_mm(). For a
task that has previously hit a uprobe tracepoint, this resulted in the
'struct xol_area' being freed multiple times if the task was killed at
just the right time while forking.
Fix it by setting ->uprobes_state.xol_area to NULL in mm_init() rather
than in uprobe_dup_mmap().
With CONFIG_UPROBE_EVENTS=y, the bug can be reproduced by the same C
program given by commit 2b7e8665b4ff ("fork: fix incorrect fput of
->exe_file causing use-after-free"), provided that a uprobe tracepoint
has been set on the fork_thread() function. For example:
$ gcc reproducer.c -o reproducer -lpthread
$ nm reproducer | grep fork_thread
0000000000400719 t fork_thread
$ echo "p $PWD/reproducer:0x719" > /sys/kernel/debug/tracing/uprobe_events
$ echo 1 > /sys/kernel/debug/tracing/events/uprobes/enable
$ ./reproducer
Here is the use-after-free reported by KASAN:
BUG: KASAN: use-after-free in uprobe_clear_state+0x1c4/0x200
Read of size 8 at addr ffff8800320a8b88 by task reproducer/198
CPU: 1 PID: 198 Comm: reproducer Not tainted 4.13.0-rc7-00015-g36fde05f3fb5 #255
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-20170228_101828-anatol 04/01/2014
Call Trace:
dump_stack+0xdb/0x185
print_address_description+0x7e/0x290
kasan_report+0x23b/0x350
__asan_report_load8_noabort+0x19/0x20
uprobe_clear_state+0x1c4/0x200
mmput+0xd6/0x360
do_exit+0x740/0x1670
do_group_exit+0x13f/0x380
get_signal+0x597/0x17d0
do_signal+0x99/0x1df0
exit_to_usermode_loop+0x166/0x1e0
syscall_return_slowpath+0x258/0x2c0
entry_SYSCALL_64_fastpath+0xbc/0xbe
...
Allocated by task 199:
save_stack_trace+0x1b/0x20
kasan_kmalloc+0xfc/0x180
kmem_cache_alloc_trace+0xf3/0x330
__create_xol_area+0x10f/0x780
uprobe_notify_resume+0x1674/0x2210
exit_to_usermode_loop+0x150/0x1e0
prepare_exit_to_usermode+0x14b/0x180
retint_user+0x8/0x20
Freed by task 199:
save_stack_trace+0x1b/0x20
kasan_slab_free+0xa8/0x1a0
kfree+0xba/0x210
uprobe_clear_state+0x151/0x200
mmput+0xd6/0x360
copy_process.part.8+0x605f/0x65d0
_do_fork+0x1a5/0xbd0
SyS_clone+0x19/0x20
do_syscall_64+0x22f/0x660
return_from_SYSCALL_64+0x0/0x7a
Note: without KASAN, you may instead see a "Bad page state" message, or
simply a general protection fault.
Link: http://lkml.kernel.org/r/20170830033303.17927-1-ebiggers3@gmail.com
Fixes: 7c051267931a ("mm, fork: make dup_mmap wait for mmap_sem for write killable")
Signed-off-by: Eric Biggers <ebiggers@google.com>
Reported-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Konstantin Khlebnikov <koct9i@gmail.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: <stable@vger.kernel.org> [4.7+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
If the worker thread continues getting work, it will hog the cpu and rcu
stall complains. Make it a good citizen. This is triggered in a loop
block device test.
Link: http://lkml.kernel.org/r/5de0a179b3184e1a2183fc503448b0269f24d75b.1503697127.git.shli@fb.com
Signed-off-by: Shaohua Li <shli@fb.com>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tejun Heo <tj@kernel.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
We are doing a last second memory allocation attempt before calling
out_of_memory(). But since slab shrinker functions might indirectly
wait for other thread's __GFP_DIRECT_RECLAIM && !__GFP_NORETRY memory
allocations via sleeping locks, calling slab shrinker functions from
node_reclaim() from get_page_from_freelist() with oom_lock held has
possibility of deadlock. Therefore, make sure that last second memory
allocation attempt does not call slab shrinker functions.
Link: http://lkml.kernel.org/r/1503577106-9196-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jp
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
The invalidate_page callback suffered from two pitfalls. First it used
to happen after the page table lock was release and thus a new page
might have setup before the call to invalidate_page() happened.
This is in a weird way fixed by commit c7ab0d2fdc84 ("mm: convert
try_to_unmap_one() to use page_vma_mapped_walk()") that moved the
callback under the page table lock but this also broke several existing
users of the mmu_notifier API that assumed they could sleep inside this
callback.
The second pitfall was invalidate_page() being the only callback not
taking a range of address in respect to invalidation but was giving an
address and a page. Lots of the callback implementers assumed this
could never be THP and thus failed to invalidate the appropriate range
for THP.
By killing this callback we unify the mmu_notifier callback API to
always take a virtual address range as input.
Finally this also simplifies the end user life as there is now two clear
choices:
- invalidate_range_start()/end() callback (which allow you to sleep)
- invalidate_range() where you can not sleep but happen right after
page table update under page table lock
Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Cc: Bernhard Held <berny156@gmx.de>
Cc: Adam Borowski <kilobyte@angband.pl>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: Wanpeng Li <kernellwp@gmail.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Takashi Iwai <tiwai@suse.de>
Cc: Nadav Amit <nadav.amit@gmail.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: axie <axie@amd.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Calls to mmu_notifier_invalidate_page() were replaced by calls to
mmu_notifier_invalidate_range() and are now bracketed by calls to
mmu_notifier_invalidate_range_start()/end()
Remove now useless invalidate_page callback.
Changed since v1 (Linus Torvalds)
- remove now useless kvm_arch_mmu_notifier_invalidate_page()
Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Tested-by: Mike Galbraith <efault@gmx.de>
Tested-by: Adam Borowski <kilobyte@angband.pl>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: kvm@vger.kernel.org
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Calls to mmu_notifier_invalidate_page() were replaced by calls to
mmu_notifier_invalidate_range() and are now bracketed by calls to
mmu_notifier_invalidate_range_start()/end()
Remove now useless invalidate_page callback.
Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Roger Pau Monné <roger.pau@citrix.com>
Cc: xen-devel@lists.xenproject.org (moderated for non-subscribers)
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Calls to mmu_notifier_invalidate_page() were replaced by calls to
mmu_notifier_invalidate_range() and are now bracketed by calls to
mmu_notifier_invalidate_range_start()/end()
Remove now useless invalidate_page callback.
Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Cc: Dimitri Sivanich <sivanich@sgi.com>
Cc: Jack Steiner <steiner@sgi.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Calls to mmu_notifier_invalidate_page() were replaced by calls to
mmu_notifier_invalidate_range() and are now bracketed by calls to
mmu_notifier_invalidate_range_start()/end()
Remove now useless invalidate_page callback.
Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Cc: Sudeep Dutt <sudeep.dutt@intel.com>
Cc: Ashutosh Dixit <ashutosh.dixit@intel.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Calls to mmu_notifier_invalidate_page() were replaced by calls to
mmu_notifier_invalidate_range() and are now bracketed by calls to
mmu_notifier_invalidate_range_start()/end()
Remove now useless invalidate_page callback.
Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: iommu@lists.linux-foundation.org
Cc: Joerg Roedel <jroedel@suse.de>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Calls to mmu_notifier_invalidate_page() were replaced by calls to
mmu_notifier_invalidate_range() and are now bracketed by calls to
mmu_notifier_invalidate_range_start()/end()
Remove now useless invalidate_page callback.
Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Cc: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Cc: iommu@lists.linux-foundation.org
Cc: Joerg Roedel <jroedel@suse.de>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Calls to mmu_notifier_invalidate_page() were replaced by calls to
mmu_notifier_invalidate_range() and are now bracketed by calls to
mmu_notifier_invalidate_range_start()/end()
Remove now useless invalidate_page callback.
Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Cc: linux-rdma@vger.kernel.org
Cc: Dean Luick <dean.luick@intel.com>
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: Doug Ledford <dledford@redhat.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Calls to mmu_notifier_invalidate_page() were replaced by calls to
mmu_notifier_invalidate_range() and are now bracketed by calls to
mmu_notifier_invalidate_range_start()/end()
Remove now useless invalidate_page callback.
Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Tested-by: Leon Romanovsky <leonro@mellanox.com>
Cc: linux-rdma@vger.kernel.org
Cc: Artemy Kovalyov <artemyko@mellanox.com>
Cc: Doug Ledford <dledford@redhat.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Calls to mmu_notifier_invalidate_page() were replaced by calls to
mmu_notifier_invalidate_range() and are now bracketed by calls to
mmu_notifier_invalidate_range_start()/end()
Remove now useless invalidate_page callback.
Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Cc: amd-gfx@lists.freedesktop.org
Cc: Felix Kuehling <Felix.Kuehling@amd.com>
Cc: Christian König <christian.koenig@amd.com>
Cc: Alex Deucher <alexander.deucher@amd.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|