Age | Commit message (Collapse) | Author |
|
If a bnx2x card is passed a GSO packet with a gso_size larger than
~9700 bytes, it will cause a firmware error that will bring the card
down:
bnx2x: [bnx2x_attn_int_deasserted3:4323(enP24p1s0f0)]MC assert!
bnx2x: [bnx2x_mc_assert:720(enP24p1s0f0)]XSTORM_ASSERT_LIST_INDEX 0x2
bnx2x: [bnx2x_mc_assert:736(enP24p1s0f0)]XSTORM_ASSERT_INDEX 0x0 = 0x00000000 0x25e43e47 0x00463e01 0x00010052
bnx2x: [bnx2x_mc_assert:750(enP24p1s0f0)]Chip Revision: everest3, FW Version: 7_13_1
... (dump of values continues) ...
Detect when the mac length of a GSO packet is greater than the maximum
packet size (9700 bytes) and disable GSO.
Signed-off-by: Daniel Axtens <dja@axtens.net>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
If you take a GSO skb, and split it into packets, will the MAC
length (L2 + L3 + L4 headers + payload) of those packets be small
enough to fit within a given length?
Move skb_gso_mac_seglen() to skbuff.h with other related functions
like skb_gso_network_seglen() so we can use it, and then create
skb_gso_validate_mac_len to do the full calculation.
Signed-off-by: Daniel Axtens <dja@axtens.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
We already hold a reference to the eventfd_ctx, which is sufficient;
there's no need to hold a reference to the struct file as well. So get
rid of vhost_dev->log_file.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Reviewed-by: Jason Wang <jasowang@redhat.com>
|
|
We already hold a reference to the eventfd_ctx, which is sufficient;
there's no need to hold a reference to the struct file as well. So get
rid of vhost_virtqueue->error.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Reviewed-by: Jason Wang <jasowang@redhat.com>
|
|
We already hold a reference to the eventfd_ctx, which is sufficient;
there's no need to hold a reference to the struct file as well. So get
rid of vhost_virtqueue->call.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Reviewed-by: Jason Wang <jasowang@redhat.com>
|
|
Code cleanup change - moving from malloc & memset to calloc.
Signed-off-by: Peter Malone <peter.malone@gmail.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
|
|
As mentioned at drivers/base/core.c:
/*
* NOTE: _Never_ directly free @dev after calling this function, even
* if it returned an error! Always use put_device() to give up the
* reference initialized in this function instead.
*/
so we don't free vdev until vdev->vdev.dev.release be called.
Signed-off-by: weiping zhang <zhangweiping@didichuxing.com>
Reviewed-by: Cornelia Huck <cohuck@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
|
|
As mentioned at drivers/base/core.c:
/*
* NOTE: _Never_ directly free @dev after calling this function, even
* if it returned an error! Always use put_device() to give up the
* reference initialized in this function instead.
*/
so we don't free vp_dev until vp_dev->vdev.dev.release be called.
Signed-off-by: weiping zhang <zhangweiping@didichuxing.com>
Reviewed-by: Cornelia Huck <cohuck@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
|
|
In order to make caller do a simple cleanup, we split device_register
into device_initialize and device_add. device_initialize always succeeds,
so the caller can always use put_device when register_virtio_device faild.
Signed-off-by: weiping zhang <zhangweiping@didichuxing.com>
Suggested-by: Cornelia Huck <cohuck@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Reviewed-by: Cornelia Huck <cohuck@redhat.com>
|
|
In commit ea5d404655ba ("vhost: fix release path lockdep checks"),
Michael added a flag to check whether we should hold a lock in
vhost_dev_cleanup(), however, in commit 47283bef7ed3 ("vhost: move
memory pointer to VQs"), RCU operations have been replaced by
mutex, we can remove the no-longer-used `locked' parameter now.
Signed-off-by: Caspar Zhang <jinli.zjl@alibaba-inc.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
|
|
The patch (7235acdb1) changed the way of the work
flushing in which the queued seq, done seq, and the
flushing are not used anymore. Then remove them now.
Fixes: 7235acdb1 ("vhost: simplify work flushing")
Cc: Jason Wang <jasowang@redhat.com>
Signed-off-by: Tonghao Zhang <xiangxia.m.yue@gmail.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
|
|
Print the capacity of the block device when the driver is probed. Many
users expect this since SCSI disks (sd) do it. Moreover, kernel dmesg
output is the primary source of troubleshooting information so it's
helpful to include the disk size there.
The capacity is already printed by virtio_blk when a resize event
occurs. Extract the code and reuse it from virtblk_probe().
This patch also adds the block device name to the message so it can be
correlated with a specific device:
virtio_blk virtio0: [vda] 20971520 512-byte logical blocks (10.7 GB/10.0 GiB)
Cc: Rodrigo A B Freire <rfreire@redhat.com>
Cc: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
|
|
No need to get into the submenu to disable all VIRTIO-related
config entries.
This makes it easier to disable all VIRTIO config options
without entering the submenu. It will also enable one
to see that en/dis-abled state from the outside menu.
This is only intended to change menuconfig UI, not change
the config dependencies.
Signed-off-by: Vincent Legoll <vincent.legoll@gmail.com>
Reviewed-by: Randy Dunlap <rdunlap@infradead.org>
Tested-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Topic branch for stable KVM clockource under Hyper-V.
Thanks to Christoffer Dall for resolving the ARM conflict.
|
|
Replace License short header by SPDX identifier.
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Reviewed-by: Darren Hart (VMware) <dvhart@infradead.org>
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
|
|
Some headers are not needed since the driver can be built as module.
Remove them.
While here, sort headers alphabetically.
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Reviewed-by: Darren Hart (VMware) <dvhart@infradead.org>
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
|
|
On some laptop like the Dell Inspiron 7000 series tablet mode switch
implemented in Intel ACPI, the events to enter and exit the tablet mode
are 0xCC and 0xCD
This initializes the tablet/laptop mode at the correct value
if the system booted in tablet mode (or the intel-vbtn module
loaded with the device in tablet mode)
Cc: platform-driver-x86@vger.kernel.org
Cc: Matthew Garrett <mjg59@srcf.ucam.org>
Cc: "Pali Rohár" <pali.rohar@gmail.com>
Cc: Darren Hart <dvhart@infradead.org>
Cc: Mario Limonciello <mario_limonciello@dell.com>
Cc: Andy Shevchenko <andy@infradead.org>
Cc: Stefan Brüns<stefan.bruens@rwth-aachen.de>
Signed-off-by: Marco Martin <notmart@gmail.com>
Reviewed-by: Mario Limonciello <mario.limonciello@dell.com>
Acked-by: Pali Rohár <pali.rohar@gmail.com>
[andy: fixed style of comments, indentation, and massaged commit message]
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
|
|
There is no longer a need for the buffer to be defined in
first 4GB physical address space.
Furthermore there may be race conditions with multiple different functions
working on a module wide buffer causing incorrect results.
Fixes: 549b4930f057658dc50d8010e66219233119a4d8
Suggested-by: Pali Rohar <pali.rohar@gmail.com>
Signed-off-by: Mario Limonciello <mario.limonciello@dell.com>
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
|
|
Recently sent patch 'platform/x86: intel_pmc_core: Remove unused EXPORTED
API' missed to remove the header file 'arch/x86/include/asm/pmc_core.h'
which was solely used to declare the EXPORTED API
'intel_pmc_slp_s0_counter_read'. This patch provides the errata fix for the
same.
Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@intel.com>
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
|
|
The function inode_cmp_iversion{+raw} is counter-intuitive, because it
returns true when the counters are different and false when these are equal.
Rename it to inode_eq_iversion{+raw}, which will returns true when
the counters are equal and false otherwise.
Signed-off-by: Goffredo Baroncelli <kreijack@inwind.it>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
|
|
Fix a couple of issues at tools/testing/selftests/bpf/Makefile so
the following command
make -C tools/testing/selftests/bpf OUTPUT=/home/yhs/tmp
can put the built results into a different directory.
Also add the built binary test_tcpbpf_user in the .gitignore file.
Fixes: 6882804c916b ("selftests/bpf: add a test for overlapping packet range checks")
Fixes: 9d1f15941967 ("bpf: move cgroup_helpers from samples/bpf/ to tools/testing/selftesting/bpf/")
Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
|
|
Undo loop condition on the error path would cause the i counter
to go below zero, if allocation failure happened with the first
(i.e. 0th) element of the array.
Fixes: 395cacb5f1a0 ("netdevsim: bpf: support fake map offload")
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Simon Horman <simon.horman@netronome.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
|
|
Meelis reported the following warning on a quad P3 HP NetServer museum piece:
WARNING: CPU: 3 PID: 258 at kernel/irq/chip.c:244 __irq_startup+0x80/0x100
EIP: __irq_startup+0x80/0x100
irq_startup+0x7e/0x170
probe_irq_on+0x128/0x2b0
parport_irq_probe.constprop.18+0x8d/0x1af [parport_pc]
parport_pc_probe_port+0xf11/0x1260 [parport_pc]
parport_pc_init+0x78a/0xf10 [parport_pc]
parport_parse_param.constprop.16+0xf0/0xf0 [parport_pc]
do_one_initcall+0x45/0x1e0
This is caused by the rewrite of the irq activation/startup sequence which
missed to convert a callsite in the irq legacy auto probing code.
To fix this irq_activate_and_startup() needs to gain a return value so the
pending logic can work proper.
Fixes: c942cee46bba ("genirq: Separate activation and startup")
Reported-by: Meelis Roos <mroos@linux.ee>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Meelis Roos <mroos@linux.ee>
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/alpine.DEB.2.20.1801301935410.1797@nanos
|
|
Commit 75f139aaf896 "KVM: x86: Add memory barrier on vmcs field lookup"
added a raw 'asm("lfence");' to prevent a bounds check bypass of
'vmcs_field_to_offset_table'.
The lfence can be avoided in this path by using the array_index_nospec()
helper designed for these types of fixes.
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Paolo Bonzini <pbonzini@redhat.com>
Cc: Andrew Honig <ahonig@google.com>
Cc: kvm@vger.kernel.org
Cc: Jim Mattson <jmattson@google.com>
Link: https://lkml.kernel.org/r/151744959670.6342.3001723920950249067.stgit@dwillia2-desk3.amr.corp.intel.com
|
|
Avoids data connection stalls when the client toggles powersave mode
Fixes: aee5b8cf2477 ("mt76: implement A-MPDU rx reordering in the driver code")
Signed-off-by: Felix Fietkau <nbd@nbd.name>
Signed-off-by: Kalle Valo <kvalo@codeaurora.org>
|
|
Fixes: aee5b8cf2477 ("mt76: implement A-MPDU rx reordering in the driver code")
Signed-off-by: Felix Fietkau <nbd@nbd.name>
Signed-off-by: Kalle Valo <kvalo@codeaurora.org>
|
|
Avoids timeouts on reordered A-MPDU rx frames
Fixes: aee5b8cf2477 ("mt76: implement A-MPDU rx reordering in the driver code")
Signed-off-by: Felix Fietkau <nbd@nbd.name>
Signed-off-by: Kalle Valo <kvalo@codeaurora.org>
|
|
With software A-MPDU reordering in place, frames that notify mac80211 of
powersave changes are reordered as well, which can cause connection
stalls. Fix this by implementing powersave state processing in the
driver.
Fixes: aee5b8cf2477 ("mt76: implement A-MPDU rx reordering in the driver code")
Signed-off-by: Felix Fietkau <nbd@nbd.name>
Signed-off-by: Kalle Valo <kvalo@codeaurora.org>
|
|
The last pull request didn't make it to v4.15 (I was too late) so pull it to
wireless-drivers-next.git instead so that it can go to v4.16.
|
|
Prepare input updates for 4.16 merge window.
|
|
Don't use u32, use uint32_t, because this won't work in xfsprogs.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
|
|
Pull documentation updates from Jonathan Corbet:
"Documentation updates for 4.16.
New stuff includes refcount_t documentation, errseq documentation,
kernel-doc support for nested structure definitions, the removal of
lots of crufty kernel-doc support for unused formats, SPDX tag
documentation, the beginnings of a manual for subsystem maintainers,
and lots of fixes and updates.
As usual, some of the changesets reach outside of Documentation/ to
effect kerneldoc comment fixes. It also adds the new LICENSES
directory, of which Thomas promises I do not need to be the
maintainer"
* tag 'docs-4.16' of git://git.lwn.net/linux: (65 commits)
linux-next: docs-rst: Fix typos in kfigure.py
linux-next: DOC: HWPOISON: Fix path to debugfs in hwpoison.txt
Documentation: Fix misconversion of #if
docs: add index entry for networking/msg_zerocopy
Documentation: security/credentials.rst: explain need to sort group_list
LICENSES: Add MPL-1.1 license
LICENSES: Add the GPL 1.0 license
LICENSES: Add Linux syscall note exception
LICENSES: Add the MIT license
LICENSES: Add the BSD-3-clause "Clear" license
LICENSES: Add the BSD 3-clause "New" or "Revised" License
LICENSES: Add the BSD 2-clause "Simplified" license
LICENSES: Add the LGPL-2.1 license
LICENSES: Add the LGPL 2.0 license
LICENSES: Add the GPL 2.0 license
Documentation: Add license-rules.rst to describe how to properly identify file licenses
scripts: kernel_doc: better handle show warnings logic
fs/*/Kconfig: drop links to 404-compliant http://acl.bestbits.at
doc: md: Fix a file name to md-fault.c in fault-injection.txt
errseq: Add to documentation tree
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vmci iov_iter updates from Al Viro:
"Get rid of "is it an iovec or an entire array?" flags in vmxi - just
use iov_iter. Simplifies the living hell out of that code..."
* 'work.vmci' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
vmci: the same on the send side...
vmci: simplify qp_dequeue_locked()
vmci: get rid of qp_memcpy_from_queue()
vmci: fix buf_size in case of iovec-based accesses
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull asm/uaccess.h whack-a-mole from Al Viro:
"It's linux/uaccess.h, damnit... Oh, well - eventually they'll stop
cropping up..."
* 'work.whack-a-mole' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
asm-prototypes.h: use linux/uaccess.h, not asm/uaccess.h
riscv: use linux/uaccess.h, not asm/uaccess.h...
ppc: for put_user() pull linux/uaccess.h, not asm/uaccess.h
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull dcache updates from Al Viro:
"Neil Brown's d_move()/d_path() race fix"
* 'work.dcache' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
VFS: close race between getcwd() and d_move()
|
|
Merge updates from Andrew Morton:
- misc fixes
- ocfs2 updates
- most of MM
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (118 commits)
mm: remove PG_highmem description
tools, vm: new option to specify kpageflags file
mm/swap.c: make functions and their kernel-doc agree
mm, memory_hotplug: fix memmap initialization
mm: correct comments regarding do_fault_around()
mm: numa: do not trap faults on shared data section pages.
hugetlb, mbind: fall back to default policy if vma is NULL
hugetlb, mempolicy: fix the mbind hugetlb migration
mm, hugetlb: further simplify hugetlb allocation API
mm, hugetlb: get rid of surplus page accounting tricks
mm, hugetlb: do not rely on overcommit limit during migration
mm, hugetlb: integrate giga hugetlb more naturally to the allocation path
mm, hugetlb: unify core page allocation accounting and initialization
mm/memcontrol.c: try harder to decrease [memory,memsw].limit_in_bytes
mm/memcontrol.c: make local symbol static
mm/hmm: fix uninitialized use of 'entry' in hmm_vma_walk_pmd()
include/linux/mmzone.h: fix explanation of lower bits in the SPARSEMEM mem_map pointer
mm/compaction.c: fix comment for try_to_compact_pages()
mm/page_ext.c: make page_ext_init a noop when CONFIG_PAGE_EXTENSION but nothing uses it
zsmalloc: use U suffix for negative literals being shifted
...
|
|
When copying between the vcpu and svcpu, we may get scheduled away onto
a different host CPU which in turn means our svcpu pointer may change.
That means we need to atomically copy to and from the svcpu with preemption
disabled, so that all code around it always sees a coherent state.
Reported-by: Simon Guo <wei.guo.simon@gmail.com>
Fixes: 3d3319b45eea ("KVM: PPC: Book3S: PR: Enable interrupts earlier")
Signed-off-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
|
|
Running with CONFIG_DEBUG_ATOMIC_SLEEP reveals that HV KVM tries to
read guest memory, in order to emulate guest instructions, while
preempt is disabled and a vcore lock is held. This occurs in
kvmppc_handle_exit_hv(), called from post_guest_process(), when
emulating guest doorbell instructions on POWER9 systems, and also
when checking whether we have hit a hypervisor breakpoint.
Reading guest memory can cause a page fault and thus cause the
task to sleep, so we need to avoid reading guest memory while
holding a spinlock or when preempt is disabled.
To fix this, we move the preempt_enable() in kvmppc_run_core() to
before the loop that calls post_guest_process() for each vcore that
has just run, and we drop and re-take the vcore lock around the calls
to kvmppc_emulate_debug_inst() and kvmppc_emulate_doorbell_instr().
Dropping the lock is safe with respect to the iteration over the
runnable vcpus in post_guest_process(); for_each_runnable_thread
is actually safe to use locklessly. It is possible for a vcpu
to become runnable and add itself to the runnable_threads array
(code near the beginning of kvmppc_run_vcpu()) and then get included
in the iteration in post_guest_process despite the fact that it
has not just run. This is benign because vcpu->arch.trap and
vcpu->arch.ceded will be zero.
Cc: stable@vger.kernel.org # v4.13+
Fixes: 579006944e0d ("KVM: PPC: Book3S HV: Virtualize doorbell facility on POWER9")
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
|
|
In the past the ast driver relied upon the fbdev emulation helpers to
call ->load_lut at boot-up. But since
commit b8e2b0199cc377617dc238f5106352c06dcd3fa2
Author: Peter Rosin <peda@axentia.se>
Date: Tue Jul 4 12:36:57 2017 +0200
drm/fb-helper: factor out pseudo-palette
that's cleaned up and drivers are expected to boot into a consistent
lut state. This patch fixes that.
Fixes: b8e2b0199cc3 ("drm/fb-helper: factor out pseudo-palette")
Cc: Peter Rosin <peda@axenita.se>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: <stable@vger.kernel.org> # v4.14+
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=198123
Cc: Bill Fraser <bill.fraser@gmail.com>
Reported-and-Tested-by: Bill Fraser <bill.fraser@gmail.com>
Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
Signed-off-by: Dave Airlie <airlied@redhat.com>
|
|
git://anongit.freedesktop.org/drm/drm-misc into drm-next
This contains a fix to restrict what lessee can do with masters and
another one when waiting for timeouts on reservation objects.
* tag 'drm-misc-next-fixes-2018-01-31' of git://anongit.freedesktop.org/drm/drm-misc:
drm: Check for lessee in DROP_MASTER ioctl
dma-buf: fix reservation_object_wait_timeout_rcu once more v2
|
|
Commit cbe37d093707 ("[PATCH] mm: remove PG_highmem") removed PG_highmem
to save a page flag. So the description of PG_highmem is no longer
needed.
Link: http://lkml.kernel.org/r/1517391212-2950-1-git-send-email-miles.chen@mediatek.com
Signed-off-by: Miles Chen <miles.chen@mediatek.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
page-types currently hardcodes /proc/kpageflags as the file to parse.
This works when using the tool to examine the state of pageflags on the
same system, but does not allow storing a snapshot of pageflags at a
given time to debug issues nor on a different system.
This allows the user to specify a saved version of kpageflags with a new
page-types -F option.
[akpm@linux-foundation.org: add "filename" to fix usage() string]
[rientjes@google.com: fix layout]
Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1801301840050.140969@chino.kir.corp.google.com
Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1801301458180.153857@chino.kir.corp.google.com
Signed-off-by: David Rientjes <rientjes@google.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Konstantin Khlebnikov <koct9i@gmail.com>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Fix some basic kernel-doc notation in mm/swap.c:
- for function lru_cache_add_anon(), make its kernel-doc function name
match its function name and change colon to hyphen following the
function name
- for function pagevec_lookup_entries(), change the function parameter
name from nr_pages to nr_entries since that is more descriptive of
what the parameter actually is and then it matches the kernel-doc
comments also
Fix function kernel-doc to match the change in commit 67fd707f4681:
- drop the kernel-doc notation for @nr_pages from
pagevec_lookup_range() and correct the function description for that
change
Link: http://lkml.kernel.org/r/3b42ee3e-04a9-a6ca-6be4-f00752a114fe@infradead.org
Fixes: 67fd707f4681 ("mm: remove nr_pages argument from pagevec_lookup_{,range}_tag()")
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Michal Hocko <mhocko@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Bharata has noticed that onlining a newly added memory doesn't increase
the total memory, pointing to commit f7f99100d8d9 ("mm: stop zeroing
memory during allocation in vmemmap") as a culprit. This commit has
changed the way how the memory for memmaps is initialized and moves it
from the allocation time to the initialization time. This works
properly for the early memmap init path.
It doesn't work for the memory hotplug though because we need to mark
page as reserved when the sparsemem section is created and later
initialize it completely during onlining. memmap_init_zone is called in
the early stage of onlining. With the current code it calls
__init_single_page and as such it clears up the whole stage and
therefore online_pages_range skips those pages.
Fix this by skipping mm_zero_struct_page in __init_single_page for
memory hotplug path. This is quite uggly but unifying both early init
and memory hotplug init paths is a large project. Make sure we plug the
regression at least.
Link: http://lkml.kernel.org/r/20180130101141.GW21609@dhcp22.suse.cz
Fixes: f7f99100d8d9 ("mm: stop zeroing memory during allocation in vmemmap")
Signed-off-by: Michal Hocko <mhocko@suse.com>
Reported-by: Bharata B Rao <bharata@linux.vnet.ibm.com>
Tested-by: Bharata B Rao <bharata@linux.vnet.ibm.com>
Reviewed-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Cc: Steven Sistare <steven.sistare@oracle.com>
Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
Cc: Bob Picco <bob.picco@oracle.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
There are multiple comments surrounding do_fault_around that memtion
fault_around_pages() and fault_around_mask(), two routines that do not
exist. These comments should be reworded to reference
fault_around_bytes, the value which is used to determine how much
do_fault_around() will attempt to read when processing a fault.
These comments should have been updated when fault_around_pages() and
fault_around_mask() were removed in commit aecd6f44266c ("mm: close race
between do_fault_around() and fault_around_bytes_set()").
Fixes: aecd6f44266c1 ("mm: close race between do_fault_around() and fault_around_bytes_set()")
Link: http://lkml.kernel.org/r/302D0B14-C7E9-44C6-8BED-033F9ACBD030@oracle.com
Signed-off-by: William Kucharski <william.kucharski@oracle.com>
Reviewed-by: Larry Bassel <larry.bassel@oracle.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Workloads consisting of a large number of processes running the same
program with a very large shared data segment may experience performance
problems when numa balancing attempts to migrate the shared cow pages.
This manifests itself with many processes or tasks in
TASK_UNINTERRUPTIBLE state waiting for the shared pages to be migrated.
The program listed below simulates the conditions with these results
when run with 288 processes on a 144 core/8 socket machine.
Average throughput Average throughput Average throughput
with numa_balancing=0 with numa_balancing=1 with numa_balancing=1
without the patch with the patch
--------------------- --------------------- ---------------------
2118782 2021534 2107979
Complex production environments show less variability and fewer poorly
performing outliers accompanied with a smaller number of processes
waiting on NUMA page migration with this patch applied. In some cases,
%iowait drops from 16%-26% to 0.
// SPDX-License-Identifier: GPL-2.0
/*
* Copyright (c) 2017 Oracle and/or its affiliates. All rights reserved.
*/
#include <sys/time.h>
#include <stdio.h>
#include <wait.h>
#include <sys/mman.h>
int a[1000000] = {13};
int main(int argc, const char **argv)
{
int n = 0;
int i;
pid_t pid;
int stat;
int *count_array;
int cpu_count = 288;
long total = 0;
struct timeval t1, t2 = {(argc > 1 ? atoi(argv[1]) : 10), 0};
if (argc > 2)
cpu_count = atoi(argv[2]);
count_array = mmap(NULL, cpu_count * sizeof(int),
(PROT_READ|PROT_WRITE),
(MAP_SHARED|MAP_ANONYMOUS), 0, 0);
if (count_array == MAP_FAILED) {
perror("mmap:");
return 0;
}
for (i = 0; i < cpu_count; ++i) {
pid = fork();
if (pid <= 0)
break;
if ((i & 0xf) == 0)
usleep(2);
}
if (pid != 0) {
if (i == 0) {
perror("fork:");
return 0;
}
for (;;) {
pid = wait(&stat);
if (pid < 0)
break;
}
for (i = 0; i < cpu_count; ++i)
total += count_array[i];
printf("Total %ld\n", total);
munmap(count_array, cpu_count * sizeof(int));
return 0;
}
gettimeofday(&t1, 0);
timeradd(&t1, &t2, &t1);
while (timercmp(&t2, &t1, <)) {
int b = 0;
int j;
for (j = 0; j < 1000000; j++)
b += a[j];
gettimeofday(&t2, 0);
n++;
}
count_array[i] = n;
return 0;
}
This patch changes change_pte_range() to skip shared copy-on-write pages
when called from change_prot_numa().
NOTE: change_prot_numa() is nominally called from task_numa_work() and
queue_pages_test_walk(). task_numa_work() is the auto NUMA balancing
path, and queue_pages_test_walk() is part of explicit NUMA policy
management. However, queue_pages_test_walk() only calls
change_prot_numa() when MPOL_MF_LAZY is specified and currently that is
not allowed, so change_prot_numa() is only called from auto NUMA
balancing.
In the case of explicit NUMA policy management, shared pages are not
migrated unless MPOL_MF_MOVE_ALL is specified, and MPOL_MF_MOVE_ALL
depends on CAP_SYS_NICE. Currently, there is no way to pass information
about MPOL_MF_MOVE_ALL to change_pte_range. This will have to be fixed
if MPOL_MF_LAZY is enabled and MPOL_MF_MOVE_ALL is to be honored in lazy
migration mode.
task_numa_work() skips the read-only VMAs of programs and shared
libraries.
Link: http://lkml.kernel.org/r/1516751617-7369-1-git-send-email-henry.willard@oracle.com
Signed-off-by: Henry Willard <henry.willard@oracle.com>
Reviewed-by: Håkon Bugge <haakon.bugge@oracle.com>
Reviewed-by: Steve Sistare <steven.sistare@oracle.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Cc: Kate Stewart <kstewart@linuxfoundation.org>
Cc: Zi Yan <zi.yan@cs.rutgers.edu>
Cc: Philippe Ombredanne <pombredanne@nexb.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: "Jérôme Glisse" <jglisse@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Dan Carpenter has noticed that mbind migration callback (new_page) can
get a NULL vma pointer and choke on it inside alloc_huge_page_vma which
relies on the VMA to get the hstate. We used to BUG_ON this case but
the BUG_+ON has been removed recently by "hugetlb, mempolicy: fix the
mbind hugetlb migration".
The proper way to handle this is to get the hstate from the migrated
page and rely on huge_node (resp. get_vma_policy) do the right thing
with null VMA. We are currently falling back to the default mempolicy
in that case which is in line what THP path is doing here.
Link: http://lkml.kernel.org/r/20180110104712.GR1732@dhcp22.suse.cz
Signed-off-by: Michal Hocko <mhocko@suse.com>
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
do_mbind migration code relies on alloc_huge_page_noerr for hugetlb
pages. alloc_huge_page_noerr uses alloc_huge_page which is a highlevel
allocation function which has to take care of reserves, overcommit or
hugetlb cgroup accounting. None of that is really required for the page
migration because the new page is only temporal and either will replace
the original page or it will be dropped. This is essentially as for
other migration call paths and there shouldn't be any reason to handle
mbind in a special way.
The current implementation is even suboptimal because the migration
might fail just because the hugetlb cgroup limit is reached, or the
overcommit is saturated.
Fix this by making mbind like other hugetlb migration paths. Add a new
migration helper alloc_huge_page_vma as a wrapper around
alloc_huge_page_nodemask with additional mempolicy handling.
alloc_huge_page_noerr has no more users and it can go.
Link: http://lkml.kernel.org/r/20180103093213.26329-7-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Andrea Reale <ar@linux.vnet.ibm.com>
Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Zi Yan <zi.yan@cs.rutgers.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Hugetlb allocator has several layer of allocation functions depending
and the purpose of the allocation. There are two allocators depending
on whether the page can be allocated from the page allocator or we need
a contiguous allocator. This is currently opencoded in
alloc_fresh_huge_page which is the only path that might allocate giga
pages which require the later allocator. Create alloc_fresh_huge_page
which hides this implementation detail and use it in all callers which
hardcoded the buddy allocator path (__hugetlb_alloc_buddy_huge_page).
This shouldn't introduce any funtional change because both migration and
surplus allocators exlude giga pages explicitly.
While we are at it let's do some renaming. The current scheme is not
consistent and overly painfull to read and understand. Get rid of
prefix underscores from most functions. There is no real reason to make
names longer.
* alloc_fresh_huge_page is the new layer to abstract underlying
allocator
* __hugetlb_alloc_buddy_huge_page becomes shorter and neater
alloc_buddy_huge_page.
* Former alloc_fresh_huge_page becomes alloc_pool_huge_page because we put
the new page directly to the pool
* alloc_surplus_huge_page can drop the opencoded prep_new_huge_page code
as it uses alloc_fresh_huge_page now
* others lose their excessive prefix underscores to make names shorter
[dan.carpenter@oracle.com: fix double unlock bug in alloc_surplus_huge_page()]
Link: http://lkml.kernel.org/r/20180109200559.g3iz5kvbdrz7yydp@mwanda
Link: http://lkml.kernel.org/r/20180103093213.26329-6-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Andrea Reale <ar@linux.vnet.ibm.com>
Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Zi Yan <zi.yan@cs.rutgers.edu>
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
alloc_surplus_huge_page increases the pool size and the number of
surplus pages opportunistically to prevent from races with the pool size
change. See commit d1c3fb1f8f29 ("hugetlb: introduce
nr_overcommit_hugepages sysctl") for more details.
The resulting code is unnecessarily hairy, cause code duplication and
doesn't allow to share the allocation paths. Moreover pool size changes
tend to be very seldom so optimizing for them is not really reasonable.
Simplify the code and allow to allocate a fresh surplus page as long as
we are under the overcommit limit and then recheck the condition after
the allocation and drop the new page if the situation has changed. This
should provide a reasonable guarantee that an abrupt allocation requests
will not go way off the limit.
If we consider races with the pool shrinking and enlarging then we
should be reasonably safe as well. In the first case we are off by one
in the worst case and the second case should work OK because the page is
not yet visible. We can waste CPU cycles for the allocation but that
should be acceptable for a relatively rare condition.
Link: http://lkml.kernel.org/r/20180103093213.26329-5-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Andrea Reale <ar@linux.vnet.ibm.com>
Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Zi Yan <zi.yan@cs.rutgers.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|