Age | Commit message (Collapse) | Author |
|
Move completion related items (like the call single data) near the
end of the struct, instead of mixing them in with the initial
queueing related fields.
Move queuelist below the bio structures. Then we have all
queueing related bits in the first cache line.
This yields a 1.5-2% increase in IOPS for a null_blk test, both for
sync and for high thread count access. Sync test goes form 975K to
992K, 32-thread case from 20.8M to 21.2M IOPS.
Reviewed-by: Bart Van Assche <bart.vanassche@wdc.com>
Reviewed-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
We only have one atomic flag left. Instead of using an entire
unsigned long for that, steal the bottom bit of the deadline
field that we already reserved.
Remove ->atomic_flags, since it's now unused.
Reviewed-by: Bart Van Assche <bart.vanassche@wdc.com>
Reviewed-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
We reduce the resolution of request expiry, but since we're already
using jiffies for this where resolution depends on the kernel
configuration and since the timeout resolution is coarse anyway,
that should be fine.
Reviewed-by: Bart Van Assche <bart.vanassche@wdc.com>
Reviewed-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
We don't need this to be an atomic flag, it can be a regular
flag. We either end up on the same CPU for the polling, in which
case the state is sane, or we did the sleep which would imply
the needed barrier to ensure we see the right state.
Reviewed-by: Bart Van Assche <bart.vanassche@wdc.com>
Reviewed-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
This patch provides a common decoder for block status path related errors
that may be retried so various entities wishing to consult this do not
have to duplicate this decision.
Acked-by: Mike Snitzer <snitzer@redhat.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Keith Busch <keith.busch@intel.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
The RCU protection has been expanded to cover both queueing and
completion paths making ->queue_rq_srcu a misnomer. Rename it to
->srcu as suggested by Bart.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Bart Van Assche <Bart.VanAssche@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
After the recent updates to use generation number and state based
synchronization, blk-mq no longer depends on REQ_ATOM_COMPLETE except
to avoid firing the same timeout multiple times.
Remove all REQ_ATOM_COMPLETE usages and use a new rq_flags flag
RQF_MQ_TIMEOUT_EXPIRED to avoid firing the same timeout multiple
times. This removes atomic bitops from hot paths too.
v2: Removed blk_clear_rq_complete() from blk_mq_rq_timed_out().
v3: Added RQF_MQ_TIMEOUT_EXPIRED flag.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: "jianchao.wang" <jianchao.w.wang@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Currently, blk-mq timeout path synchronizes against the usual
issue/completion path using a complex scheme involving atomic
bitflags, REQ_ATOM_*, memory barriers and subtle memory coherence
rules. Unfortunately, it contains quite a few holes.
There's a complex dancing around REQ_ATOM_STARTED and
REQ_ATOM_COMPLETE between issue/completion and timeout paths; however,
they don't have a synchronization point across request recycle
instances and it isn't clear what the barriers add.
blk_mq_check_expired() can easily read STARTED from N-2'th iteration,
deadline from N-1'th, blk_mark_rq_complete() against Nth instance.
In fact, it's pretty easy to make blk_mq_check_expired() terminate a
later instance of a request. If we induce 5 sec delay before
time_after_eq() test in blk_mq_check_expired(), shorten the timeout to
2s, and issue back-to-back large IOs, blk-mq starts timing out
requests spuriously pretty quickly. Nothing actually timed out. It
just made the call on a recycle instance of a request and then
terminated a later instance long after the original instance finished.
The scenario isn't theoretical either.
This patch replaces the broken synchronization mechanism with a RCU
and generation number based one.
1. Each request has a u64 generation + state value, which can be
updated only by the request owner. Whenever a request becomes
in-flight, the generation number gets bumped up too. This provides
the basis for the timeout path to distinguish different recycle
instances of the request.
Also, marking a request in-flight and setting its deadline are
protected with a seqcount so that the timeout path can fetch both
values coherently.
2. The timeout path fetches the generation, state and deadline. If
the verdict is timeout, it records the generation into a dedicated
request abortion field and does RCU wait.
3. The completion path is also protected by RCU (from the previous
patch) and checks whether the current generation number and state
match the abortion field. If so, it skips completion.
4. The timeout path, after RCU wait, scans requests again and
terminates the ones whose generation and state still match the ones
requested for abortion.
By now, the timeout path knows that either the generation number
and state changed if it lost the race or the completion will yield
to it and can safely timeout the request.
While it's more lines of code, it's conceptually simpler, doesn't
depend on direct use of subtle memory ordering or coherence, and
hopefully doesn't terminate the wrong instance.
While this change makes REQ_ATOM_COMPLETE synchronization unnecessary
between issue/complete and timeout paths, REQ_ATOM_COMPLETE isn't
removed yet as it's still used in other places. Future patches will
move all state tracking to the new mechanism and remove all bitops in
the hot paths.
Note that this patch adds a comment explaining a race condition in
BLK_EH_RESET_TIMER path. The race has always been there and this
patch doesn't change it. It's just documenting the existing race.
v2: - Fixed BLK_EH_RESET_TIMER handling as pointed out by Jianchao.
- s/request->gstate_seqc/request->gstate_seq/ as suggested by Peter.
- READ_ONCE() added in blk_mq_rq_update_state() as suggested by Peter.
v3: - Fixed possible extended seqcount / u64_stats_sync read looping
spotted by Peter.
- MQ_RQ_IDLE was incorrectly being set in complete_request instead
of free_request. Fixed.
v4: - Rebased on top of hctx_lock() refactoring patch.
- Added comment explaining the use of hctx_lock() in completion path.
v5: - Added comments requested by Bart.
- Note the addition of BLK_EH_RESET_TIMER race condition in the
commit message.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: "jianchao.wang" <jianchao.w.wang@oracle.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Bart Van Assche <Bart.VanAssche@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Many kernel drivers contain code that allocates and frees both a
scatterlist and the pages that populate that scatterlist.
Introduce functions in lib/scatterlist.c that perform these tasks
instead of duplicating this functionality in multiple drivers.
Only include these functions in the build if CONFIG_SGL_ALLOC=y
to avoid that the kernel size increases if this functionality is
not used.
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
bcache is the only user of bio_alloc_pages(), so move this function into
bcache, and avoid it being misused in the future.
Also rename it to bch_bio_allo_pages() since it is bcache only.
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Firstly this patch introduces BVEC_ITER_ALL_INIT for iterating one bio
from start to end.
As we need to support multipage bvecs, don't access bio->bi_io_vec
in copy_to_high_bio_irq(), and just use the standard iterator for that.
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
The following helpers are introduced for converting current users of
direct access to bvec table, and prepares for supporting multipage bvec:
bio_pages_all()
bio_first_bvec_all()
bio_first_page_all()
bio_last_bvec_all()
All are named as bio_*_all() to following bio_for_each_segment_all(),
they can only be used on bio of !bio_flagged(bio, BIO_CLONED), that means
the whole bvec table is covered.
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Components relying only on the request_queue structure for accessing
block devices (e.g. I/O schedulers) have a limited knowledged of the
device characteristics. In particular, the device capacity cannot be
easily discovered, which for a zoned block device also result in the
inability to easily know the number of zones of the device (the zone
size is indicated by the chunk_sectors field of the queue limits).
Introduce the nr_zones field to the request_queue structure to simplify
access to this information. Also, add the bitmap seq_zone_bitmap which
indicates which zones of the device are sequential zones (write
preferred or write required) and the bitmap seq_zones_wlock which
indicates if a zone is write locked, that is, if a write request
targeting a zone was dispatched to the device. These fields are
initialized by the low level block device driver (sd.c for ZBC/ZAC
disks). They are not initialized by stacking drivers (device mappers)
handling zoned block devices (e.g. dm-linear).
Using this, I/O schedulers can introduce zone write locking to control
request dispatching to a zoned block device and avoid write request
reordering by limiting to at most a single write request per zone
outside of the scheduler at any time.
Based on previous patches from Damien Le Moal.
Signed-off-by: Christoph Hellwig <hch@lst.de>
[Damien]
* Fixed comments and identation in blkdev.h
* Changed helper functions
* Fixed this commit message
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Allow to set the over-provision percentage on target creation. In case
that the value is not provided, fall back to the default value set by
the target.
In pblk, set the default OP to 11% of the total size of the device
Signed-off-by: Javier González <javier@cnexlabs.com>
Signed-off-by: Hans Holmberg <hans.holmberg@cnexlabs.com>
Signed-off-by: Matias Bjørling <m@bjorling.me>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Prepare for the 2.0 revision by adapting the geometry
structures to coexist with the 1.2 revision.
Signed-off-by: Matias Bjørling <m@bjorling.me>
Reviewed-by: Javier González <javier@cnexlabs.com>
Signed-off-by: Matias Bjørling <m@bjorling.me>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
The lower page table is unused. All page tables reported by 1.2
devices are all reporting a sequential 1:1 page mapping. This is
also not used going forward with the 2.0 revision.
Signed-off-by: Matias Bjørling <m@bjorling.me>
Reviewed-by: Javier González <javier@cnexlabs.com>
Signed-off-by: Matias Bjørling <m@bjorling.me>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Remove the wait filed in nvm_rq. It is not used anymore, as targets rely
on the functionality provided by the LightNVM subsystem when sending
sync I/O.
Signed-off-by: Javier González <javier@cnexlabs.com>
Signed-off-by: Matias Bjørling <m@bjorling.me>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Now that rrpc have been removed. Also remove the hybrid 1.2 support
from the core.
Signed-off-by: Matias Bjørling <m@bjorling.me>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Now that rrpc has been removed, the only users of the ppa helpers
is pblk. However, pblk already defines similar functions.
Switch pblk to use the internal ones, and remove the generic ppa
helpers.
Signed-off-by: Matias Bjørling <m@bjorling.me>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull Page Table Isolation (PTI) preparatory tree from Ingo Molnar:
"This does a rename to free up linux/pti.h to be used by the upcoming
page table isolation feature"
* 'WIP.x86-pti.base.prep-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
drivers/misc/intel/pti: Rename the header file to free up the namespace
|
|
We'd like to use the 'PTI' acronym for 'Page Table Isolation' - free up the
namespace by renaming the <linux/pti.h> driver header to <linux/intel-pti.h>.
(Also standardize the header guard name while at it.)
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: J Freyensee <james_p_freyensee@linux.intel.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
Pull networking fixes from David Miller:
1) Clamp timeouts to INT_MAX in conntrack, from Jay Elliot.
2) Fix broken UAPI for BPF_PROG_TYPE_PERF_EVENT, from Hendrik
Brueckner.
3) Fix locking in ieee80211_sta_tear_down_BA_sessions, from Johannes
Berg.
4) Add missing barriers to ptr_ring, from Michael S. Tsirkin.
5) Don't advertise gigabit in sh_eth when not available, from Thomas
Petazzoni.
6) Check network namespace when delivering to netlink taps, from Kevin
Cernekee.
7) Kill a race in raw_sendmsg(), from Mohamed Ghannam.
8) Use correct address in TCP md5 lookups when replying to an incoming
segment, from Christoph Paasch.
9) Add schedule points to BPF map alloc/free, from Eric Dumazet.
10) Don't allow silly mtu values to be used in ipv4/ipv6 multicast, also
from Eric Dumazet.
11) Fix SKB leak in tipc, from Jon Maloy.
12) Disable MAC learning on OVS ports of mlxsw, from Yuval Mintz.
13) SKB leak fix in skB_complete_tx_timestamp(), from Willem de Bruijn.
14) Add some new qmi_wwan device IDs, from Daniele Palmas.
15) Fix static key imbalance in ingress qdisc, from Jiri Pirko.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (76 commits)
net: qcom/emac: Reduce timeout for mdio read/write
net: sched: fix static key imbalance in case of ingress/clsact_init error
net: sched: fix clsact init error path
ip_gre: fix wrong return value of erspan_rcv
net: usb: qmi_wwan: add Telit ME910 PID 0x1101 support
pkt_sched: Remove TC_RED_OFFLOADED from uapi
net: sched: Move to new offload indication in RED
net: sched: Add TCA_HW_OFFLOAD
net: aquantia: Increment driver version
net: aquantia: Fix typo in ethtool statistics names
net: aquantia: Update hw counters on hw init
net: aquantia: Improve link state and statistics check interval callback
net: aquantia: Fill in multicast counter in ndev stats from hardware
net: aquantia: Fill ndev stat couters from hardware
net: aquantia: Extend stat counters to 64bit values
net: aquantia: Fix hardware DMA stream overload on large MRRS
net: aquantia: Fix actual speed capabilities reporting
sock: free skb in skb_complete_tx_timestamp on error
s390/qeth: update takeover IPs after configuration change
s390/qeth: lock IP table while applying takeover changes
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull locking fixes from Ingo Molnar:
"Misc fixes:
- Fix a S390 boot hang that was caused by the lock-break logic.
Remove lock-break to begin with, as review suggested it was
unreasonably fragile and our confidence in its continued good
health is lower than our confidence in its removal.
- Remove the lockdep cross-release checking code for now, because of
unresolved false positive warnings. This should make lockdep work
well everywhere again.
- Get rid of the final (and single) ACCESS_ONCE() straggler and
remove the API from v4.15.
- Fix a liblockdep build warning"
* 'locking-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
tools/lib/lockdep: Add missing declaration of 'pr_cont()'
checkpatch: Remove ACCESS_ONCE() warning
compiler.h: Remove ACCESS_ONCE()
tools/include: Remove ACCESS_ONCE()
tools/perf: Convert ACCESS_ONCE() to READ_ONCE()
locking/lockdep: Remove the cross-release locking checks
locking/core: Remove break_lock field when CONFIG_GENERIC_LOCKBREAK=y
locking/core: Fix deadlock during boot on systems with GENERIC_LOCKBREAK
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull power management fix from Rafael Wysocki:
"This fixes an issue in two recent commits that may cause
pm_runtime_enable() to be called for too many times for some devices
during the "thaw" transition belonging to hibernation"
* tag 'pm-4.15-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
PM / sleep: Avoid excess pm_runtime_enable() calls in device_resume()
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing fixes from Steven Rostedt:
"Various fix-ups:
- comment fixes
- build fix
- better memory alloction (don't use NR_CPUS)
- configuration fix
- build warning fix
- enhanced callback parameter (to simplify users of trace hooks)
- give up on stack tracing when RCU isn't watching (it's a lost
cause)"
* tag 'trace-v4.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
tracing: Have stack trace not record if RCU is not watching
tracing: Pass export pointer as argument to ->write()
ring-buffer: Remove unused function __rb_data_page_index()
tracing: make PREEMPTIRQ_EVENTS depend on TRACING
tracing: Allocate mask_str buffer dynamically
tracing: always define trace_{irq,preempt}_{enable_disable}
tracing: Fix code comments in trace.c
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci
Pull PCI fixes from Bjorn Helgaas:
- add a pci_get_domain_bus_and_slot() stub for the CONFIG_PCI=n case to
avoid build breakage in the v4.16 merge window if a
pci_get_bus_and_slot() -> pci_get_domain_bus_and_slot() patch gets
merged before the PCI tree (Randy Dunlap)
- fix an AMD boot regression in the 64bit BAR support added in v4.15
(Christian König)
- fix an R-Car use-after-free that causes a crash if no PCIe card is
present (Geert Uytterhoeven)
* tag 'pci-v4.15-fixes-1' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci:
PCI: rcar: Fix use-after-free in probe error path
x86/PCI: Only enable a 64bit BAR on single-socket AMD Family 15h
x86/PCI: Fix infinite loop in search for 64bit BAR placement
PCI: Add pci_get_domain_bus_and_slot() stub
|
|
David Rientjes has reported the following memory corruption while the
oom reaper tries to unmap the victims address space
BUG: Bad page map in process oom_reaper pte:6353826300000000 pmd:00000000
addr:00007f50cab1d000 vm_flags:08100073 anon_vma:ffff9eea335603f0 mapping: (null) index:7f50cab1d
file: (null) fault: (null) mmap: (null) readpage: (null)
CPU: 2 PID: 1001 Comm: oom_reaper
Call Trace:
unmap_page_range+0x1068/0x1130
__oom_reap_task_mm+0xd5/0x16b
oom_reaper+0xff/0x14c
kthread+0xc1/0xe0
Tetsuo Handa has noticed that the synchronization inside exit_mmap is
insufficient. We only synchronize with the oom reaper if
tsk_is_oom_victim which is not true if the final __mmput is called from
a different context than the oom victim exit path. This can trivially
happen from context of any task which has grabbed mm reference (e.g. to
read /proc/<pid>/ file which requires mm etc.).
The race would look like this
oom_reaper oom_victim task
mmget_not_zero
do_exit
mmput
__oom_reap_task_mm mmput
__mmput
exit_mmap
remove_vma
unmap_page_range
Fix this issue by providing a new mm_is_oom_victim() helper which
operates on the mm struct rather than a task. Any context which
operates on a remote mm struct should use this helper in place of
tsk_is_oom_victim. The flag is set in mark_oom_victim and never cleared
so it is stable in the exit_mmap path.
Debugged by Tetsuo Handa.
Link: http://lkml.kernel.org/r/20171210095130.17110-1-mhocko@kernel.org
Fixes: 212925802454 ("mm: oom: let oom_reap_task and exit_mmap run concurrently")
Signed-off-by: Michal Hocko <mhocko@suse.com>
Reported-by: David Rientjes <rientjes@google.com>
Acked-by: David Rientjes <rientjes@google.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Andrea Argangeli <andrea@kernel.org>
Cc: <stable@vger.kernel.org> [4.14]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
In testing, we found that nfsd threads may call set_groups in parallel
for the same entry cached in auth.unix.gid, racing in the call of
groups_sort, corrupting the groups for that entry and leading to
permission denials for the client.
This patch:
- Make groups_sort globally visible.
- Move the call to groups_sort to the modifiers of group_info
- Remove the call to groups_sort from set_groups
Link: http://lkml.kernel.org/r/20171211151420.18655-1-thiago.becker@gmail.com
Signed-off-by: Thiago Rafael Becker <thiago.becker@gmail.com>
Reviewed-by: Matthew Wilcox <mawilcox@microsoft.com>
Reviewed-by: NeilBrown <neilb@suse.com>
Acked-by: "J. Bruce Fields" <bfields@fieldses.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
gcc-8 warns about using strncpy() with the source size as the limit:
fs/exec.c:1223:32: error: argument to 'sizeof' in 'strncpy' call is the same expression as the source; did you mean to use the size of the destination? [-Werror=sizeof-pointer-memaccess]
This is indeed slightly suspicious, as it protects us from source
arguments without NUL-termination, but does not guarantee that the
destination is terminated.
This keeps the strncpy() to ensure we have properly padded target
buffer, but ensures that we use the correct length, by passing the
actual length of the destination buffer as well as adding a build-time
check to ensure it is exactly TASK_COMM_LEN.
There are only 23 callsites which I all reviewed to ensure this is
currently the case. We could get away with doing only the check or
passing the right length, but it doesn't hurt to do both.
Link: http://lkml.kernel.org/r/20171205151724.1764896-1-arnd@arndb.de
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Suggested-by: Kees Cook <keescook@chromium.org>
Acked-by: Kees Cook <keescook@chromium.org>
Acked-by: Ingo Molnar <mingo@kernel.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Serge Hallyn <serge@hallyn.com>
Cc: James Morris <james.l.morris@oracle.com>
Cc: Aleksa Sarai <asarai@suse.de>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Frederic Weisbecker <frederic@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
The hardened strlen() function causes rather large stack usage in at
least one file in the kernel, in particular when CONFIG_KASAN is
enabled:
drivers/media/usb/em28xx/em28xx-dvb.c: In function 'em28xx_dvb_init':
drivers/media/usb/em28xx/em28xx-dvb.c:2062:1: error: the frame size of 3256 bytes is larger than 204 bytes [-Werror=frame-larger-than=]
Analyzing this problem led to the discovery that gcc fails to merge the
stack slots for the i2c_board_info[] structures after we strlcpy() into
them, due to the 'noreturn' attribute on the source string length check.
I reported this as a gcc bug, but it is unlikely to get fixed for gcc-8,
since it is relatively easy to work around, and it gets triggered
rarely. An earlier workaround I did added an empty inline assembly
statement before the call to fortify_panic(), which works surprisingly
well, but is really ugly and unintuitive.
This is a new approach to the same problem, this time addressing it by
not calling the 'extern __real_strnlen()' function for string constants
where __builtin_strlen() is a compile-time constant and therefore known
to be safe.
We do this by checking if the last character in the string is a
compile-time constant '\0'. If it is, we can assume that strlen() of
the string is also constant.
As a side-effect, this should also improve the object code output for
any other call of strlen() on a string constant.
[akpm@linux-foundation.org: add comment]
Link: http://lkml.kernel.org/r/20171205215143.3085755-1-arnd@arndb.de
Link: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82365
Link: https://patchwork.kernel.org/patch/9980413/
Link: https://patchwork.kernel.org/patch/9974047/
Fixes: 6974f0c4555 ("include/linux/string.h: add the option of fortified string.h functions")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Cc: Kees Cook <keescook@chromium.org>
Cc: Mauro Carvalho Chehab <mchehab@kernel.org>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Daniel Micay <danielmicay@gmail.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Martin Wilck <mwilck@suse.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Add a variant of rbtree_replace_node() that maintains the leftmost cache
of struct rbtree_root_cached when replacing nodes within the rbtree.
As drm_mm is the only rb_replace_node() being used on an interval tree,
the mistake looks fairly self-contained. Furthermore the only user of
drm_mm_replace_node() is its testsuite...
Testcase: igt/drm_mm/replace
Link: http://lkml.kernel.org/r/20171122100729.3742-1-chris@chris-wilson.co.uk
Link: https://patchwork.freedesktop.org/patch/msgid/20171109212435.9265-1-chris@chris-wilson.co.uk
Fixes: f808c13fd373 ("lib/interval_tree: fast overlap detection")
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Reviewed-by: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Acked-by: Davidlohr Bueso <dbueso@suse.de>
Cc: Jérôme Glisse <jglisse@redhat.com>
Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
The <linux/bug.h> was removed from radix-tree.h by commit f5bba9d11a25
("include/linux/radix-tree.h: remove unneeded #include <linux/bug.h>").
Since that commit, tools/testing/radix-tree/ couldn't pass compilation
due to tools/testing/radix-tree/idr.c:17: undefined reference to
WARN_ON_ONCE. This patch adds the bug.h header to idr.h to solve the
issue.
Link: http://lkml.kernel.org/r/1511963726-34070-2-git-send-email-wei.w.wang@intel.com
Fixes: f5bba9d11a2 ("include/linux/radix-tree.h: remove unneeded #include <linux/bug.h>")
Signed-off-by: Wei Wang <wei.w.wang@intel.com>
Cc: Matthew Wilcox <mawilcox@microsoft.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Eric Biggers <ebiggers@google.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Masahiro Yamada <yamada.masahiro@socionext.com>
Cc: Michal Hocko <mhocko@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
There are no longer any kernelspace uses of ACCESS_ONCE(), so we can
remove the definition from <linux/compiler.h>.
This patch removes the ACCESS_ONCE() definition, and updates comments
which referred to it. At the same time, some inconsistent and redundant
whitespace is removed from comments.
Tested-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Joe Perches <joe@perches.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: apw@canonical.com
Link: http://lkml.kernel.org/r/20171127103824.36526-4-mark.rutland@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
This code (CONFIG_LOCKDEP_CROSSRELEASE=y and CONFIG_LOCKDEP_COMPLETIONS=y),
while it found a number of old bugs initially, was also causing too many
false positives that caused people to disable lockdep - which is arguably
a worse overall outcome.
If we disable cross-release by default but keep the code upstream then
in practice the most likely outcome is that we'll allow the situation
to degrade gradually, by allowing entropy to introduce more and more
false positives, until it overwhelms maintenance capacity.
Another bad side effect was that people were trying to work around
the false positives by uglifying/complicating unrelated code. There's
a marked difference between annotating locking operations and
uglifying good code just due to bad lock debugging code ...
This gradual decrease in quality happened to a number of debugging
facilities in the kernel, and lockdep is pretty complex already,
so we cannot risk this outcome.
Either cross-release checking can be done right with no false positives,
or it should not be included in the upstream kernel.
( Note that it might make sense to maintain it out of tree and go through
the false positives every now and then and see whether new bugs were
introduced. )
Cc: Byungchul Park <byungchul.park@lge.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
When CONFIG_GENERIC_LOCKBEAK=y, locking structures grow an extra int ->break_lock
field which is used to implement raw_spin_is_contended() by setting the field
to 1 when waiting on a lock and clearing it to zero when holding a lock.
However, there are a few problems with this approach:
- There is a write-write race between a CPU successfully taking the lock
(and subsequently writing break_lock = 0) and a waiter waiting on
the lock (and subsequently writing break_lock = 1). This could result
in a contended lock being reported as uncontended and vice-versa.
- On machines with store buffers, nothing guarantees that the writes
to break_lock are visible to other CPUs at any particular time.
- READ_ONCE/WRITE_ONCE are not used, so the field is potentially
susceptible to harmful compiler optimisations,
Consequently, the usefulness of this field is unclear and we'd be better off
removing it and allowing architectures to implement raw_spin_is_contended() by
providing a definition of arch_spin_is_contended(), as they can when
CONFIG_GENERIC_LOCKBREAK=n.
Signed-off-by: Will Deacon <will.deacon@arm.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Sebastian Ott <sebott@linux.vnet.ibm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1511894539-7988-3-git-send-email-will.deacon@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
Users of ptr_ring expect that it's safe to give the
data structure a pointer and have it be available
to consumers, but that actually requires an smb_wmb
or a stronger barrier.
In absence of such barriers and on architectures that reorder writes,
consumer might read an un=initialized value from an skb pointer stored
in the skb array. This was observed causing crashes.
To fix, add memory barriers. The barrier we use is a wmb, the
assumption being that producers do not need to read the value so we do
not need to order these reads.
Reported-by: George Cherian <george.cherian@cavium.com>
Suggested-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Middle-layer code doing suspend-time optimizations for devices with
the DPM_FLAG_SMART_SUSPEND flag set (currently, the PCI bus type and
the ACPI PM domain) needs to make the core skip ->thaw_early and
->thaw callbacks for those devices in some cases and it sets the
power.direct_complete flag for them for this purpose.
However, it turns out that setting power.direct_complete outside of
the PM core is a bad idea as it triggers an excess invocation of
pm_runtime_enable() in device_resume().
For this reason, provide a helper to clear power.is_late_suspended
and power.is_suspended to be invoked by the middle-layer code in
question instead of setting power.direct_complete and make that code
call the new helper.
Fixes: c4b65157aeef (PCI / PM: Take SMART_SUSPEND driver flag into account)
Fixes: 05087360fd7a (ACPI / PM: Take SMART_SUSPEND driver flag into account)
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Reviewed-by: Ulf Hansson <ulf.hansson@linaro.org>
Acked-by: Bjorn Helgaas <bhelgaas@google.com>
|
|
Pull KVM fixes from Radim Krčmář:
"ARM:
- A number of issues in the vgic discovered using SMATCH
- A bit one-off calculation in out stage base address mask (32-bit
and 64-bit)
- Fixes to single-step debugging instructions that trap for other
reasons such as MMMIO aborts
- Printing unavailable hyp mode as error
- Potential spinlock deadlock in the vgic
- Avoid calling vgic vcpu free more than once
- Broken bit calculation for big endian systems
s390:
- SPDX tags
- Fence storage key accesses from problem state
- Make sure that irq_state.flags is not used in the future
x86:
- Intercept port 0x80 accesses to prevent host instability (CVE)
- Use userspace FPU context for guest FPU (mainly an optimization
that fixes a double use of kernel FPU)
- Do not leak one page per module load
- Flush APIC page address cache from MMU invalidation notifiers"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (28 commits)
KVM: x86: fix APIC page invalidation
KVM: s390: Fix skey emulation permission check
KVM: s390: mark irq_state.flags as non-usable
KVM: s390: Remove redundant license text
KVM: s390: add SPDX identifiers to the remaining files
KVM: VMX: fix page leak in hardware_setup()
KVM: VMX: remove I/O port 0x80 bypass on Intel hosts
x86,kvm: remove KVM emulator get_fpu / put_fpu
x86,kvm: move qemu/guest FPU switching out to vcpu_run
KVM: arm/arm64: Fix broken GICH_ELRSR big endian conversion
KVM: arm/arm64: kvm_arch_destroy_vm cleanups
KVM: arm/arm64: Fix spinlock acquisition in vgic_set_owner
kvm: arm: don't treat unavailable HYP mode as an error
KVM: arm/arm64: Avoid attempting to load timer vgic state without a vgic
kvm: arm64: handle single-step of hyp emulated mmio instructions
kvm: arm64: handle single-step during SError exceptions
kvm: arm64: handle single-step of userspace mmio instructions
kvm: arm64: handle single-stepping trapped instructions
KVM: arm/arm64: debug: Introduce helper for single-step
arm: KVM: Fix VTTBR_BADDR_MASK BUG_ON off-by-one
...
|
|
Commit 4675ff05de2d ("kmemcheck: rip it out") has removed the code but
for some reason SPDX header stayed in place. This looks like a rebase
mistake in the mmotm tree or the merge mistake. Let's drop those
leftovers as well.
Signed-off-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Pull networking fixes from David Miller:
1) CAN fixes from Martin Kelly (cancel URBs properly in all the CAN usb
drivers).
2) Revert returning -EEXIST from __dev_alloc_name() as this propagates
to userspace and broke some apps. From Johannes Berg.
3) Fix conn memory leaks and crashes in TIPC, from Jon Malloc and Cong
Wang.
4) Gianfar MAC can't do EEE so don't advertise it by default, from
Claudiu Manoil.
5) Relax strict netlink attribute validation, but emit a warning. From
David Ahern.
6) Fix regression in checksum offload of thunderx driver, from Florian
Westphal.
7) Fix UAPI bpf issues on s390, from Hendrik Brueckner.
8) New card support in iwlwifi, from Ihab Zhaika.
9) BBR congestion control bug fixes from Neal Cardwell.
10) Fix port stats in nfp driver, from Pieter Jansen van Vuuren.
11) Fix leaks in qualcomm rmnet, from Subash Abhinov Kasiviswanathan.
12) Fix DMA API handling in sh_eth driver, from Thomas Petazzoni.
13) Fix spurious netpoll warnings in bnxt_en, from Calvin Owens.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (67 commits)
net: mvpp2: fix the RSS table entry offset
tcp: evaluate packet losses upon RTT change
tcp: fix off-by-one bug in RACK
tcp: always evaluate losses in RACK upon undo
tcp: correctly test congestion state in RACK
bnxt_en: Fix sources of spurious netpoll warnings
tcp_bbr: reset long-term bandwidth sampling on loss recovery undo
tcp_bbr: reset full pipe detection on loss recovery undo
tcp_bbr: record "full bw reached" decision in new full_bw_reached bit
sfc: pass valid pointers from efx_enqueue_unwind
gianfar: Disable EEE autoneg by default
tcp: invalidate rate samples during SACK reneging
can: peak/pcie_fd: fix potential bug in restarting tx queue
can: usb_8dev: cancel urb on -EPIPE and -EPROTO
can: kvaser_usb: cancel urb on -EPIPE and -EPROTO
can: esd_usb2: cancel urb on -EPIPE and -EPROTO
can: ems_usb: cancel urb on -EPIPE and -EPROTO
can: mcba_usb: cancel urb on -EPROTO
usbnet: fix alignment for frames with no ethernet header
tcp: use current time in tcp_rcv_space_adjust()
...
|
|
Mark tcp_sock during a SACK reneging event and invalidate rate samples
while marked. Such rate samples may overestimate bw by including packets
that were SACKed before reneging.
< ack 6001 win 10000 sack 7001:38001
< ack 7001 win 0 sack 8001:38001 // Reneg detected
> seq 7001:8001 // RTO, SACK cleared.
< ack 38001 win 10000
In above example the rate sample taken after the last ack will count
7001-38001 as delivered while the actual delivery rate likely could
be much lower i.e. 7001-8001.
This patch adds a new field tcp_sock.sack_reneg and marks it when we
declare SACK reneging and entering TCP_CA_Loss, and unmarks it after
the last rate sample was taken before moving back to TCP_CA_Open. This
patch also invalidates rate samples taken while tcp_sock.is_sack_reneg
is set.
Fixes: b9f64820fb22 ("tcp: track data delivery rate for a TCP connection")
Signed-off-by: Yousuk Seung <ysseung@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Acked-by: Priyaranjan Jha <priyarjha@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Daniel Borkmann says:
====================
pull-request: bpf 2017-12-06
The following pull-request contains BPF updates for your *net* tree.
The main changes are:
1) Fixing broken uapi for BPF tracing programs for s390 and arm64
architectures due to pt_regs being in-kernel only, and not part
of uapi right now. A wrapper is added that exports pt_regs in
an asm-generic way. For arm64 this maps to existing user_pt_regs
structure and for s390 a user_pt_regs structure exporting the
beginning of pt_regs is added and uapi-exported, thus fixing the
BPF issues seen in perf (and BPF selftests), all from Hendrik.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
The qmi_wwan minidriver support a 'raw-ip' mode where frames are
received without any ethernet header. This causes alignment issues
because the skbs allocated by usbnet are "IP aligned".
Fix by allowing minidrivers to disable the additional alignment
offset. This is implemented using a per-device flag, since the same
minidriver also supports 'ethernet' mode.
Fixes: 32f7adf633b9 ("net: qmi_wwan: support "raw IP" mode")
Reported-and-tested-by: Jay Foster <jay@systech.com>
Signed-off-by: Bjørn Mork <bjorn@mork.no>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull irq fixes from Ingo Molnar:
"Two fixes: use bool type consistently, plus a irq_matrix_available()
bugfix"
* 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
irqdesc: Use bool return type instead of int
genirq/matrix: Fix the precedence fix for real
|
|
The coretemp driver build fails when CONFIG_PCI is not enabled because it
uses a function that does not have a stub for that config case, so add the
function stub.
../drivers/hwmon/coretemp.c: In function 'adjust_tjmax':
../drivers/hwmon/coretemp.c:250:9: error: implicit declaration of function 'pci_get_domain_bus_and_slot' [-Werror=implicit-function-declaration]
struct pci_dev *host_bridge = pci_get_domain_bus_and_slot(0, 0, devfn);
../drivers/hwmon/coretemp.c:250:32: warning: initialization makes pointer from integer without a cast [enabled by default]
struct pci_dev *host_bridge = pci_get_domain_bus_and_slot(0, 0, devfn);
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
[bhelgaas: identical patch also by Arnd Bergmann <arnd@arndb.de>]
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Acked-by: Guenter Roeck <linux@roeck-us.net>
|
|
Thanks to the scripts/leaking_addresses.pl script, it was found that
some EFI values should not be readable by non-root users.
So make them root-only, and to do that, add a __ATTR_RO_MODE() macro to
make this easier, and use it in other places at the same time.
Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
Tested-by: Dave Young <dyoung@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Matt Fleming <matt@codeblueprint.co.uk>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-efi@vger.kernel.org
Cc: stable <stable@vger.kernel.org>
Link: http://lkml.kernel.org/r/20171206095010.24170-2-ard.biesheuvel@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
Alexander Potapenko reported use of uninitialized memory [1]
This happens when inserting a request socket into TCP ehash,
in __sk_nulls_add_node_rcu(), since sk_reuseport is not initialized.
Bug was added by commit d894ba18d4e4 ("soreuseport: fix ordering for
mixed v4/v6 sockets")
Note that d296ba60d8e2 ("soreuseport: Resolve merge conflict for v4/v6
ordering fix") missed the opportunity to get rid of
hlist_nulls_add_tail_rcu() :
Both UDP sockets and TCP/DCCP listeners no longer use
__sk_nulls_add_node_rcu() for their hash insertion.
Since all other sockets have unique 4-tuple, the reuseport status
has no special meaning, so we can always use hlist_nulls_add_head_rcu()
for them and save few cycles/instructions.
[1]
==================================================================
BUG: KMSAN: use of uninitialized memory in inet_ehash_insert+0xd40/0x1050
CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.13.0+ #3288
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
Call Trace:
<IRQ>
__dump_stack lib/dump_stack.c:16
dump_stack+0x185/0x1d0 lib/dump_stack.c:52
kmsan_report+0x13f/0x1c0 mm/kmsan/kmsan.c:1016
__msan_warning_32+0x69/0xb0 mm/kmsan/kmsan_instr.c:766
__sk_nulls_add_node_rcu ./include/net/sock.h:684
inet_ehash_insert+0xd40/0x1050 net/ipv4/inet_hashtables.c:413
reqsk_queue_hash_req net/ipv4/inet_connection_sock.c:754
inet_csk_reqsk_queue_hash_add+0x1cc/0x300 net/ipv4/inet_connection_sock.c:765
tcp_conn_request+0x31e7/0x36f0 net/ipv4/tcp_input.c:6414
tcp_v4_conn_request+0x16d/0x220 net/ipv4/tcp_ipv4.c:1314
tcp_rcv_state_process+0x42a/0x7210 net/ipv4/tcp_input.c:5917
tcp_v4_do_rcv+0xa6a/0xcd0 net/ipv4/tcp_ipv4.c:1483
tcp_v4_rcv+0x3de0/0x4ab0 net/ipv4/tcp_ipv4.c:1763
ip_local_deliver_finish+0x6bb/0xcb0 net/ipv4/ip_input.c:216
NF_HOOK ./include/linux/netfilter.h:248
ip_local_deliver+0x3fa/0x480 net/ipv4/ip_input.c:257
dst_input ./include/net/dst.h:477
ip_rcv_finish+0x6fb/0x1540 net/ipv4/ip_input.c:397
NF_HOOK ./include/linux/netfilter.h:248
ip_rcv+0x10f6/0x15c0 net/ipv4/ip_input.c:488
__netif_receive_skb_core+0x36f6/0x3f60 net/core/dev.c:4298
__netif_receive_skb net/core/dev.c:4336
netif_receive_skb_internal+0x63c/0x19c0 net/core/dev.c:4497
napi_skb_finish net/core/dev.c:4858
napi_gro_receive+0x629/0xa50 net/core/dev.c:4889
e1000_receive_skb drivers/net/ethernet/intel/e1000/e1000_main.c:4018
e1000_clean_rx_irq+0x1492/0x1d30
drivers/net/ethernet/intel/e1000/e1000_main.c:4474
e1000_clean+0x43aa/0x5970 drivers/net/ethernet/intel/e1000/e1000_main.c:3819
napi_poll net/core/dev.c:5500
net_rx_action+0x73c/0x1820 net/core/dev.c:5566
__do_softirq+0x4b4/0x8dd kernel/softirq.c:284
invoke_softirq kernel/softirq.c:364
irq_exit+0x203/0x240 kernel/softirq.c:405
exiting_irq+0xe/0x10 ./arch/x86/include/asm/apic.h:638
do_IRQ+0x15e/0x1a0 arch/x86/kernel/irq.c:263
common_interrupt+0x86/0x86
Fixes: d894ba18d4e4 ("soreuseport: fix ordering for mixed v4/v6 sockets")
Fixes: d296ba60d8e2 ("soreuseport: Resolve merge conflict for v4/v6 ordering fix")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Alexander Potapenko <glider@google.com>
Acked-by: Craig Gallek <kraig@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Currently, every time a VCPU is scheduled out, the host kernel will
first save the guest FPU/xstate context, then load the qemu userspace
FPU context, only to then immediately save the qemu userspace FPU
context back to memory. When scheduling in a VCPU, the same extraneous
FPU loads and saves are done.
This could be avoided by moving from a model where the guest FPU is
loaded and stored with preemption disabled, to a model where the
qemu userspace FPU is swapped out for the guest FPU context for
the duration of the KVM_RUN ioctl.
This is done under the VCPU mutex, which is also taken when other
tasks inspect the VCPU FPU context, so the code should already be
safe for this change. That should come as no surprise, given that
s390 already has this optimization.
This can fix a bug where KVM calls get_user_pages while owning the
FPU, and the file system ends up requesting the FPU again:
[258270.527947] __warn+0xcb/0xf0
[258270.527948] warn_slowpath_null+0x1d/0x20
[258270.527951] kernel_fpu_disable+0x3f/0x50
[258270.527953] __kernel_fpu_begin+0x49/0x100
[258270.527955] kernel_fpu_begin+0xe/0x10
[258270.527958] crc32c_pcl_intel_update+0x84/0xb0
[258270.527961] crypto_shash_update+0x3f/0x110
[258270.527968] crc32c+0x63/0x8a [libcrc32c]
[258270.527975] dm_bm_checksum+0x1b/0x20 [dm_persistent_data]
[258270.527978] node_prepare_for_write+0x44/0x70 [dm_persistent_data]
[258270.527985] dm_block_manager_write_callback+0x41/0x50 [dm_persistent_data]
[258270.527988] submit_io+0x170/0x1b0 [dm_bufio]
[258270.527992] __write_dirty_buffer+0x89/0x90 [dm_bufio]
[258270.527994] __make_buffer_clean+0x4f/0x80 [dm_bufio]
[258270.527996] __try_evict_buffer+0x42/0x60 [dm_bufio]
[258270.527998] dm_bufio_shrink_scan+0xc0/0x130 [dm_bufio]
[258270.528002] shrink_slab.part.40+0x1f5/0x420
[258270.528004] shrink_node+0x22c/0x320
[258270.528006] do_try_to_free_pages+0xf5/0x330
[258270.528008] try_to_free_pages+0xe9/0x190
[258270.528009] __alloc_pages_slowpath+0x40f/0xba0
[258270.528011] __alloc_pages_nodemask+0x209/0x260
[258270.528014] alloc_pages_vma+0x1f1/0x250
[258270.528017] do_huge_pmd_anonymous_page+0x123/0x660
[258270.528021] handle_mm_fault+0xfd3/0x1330
[258270.528025] __get_user_pages+0x113/0x640
[258270.528027] get_user_pages+0x4f/0x60
[258270.528063] __gfn_to_pfn_memslot+0x120/0x3f0 [kvm]
[258270.528108] try_async_pf+0x66/0x230 [kvm]
[258270.528135] tdp_page_fault+0x130/0x280 [kvm]
[258270.528149] kvm_mmu_page_fault+0x60/0x120 [kvm]
[258270.528158] handle_ept_violation+0x91/0x170 [kvm_intel]
[258270.528162] vmx_handle_exit+0x1ca/0x1400 [kvm_intel]
No performance changes were detected in quick ping-pong tests on
my 4 socket system, which is expected since an FPU+xstate load is
on the order of 0.1us, while ping-ponging between CPUs is on the
order of 20us, and somewhat noisy.
Cc: stable@vger.kernel.org
Signed-off-by: Rik van Riel <riel@redhat.com>
Suggested-by: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
[Fixed a bug where reset_vcpu called put_fpu without preceding load_fpu,
which happened inside from KVM_CREATE_VCPU ioctl. - Radim]
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi
Pull SCSI fixes from James Bottomley:
"A bunch of fixes for aacraid, a set of coherency fixes that only
affect non-coherent platforms and one coccinelle detected null check
after use"
* tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi:
scsi: libsas: align sata_device's rps_resp on a cacheline
scsi: use dma_get_cache_alignment() as minimum DMA alignment
scsi: dma-mapping: always provide dma_get_cache_alignment
scsi: ufs: ufshcd: fix potential NULL pointer dereference in ufshcd_config_vreg
scsi: aacraid: Prevent crash in case of free interrupt during scsi EH path
scsi: aacraid: Perform initialization reset only once
scsi: aacraid: Check for PCI state of device in a generic way
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc
Pull char/misc fixes from Greg KH:
"Here are some small misc driver fixes for 4.15-rc3 to resolve reported
issues. Specifically these are:
- binder fix for a memory leak
- vpd driver fixes for a number of reported problems
- hyperv driver fix for memory accesses where it shouldn't be.
All of these have been in linux-next for a while. There's also one
more MAINTAINERS file update that came in today to get the Android
developer's emails correct, which is also in this pull request, that
was not in linux-next, but should not be an issue"
* tag 'char-misc-4.15-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc:
MAINTAINERS: update Android driver maintainers.
firmware: vpd: Fix platform driver and device registration/unregistration
firmware: vpd: Tie firmware kobject to device lifetime
firmware: vpd: Destroy vpd sections in remove function
hv: kvp: Avoid reading past allocated blocks from KVP file
Drivers: hv: vmbus: Fix a rescind issue
ANDROID: binder: fix transaction leak.
|