Age | Commit message (Collapse) | Author |
|
We can't allow the user to change multipath settings at runtime, as this
will create naming conflicts due to the different naming schemes used
for each mode.
Signed-off-by: Keith Busch <keith.busch@intel.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
If the command a separate metadata buffer attached, the request needs
to have the integrity flag set so the driver knows to map it.
Signed-off-by: Keith Busch <keith.busch@intel.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
When specifying same string type option several times,
current option parsing may cause memory leak. Hence,
call kfree for previous one in this case.
Signed-off-by: Chengguang Xu <cgxu519@gmx.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
syzbot is reporting use after free bug in debugfs_remove() [1].
This is because fault injection made memory allocation for
debugfs_create_file() from bdi_debug_register() from bdi_register_va()
fail and continued with setting WB_registered. But when debugfs_remove()
is called from debugfs_remove(bdi->debug_dir) from bdi_debug_unregister()
from bdi_unregister() from release_bdi() because WB_registered was set
by bdi_register_va(), IS_ERR_OR_NULL(bdi->debug_dir) == false despite
debugfs_remove(bdi->debug_dir) was already called from bdi_register_va().
Fix this by making IS_ERR_OR_NULL(bdi->debug_dir) == true.
[1] https://syzkaller.appspot.com/bug?id=5ab4efd91a96dcea9b68104f159adf4af2a6dfc1
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Reported-by: syzbot <syzbot+049cb4ae097049dac137@syzkaller.appspotmail.com>
Fixes: 97f07697932e6faf ("bdi: convert bdi_debug_register to int")
Cc: weiping zhang <zhangweiping@didichuxing.com>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
When adding rb-tree for TCP retransmit queue, we inadvertently broke
TCP autocorking.
tcp_should_autocork() should really check if the rtx queue is not empty.
Tested:
Before the fix :
$ nstat -n;./netperf -H 10.246.7.152 -Cc -- -m 500;nstat | grep AutoCork
MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 10.246.7.152 () port 0 AF_INET
Recv Send Send Utilization Service Demand
Socket Socket Message Elapsed Send Recv Send Recv
Size Size Size Time Throughput local remote local remote
bytes bytes bytes secs. 10^6bits/s % S % S us/KB us/KB
540000 262144 500 10.00 2682.85 2.47 1.59 3.618 2.329
TcpExtTCPAutoCorking 33 0.0
// Same test, but forcing TCP_NODELAY
$ nstat -n;./netperf -H 10.246.7.152 -Cc -- -D -m 500;nstat | grep AutoCork
MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 10.246.7.152 () port 0 AF_INET : nodelay
Recv Send Send Utilization Service Demand
Socket Socket Message Elapsed Send Recv Send Recv
Size Size Size Time Throughput local remote local remote
bytes bytes bytes secs. 10^6bits/s % S % S us/KB us/KB
540000 262144 500 10.00 1408.75 2.44 2.96 6.802 8.259
TcpExtTCPAutoCorking 1 0.0
After the fix :
$ nstat -n;./netperf -H 10.246.7.152 -Cc -- -m 500;nstat | grep AutoCork
MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 10.246.7.152 () port 0 AF_INET
Recv Send Send Utilization Service Demand
Socket Socket Message Elapsed Send Recv Send Recv
Size Size Size Time Throughput local remote local remote
bytes bytes bytes secs. 10^6bits/s % S % S us/KB us/KB
540000 262144 500 10.00 5472.46 2.45 1.43 1.761 1.027
TcpExtTCPAutoCorking 361293 0.0
// With TCP_NODELAY option
$ nstat -n;./netperf -H 10.246.7.152 -Cc -- -D -m 500;nstat | grep AutoCork
MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 10.246.7.152 () port 0 AF_INET : nodelay
Recv Send Send Utilization Service Demand
Socket Socket Message Elapsed Send Recv Send Recv
Size Size Size Time Throughput local remote local remote
bytes bytes bytes secs. 10^6bits/s % S % S us/KB us/KB
540000 262144 500 10.00 5454.96 2.46 1.63 1.775 1.174
TcpExtTCPAutoCorking 315448 0.0
Fixes: 75c119afe14f ("tcp: implement rb-tree based retransmit queue")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Michael Wenig <mwenig@vmware.com>
Tested-by: Michael Wenig <mwenig@vmware.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Michael Wenig <mwenig@vmware.com>
Tested-by: Michael Wenig <mwenig@vmware.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
syzbot/KMSAN reported an uninit-value in put_cmsg(), originating
from rds_cmsg_recv().
Simply clear the structure, since we have holes there, or since
rx_traces might be smaller than RDS_MSG_RX_DGRAM_TRACE_MAX.
BUG: KMSAN: uninit-value in copy_to_user include/linux/uaccess.h:184 [inline]
BUG: KMSAN: uninit-value in put_cmsg+0x600/0x870 net/core/scm.c:242
CPU: 0 PID: 4459 Comm: syz-executor582 Not tainted 4.16.0+ #87
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
__dump_stack lib/dump_stack.c:17 [inline]
dump_stack+0x185/0x1d0 lib/dump_stack.c:53
kmsan_report+0x142/0x240 mm/kmsan/kmsan.c:1067
kmsan_internal_check_memory+0x135/0x1e0 mm/kmsan/kmsan.c:1157
kmsan_copy_to_user+0x69/0x160 mm/kmsan/kmsan.c:1199
copy_to_user include/linux/uaccess.h:184 [inline]
put_cmsg+0x600/0x870 net/core/scm.c:242
rds_cmsg_recv net/rds/recv.c:570 [inline]
rds_recvmsg+0x2db5/0x3170 net/rds/recv.c:657
sock_recvmsg_nosec net/socket.c:803 [inline]
sock_recvmsg+0x1d0/0x230 net/socket.c:810
___sys_recvmsg+0x3fb/0x810 net/socket.c:2205
__sys_recvmsg net/socket.c:2250 [inline]
SYSC_recvmsg+0x298/0x3c0 net/socket.c:2262
SyS_recvmsg+0x54/0x80 net/socket.c:2257
do_syscall_64+0x309/0x430 arch/x86/entry/common.c:287
entry_SYSCALL_64_after_hwframe+0x3d/0xa2
Fixes: 3289025aedc0 ("RDS: add receive message trace used by application")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Cc: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Cc: linux-rdma <linux-rdma@vger.kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
syzbot is reporting hung tasks at wait_on_bit(WB_shutting_down) in
wb_shutdown() [1]. This seems to be because commit 5318ce7d46866e1d ("bdi:
Shutdown writeback on all cgwbs in cgwb_bdi_destroy()") forgot to call
wake_up_bit(WB_shutting_down) after clear_bit(WB_shutting_down).
Introduce a helper function clear_and_wake_up_bit() and use it, in order
to avoid similar errors in future.
[1] https://syzkaller.appspot.com/bug?id=b297474817af98d5796bc544e1bb806fc3da0e5e
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Reported-by: syzbot <syzbot+c0cf869505e03bdf1a24@syzkaller.appspotmail.com>
Fixes: 5318ce7d46866e1d ("bdi: Shutdown writeback on all cgwbs in cgwb_bdi_destroy()")
Cc: Tejun Heo <tj@kernel.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
The USB_DEVICE_INTERFACE_NUMBER matching macro assumes that
the { vendorid, productid, interfacenumber } set uniquely
identifies one specific function. This has proven to fail
for some configurable devices. One example is the Quectel
EM06/EP06 where the same interface number can be either
QMI or MBIM, without the device ID changing either.
Fix by requiring the vendor-specific class for interface number
based matching. Functions of other classes can and should use
class based matching instead.
Fixes: 03304bcb5ec4 ("net: qmi_wwan: use fixed interface number matching")
Signed-off-by: Bjørn Mork <bjorn@mork.no>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
We get a build error when compiling the iommu driver without CONFIG_OF:
drivers/iommu/rockchip-iommu.c: In function 'rk_iommu_of_xlate':
drivers/iommu/rockchip-iommu.c:1101:2: error: implicit declaration of function 'of_dev_put'; did you mean 'of_node_put'? [-Werror=implicit-function-declaration]
This replaces the of_dev_put() with the equivalent
platform_device_put().
Fixes: 5fd577c3eac3 ("iommu/rockchip: Use OF_IOMMU to attach devices automatically")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Joerg Roedel <jroedel@suse.de>
|
|
It is possible that multiple I/O requests hits on failed cache device or
backing device, therefore it is quite common that CACHE_SET_IO_DISABLE is
set already when a task tries to set the bit from bch_cache_set_error().
Currently the message "CACHE_SET_IO_DISABLE already set" is printed by
pr_warn(), which might mislead users to think a serious fault happens in
source code.
This patch uses pr_info() to print the information in such situation,
avoid extra worries. This information is helpful to understand bcache
behavior in cache device failures, so I still keep them in source code.
Fixes: 771f393e8ffc9 ("bcache: add CACHE_SET_IO_DISABLE to struct cache_set flags")
Signed-off-by: Coly Li <colyli@suse.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Commit 7e027ca4b534b ("bcache: add stop_when_cache_set_failed option to
backing device") adds stop_when_cache_set_failed option and stops bcache
device if stop_when_cache_set_failed is auto and there is dirty data on
broken cache device. There might exists a small time gap that the cache
set is released and set to NULL but bcache device is not released yet
(because they are released in parallel). During this time gap, dc->c is
NULL so CACHE_SET_IO_DISABLE won't be checked, and dc->io_disable is still
false, so new coming I/O requests will be accepted and directly go into
backing device as no cache set attached to. If there is dirty data on
cache device, this behavior may introduce potential inconsistent data.
This patch sets dc->io_disable to true before calling bcache_device_stop()
to make sure the backing device will reject new coming I/O request as
well, so even in the small time gap no I/O will directly go into backing
device to corrupt data consistency.
Fixes: 7e027ca4b534b ("bcache: add stop_when_cache_set_failed option to backing device")
Signed-off-by: Coly Li <colyli@suse.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
When CACHE_SET_IO_DISABLE is set on cache set flags, bcache allocator
thread routine bch_allocator_thread() may stop the while-loops and
exit. Then it is possible to observe the following kernel oops message,
[ 631.068366] bcache: bch_btree_insert() error -5
[ 631.069115] bcache: cached_dev_detach_finish() Caching disabled for sdf
[ 631.070220] BUG: unable to handle kernel NULL pointer dereference at 0000000000000000
[ 631.070250] PGD 0 P4D 0
[ 631.070261] Oops: 0002 [#1] SMP PTI
[snipped]
[ 631.070578] Workqueue: events cache_set_flush [bcache]
[ 631.070597] RIP: 0010:exit_creds+0x1b/0x50
[ 631.070610] RSP: 0018:ffffc9000705fe08 EFLAGS: 00010246
[ 631.070626] RAX: 0000000000000001 RBX: ffff880a622ad300 RCX: 000000000000000b
[ 631.070645] RDX: 0000000000000601 RSI: 000000000000000c RDI: 0000000000000000
[ 631.070663] RBP: ffff880a622ad300 R08: ffffea00190c66e0 R09: 0000000000000200
[ 631.070682] R10: ffff880a48123000 R11: ffff880000000000 R12: 0000000000000000
[ 631.070700] R13: ffff880a4b160e40 R14: ffff880a4b160000 R15: 0ffff880667e2530
[ 631.070719] FS: 0000000000000000(0000) GS:ffff880667e00000(0000) knlGS:0000000000000000
[ 631.070740] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 631.070755] CR2: 0000000000000000 CR3: 000000000200a001 CR4: 00000000003606e0
[ 631.070774] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 631.070793] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 631.070811] Call Trace:
[ 631.070828] __put_task_struct+0x55/0x160
[ 631.070845] kthread_stop+0xee/0x100
[ 631.070863] cache_set_flush+0x11d/0x1a0 [bcache]
[ 631.070879] process_one_work+0x146/0x340
[ 631.070892] worker_thread+0x47/0x3e0
[ 631.070906] kthread+0xf5/0x130
[ 631.070917] ? max_active_store+0x60/0x60
[ 631.070930] ? kthread_bind+0x10/0x10
[ 631.070945] ret_from_fork+0x35/0x40
[snipped]
[ 631.071017] RIP: exit_creds+0x1b/0x50 RSP: ffffc9000705fe08
[ 631.071033] CR2: 0000000000000000
[ 631.071045] ---[ end trace 011c63a24b22c927 ]---
[ 631.071085] bcache: bcache_device_free() bcache0 stopped
The reason is when cache_set_flush() tries to call kthread_stop() to stop
allocator thread, but it exits already due to cache device I/O errors.
This patch adds wait_for_kthread_stop() at tail of bch_allocator_thread(),
to prevent the thread routine exiting directly. Then the allocator thread
can be blocked at wait_for_kthread_stop() and wait for cache_set_flush()
to stop it by calling kthread_stop().
changelog:
v3: add Reviewed-by from Hannnes.
v2: not directly return from allocator_wait(), move 'return 0' to tail of
bch_allocator_thread().
v1: initial version.
Fixes: 771f393e8ffc ("bcache: add CACHE_SET_IO_DISABLE to struct cache_set flags")
Signed-off-by: Coly Li <colyli@suse.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Commit c7b7bd07404c5 ("bcache: add io_disable to struct cached_dev")
counts backing device I/O requets and set dc->io_disable to true if error
counters exceeds dc->io_error_limit. But it only counts I/O errors for
regular I/O request, neglects errors of write back I/Os when backing device
is offline.
This patch counts the errors of writeback I/Os, in dirty_endio() if
bio->bi_status is not 0, it means error happens when writing dirty keys
to backing device, then bch_count_backing_io_errors() is called.
By this fix, even there is no reqular I/O request coming, if writeback I/O
errors exceed dc->io_error_limit, the bcache device may still be stopped
for the broken backing device.
Fixes: c7b7bd07404c5 ("bcache: add io_disable to struct cached_dev")
Signed-off-by: Coly Li <colyli@suse.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Commit c7b7bd07404c5 ("bcache: add io_disable to struct cached_dev") tries
to stop bcache device by calling bcache_device_stop() when too many I/O
errors happened on backing device. But if there is internal I/O happening
on cache device (writeback scan, garbage collection, etc), a regular I/O
request triggers the internal I/Os may still holds a refcount of dc->count,
and the refcount may only be dropped after the internal I/O stopped.
By this patch, bch_cached_dev_error() will check if the backing device is
attached to a cache set, if yes that CACHE_SET_IO_DISABLE will be set to
flags of this cache set. Then internal I/Os on cache device will be
rejected and stopped immediately, and the bcache device can be stopped.
For people who are not familiar with the interesting refcount dependance,
let me explain a bit more how the fix works. Example the writeback thread
will scan cache device for dirty data writeback purpose. Before it stopps,
it holds a refcount of dc->count. When CACHE_SET_IO_DISABLE bit is set,
the internal I/O will stopped and the while-loop in bch_writeback_thread()
quits and calls cached_dev_put() to drop dc->count. If this is the last
refcount to drop, then cached_dev_detach_finish() will be called. In this
call back function, in turn closure_put(dc->disk.cl) is called to drop a
refcount of closure dc->disk.cl. If this is the last refcount of this
closure to drop, then cached_dev_flush() will be called. Then the cached
device is freed. So if CACHE_SET_IO_DISABLE is not set, the bache device
can not be stopped until all inernal cache device I/O stopped. For large
size cache device, and writeback thread competes locks with gc thread,
there might be a quite long time to wait.
Fixes: c7b7bd07404c5 ("bcache: add io_disable to struct cached_dev")
Signed-off-by: Coly Li <colyli@suse.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Current code uses bdevname() or bio_devname() to reference gendisk
disk name when bcache needs to display the disk names in kernel message.
It was safe before bcache device failure handling patch set merged in,
because when devices are failed, there was deadlock to prevent bcache
printing error messages with gendisk disk name. But after the failure
handling patch set merged, the deadlock is fixed, so it is possible
that the gendisk structure bdev->hd_disk is released when bdevname() is
called to reference bdev->bd_disk->disk_name[]. This is why I receive
bug report of NULL pointers deference panic.
This patch stores gendisk disk name in a buffer inside struct cache and
struct cached_dev, then print out the offline device name won't reference
bdev->hd_disk anymore. And this patch also avoids extra function calls
of bdevname() and bio_devnmae().
Changelog:
v3, add Reviewed-by from Hannes.
v2, call bdevname() earlier in register_bdev()
v1, first version with segguestion from Junhui Tang.
Fixes: c7b7bd07404c5 ("bcache: add io_disable to struct cached_dev")
Fixes: 5138ac6748e38 ("bcache: fix misleading error message in bch_count_io_errors()")
Signed-off-by: Coly Li <colyli@suse.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
A misaligned address is only worth a warning, and not
stopping the while execution path with a BUG_ON().
Signed-off-by: Joerg Roedel <jroedel@suse.de>
|
|
It allows to flush more than 4GB of device TLBs. So the mask should be
64bit wide. UBSAN captured this fault as below.
[ 3.760024] ================================================================================
[ 3.768440] UBSAN: Undefined behaviour in drivers/iommu/dmar.c:1348:3
[ 3.774864] shift exponent 64 is too large for 32-bit type 'int'
[ 3.780853] CPU: 2 PID: 0 Comm: swapper/2 Tainted: G U 4.17.0-rc1+ #89
[ 3.788661] Hardware name: Dell Inc. OptiPlex 7040/0Y7WYT, BIOS 1.2.8 01/26/2016
[ 3.796034] Call Trace:
[ 3.798472] <IRQ>
[ 3.800479] dump_stack+0x90/0xfb
[ 3.803787] ubsan_epilogue+0x9/0x40
[ 3.807353] __ubsan_handle_shift_out_of_bounds+0x10e/0x170
[ 3.812916] ? qi_flush_dev_iotlb+0x124/0x180
[ 3.817261] qi_flush_dev_iotlb+0x124/0x180
[ 3.821437] iommu_flush_dev_iotlb+0x94/0xf0
[ 3.825698] iommu_flush_iova+0x10b/0x1c0
[ 3.829699] ? fq_ring_free+0x1d0/0x1d0
[ 3.833527] iova_domain_flush+0x25/0x40
[ 3.837448] fq_flush_timeout+0x55/0x160
[ 3.841368] ? fq_ring_free+0x1d0/0x1d0
[ 3.845200] ? fq_ring_free+0x1d0/0x1d0
[ 3.849034] call_timer_fn+0xbe/0x310
[ 3.852696] ? fq_ring_free+0x1d0/0x1d0
[ 3.856530] run_timer_softirq+0x223/0x6e0
[ 3.860625] ? sched_clock+0x5/0x10
[ 3.864108] ? sched_clock+0x5/0x10
[ 3.867594] __do_softirq+0x1b5/0x6f5
[ 3.871250] irq_exit+0xd4/0x130
[ 3.874470] smp_apic_timer_interrupt+0xb8/0x2f0
[ 3.879075] apic_timer_interrupt+0xf/0x20
[ 3.883159] </IRQ>
[ 3.885255] RIP: 0010:poll_idle+0x60/0xe7
[ 3.889252] RSP: 0018:ffffb1b201943e30 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff13
[ 3.896802] RAX: 0000000080200000 RBX: 000000000000008e RCX: 000000000000001f
[ 3.903918] RDX: 0000000000000000 RSI: 000000002819aa06 RDI: 0000000000000000
[ 3.911031] RBP: ffff9e93c6b33280 R08: 00000010f717d567 R09: 000000000010d205
[ 3.918146] R10: ffffb1b201943df8 R11: 0000000000000001 R12: 00000000e01b169d
[ 3.925260] R13: 0000000000000000 R14: ffffffffb12aa400 R15: 0000000000000000
[ 3.932382] cpuidle_enter_state+0xb4/0x470
[ 3.936558] do_idle+0x222/0x310
[ 3.939779] cpu_startup_entry+0x78/0x90
[ 3.943693] start_secondary+0x205/0x2e0
[ 3.947607] secondary_startup_64+0xa5/0xb0
[ 3.951783] ================================================================================
Signed-off-by: Changbin Du <changbin.du@intel.com>
Signed-off-by: Joerg Roedel <jroedel@suse.de>
|
|
This pretty much reverts commit 273df9635385 ("iommu/dma: Make PCI
window reservation generic") by moving the PCI window region
reservation back into the dma specific path so that these regions
doesn't get exposed via the IOMMU API interface. With this change,
the vfio interface will report only iommu specific reserved regions
to the user space.
Cc: Joerg Roedel <joro@8bytes.org>
Signed-off-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com>
Reviewed-by: Robin Murphy <robin.murphy@arm.com>
Fixes: 273df9635385 ('iommu/dma: Make PCI window reservation generic')
Signed-off-by: Joerg Roedel <jroedel@suse.de>
|
|
iommu clocks are optional, so the driver should not fail if they are not
present. Instead just set the number of clocks to 0, which the clk-blk APIs
can handle just fine.
Fixes: f2e3a5f557ad ("iommu/rockchip: Control clocks needed to access the IOMMU")
Signed-off-by: Heiko Stuebner <heiko@sntech.de>
Reviewed-by: Robin Murphy <robin.murphy@arm.com>
Tested-by: Enric Balletbo i Serra <enric.balletbo@collabora.com>
Signed-off-by: Joerg Roedel <jroedel@suse.de>
|
|
The newly introduced lock is only used when CONFIG_IRQ_REMAP is enabled:
drivers/iommu/amd_iommu.c:86:24: error: 'iommu_table_lock' defined but not used [-Werror=unused-variable]
static DEFINE_SPINLOCK(iommu_table_lock);
This moves the definition next to the user, within the #ifdef protected
section of the file.
Fixes: ea6166f4b83e ("iommu/amd: Split irq_lookup_table out of the amd_iommu_devtable_lock")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Joerg Roedel <jroedel@suse.de>
|
|
It was noticed that the IRTE configured for guest OS kernel
was over-written while the guest was running. As a result,
vt-d Posted Interrupts configured for the guest are not being
delivered directly, and instead bounces off the host. Every
interrupt delivery takes a VM Exit.
It was noticed that the following stack is doing the over-write:
[ 147.463177] modify_irte+0x171/0x1f0
[ 147.463405] intel_ir_set_affinity+0x5c/0x80
[ 147.463641] msi_domain_set_affinity+0x32/0x90
[ 147.463881] irq_do_set_affinity+0x37/0xd0
[ 147.464125] irq_set_affinity_locked+0x9d/0xb0
[ 147.464374] __irq_set_affinity+0x42/0x70
[ 147.464627] write_irq_affinity.isra.5+0xe1/0x110
[ 147.464895] proc_reg_write+0x38/0x70
[ 147.465150] __vfs_write+0x36/0x180
[ 147.465408] ? handle_mm_fault+0xdf/0x200
[ 147.465671] ? _cond_resched+0x15/0x30
[ 147.465936] vfs_write+0xad/0x1a0
[ 147.466204] SyS_write+0x52/0xc0
[ 147.466472] do_syscall_64+0x74/0x1a0
[ 147.466744] entry_SYSCALL_64_after_hwframe+0x3d/0xa2
reversing the sense of force check in intel_ir_reconfigure_irte()
restores proper posted interrupt functionality
Signed-off-by: Jagannathan Raman <jag.raman@oracle.com>
Fixes: d491bdff888e ('iommu/vt-d: Reevaluate vector configuration on activate()')
Signed-off-by: Joerg Roedel <jroedel@suse.de>
|
|
When speculation flaw mitigations are opt-in (via prctl), using seccomp
will automatically opt-in to these protections, since using seccomp
indicates at least some level of sandboxing is desired.
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
|
|
As done with seccomp and no_new_privs, also show speculation flaw
mitigation state in /proc/$pid/status.
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
|
|
Adjust arch_prctl_get/set_spec_ctrl() to operate on tasks other than
current.
This is needed both for /proc/$pid/status queries and for seccomp (since
thread-syncing can trigger seccomp in non-current threads).
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
|
|
Add prctl based control for Speculative Store Bypass mitigation and make it
the default mitigation for Intel and AMD.
Andi Kleen provided the following rationale (slightly redacted):
There are multiple levels of impact of Speculative Store Bypass:
1) JITed sandbox.
It cannot invoke system calls, but can do PRIME+PROBE and may have call
interfaces to other code
2) Native code process.
No protection inside the process at this level.
3) Kernel.
4) Between processes.
The prctl tries to protect against case (1) doing attacks.
If the untrusted code can do random system calls then control is already
lost in a much worse way. So there needs to be system call protection in
some way (using a JIT not allowing them or seccomp). Or rather if the
process can subvert its environment somehow to do the prctl it can already
execute arbitrary code, which is much worse than SSB.
To put it differently, the point of the prctl is to not allow JITed code
to read data it shouldn't read from its JITed sandbox. If it already has
escaped its sandbox then it can already read everything it wants in its
address space, and do much worse.
The ability to control Speculative Store Bypass allows to enable the
protection selectively without affecting overall system performance.
Based on an initial patch from Tim Chen. Completely rewritten.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
|
|
The Speculative Store Bypass vulnerability can be mitigated with the
Reduced Data Speculation (RDS) feature. To allow finer grained control of
this eventually expensive mitigation a per task mitigation control is
required.
Add a new TIF_RDS flag and put it into the group of TIF flags which are
evaluated for mismatch in switch_to(). If these bits differ in the previous
and the next task, then the slow path function __switch_to_xtra() is
invoked. Implement the TIF_RDS dependent mitigation control in the slow
path.
If the prctl for controlling Speculative Store Bypass is disabled or no
task uses the prctl then there is no overhead in the switch_to() fast
path.
Update the KVM related speculation control functions to take TID_RDS into
account as well.
Based on a patch from Tim Chen. Completely rewritten.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
|
|
Add two new prctls to control aspects of speculation related vulnerabilites
and their mitigations to provide finer grained control over performance
impacting mitigations.
PR_GET_SPECULATION_CTRL returns the state of the speculation misfeature
which is selected with arg2 of prctl(2). The return value uses bit 0-2 with
the following meaning:
Bit Define Description
0 PR_SPEC_PRCTL Mitigation can be controlled per task by
PR_SET_SPECULATION_CTRL
1 PR_SPEC_ENABLE The speculation feature is enabled, mitigation is
disabled
2 PR_SPEC_DISABLE The speculation feature is disabled, mitigation is
enabled
If all bits are 0 the CPU is not affected by the speculation misfeature.
If PR_SPEC_PRCTL is set, then the per task control of the mitigation is
available. If not set, prctl(PR_SET_SPECULATION_CTRL) for the speculation
misfeature will fail.
PR_SET_SPECULATION_CTRL allows to control the speculation misfeature, which
is selected by arg2 of prctl(2) per task. arg3 is used to hand in the
control value, i.e. either PR_SPEC_ENABLE or PR_SPEC_DISABLE.
The common return values are:
EINVAL prctl is not implemented by the architecture or the unused prctl()
arguments are not 0
ENODEV arg2 is selecting a not supported speculation misfeature
PR_SET_SPECULATION_CTRL has these additional return values:
ERANGE arg3 is incorrect, i.e. it's not either PR_SPEC_ENABLE or PR_SPEC_DISABLE
ENXIO prctl control of the selected speculation misfeature is disabled
The first supported controlable speculation misfeature is
PR_SPEC_STORE_BYPASS. Add the define so this can be shared between
architectures.
Based on an initial patch from Tim Chen and mostly rewritten.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
|
|
Having everything in nospec-branch.h creates a hell of dependencies when
adding the prctl based switching mechanism. Move everything which is not
required in nospec-branch.h to spec-ctrl.h and fix up the includes in the
relevant files.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Reviewed-by: Ingo Molnar <mingo@kernel.org>
|
|
Expose the CPUID.7.EDX[31] bit to the guest, and also guard against various
combinations of SPEC_CTRL MSR values.
The handling of the MSR (to take into account the host value of SPEC_CTRL
Bit(2)) is taken care of in patch:
KVM/SVM/VMX/x86/spectre_v2: Support the combination of guest and host IBRS
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Ingo Molnar <mingo@kernel.org>
|
|
AMD does not need the Speculative Store Bypass mitigation to be enabled.
The parameters for this are already available and can be done via MSR
C001_1020. Each family uses a different bit in that MSR for this.
[ tglx: Expose the bit mask via a variable and move the actual MSR fiddling
into the bugs code as that's the right thing to do and also required
to prepare for dynamic enable/disable ]
Suggested-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Ingo Molnar <mingo@kernel.org>
|
|
Intel and AMD SPEC_CTRL (0x48) MSR semantics may differ in the
future (or in fact use different MSRs for the same functionality).
As such a run-time mechanism is required to whitelist the appropriate MSR
values.
[ tglx: Made the variable __ro_after_init ]
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Ingo Molnar <mingo@kernel.org>
|
|
Intel CPUs expose methods to:
- Detect whether RDS capability is available via CPUID.7.0.EDX[31],
- The SPEC_CTRL MSR(0x48), bit 2 set to enable RDS.
- MSR_IA32_ARCH_CAPABILITIES, Bit(4) no need to enable RRS.
With that in mind if spec_store_bypass_disable=[auto,on] is selected set at
boot-time the SPEC_CTRL MSR to enable RDS if the platform requires it.
Note that this does not fix the KVM case where the SPEC_CTRL is exposed to
guests which can muck with it, see patch titled :
KVM/SVM/VMX/x86/spectre_v2: Support the combination of guest and host IBRS.
And for the firmware (IBRS to be set), see patch titled:
x86/spectre_v2: Read SPEC_CTRL MSR during boot and re-use reserved bits
[ tglx: Distangled it from the intel implementation and kept the call order ]
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Ingo Molnar <mingo@kernel.org>
|
|
Contemporary high performance processors use a common industry-wide
optimization known as "Speculative Store Bypass" in which loads from
addresses to which a recent store has occurred may (speculatively) see an
older value. Intel refers to this feature as "Memory Disambiguation" which
is part of their "Smart Memory Access" capability.
Memory Disambiguation can expose a cache side-channel attack against such
speculatively read values. An attacker can create exploit code that allows
them to read memory outside of a sandbox environment (for example,
malicious JavaScript in a web page), or to perform more complex attacks
against code running within the same privilege level, e.g. via the stack.
As a first step to mitigate against such attacks, provide two boot command
line control knobs:
nospec_store_bypass_disable
spec_store_bypass_disable=[off,auto,on]
By default affected x86 processors will power on with Speculative
Store Bypass enabled. Hence the provided kernel parameters are written
from the point of view of whether to enable a mitigation or not.
The parameters are as follows:
- auto - Kernel detects whether your CPU model contains an implementation
of Speculative Store Bypass and picks the most appropriate
mitigation.
- on - disable Speculative Store Bypass
- off - enable Speculative Store Bypass
[ tglx: Reordered the checks so that the whole evaluation is not done
when the CPU does not support RDS ]
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Ingo Molnar <mingo@kernel.org>
|
|
Add the CPU feature bit CPUID.7.0.EDX[31] which indicates whether the CPU
supports Reduced Data Speculation.
[ tglx: Split it out from a later patch ]
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Ingo Molnar <mingo@kernel.org>
|
|
Add the sysfs file for the new vulerability. It does not do much except
show the words 'Vulnerable' for recent x86 cores.
Intel cores prior to family 6 are known not to be vulnerable, and so are
some Atoms and some Xeon Phi.
It assumes that older Cyrix, Centaur, etc. cores are immune.
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Ingo Molnar <mingo@kernel.org>
|
|
A guest may modify the SPEC_CTRL MSR from the value used by the
kernel. Since the kernel doesn't use IBRS, this means a value of zero is
what is needed in the host.
But the 336996-Speculative-Execution-Side-Channel-Mitigations.pdf refers to
the other bits as reserved so the kernel should respect the boot time
SPEC_CTRL value and use that.
This allows to deal with future extensions to the SPEC_CTRL interface if
any at all.
Note: This uses wrmsrl() instead of native_wrmsl(). I does not make any
difference as paravirt will over-write the callq *0xfff.. with the wrmsrl
assembler code.
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Ingo Molnar <mingo@kernel.org>
|
|
The 336996-Speculative-Execution-Side-Channel-Mitigations.pdf refers to all
the other bits as reserved. The Intel SDM glossary defines reserved as
implementation specific - aka unknown.
As such at bootup this must be taken it into account and proper masking for
the bits in use applied.
A copy of this document is available at
https://bugzilla.kernel.org/show_bug.cgi?id=199511
[ tglx: Made x86_spec_ctrl_base __ro_after_init ]
Suggested-by: Jon Masters <jcm@redhat.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Ingo Molnar <mingo@kernel.org>
|
|
Those SysFS functions have a similar preamble, as such make common
code to handle them.
Suggested-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Ingo Molnar <mingo@kernel.org>
|
|
Combine the various logic which goes through all those
x86_cpu_id matching structures in one function.
Suggested-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Ingo Molnar <mingo@kernel.org>
|
|
The macro is not type safe and I did look for why that "g" constraint for
the asm doesn't work: it's because the asm is more fundamentally wrong.
It does
movl %[val], %%eax
but "val" isn't a 32-bit value, so then gcc will pass it in a register,
and generate code like
movl %rsi, %eax
and gas will complain about a nonsensical 'mov' instruction (it's moving a
64-bit register to a 32-bit one).
Passing it through memory will just hide the real bug - gcc still thinks
the memory location is 64-bit, but the "movl" will only load the first 32
bits and it all happens to work because x86 is little-endian.
Convert it to a type safe inline function with a little trick which hands
the feature into the ALTERNATIVE macro.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Ingo Molnar <mingo@kernel.org>
|
|
Need to configure PHY interrupt as active low for P3310 Tegra186
platform otherwise it results in spurious interrupts.
This issue wasn't seen before because the generic PHY driver without
interrupt support was used.
Signed-off-by: Bhadram Varka <vbhadram@nvidia.com>
Signed-off-by: Thierry Reding <treding@nvidia.com>
|
|
Even with the wait-loop fixed, there is a further issue with
kthread_parkme(). Upon hotplug, when we do takedown_cpu(),
smpboot_park_threads() can return before all those threads are in fact
blocked, due to the placement of the complete() in __kthread_parkme().
When that happens, sched_cpu_dying() -> migrate_tasks() can end up
migrating such a still runnable task onto another CPU.
Normally the task will have hit schedule() and gone to sleep by the
time we do kthread_unpark(), which will then do __kthread_bind() to
re-bind the task to the correct CPU.
However, when we loose the initial TASK_PARKED store to the concurrent
wakeup issue described previously, do the complete(), get migrated, it
is possible to either:
- observe kthread_unpark()'s clearing of SHOULD_PARK and terminate
the park and set TASK_RUNNING, or
- __kthread_bind()'s wait_task_inactive() to observe the competing
TASK_RUNNING store.
Either way the WARN() in __kthread_bind() will trigger and fail to
correctly set the CPU affinity.
Fix this by only issuing the complete() when the kthread has scheduled
out. This does away with all the icky 'still running' nonsense.
The alternative is to promote TASK_PARKED to a special state, this
guarantees wait_task_inactive() cannot observe a 'stale' TASK_RUNNING
and we'll end up doing the right thing, but this preserves the whole
icky business of potentially migating the still runnable thing.
Reported-by: Gaurav Kohli <gkohli@codeaurora.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
Gaurav reported a problem with __kthread_parkme() where a concurrent
try_to_wake_up() could result in competing stores to ->state which,
when the TASK_PARKED store got lost bad things would happen.
The comment near set_current_state() actually mentions this competing
store, but only mentions the case against TASK_RUNNING. This same
store, with different timing, can happen against a subsequent !RUNNING
store.
This normally is not a problem, because as per that same comment, the
!RUNNING state store is inside a condition based wait-loop:
for (;;) {
set_current_state(TASK_UNINTERRUPTIBLE);
if (!need_sleep)
break;
schedule();
}
__set_current_state(TASK_RUNNING);
If we loose the (first) TASK_UNINTERRUPTIBLE store to a previous
(concurrent) wakeup, the schedule() will NO-OP and we'll go around the
loop once more.
The problem here is that the TASK_PARKED store is not inside the
KTHREAD_SHOULD_PARK condition wait-loop.
There is a genuine issue with sleeps that do not have a condition;
this is addressed in a subsequent patch.
Reported-by: Gaurav Kohli <gkohli@codeaurora.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
With commit:
31e77c93e432 ("sched/fair: Update blocked load when newly idle")
... we release the rq->lock when updating blocked load of idle CPUs.
This opens a time window during which another CPU can add a task to this
CPU's cfs_rq.
The check for newly added task of idle_balance() is not in the common path.
Move the out label to include this check.
Reported-by: Heiner Kallweit <hkallweit1@gmail.com>
Tested-by: Geert Uytterhoeven <geert+renesas@glider.be>
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes: 31e77c93e432 ("sched/fair: Update blocked load when newly idle")
Link: http://lkml.kernel.org/r/20180426103133.GA6953@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
Matt reported the following deadlock:
CPU0 CPU1
schedule(.prev=migrate/0) <fault>
pick_next_task() ...
idle_balance() migrate_swap()
active_balance() stop_two_cpus()
spin_lock(stopper0->lock)
spin_lock(stopper1->lock)
ttwu(migrate/0)
smp_cond_load_acquire() -- waits for schedule()
stop_one_cpu(1)
spin_lock(stopper1->lock) -- waits for stopper lock
Fix this deadlock by taking the wakeups out from under stopper->lock.
This allows the active_balance() to queue the stop work and finish the
context switch, which in turn allows the wakeup from migrate_swap() to
observe the context and complete the wakeup.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reported-by: Matt Fleming <matt@codeblueprint.co.uk>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Matt Fleming <matt@codeblueprint.co.uk>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Galbraith <umgwanakikbuti@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20180420095005.GH4064@hirez.programming.kicks-ass.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing fixes from Steven Rostedt:
"Various fixes in tracing:
- Tracepoints should not give warning on OOM failures
- Use special field for function pointer in trace event
- Fix igrab issues in uprobes
- Fixes to the new histogram triggers"
* tag 'trace-v4.17-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
tracepoint: Do not warn on ENOMEM
tracing: Add field modifier parsing hist error for hist triggers
tracing: Add field parsing hist error for hist triggers
tracing: Restore proper field flag printing when displaying triggers
tracing: initcall: Ordered comparison of function pointers
tracing: Remove igrab() iput() call from uprobes.c
tracing: Fix bad use of igrab in trace_uprobe.c
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input
Pull input updates from Dmitry Torokhov:
"Just a few driver fixes"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input:
Input: atmel_mxt_ts - add missing compatible strings to OF device table
Input: atmel_mxt_ts - fix the firmware update
Input: atmel_mxt_ts - add touchpad button mapping for Samsung Chromebook Pro
MAINTAINERS: Rakesh Iyer can't be reached anymore
Input: hideep_ts - fix a typo in Kconfig
Input: alps - fix reporting pressure of v3 trackstick
Input: leds - fix out of bound access
Input: synaptics-rmi4 - fix an unchecked out of memory error path
|
|
Allow some non-cached routes to use non-expired fnhe:
1. ip_del_fnhe: moved above and now called by find_exception.
The 4.5+ commit deed49df7390 expires fnhe only when caching
routes. Change that to:
1.1. use fnhe for non-cached local output routes, with the help
from (2)
1.2. allow __mkroute_input to detect expired fnhe (outdated
fnhe_gw, for example) when do_cache is false, eg. when itag!=0
for unicast destinations.
2. __mkroute_output: keep fi to allow local routes with orig_oif != 0
to use fnhe info even when the new route will not be cached into fnhe.
After commit 839da4d98960 ("net: ipv4: set orig_oif based on fib
result for local traffic") it means all local routes will be affected
because they are not cached. This change is used to solve a PMTU
problem with IPVS (and probably Netfilter DNAT) setups that redirect
local clients from target local IP (local route to Virtual IP)
to new remote IP target, eg. IPVS TUN real server. Loopback has
64K MTU and we need to create fnhe on the local route that will
keep the reduced PMTU for the Virtual IP. Without this change
fnhe_pmtu is updated from ICMP but never exposed to non-cached
local routes. This includes routes with flowi4_oif!=0 for 4.6+ and
with flowi4_oif=any for 4.14+).
3. update_or_create_fnhe: make sure fnhe_expires is not 0 for
new entries
Fixes: 839da4d98960 ("net: ipv4: set orig_oif based on fib result for local traffic")
Fixes: d6d5e999e5df ("route: do not cache fib route info on local routes with oif")
Fixes: deed49df7390 ("route: check and remove route cache when we get route")
Cc: David Ahern <dsahern@gmail.com>
Cc: Xin Long <lucien.xin@gmail.com>
Signed-off-by: Julian Anastasov <ja@ssi.bg>
Acked-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi
Pull SCSI fixes from James Bottomley:
"Three small bug fixes: an illegally overlapping memcmp in target code,
a potential infinite loop in isci under certain rare phy conditions
and an ATA queue depth (performance) correction for storvsc"
* tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi:
scsi: target: Fix fortify_panic kernel exception
scsi: isci: Fix infinite loop in while loop
scsi: storvsc: Set up correct queue depth values for IDE devices
|
|
git://anongit.freedesktop.org/drm/drm-misc into drm-fixes
vc4: Fix bo refcounts during async commits (Boris)
vga-dac: Fix edid memory leak (Sean)
Cc: Boris Brezillon <boris.brezillon@bootlin.com>
Cc: Sean Paul <seanpaul@chromium.org>
* tag 'drm-misc-fixes-2018-05-02' of git://anongit.freedesktop.org/drm/drm-misc:
drm/bridge: vga-dac: Fix edid memory leak
drm/vc4: Make sure vc4_bo_{inc,dec}_usecnt() calls are balanced
|