linux.git - Linus' kernel tree

Age	Commit message (Collapse)	Author
2018-08-07	blk-wbt: Avoid lock contention and thundering herd issue in wbt_wait	Anchal Agarwal
	I am currently running a large bare metal instance (i3.metal) on EC2 with 72 cores, 512GB of RAM and NVME drives, with a 4.18 kernel. I have a workload that simulates a database workload and I am running into lockup issues when writeback throttling is enabled,with the hung task detector also kicking in. Crash dumps show that most CPUs (up to 50 of them) are all trying to get the wbt wait queue lock while trying to add themselves to it in __wbt_wait (see stack traces below). [ 0.948118] CPU: 45 PID: 0 Comm: swapper/45 Not tainted 4.14.51-62.38.amzn1.x86_64 #1 [ 0.948119] Hardware name: Amazon EC2 i3.metal/Not Specified, BIOS 1.0 10/16/2017 [ 0.948120] task: ffff883f7878c000 task.stack: ffffc9000c69c000 [ 0.948124] RIP: 0010:native_queued_spin_lock_slowpath+0xf8/0x1a0 [ 0.948125] RSP: 0018:ffff883f7fcc3dc8 EFLAGS: 00000046 [ 0.948126] RAX: 0000000000000000 RBX: ffff887f7709ca68 RCX: ffff883f7fce2a00 [ 0.948128] RDX: 000000000000001c RSI: 0000000000740001 RDI: ffff887f7709ca68 [ 0.948129] RBP: 0000000000000002 R08: 0000000000b80000 R09: 0000000000000000 [ 0.948130] R10: ffff883f7fcc3d78 R11: 000000000de27121 R12: 0000000000000002 [ 0.948131] R13: 0000000000000003 R14: 0000000000000000 R15: 0000000000000000 [ 0.948132] FS: 0000000000000000(0000) GS:ffff883f7fcc0000(0000) knlGS:0000000000000000 [ 0.948134] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 0.948135] CR2: 000000c424c77000 CR3: 0000000002010005 CR4: 00000000003606e0 [ 0.948136] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 0.948137] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 0.948138] Call Trace: [ 0.948139] <IRQ> [ 0.948142] do_raw_spin_lock+0xad/0xc0 [ 0.948145] _raw_spin_lock_irqsave+0x44/0x4b [ 0.948149] ? __wake_up_common_lock+0x53/0x90 [ 0.948150] __wake_up_common_lock+0x53/0x90 [ 0.948155] wbt_done+0x7b/0xa0 [ 0.948158] blk_mq_free_request+0xb7/0x110 [ 0.948161] __blk_mq_complete_request+0xcb/0x140 [ 0.948166] nvme_process_cq+0xce/0x1a0 [nvme] [ 0.948169] nvme_irq+0x23/0x50 [nvme] [ 0.948173] __handle_irq_event_percpu+0x46/0x300 [ 0.948176] handle_irq_event_percpu+0x20/0x50 [ 0.948179] handle_irq_event+0x34/0x60 [ 0.948181] handle_edge_irq+0x77/0x190 [ 0.948185] handle_irq+0xaf/0x120 [ 0.948188] do_IRQ+0x53/0x110 [ 0.948191] common_interrupt+0x87/0x87 [ 0.948192] </IRQ> .... [ 0.311136] CPU: 4 PID: 9737 Comm: run_linux_amd64 Not tainted 4.14.51-62.38.amzn1.x86_64 #1 [ 0.311137] Hardware name: Amazon EC2 i3.metal/Not Specified, BIOS 1.0 10/16/2017 [ 0.311138] task: ffff883f6e6a8000 task.stack: ffffc9000f1ec000 [ 0.311141] RIP: 0010:native_queued_spin_lock_slowpath+0xf5/0x1a0 [ 0.311142] RSP: 0018:ffffc9000f1efa28 EFLAGS: 00000046 [ 0.311144] RAX: 0000000000000000 RBX: ffff887f7709ca68 RCX: ffff883f7f722a00 [ 0.311145] RDX: 0000000000000035 RSI: 0000000000d80001 RDI: ffff887f7709ca68 [ 0.311146] RBP: 0000000000000202 R08: 0000000000140000 R09: 0000000000000000 [ 0.311147] R10: ffffc9000f1ef9d8 R11: 000000001a249fa0 R12: ffff887f7709ca68 [ 0.311148] R13: ffffc9000f1efad0 R14: 0000000000000000 R15: ffff887f7709ca00 [ 0.311149] FS: 000000c423f30090(0000) GS:ffff883f7f700000(0000) knlGS:0000000000000000 [ 0.311150] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 0.311151] CR2: 00007feefcea4000 CR3: 0000007f7016e001 CR4: 00000000003606e0 [ 0.311152] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 0.311153] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 0.311154] Call Trace: [ 0.311157] do_raw_spin_lock+0xad/0xc0 [ 0.311160] _raw_spin_lock_irqsave+0x44/0x4b [ 0.311162] ? prepare_to_wait_exclusive+0x28/0xb0 [ 0.311164] prepare_to_wait_exclusive+0x28/0xb0 [ 0.311167] wbt_wait+0x127/0x330 [ 0.311169] ? finish_wait+0x80/0x80 [ 0.311172] ? generic_make_request+0xda/0x3b0 [ 0.311174] blk_mq_make_request+0xd6/0x7b0 [ 0.311176] ? blk_queue_enter+0x24/0x260 [ 0.311178] ? generic_make_request+0xda/0x3b0 [ 0.311181] generic_make_request+0x10c/0x3b0 [ 0.311183] ? submit_bio+0x5c/0x110 [ 0.311185] submit_bio+0x5c/0x110 [ 0.311197] ? __ext4_journal_stop+0x36/0xa0 [ext4] [ 0.311210] ext4_io_submit+0x48/0x60 [ext4] [ 0.311222] ext4_writepages+0x810/0x11f0 [ext4] [ 0.311229] ? do_writepages+0x3c/0xd0 [ 0.311239] ? ext4_mark_inode_dirty+0x260/0x260 [ext4] [ 0.311240] do_writepages+0x3c/0xd0 [ 0.311243] ? _raw_spin_unlock+0x24/0x30 [ 0.311245] ? wbc_attach_and_unlock_inode+0x165/0x280 [ 0.311248] ? __filemap_fdatawrite_range+0xa3/0xe0 [ 0.311250] __filemap_fdatawrite_range+0xa3/0xe0 [ 0.311253] file_write_and_wait_range+0x34/0x90 [ 0.311264] ext4_sync_file+0x151/0x500 [ext4] [ 0.311267] do_fsync+0x38/0x60 [ 0.311270] SyS_fsync+0xc/0x10 [ 0.311272] do_syscall_64+0x6f/0x170 [ 0.311274] entry_SYSCALL_64_after_hwframe+0x42/0xb7 In the original patch, wbt_done is waking up all the exclusive processes in the wait queue, which can cause a thundering herd if there is a large number of writer threads in the queue. The original intention of the code seems to be to wake up one thread only however, it uses wake_up_all() in __wbt_done(), and then uses the following check in __wbt_wait to have only one thread actually get out of the wait loop: if (waitqueue_active(&rqw->wait) && rqw->wait.head.next != &wait->entry) return false; The problem with this is that the wait entry in wbt_wait is define with DEFINE_WAIT, which uses the autoremove wakeup function. That means that the above check is invalid - the wait entry will have been removed from the queue already by the time we hit the check in the loop. Secondly, auto-removing the wait entries also means that the wait queue essentially gets reordered "randomly" (e.g. threads re-add themselves in the order they got to run after being woken up). Additionally, new requests entering wbt_wait might overtake requests that were queued earlier, because the wait queue will be (temporarily) empty after the wake_up_all, so the waitqueue_active check will not stop them. This can cause certain threads to starve under high load. The fix is to leave the woken up requests in the queue and remove them in finish_wait() once the current thread breaks out of the wait loop in __wbt_wait. This will ensure new requests always end up at the back of the queue, and they won't overtake requests that are already in the wait queue. With that change, the loop in wbt_wait is also in line with many other wait loops in the kernel. Waking up just one thread drastically reduces lock contention, as does moving the wait queue add/remove out of the loop. A significant drop in lockdep's lock contention numbers is seen when running the test application on the patched kernel. Signed-off-by: Anchal Agarwal <anchalag@amazon.com> Signed-off-by: Frank van der Linden <fllinden@amazon.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-08-07	Merge branch 'qed-Add-Multi-TC-RoCE-support'	David S. Miller
	Denis Bolotin says: ==================== qed: Add Multi-TC RoCE support This patch series adds support for multiple concurrent traffic classes for RoCE. The first three patches enable the required parts of the driver to learn the TC configuration, and the last one makes use of it to enable the feature. Please consider applying this to net-next. V1->V2: ------- Avoid allocation in qed_dcbx_get_priority_tc(). Move qed_dcbx_get_priority_tc() out of CONFIG_DCB section since it doesn't call qed_dcbx_query_params() anymore. v2->V3: ------- patch 1/3: qed_dcbx_get_priority_tc() always returns a valid TC by value. In error cases, it returns QED_DCBX_DEFAULT_TC (currently defined 0). patch 3/3: Cosmetic changes in qed_dev.c. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-07	qed: Add Multi-TC RoCE support	Denis Bolotin
	RoCE qps use a pair of physical queues (pq) received from the Queue Manager (QM) - an offload queue (OFLD) and a low latency queue (LLT). The QM block creates a pq for each TC, and allows RoCE qps to ask for a pq with a specific TC. As a result, qps with different VLAN priorities can be mapped to different TCs, and employ features such as PFC and ETS. Signed-off-by: Michal Kalderon <michal.kalderon@cavium.com> Signed-off-by: Ariel Elior <ariel.elior@cavium.com> Signed-off-by: Denis Bolotin <denis.bolotin@cavium.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-07	qed: Add a flag which indicates if offload TC is set	Denis Bolotin
	Distinguish not set offload_tc from offload_tc 0 and add getters and setters. Signed-off-by: Michal Kalderon <michal.kalderon@cavium.com> Signed-off-by: Ariel Elior <ariel.elior@cavium.com> Signed-off-by: Denis Bolotin <denis.bolotin@cavium.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-07	qed: Add DCBX API - qed_dcbx_get_priority_tc()	Denis Bolotin
	The API receives a priority and looks for the TC it is mapped to in the operational DCBX configuration. The API returns QED_DCBX_DEFAULT_TC (0) when DCBX is disabled. Signed-off-by: Michal Kalderon <michal.kalderon@cavium.com> Signed-off-by: Ariel Elior <ariel.elior@cavium.com> Signed-off-by: Denis Bolotin <denis.bolotin@cavium.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-07	RDS: IB: fix 'passing zero to ERR_PTR()' warning	YueHaibing
	Fix a static code checker warning: net/rds/ib_frmr.c:82 rds_ib_alloc_frmr() warn: passing zero to 'ERR_PTR' The error path for ib_alloc_mr failure should set err to PTR_ERR. Fixes: 1659185fb4d0 ("RDS: IB: Support Fastreg MR (FRMR) memory registration mode") Signed-off-by: YueHaibing <yuehaibing@huawei.com> Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-07	Merge branch 'macb-add-pad-and-fcs-support'	David S. Miller
	Claudiu Beznea says: ==================== net: macb: add pad and fcs support In [1] it was reported that UDP checksum is offloaded to hardware no mather it was previously computed in software or not. The proposal on [1] was to disable TX checksum offload. This series (mostly patch 3/3) address the issue described at [1] by setting NOCRC bit to TX buffer descriptor for SKBs that arrived from networking stack with checksum computed. For these packets padding and FCS need to be added (hardware doesn't compute them if NOCRC bit is set). The minimum packet size that hardware expects is 64 bytes (including FCS). This feature could not be used in case of GSO, so, it was used only for no GSO SKBs. For SKBs wich requires padding and FCS computation macb_pad_and_fcs() checks if there is enough headroom and tailroom in SKB to avoid copying SKB structure. Since macb_pad_and_fcs() may change SKB the macb_pad_and_fcs() was places in macb_start_xmit() b/w macb_csum_clear() and skb_headlen() calls. This patch was tested with pktgen in kernel tool in a script like this: (pktgen_sample01_simple.sh is at [2]): minSize=1 maxSize=1500 for i in `seq $minSize $maxSize` ; do copy="$(shuf -i 1-2000 -n 1)" ./pktgen_sample01_simple.sh -i eth0 \ -m <dst-mac-addr> -d <dst-ip-addr> -x -s $i -c $copy done minStep=1 maxStep=200 for i in `seq $minStep $maxStep` ; do copy="$(shuf -i 1-2000 -n 1)" size="$(shuf -i 1-1500 -n 1)" ./pktgen_sample01_simple.sh -i eth0 \ -m <dst-mac-addr> -d <dst-ip-addr> -x -s $size -c $copy done Changes since RFC: - in patch 3/3 order local variables by their lenght (reverse christmas tree format) [1] https://www.spinics.net/lists/netdev/msg505065.html [2] https://github.com/netoptimizer/network-testing/blob/master/pktgen/pktgen_sample01_simple.sh ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-07	net: macb: add support for padding and fcs computation	Claudiu Beznea
	For packets with computed IP/TCP/UDP checksum there is no need to tell hardware to recompute it. For such kind of packets hardware expects the packet to be at least 64 bytes and FCS to be computed. Signed-off-by: Claudiu Beznea <claudiu.beznea@microchip.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-07	net: macb: move checksum clearing outside of spinlock	Claudiu Beznea
	Move checksum clearing outside of spinlock. The SKB is protected by networking lock (HARD_TX_LOCK()). Signed-off-by: Claudiu Beznea <claudiu.beznea@microchip.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-07	net: macb: use netdev_tx_t return type for ndo_start_xmit functions	Claudiu Beznea
	Use netdev_tx_t return type for ndo_start_xmit function of macb driver. Signed-off-by: Claudiu Beznea <claudiu.beznea@microchip.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-07	tipc: fix an interrupt unsafe locking scenario	Ying Xue
	Commit 9faa89d4ed9d ("tipc: make function tipc_net_finalize() thread safe") tries to make it thread safe to set node address, so it uses node_list_lock lock to serialize the whole process of setting node address in tipc_net_finalize(). But it causes the following interrupt unsafe locking scenario: CPU0 CPU1 ---- ---- rht_deferred_worker() rhashtable_rehash_table() lock(&(&ht->lock)->rlock) tipc_nl_compat_doit() tipc_net_finalize() local_irq_disable(); lock(&(&tn->node_list_lock)->rlock); tipc_sk_reinit() rhashtable_walk_enter() lock(&(&ht->lock)->rlock); <Interrupt> tipc_disc_rcv() tipc_node_check_dest() tipc_node_create() lock(&(&tn->node_list_lock)->rlock); * DEADLOCK * When rhashtable_rehash_table() holds ht->lock on CPU0, it doesn't disable BH. So if an interrupt happens after the lock, it can create an inverse lock ordering between ht->lock and tn->node_list_lock. As a consequence, deadlock might happen. The reason causing the inverse lock ordering scenario above is because the initial purpose of node_list_lock is not designed to do the serialization of node address setting. As cmpxchg() can guarantee CAS (compare-and-swap) process is atomic, we use it to replace node_list_lock to ensure setting node address can be atomically finished. It turns out the potential deadlock can be avoided as well. Fixes: 9faa89d4ed9d ("tipc: make function tipc_net_finalize() thread safe") Signed-off-by: Ying Xue <ying.xue@windriver.com> Acked-by: Jon Maloy <maloy@donjonn.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-07	x86/paravirt: Fix spectre-v2 mitigations for paravirt guests	Peter Zijlstra
	Nadav reported that on guests we're failing to rewrite the indirect calls to CALLEE_SAVE paravirt functions. In particular the pv_queued_spin_unlock() call is left unpatched and that is all over the place. This obviously wrecks Spectre-v2 mitigation (for paravirt guests) which relies on not actually having indirect calls around. The reason is an incorrect clobber test in paravirt_patch_call(); this function rewrites an indirect call with a direct call to the _SAME_ function, there is no possible way the clobbers can be different because of this. Therefore remove this clobber check. Also put WARNs on the other patch failure case (not enough room for the instruction) which I've not seen trigger in my (limited) testing. Three live kernel image disassemblies for lock_sock_nested (as a small function that illustrates the problem nicely). PRE is the current situation for guests, POST is with this patch applied and NATIVE is with or without the patch for !guests. PRE: (gdb) disassemble lock_sock_nested Dump of assembler code for function lock_sock_nested: 0xffffffff817be970 <+0>: push %rbp 0xffffffff817be971 <+1>: mov %rdi,%rbp 0xffffffff817be974 <+4>: push %rbx 0xffffffff817be975 <+5>: lea 0x88(%rbp),%rbx 0xffffffff817be97c <+12>: callq 0xffffffff819f7160 <_cond_resched> 0xffffffff817be981 <+17>: mov %rbx,%rdi 0xffffffff817be984 <+20>: callq 0xffffffff819fbb00 <_raw_spin_lock_bh> 0xffffffff817be989 <+25>: mov 0x8c(%rbp),%eax 0xffffffff817be98f <+31>: test %eax,%eax 0xffffffff817be991 <+33>: jne 0xffffffff817be9ba <lock_sock_nested+74> 0xffffffff817be993 <+35>: movl $0x1,0x8c(%rbp) 0xffffffff817be99d <+45>: mov %rbx,%rdi 0xffffffff817be9a0 <+48>: callq *0xffffffff822299e8 0xffffffff817be9a7 <+55>: pop %rbx 0xffffffff817be9a8 <+56>: pop %rbp 0xffffffff817be9a9 <+57>: mov $0x200,%esi 0xffffffff817be9ae <+62>: mov $0xffffffff817be993,%rdi 0xffffffff817be9b5 <+69>: jmpq 0xffffffff81063ae0 <__local_bh_enable_ip> 0xffffffff817be9ba <+74>: mov %rbp,%rdi 0xffffffff817be9bd <+77>: callq 0xffffffff817be8c0 <__lock_sock> 0xffffffff817be9c2 <+82>: jmp 0xffffffff817be993 <lock_sock_nested+35> End of assembler dump. POST: (gdb) disassemble lock_sock_nested Dump of assembler code for function lock_sock_nested: 0xffffffff817be970 <+0>: push %rbp 0xffffffff817be971 <+1>: mov %rdi,%rbp 0xffffffff817be974 <+4>: push %rbx 0xffffffff817be975 <+5>: lea 0x88(%rbp),%rbx 0xffffffff817be97c <+12>: callq 0xffffffff819f7160 <_cond_resched> 0xffffffff817be981 <+17>: mov %rbx,%rdi 0xffffffff817be984 <+20>: callq 0xffffffff819fbb00 <_raw_spin_lock_bh> 0xffffffff817be989 <+25>: mov 0x8c(%rbp),%eax 0xffffffff817be98f <+31>: test %eax,%eax 0xffffffff817be991 <+33>: jne 0xffffffff817be9ba <lock_sock_nested+74> 0xffffffff817be993 <+35>: movl $0x1,0x8c(%rbp) 0xffffffff817be99d <+45>: mov %rbx,%rdi 0xffffffff817be9a0 <+48>: callq 0xffffffff810a0c20 <__raw_callee_save___pv_queued_spin_unlock> 0xffffffff817be9a5 <+53>: xchg %ax,%ax 0xffffffff817be9a7 <+55>: pop %rbx 0xffffffff817be9a8 <+56>: pop %rbp 0xffffffff817be9a9 <+57>: mov $0x200,%esi 0xffffffff817be9ae <+62>: mov $0xffffffff817be993,%rdi 0xffffffff817be9b5 <+69>: jmpq 0xffffffff81063aa0 <__local_bh_enable_ip> 0xffffffff817be9ba <+74>: mov %rbp,%rdi 0xffffffff817be9bd <+77>: callq 0xffffffff817be8c0 <__lock_sock> 0xffffffff817be9c2 <+82>: jmp 0xffffffff817be993 <lock_sock_nested+35> End of assembler dump. NATIVE: (gdb) disassemble lock_sock_nested Dump of assembler code for function lock_sock_nested: 0xffffffff817be970 <+0>: push %rbp 0xffffffff817be971 <+1>: mov %rdi,%rbp 0xffffffff817be974 <+4>: push %rbx 0xffffffff817be975 <+5>: lea 0x88(%rbp),%rbx 0xffffffff817be97c <+12>: callq 0xffffffff819f7160 <_cond_resched> 0xffffffff817be981 <+17>: mov %rbx,%rdi 0xffffffff817be984 <+20>: callq 0xffffffff819fbb00 <_raw_spin_lock_bh> 0xffffffff817be989 <+25>: mov 0x8c(%rbp),%eax 0xffffffff817be98f <+31>: test %eax,%eax 0xffffffff817be991 <+33>: jne 0xffffffff817be9ba <lock_sock_nested+74> 0xffffffff817be993 <+35>: movl $0x1,0x8c(%rbp) 0xffffffff817be99d <+45>: mov %rbx,%rdi 0xffffffff817be9a0 <+48>: movb $0x0,(%rdi) 0xffffffff817be9a3 <+51>: nopl 0x0(%rax) 0xffffffff817be9a7 <+55>: pop %rbx 0xffffffff817be9a8 <+56>: pop %rbp 0xffffffff817be9a9 <+57>: mov $0x200,%esi 0xffffffff817be9ae <+62>: mov $0xffffffff817be993,%rdi 0xffffffff817be9b5 <+69>: jmpq 0xffffffff81063ae0 <__local_bh_enable_ip> 0xffffffff817be9ba <+74>: mov %rbp,%rdi 0xffffffff817be9bd <+77>: callq 0xffffffff817be8c0 <__lock_sock> 0xffffffff817be9c2 <+82>: jmp 0xffffffff817be993 <lock_sock_nested+35> End of assembler dump. Fixes: 63f70270ccd9 ("[PATCH] i386: PARAVIRT: add common patching machinery") Fixes: 3010a0663fd9 ("x86/paravirt, objtool: Annotate indirect calls") Reported-by: Nadav Amit <namit@vmware.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Juergen Gross <jgross@suse.com> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com> Cc: David Woodhouse <dwmw2@infradead.org> Cc: stable@vger.kernel.org
2018-08-07	Merge branch 'ibmvnic-next'	David S. Miller
	Thomas Falcon says: ==================== ibmvnic: Update firmware error reporting This patch set cleans out a lot of dead code from the ibmvnic driver and adds some more. The error ID field of the descriptor is not filled in by firmware, so do not print it and do not use it to query for more detailed information. Remove the unused code written for this. Finally, update the message to print a string explainng the error cause instead of just the error code. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-07	ibmvnic: Update firmware error reporting with cause string	Thomas Falcon
	Print a string instead of the error code. Since there is a possibility that the driver can recover, classify it as a warning instead of an error. Signed-off-by: Thomas Falcon <tlfalcon@linux.vnet.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-07	ibmvnic: Remove code to request error information	Thomas Falcon
	When backing device firmware reports an error, it provides an error ID, which is meant to be queried for more detailed error information. Currently, however, an error ID is not provided by the Virtual I/O server and there are not any plans to do so. For now, it is always unfilled or zero, so request_error_information will never be called. Remove it. Signed-off-by: Thomas Falcon <tlfalcon@linux.vnet.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-07	liquidio: avoided acquiring post_lock for data only queues	Intiyaz Basha
	All control commands (soft commands) goes through only Queue 0 (control and data queue). So only queue-0 needs post_lock, other queues are only data queues and does not need post_lock Added a flag to indicate the queue can be used for soft commands. If this flag is set, post_lock must be acquired before posting a command to the queue. If this flag is clear, post_lock is invalid for the queue. Signed-off-by: Intiyaz Basha <intiyaz.basha@cavium.com> Signed-off-by: Felix Manlunas <felix.manlunas@cavium.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-07	vsock: split dwork to avoid reinitializations	Cong Wang
	syzbot reported that we reinitialize an active delayed work in vsock_stream_connect(): ODEBUG: init active (active state 0) object type: timer_list hint: delayed_work_timer_fn+0x0/0x90 kernel/workqueue.c:1414 WARNING: CPU: 1 PID: 11518 at lib/debugobjects.c:329 debug_print_object+0x16a/0x210 lib/debugobjects.c:326 The pattern is apparently wrong, we should only initialize the dealyed work once and could repeatly schedule it. So we have to move out the initializations to allocation side. And to avoid confusion, we can split the shared dwork into two, instead of re-using the same one. Fixes: d021c344051a ("VSOCK: Introduce VM Sockets") Reported-by: <syzbot+8a9b1bd330476a4f3db6@syzkaller.appspotmail.com> Cc: Andy king <acking@vmware.com> Cc: Stefan Hajnoczi <stefanha@redhat.com> Cc: Jorgen Hansen <jhansen@vmware.com> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-07	net: thunderx: check for failed allocation lmac->dmacs	Colin Ian King
	The allocation of lmac->dmacs is not being checked for allocation failure. Add the check. Fixes: 3a34ecfd9d3f ("net: thunderx: add MAC address filter tracking for LMAC") Signed-off-by: Colin Ian King <colin.king@canonical.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-07	ip6_tunnel: collect_md xmit: Use ip_tunnel_key's provided src address	Shmulik Ladkani
	When using an ip6tnl device in collect_md mode, the xmit methods ignore the ipv6.src field present in skb_tunnel_info's key, both for route calculation purposes (flowi6 construction) and for assigning the packet's final ipv6h->saddr. This makes it impossible specifying a desired ipv6 local address in the encapsulating header (for example, when using tc action tunnel_key). This is also not aligned with behavior of ipip (ipv4) in collect_md mode, where the key->u.ipv4.src gets used. Fix, by assigning fl6.saddr with given key->u.ipv6.src. In case ipv6.src is not specified, ip6_tnl_xmit uses existing saddr selection code. Fixes: 8d79266bc48c ("ip6_tunnel: add collect_md mode to IPv6 tunnels") Signed-off-by: Shmulik Ladkani <shmulik.ladkani@gmail.com> Reviewed-by: Eyal Birger <eyal.birger@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-07	MAINTAINERS: add an entry for MediaTek Bluetooth driver	Sean Wang
	Add an entry for the MediaTek Bluetooth driver. Signed-off-by: Sean Wang <sean.wang@mediatek.com> Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2018-08-07	net: sched: cls_flower: set correct offload data in fl_reoffload	Vlad Buslov
	fl_reoffload implementation sets following members of struct tc_cls_flower_offload incorrectly: - masked key instead of mask - key instead of masked key Fix fl_reoffload to provide correct data to offload callback. Fixes: 31533cba4327 ("net: sched: cls_flower: implement offload tcf_proto_op") Signed-off-by: Vlad Buslov <vladbu@mellanox.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-07	cxgb4: mk_act_open_req() buggers ->{local, peer}_ip on big-endian hosts	Al Viro
	Unlike fs.val.lport and fs.val.fport, cxgb4_process_flow_match() sets fs.val.{l,f}ip to net-endian values without conversion - they come straight from flow_dissector_key_ipv4_addrs ->dst and ->src resp. So the assignment in mk_act_open_req() ought to be a straight copy. As far as I know, T4 PCIe cards do exist, so it's not as if that thing could only be found on little-endian systems... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Acked-by: Rahul Lakkireddy <rahul.lakkireddy@chelsio.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-07	Bluetooth: mediatek: Add protocol support for MediaTek serial devices	Sean Wang
	This adds a driver based on serdev driver for the MediaTek serial protocol based on running H:4, which can enable the built-in Bluetooth device inside MT7622 SoC. Signed-off-by: Sean Wang <sean.wang@mediatek.com> Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2018-08-07	dt-bindings: net: bluetooth: Add mediatek-bluetooth	Sean Wang
	Add binding document for a SoC built-in device using MediaTek protocol. Which could be found on MT7622 SoC or other similar MediaTek SoCs. Signed-off-by: Sean Wang <sean.wang@mediatek.com> Reviewed-by: Rob Herring <robh@kernel.org> Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2018-08-07	smb3: display stats counters for number of slow commands	Steve French
	When CONFIG_CIFS_STATS2 is enabled keep counters for slow commands (ie server took longer than 1 second to respond) by SMB2/SMB3 command code. This can help in diagnosing whether performance problems are on server (instead of client) and which commands are causing the problem. Sample output (the new lines contain words "slow responses ...") $ cat /proc/fs/cifs/Stats Resources in use CIFS Session: 1 Share (unique mount targets): 2 SMB Request/Response Buffer: 1 Pool size: 5 SMB Small Req/Resp Buffer: 1 Pool size: 30 Total Large 10 Small 490 Allocations Operations (MIDs): 0 0 session 0 share reconnects Total vfs operations: 67 maximum at one time: 2 4 slow responses from localhost for command 5 1 slow responses from localhost for command 6 1 slow responses from localhost for command 14 1 slow responses from localhost for command 16 1) \\localhost\test SMBs: 243 Bytes read: 1024000 Bytes written: 104857600 TreeConnects: 1 total 0 failed TreeDisconnects: 0 total 0 failed Creates: 40 total 0 failed Closes: 39 total 0 failed ... Signed-off-by: Steve French <stfrench@microsoft.com> Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com> Reviewed-by: Aurelien Aptel <aaptel@suse.com>
2018-08-07	CIFS: fix uninitialized ptr deref in smb2 signing	Aurelien Aptel
	server->secmech.sdeschmacsha256 is not properly initialized before smb2_shash_allocate(), set shash after that call. also fix typo in error message Fixes: 8de8c4608fe9 ("cifs: Fix validation of signed data in smb2") Signed-off-by: Aurelien Aptel <aaptel@suse.com> Reviewed-by: Paulo Alcantara <palcantara@suse.com> Reported-by: Xiaoli Feng <xifeng@redhat.com> Signed-off-by: Steve French <stfrench@microsoft.com> CC: Stable <stable@vger.kernel.org>
2018-08-07	smb3: Do not send SMB3 SET_INFO if nothing changed	Steve French
	An earlier commit had a typo which prevented the optimization from working: commit 18dd8e1a65dd ("Do not send SMB3 SET_INFO request if nothing is changing") Thank you to Metze for noticing this. Also clear a reserved field in the FILE_BASIC_INFO struct we send that should be zero (all the other fields in that struct were set or cleared explicitly already in cifs_set_file_info). Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com> CC: Stable <stable@vger.kernel.org> # 4.9.x+ Reported-by: Stefan Metzmacher <metze@samba.org> Signed-off-by: Steve French <stfrench@microsoft.com>
2018-08-07	smb3: fix minor debug output for CONFIG_CIFS_STATS	Steve French
	CONFIG_CIFS_STATS is now always enabled (to simplify the code and since the STATS are important for some common customer use cases and also debugging), but needed one minor change so that STATS shows as enabled in the debug output in /proc/fs/cifs/DebugData, otherwise it could get confusing with STATS no longer showing up in the "Features" list in /proc/fs/cifs/DebugData when basic stats were in fact available. Signed-off-by: Steve French <stfrench@microsoft.com> Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com>
2018-08-07	smb3: add tracepoint for slow responses	Steve French
	If responses take longer than one second from the server, we can optionally log them to dmesg in current cifs.ko code (CONFIG_CIFS_STATS2 must be configured and a /proc/fs/cifs/cifsFYI flag must be set), but can be more useful to log these via ftrace (tracepoint is smb3_slow_rsp) which is easier and more granular (still requires CONFIG_CIFS_STATS2 to be configured in the build though). Signed-off-by: Steve French <stfrench@microsoft.com> Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
2018-08-07	cifs: add compound_send_recv()	Ronnie Sahlberg
	Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com> Signed-off-by: Steve French <stfrench@microsoft.com> Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
2018-08-07	cifs: make smb_send_rqst take an array of requests	Ronnie Sahlberg
	Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com> Signed-off-by: Steve French <stfrench@microsoft.com> Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
2018-08-07	Merge branch 'nfp-ttl-tos-geneve'	David S. Miller
	Simon Horman says: ==================== nfp: flower: tunnel TTL & TOS, and Geneve options set & match support this series contains updates for the TC Flower classifier and the offload facility for it in the NFP driver. * Patches 1 & 2: update the NFP driver to allow offload of matching and setting tunnel ToS/TTL of flows using the TC Flower classifier and tun_key action * Patches 3 & 4: enhance the flow dissector and TC Flower classifier to allow match on Geneve options * Patch 5 & 6: update the NFP driver to allow offload of matching and setting Geneve options of flows using the TC Flower classifier and tun_key action ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-07	nfp: flower: add geneve option match offload	Pieter Jansen van Vuuren
	Introduce a new layer for matching on geneve options. This allows offloading filters configured to match geneve with options. Signed-off-by: Pieter Jansen van Vuuren <pieter.jansenvanvuuren@netronome.com> Signed-off-by: Simon Horman <simon.horman@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-07	nfp: flower: add geneve option push action offload	Pieter Jansen van Vuuren
	Introduce new push geneve option action. This allows offloading filters configured to entunnel geneve with options. Signed-off-by: Pieter Jansen van Vuuren <pieter.jansenvanvuuren@netronome.com> Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com> Signed-off-by: Simon Horman <simon.horman@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-07	net/sched: allow flower to match tunnel options	Pieter Jansen van Vuuren
	Allow matching on options in Geneve tunnel headers. This makes use of existing tunnel metadata support. The options can be described in the form CLASS:TYPE:DATA/CLASS_MASK:TYPE_MASK:DATA_MASK, where CLASS is represented as a 16bit hexadecimal value, TYPE as an 8bit hexadecimal value and DATA as a variable length hexadecimal value. e.g. # ip link add name geneve0 type geneve dstport 0 external # tc qdisc add dev geneve0 ingress # tc filter add dev geneve0 protocol ip parent ffff: \ flower \ enc_src_ip 10.0.99.192 \ enc_dst_ip 10.0.99.193 \ enc_key_id 11 \ geneve_opts 0102:80:1122334421314151/ffff:ff:ffffffffffffffff \ ip_proto udp \ action mirred egress redirect dev eth1 This patch adds support for matching Geneve options in the order supplied by the user. This leads to an efficient implementation in the software datapath (and in our opinion hardware datapaths that offload this feature). It is also compatible with Geneve options matching provided by the Open vSwitch kernel datapath which is relevant here as the Flower classifier may be used as a mechanism to program flows into hardware as a form of Open vSwitch datapath offload (sometimes referred to as OVS-TC). The netlink Kernel/Userspace API may be extended, for example by adding a flag, if other matching options are desired, for example matching given options in any order. This would require an implementation in the TC software datapath. And be done in a way that drivers that facilitate offload of the Flower classifier can reject or accept such flows based on hardware datapath capabilities. This approach was discussed and agreed on at Netconf 2017 in Seoul. Signed-off-by: Simon Horman <simon.horman@netronome.com> Signed-off-by: Pieter Jansen van Vuuren <pieter.jansenvanvuuren@netronome.com> Acked-by: Jakub Kicinski <jakub.kicinski@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-07	flow_dissector: allow dissection of tunnel options from metadata	Simon Horman
	Allow the existing 'dissection' of tunnel metadata to 'dissect' options already present in tunnel metadata. This dissection is controlled by a new dissector key, FLOW_DISSECTOR_KEY_ENC_OPTS. This dissection only occurs when skb_flow_dissect_tunnel_info() is called, currently only the Flower classifier makes that call. So there should be no impact on other users of the flow dissector. This is in preparation for allowing the flower classifier to match on Geneve options. Signed-off-by: Simon Horman <simon.horman@netronome.com> Signed-off-by: Pieter Jansen van Vuuren <pieter.jansenvanvuuren@netronome.com> Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-07	nfp: flower: allow matching on ipv4 UDP tunnel tos and ttl	John Hurley
	The addition of FLOW_DISSECTOR_KEY_ENC_IP to TC flower means that the ToS and TTL of the tunnel header can now be matched on. Extend the NFP tunnel match function to include these new fields. Signed-off-by: John Hurley <john.hurley@netronome.com> Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com> Signed-off-by: Simon Horman <simon.horman@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-07	nfp: flower: set ip tunnel ttl from encap action	John Hurley
	The TTL for encapsulating headers in IPv4 UDP tunnels is taken from a route lookup. Modify this to first check if a user has specified a TTL to be used in the TC action. Signed-off-by: John Hurley <john.hurley@netronome.com> Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com> Signed-off-by: Simon Horman <simon.horman@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-07	cifs: update init_sg, crypt_message to take an array of rqst	Ronnie Sahlberg
	These are used for SMB3 encryption and compounded requests. Update these functions and the other functions related to SMB3 encryption to take an array of requests. Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com> Signed-off-by: Steve French <stfrench@microsoft.com> Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
2018-08-07	i40e: fix i40e_add_queue_stats data pointer update	Jacob Keller
	This function accidentally failed to update the data pointer, which caused the reported stats to be incorrect. Additionally, statistics which follow queue stats in the output would potentially read non-zeroed garbage data from the ethtool buffer. This occurred because the data double pointer was not dereferenced before incrementing the size. Additionally, make sure this issue is more visible by adding a WARN_ONCE to the i40e_get_ethtool_stats function. This warning will trigger whenever the data pointer is not at the expected address, similar to the check that we make in the i40e_get_stat_strings() function. Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2018-08-07	i40e: Add AQ command for rearrange NVM structure	Piotr Azarewicz
	During switching between old NVM structure approach (called structured NVM) to new one (called flat NVM) or backward flash needs to be rearranged to required NVM structure. This is a part of transition from one NVM structure to another. The function is introduced to command firmware to start rearrangement process. Signed-off-by: Piotr Azarewicz <piotr.azarewicz@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2018-08-07	i40e: Add additional return code to i40e_asq_send_command	Piotr Azarewicz
	Firmware can return a busy state, so the function return I40E_ERR_NOT_READY. Signed-off-by: Piotr Azarewicz <piotr.azarewicz@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2018-08-07	smb3: update readme to correct information about /proc/fs/cifs/Stats	Steve French
	Signed-off-by: Steve French <stfrench@microsoft.com> Reviewed-by: Aurelien Aptel <aaptel@suse.com>
2018-08-07	smb3: fix reset of bytes read and written stats	Steve French
	echo 0 > /proc/fs/cifs/Stats is supposed to reset the stats but there were four (see example below) that were not reset (bytes read and witten, total vfs ops and max ops at one time). ... 0 session 0 share reconnects Total vfs operations: 100 maximum at one time: 2 1) \\localhost\test SMBs: 0 Bytes read: 502092 Bytes written: 31457286 TreeConnects: 0 total 0 failed TreeDisconnects: 0 total 0 failed ... This patch fixes cifs_stats_proc_write to properly reset those four. Signed-off-by: Steve French <stfrench@microsoft.com> Reviewed-by: Aurelien Aptel <aaptel@suse.com>
2018-08-07	smb3: display bytes_read and bytes_written in smb3 stats	Steve French
	We were only displaying bytes_read and bytes_written in cifs stats, fix smb3 stats to also display them. Sample output with this patch: cat /proc/fs/cifs/Stats: CIFS Session: 1 Share (unique mount targets): 2 SMB Request/Response Buffer: 1 Pool size: 5 SMB Small Req/Resp Buffer: 1 Pool size: 30 Operations (MIDs): 0 0 session 0 share reconnects Total vfs operations: 94 maximum at one time: 2 1) \\localhost\test SMBs: 214 Bytes read: 502092 Bytes written: 31457286 TreeConnects: 1 total 0 failed TreeDisconnects: 0 total 0 failed Creates: 52 total 3 failed Closes: 48 total 0 failed Flushes: 0 total 0 failed Reads: 17 total 0 failed Writes: 31 total 0 failed ... Signed-off-by: Steve French <stfrench@microsoft.com> Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com>
2018-08-07	cifs: simple stats should always be enabled	Steve French
	CONFIG_CIFS_STATS should always be enabled as Pavel recently noted. Simple statistics are not a significant performance hit, and removing the ifdef simplifies the code slightly. Signed-off-by: Steve French <stfrench@microsoft.com> Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com> Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
2018-08-07	cifs: use a refcount to protect open/closing the cached file handle	Ronnie Sahlberg
	Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com> Signed-off-by: Steve French <stfrench@microsoft.com> Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com> Cc: <stable@vger.kernel.org>
2018-08-07	smb3: add reconnect tracepoints	Steve French
	Add tracepoints for reconnecting an smb3 session Example output (from trace-cmd) with the patch (showing the session marked for reconnect, the stat failing, and then the subsequent SMB3 commands after the server comes back up). The "smb3_reconnect" event is the new one. cifsd-25993 [000] .... 29635.368265: smb3_reconnect: server=localhost current_mid=0x1e stat-26200 [001] .... 29638.516403: smb3_enter: cifs_revalidate_dentry_attr: xid=22 stat-26200 [001] .... 29648.723296: smb3_exit_err: cifs_revalidate_dentry_attr: xid=22 rc=-112 kworker/0:1-22830 [000] .... 29653.850947: smb3_cmd_done: sid=0x0 tid=0x0 cmd=0 mid=0 kworker/0:1-22830 [000] .... 29653.851191: smb3_cmd_err: sid=0x8ae4683c tid=0x0 cmd=1 mid=1 status=0xc0000016 rc=-5 kworker/0:1-22830 [000] .... 29653.855254: smb3_cmd_done: sid=0x8ae4683c tid=0x0 cmd=1 mid=2 kworker/0:1-22830 [000] .... 29653.855482: smb3_cmd_done: sid=0x8ae4683c tid=0x8084f30d cmd=3 mid=3 Signed-off-by: Steve French <stfrench@microsoft.com> Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com> Reviewed-by: Aurelien Aptel <aaptel@suse.com>
2018-08-07	i40e: fix warning about shadowed ring parameter	Jacob Keller
	In commit 147e81ec7568 ("i40e: Test memory before ethtool alloc succeeds") code was added to handle ring allocation on systems with low memory. It shadowed the ring parameter pointer by introducing a local ring pointer inside the for loop. Most of the code in the loop already just accessed the ring via &rx_rings[i]. Since most of the code already does this, just remove the local variable. If someone considers it worth keeping a local around, they should use it for the whole section instead of just a couple of accesses. This fixes a warning when -Wshadow is enabled Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2018-08-07	i40e: remove unnecessary i variable causing -Wshadow warning	Jacob Keller
	Commit c61c8fe1d592 ("i40e: Implement an ethtool private flag to stop LLDP in FW") added an extra for-loop which added a shadowing 'i' variable as the index. However, the local variable i already exists, and we already use it as a loop index. Additionally, at this point, there is no further use of the variable, so it's safe to simply overwrite the variable contents. This fixes a -Wshadow warning which has started being enabled on some distributions Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Reviewed-by: Patryk Malek <patryk.malek@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>