summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2021-11-01Merge tag 'for-5.16/scsi-ma-2021-10-29' of git://git.kernel.dk/linux-blockLinus Torvalds
Pull SCSI multi-actuator support from Jens Axboe: "This adds SCSI support for the recently merged block multi-actuator support. Since this was sitting on top of the block tree, the SCSI side asked me to queue it up." * tag 'for-5.16/scsi-ma-2021-10-29' of git://git.kernel.dk/linux-block: doc: Fix typo in request queue sysfs documentation doc: document sysfs queue/independent_access_ranges attributes libata: support concurrent positioning ranges log scsi: sd: add concurrent positioning ranges support
2021-11-01Merge tag 'for-5.16/bdev-size-2021-10-29' of git://git.kernel.dk/linux-blockLinus Torvalds
Pull bdev size cleanups from Jens Axboe: "Clean up the bdev size handling with new bdev_nr_bytes() helper" * tag 'for-5.16/bdev-size-2021-10-29' of git://git.kernel.dk/linux-block: (34 commits) partitions/ibm: use bdev_nr_sectors instead of open coding it partitions/efi: use bdev_nr_bytes instead of open coding it block/ioctl: use bdev_nr_sectors and bdev_nr_bytes block: cache inode size in bdev udf: use sb_bdev_nr_blocks reiserfs: use sb_bdev_nr_blocks ntfs: use sb_bdev_nr_blocks jfs: use sb_bdev_nr_blocks ext4: use sb_bdev_nr_blocks block: add a sb_bdev_nr_blocks helper block: use bdev_nr_bytes instead of open coding it in blkdev_fallocate squashfs: use bdev_nr_bytes instead of open coding it reiserfs: use bdev_nr_bytes instead of open coding it pstore/blk: use bdev_nr_bytes instead of open coding it ntfs3: use bdev_nr_bytes instead of open coding it nilfs2: use bdev_nr_bytes instead of open coding it nfs/blocklayout: use bdev_nr_bytes instead of open coding it jfs: use bdev_nr_bytes instead of open coding it hfsplus: use bdev_nr_sectors instead of open coding it hfs: use bdev_nr_sectors instead of open coding it ...
2021-11-01Merge tag 'for-5.16/io_uring-2021-10-29' of git://git.kernel.dk/linux-blockLinus Torvalds
Pull io_uring updates from Jens Axboe: "Light on new features - basically just the hybrid mode support. Outside of that it's just fixes, cleanups, and performance improvements. In detail: - Add ring related information to the fdinfo output (Hao) - Hybrid async mode (Hao) - Support for batched issue on block (me) - sqe error trace improvement (me) - IOPOLL efficiency improvements (Pavel) - submit state cleanups and improvements (Pavel) - Completion side improvements (Pavel) - Drain improvements (Pavel) - Buffer selection cleanups (Pavel) - Fixed file node improvements (Pavel) - io-wq setup cancelation fix (Pavel) - Various other performance improvements and cleanups (Pavel) - Misc fixes (Arnd, Bixuan, Changcheng, Hao, me, Noah)" * tag 'for-5.16/io_uring-2021-10-29' of git://git.kernel.dk/linux-block: (97 commits) io-wq: remove worker to owner tw dependency io_uring: harder fdinfo sq/cq ring iterating io_uring: don't assign write hint in the read path io_uring: clusterise ki_flags access in rw_prep io_uring: kill unused param from io_file_supports_nowait io_uring: clean up timeout async_data allocation io_uring: don't try io-wq polling if not supported io_uring: check if opcode needs poll first on arming io_uring: clean iowq submit work cancellation io_uring: clean io_wq_submit_work()'s main loop io-wq: use helper for worker refcounting io_uring: implement async hybrid mode for pollable requests io_uring: Use ERR_CAST() instead of ERR_PTR(PTR_ERR()) io_uring: split logic of force_nonblock io_uring: warning about unused-but-set parameter io_uring: inform block layer of how many requests we are submitting io_uring: simplify io_file_supports_nowait() io_uring: combine REQ_F_NOWAIT_{READ,WRITE} flags io_uring: arm poll for non-nowait files fs/io_uring: Prioritise checking faster conditions first in io_write ...
2021-11-01Merge tag 'for-5.16/drivers-2021-10-29' of git://git.kernel.dk/linux-blockLinus Torvalds
Pull block driver updates from Jens Axboe: - paride driver cleanups (Christoph) - Remove cryptoloop support (Christoph) - null_blk poll support (me) - Now that add_disk() supports proper error handling, add it to various drivers (Luis) - Make ataflop actually work again (Michael) - s390 dasd fixes (Stefan, Heiko) - nbd fixes (Yu, Ye) - Remove redundant wq flush in mtip32xx (Christophe) - NVMe updates - fix a multipath partition scanning deadlock (Hannes Reinecke) - generate uevent once a multipath namespace is operational again (Hannes Reinecke) - support unique discovery controller NQNs (Hannes Reinecke) - fix use-after-free when a port is removed (Israel Rukshin) - clear shadow doorbell memory on resets (Keith Busch) - use struct_size (Len Baker) - add error handling support for add_disk (Luis Chamberlain) - limit the maximal queue size for RDMA controllers (Max Gurtovoy) - use a few more symbolic names (Max Gurtovoy) - fix error code in nvme_rdma_setup_ctrl (Max Gurtovoy) - add support for ->map_queues on FC (Saurav Kashyap) - support the current discovery subsystem entry (Hannes Reinecke) - use flex_array_size and struct_size (Len Baker) - bcache fixes (Christoph, Coly, Chao, Lin, Qing) - MD updates (Christoph, Guoqing, Xiao) - Misc fixes (Dan, Ding, Jiapeng, Shin'ichiro, Ye) * tag 'for-5.16/drivers-2021-10-29' of git://git.kernel.dk/linux-block: (117 commits) null_blk: Fix handling of submit_queues and poll_queues attributes block: ataflop: Fix warning comparing pointer to 0 bcache: replace snprintf in show functions with sysfs_emit bcache: move uapi header bcache.h to bcache code directory nvmet: use flex_array_size and struct_size nvmet: register discovery subsystem as 'current' nvmet: switch check for subsystem type nvme: add new discovery log page entry definitions block: ataflop: more blk-mq refactoring fixes block: remove support for cryptoloop and the xor transfer mtd: add add_disk() error handling rnbd: add error handling support for add_disk() um/drivers/ubd_kern: add error handling support for add_disk() m68k/emu/nfblock: add error handling support for add_disk() xen-blkfront: add error handling support for add_disk() bcache: add error handling support for add_disk() dm: add add_disk() error handling block: aoe: fixup coccinelle warnings nvmet: use struct_size over open coded arithmetic nvme: drop scan_lock and always kick requeue list when removing namespaces ...
2021-11-01Merge tag 'for-5.16/block-2021-10-29' of git://git.kernel.dk/linux-blockLinus Torvalds
Pull block updates from Jens Axboe: - mq-deadline accounting improvements (Bart) - blk-wbt timer fix (Andrea) - Untangle the block layer includes (Christoph) - Rework the poll support to be bio based, which will enable adding support for polling for bio based drivers (Christoph) - Block layer core support for multi-actuator drives (Damien) - blk-crypto improvements (Eric) - Batched tag allocation support (me) - Request completion batching support (me) - Plugging improvements (me) - Shared tag set improvements (John) - Concurrent queue quiesce support (Ming) - Cache bdev in ->private_data for block devices (Pavel) - bdev dio improvements (Pavel) - Block device invalidation and block size improvements (Xie) - Various cleanups, fixes, and improvements (Christoph, Jackie, Masahira, Tejun, Yu, Pavel, Zheng, me) * tag 'for-5.16/block-2021-10-29' of git://git.kernel.dk/linux-block: (174 commits) blk-mq-debugfs: Show active requests per queue for shared tags block: improve readability of blk_mq_end_request_batch() virtio-blk: Use blk_validate_block_size() to validate block size loop: Use blk_validate_block_size() to validate block size nbd: Use blk_validate_block_size() to validate block size block: Add a helper to validate the block size block: re-flow blk_mq_rq_ctx_init() block: prefetch request to be initialized block: pass in blk_mq_tags to blk_mq_rq_ctx_init() block: add rq_flags to struct blk_mq_alloc_data block: add async version of bio_set_polled block: kill DIO_MULTI_BIO block: kill unused polling bits in __blkdev_direct_IO() block: avoid extra iter advance with async iocb block: Add independent access ranges support blk-mq: don't issue request directly in case that current is to be blocked sbitmap: silence data race warning blk-cgroup: synchronize blkg creation against policy deactivation block: refactor bio_iov_bvec_set() block: add single bio async direct IO helper ...
2021-11-01selftests, bpf: Fix broken riscv buildBjörn Töpel
This patch is closely related to commit 6016df8fe874 ("selftests/bpf: Fix broken riscv build"). When clang includes the system include directories, but targeting BPF program, __BITS_PER_LONG defaults to 32, unless explicitly set. Work around this problem, by explicitly setting __BITS_PER_LONG to __riscv_xlen. Signed-off-by: Björn Töpel <bjorn@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20211028161057.520552-5-bjorn@kernel.org
2021-11-01riscv, libbpf: Add RISC-V (RV64) support to bpf_tracing.hBjörn Töpel
Add macros for 64-bit RISC-V PT_REGS to bpf_tracing.h. Signed-off-by: Björn Töpel <bjorn@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20211028161057.520552-4-bjorn@kernel.org
2021-11-01tools, build: Add RISC-V to HOSTARCH parsingBjörn Töpel
Add RISC-V to the HOSTARCH parsing, so that ARCH is "riscv", and not "riscv32" or "riscv64". This affects the perf and libbpf builds, so that arch specific includes are correctly picked up for RISC-V. Signed-off-by: Björn Töpel <bjorn@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20211028161057.520552-3-bjorn@kernel.org
2021-11-01riscv, bpf: Increase the maximum number of iterationsBjörn Töpel
Now that BPF programs can be up to 1M instructions, it is not uncommon that a program requires more than the current 16 iterations to converge. Bump it to 32, which is enough for selftests/bpf, and test_bpf.ko. Signed-off-by: Björn Töpel <bjorn@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20211028161057.520552-2-bjorn@kernel.org
2021-11-01selftests, bpf: Add one test for sockmap with strparserLiu Jian
Add the test to check sockmap with strparser is working well. Signed-off-by: Liu Jian <liujian56@huawei.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: John Fastabend <john.fastabend@gmail.com> Link: https://lore.kernel.org/bpf/20211029141216.211899-3-liujian56@huawei.com
2021-11-01selftests, bpf: Fix test_txmsg_ingress_parser errorLiu Jian
After "skmsg: lose offset info in sk_psock_skb_ingress", the test case with ktls failed. This because ktls parser(tls_read_size) return value is 285 not 256. The case like this: tls_sk1 --> redir_sk --> tls_sk2 tls_sk1 sent out 512 bytes data, after tls related processing redir_sk recved 570 btyes data, and redirect 512 (skb_use_parser) bytes data to tls_sk2; but tls_sk2 needs 285 * 2 bytes data, receive timeout occurred. Signed-off-by: Liu Jian <liujian56@huawei.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: John Fastabend <john.fastabend@gmail.com> Link: https://lore.kernel.org/bpf/20211029141216.211899-2-liujian56@huawei.com
2021-11-01skmsg: Lose offset info in sk_psock_skb_ingressLiu Jian
If sockmap enable strparser, there are lose offset info in sk_psock_skb_ingress(). If the length determined by parse_msg function is not skb->len, the skb will be converted to sk_msg multiple times, and userspace app will get the data multiple times. Fix this by get the offset and length from strp_msg. And as Cong suggested, add one bit in skb->_sk_redir to distinguish enable or disable strparser. Fixes: 604326b41a6fb ("bpf, sockmap: convert to generic sk_msg interface") Signed-off-by: Liu Jian <liujian56@huawei.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Cong Wang <cong.wang@bytedance.com> Acked-by: John Fastabend <john.fastabend@gmail.com> Link: https://lore.kernel.org/bpf/20211029141216.211899-1-liujian56@huawei.com
2021-11-01selftests/bpf: Fix strobemeta selftest regressionAndrii Nakryiko
After most recent nightly Clang update strobemeta selftests started failing with the following error (relevant portion of assembly included): 1624: (85) call bpf_probe_read_user_str#114 1625: (bf) r1 = r0 1626: (18) r2 = 0xfffffffe 1628: (5f) r1 &= r2 1629: (55) if r1 != 0x0 goto pc+7 1630: (07) r9 += 104 1631: (6b) *(u16 *)(r9 +0) = r0 1632: (67) r0 <<= 32 1633: (77) r0 >>= 32 1634: (79) r1 = *(u64 *)(r10 -456) 1635: (0f) r1 += r0 1636: (7b) *(u64 *)(r10 -456) = r1 1637: (79) r1 = *(u64 *)(r10 -368) 1638: (c5) if r1 s< 0x1 goto pc+778 1639: (bf) r6 = r8 1640: (0f) r6 += r7 1641: (b4) w1 = 0 1642: (6b) *(u16 *)(r6 +108) = r1 1643: (79) r3 = *(u64 *)(r10 -352) 1644: (79) r9 = *(u64 *)(r10 -456) 1645: (bf) r1 = r9 1646: (b4) w2 = 1 1647: (85) call bpf_probe_read_user_str#114 R1 unbounded memory access, make sure to bounds check any such access In the above code r0 and r1 are implicitly related. Clang knows that, but verifier isn't able to infer this relationship. Yonghong Song narrowed down this "regression" in code generation to a recent Clang optimization change ([0]), which for BPF target generates code pattern that BPF verifier can't handle and loses track of register boundaries. This patch works around the issue by adding an BPF assembly-based helper that helps to prove to the verifier that upper bound of the register is a given constant by controlling the exact share of generated BPF instruction sequence. This fixes the immediate issue for strobemeta selftest. [0] https://github.com/llvm/llvm-project/commit/acabad9ff6bf13e00305d9d8621ee8eafc1f8b08 Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Yonghong Song <yhs@fb.com> Link: https://lore.kernel.org/bpf/20211029182907.166910-1-andrii@kernel.org
2021-11-01Merge tag 'locks-v5.16' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux Pull file locking updates from Jeff Layton: "Most of this is just follow-on cleanup work of documentation and comments from the mandatory locking removal in v5.15. The only real functional change is that LOCK_MAND flock() support is also being removed, as it has basically been non-functional since the v2.5 days" * tag 'locks-v5.16' of git://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux: fs: remove leftover comments from mandatory locking removal locks: remove changelog comments docs: fs: locks.rst: update comment about mandatory file locking Documentation: remove reference to now removed mandatory-locking doc locks: remove LOCK_MAND flock lock support
2021-11-01bpf: Disallow unprivileged bpf by defaultPawan Gupta
Disabling unprivileged BPF would help prevent unprivileged users from creating certain conditions required for potential speculative execution side-channel attacks on unmitigated affected hardware. A deep dive on such attacks and current mitigations is available here [0]. Sync with what many distros are currently applying already, and disable unprivileged BPF by default. An admin can enable this at runtime, if necessary, as described in 08389d888287 ("bpf: Add kconfig knob for disabling unpriv bpf by default"). [0] "BPF and Spectre: Mitigating transient execution attacks", Daniel Borkmann, eBPF Summit '21 https://ebpf.io/summit-2021-slides/eBPF_Summit_2021-Keynote-Daniel_Borkmann-BPF_and_Spectre.pdf Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Mark Rutland <mark.rutland@arm.com> Link: https://lore.kernel.org/bpf/0ace9ce3f97656d5f62d11093ad7ee81190c3c25.1635535215.git.pawan.kumar.gupta@linux.intel.com
2021-11-01Merge tag 'tpmdd-next-v5.16' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/jarkko/linux-tpmdd Pull tpm updates from Jarkko Sakkinen: "Only bug fixes" * tag 'tpmdd-next-v5.16' of git://git.kernel.org/pub/scm/linux/kernel/git/jarkko/linux-tpmdd: tpm_tis_spi: Add missing SPI ID tpm: fix Atmel TPM crash caused by too frequent queries tpm: Check for integer overflow in tpm2_map_response_body() tpm: tis: Kconfig: Add helper dependency on COMPILE_TEST
2021-11-01Merge tag 'folio-5.16' of git://git.infradead.org/users/willy/pagecacheLinus Torvalds
Pull memory folios from Matthew Wilcox: "Add memory folios, a new type to represent either order-0 pages or the head page of a compound page. This should be enough infrastructure to support filesystems converting from pages to folios. The point of all this churn is to allow filesystems and the page cache to manage memory in larger chunks than PAGE_SIZE. The original plan was to use compound pages like THP does, but I ran into problems with some functions expecting only a head page while others expect the precise page containing a particular byte. The folio type allows a function to declare that it's expecting only a head page. Almost incidentally, this allows us to remove various calls to VM_BUG_ON(PageTail(page)) and compound_head(). This converts just parts of the core MM and the page cache. For 5.17, we intend to convert various filesystems (XFS and AFS are ready; other filesystems may make it) and also convert more of the MM and page cache to folios. For 5.18, multi-page folios should be ready. The multi-page folios offer some improvement to some workloads. The 80% win is real, but appears to be an artificial benchmark (postgres startup, which isn't a serious workload). Real workloads (eg building the kernel, running postgres in a steady state, etc) seem to benefit between 0-10%. I haven't heard of any performance losses as a result of this series. Nobody has done any serious performance tuning; I imagine that tweaking the readahead algorithm could provide some more interesting wins. There are also other places where we could choose to create large folios and currently do not, such as writes that are larger than PAGE_SIZE. I'd like to thank all my reviewers who've offered review/ack tags: Christoph Hellwig, David Howells, Jan Kara, Jeff Layton, Johannes Weiner, Kirill A. Shutemov, Michal Hocko, Mike Rapoport, Vlastimil Babka, William Kucharski, Yu Zhao and Zi Yan. I'd also like to thank those who gave feedback I incorporated but haven't offered up review tags for this part of the series: Nick Piggin, Mel Gorman, Ming Lei, Darrick Wong, Ted Ts'o, John Hubbard, Hugh Dickins, and probably a few others who I forget" * tag 'folio-5.16' of git://git.infradead.org/users/willy/pagecache: (90 commits) mm/writeback: Add folio_write_one mm/filemap: Add FGP_STABLE mm/filemap: Add filemap_get_folio mm/filemap: Convert mapping_get_entry to return a folio mm/filemap: Add filemap_add_folio() mm/filemap: Add filemap_alloc_folio mm/page_alloc: Add folio allocation functions mm/lru: Add folio_add_lru() mm/lru: Convert __pagevec_lru_add_fn to take a folio mm: Add folio_evictable() mm/workingset: Convert workingset_refault() to take a folio mm/filemap: Add readahead_folio() mm/filemap: Add folio_mkwrite_check_truncate() mm/filemap: Add i_blocks_per_folio() mm/writeback: Add folio_redirty_for_writepage() mm/writeback: Add folio_account_redirty() mm/writeback: Add folio_clear_dirty_for_io() mm/writeback: Add folio_cancel_dirty() mm/writeback: Add folio_account_cleaned() mm/writeback: Add filemap_dirty_folio() ...
2021-11-01vsprintf: Update %pGp documentation about that it prints hex valuePetr Mladek
The commit 23efd0804c0a869dfb1e7 ("vsprintf: Make %pGp print the hex value") changed the behavior of %pGp printk format. Update the documentation accordingly. Fixes: 23efd0804c0a869dfb1e7 ("vsprintf: Make %pGp print the hex value") Reviewed-by: Yafang Shao <laoar.shao@gmail.com> Signed-off-by: Petr Mladek <pmladek@suse.com> Link: https://lore.kernel.org/r/YXlKqCPY9suM4mfT@alley
2021-11-01Merge branch 'SMC-tracepoints'David S. Miller
Tony Lu says: ==================== Tracepoints for SMC This patch set introduces tracepoints for SMC, including the tracepoints basic code. The tracepoitns would help us to track SMC's behaviors by automatic tools, or other BPF tools, and zero overhead if not enabled. Compared with kprobe and other dymatic tools, the tracepoints are considered as stable API, and less overhead for tracing with easy-to-use API. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2021-11-01net/smc: Introduce tracepoint for smcr link downTony Lu
SMC-R link down event is important to help us find links' issues, we should track this event, especially in the single nic mode, which means upper layer connection would be shut down. Then find out the direct link-down reason in time, not only increased the counter, also the location of the code who triggered this event. Signed-off-by: Tony Lu <tonylu@linux.alibaba.com> Reviewed-by: Wen Gu <guwen@linux.alibaba.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-11-01net/smc: Introduce tracepoints for tx and rx msgTony Lu
This introduce two tracepoints for smc tx and rx msg to help us diagnosis issues of data path. These two tracepoitns don't cover the path of CORK or MSG_MORE in tx, just the top half of data path. Signed-off-by: Tony Lu <tonylu@linux.alibaba.com> Reviewed-by: Wen Gu <guwen@linux.alibaba.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-11-01net/smc: Introduce tracepoint for fallbackTony Lu
This introduces tracepoint for smc fallback to TCP, so that we can track which connection and why it fallbacks, and map the clcsocks' pointer with /proc/net/tcp to find more details about TCP connections. Compared with kprobe or other dynamic tracing, tracepoints are stable and easy to use. Signed-off-by: Tony Lu <tonylu@linux.alibaba.com> Reviewed-by: Wen Gu <guwen@linux.alibaba.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-11-01Merge branch 'amt-driver'David S. Miller
Taehee Yoo says: ==================== amt: add initial driver for Automatic Multicast Tunneling (AMT) This is an implementation of AMT(Automatic Multicast Tunneling), RFC 7450. https://datatracker.ietf.org/doc/html/rfc7450 This implementation supports IGMPv2, IGMPv3, MLDv1, MLDv2, and IPv4 underlay. Summary of RFC 7450 The purpose of this protocol is to provide multicast tunneling. The main use-case of this protocol is to provide delivery multicast traffic from a multicast-enabled network to sites that lack multicast connectivity to the source network. There are two roles in AMT protocol, Gateway, and Relay. The main purpose of Gateway mode is to forward multicast listening information(IGMP, MLD) to the source. The main purpose of Relay mode is to forward multicast data to listeners. These multicast traffics(IGMP, MLD, multicast data packets) are tunneled. Listeners are located behind Gateway endpoint. But gateway itself can be a listener too. Senders are located behind Relay endpoint. ___________ _________ _______ ________ | | | | | | | | | Listeners <-----> Gateway <-----> Relay <-----> Source | |___________| |_________| |_______| |________| IGMP/MLD---------(encap)-----------> <-------------(decap)--------(encap)------Multicast Data Usage of AMT interface 1. Create gateway interface ip link add amtg type amt mode gateway local 10.0.0.1 discovery 10.0.0.2 \ dev gw1_rt gateway_port 2268 relay_port 2268 2. Create Relay interface ip link add amtr type amt mode relay local 10.0.0.2 dev relay_rt \ relay_port 2268 max_tunnels 4 v1 -> v2: - Eliminate sparse warnings. - Use bool type instead of __be16 for identifying v4/v6 protocol. v2 -> v3: - Fix compile warning due to unsed variable. - Add missing spinlock comment. - Update help message of amt in Kconfig. v3 -> v4: - Split patch. - Use CHECKSUM_NONE instead of CHECKSUM_UNNECESSARY. - Fix compile error. v4 -> v5: - Remove unnecessary rcu_read_lock(). - Remove unnecessary amt_change_mtu(). - Change netlink error message. - Add validation for IFLA_AMT_LOCAL_IP and IFLA_AMT_DISCOVERY_IP. - Add comments in amt.h. - Add missing dev_put() in error path of amt_newlink(). - Fix typo. - Add BUILD_BUG_ON() in amt_smb_cb(). - Use macro instead of magic values. - Use kzalloc() instead of kmalloc(). - Add selftest script. v5 -> v6: - Reset remote_ip in amt_dev_stop(). v6 -> v7: - Fix compile error. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2021-11-01selftests: add amt interface selftest scriptTaehee Yoo
This is selftest script for amt interface. This script includes basic forwarding scenarion and torture scenario. Signed-off-by: Taehee Yoo <ap420073@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-11-01amt: add mld report message handlerTaehee Yoo
In the previous patch, igmp report handler was added. That handler can be used for mld too. So, it uses that common code to parse mld report message. Signed-off-by: Taehee Yoo <ap420073@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-11-01amt: add multicast(IGMP) report message handlerTaehee Yoo
amt 'Relay' interface manages multicast groups(igmp/mld) and sources. In order to manage, it should have the function to parse igmp/mld report messages. So, this adds the logic for parsing igmp report messages and saves them on their own data structure. struct amt_group_node means one group(igmp/mld). struct amt_source_node means one source. The same source can't exist in the same group. The same group can exist in the same tunnel because it manages the host address too. The group information is used when forwarding multicast data. If there are no groups in the specific tunnel, Relay doesn't forward it. Although Relay manages sources, it doesn't support the source filtering feature. Because the reason to manage sources is just that in order to manage group more correctly. In the next patch, MLD part will be added. Signed-off-by: Taehee Yoo <ap420073@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-11-01amt: add data plane of amt interfaceTaehee Yoo
Before forwarding multicast traffic, the amt interface establishes between gateway and relay. In order to establish, amt defined some message type and those message flow looks like the below. Gateway Relay ------- ----- : Request : [1] | N | |---------------------->| | Membership Query | [2] | N,MAC,gADDR,gPORT | |<======================| [3] | Membership Update | | ({G:INCLUDE({S})}) | |======================>| | | ---------------------:-----------------------:--------------------- | | | | | | *Multicast Data | *IP Packet(S,G) | | | gADDR,gPORT |<-----------------() | | *IP Packet(S,G) |<======================| | | ()<-----------------| | | | | | | ---------------------:-----------------------:--------------------- ~ ~ ~ Request ~ [4] | N' | |---------------------->| | Membership Query | [5] | N',MAC',gADDR',gPORT' | |<======================| [6] | | | Teardown | | N,MAC,gADDR,gPORT | |---------------------->| | | [7] | Membership Update | | ({G:INCLUDE({S})}) | |======================>| | | ---------------------:-----------------------:--------------------- | | | | | | *Multicast Data | *IP Packet(S,G) | | | gADDR',gPORT' |<-----------------() | | *IP Packet (S,G) |<======================| | | ()<-----------------| | | | | | | ---------------------:-----------------------:--------------------- | | : : 1. Discovery - Sent by Gateway to Relay - To find Relay unique ip address 2. Advertisement - Sent by Relay to Gateway - Contains the unique IP address 3. Request - Sent by Gateway to Relay - Solicit to receive 'Query' message. 4. Query - Sent by Relay to Gateway - Contains General Query message. 5. Update - Sent by Gateway to Relay - Contains report message. 6. Multicast Data - Sent by Relay to Gateway - encapsulated multicast traffic. 7. Teardown - Not supported at this time. Except for the Teardown message, it supports all messages. In the next patch, IGMP/MLD logic will be added. Signed-off-by: Taehee Yoo <ap420073@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-11-01amt: add control plane of amt interfaceTaehee Yoo
It adds definitions and control plane code for AMT. this is very similar to udp tunneling interfaces such as gtp, vxlan, etc. In the next patch, data plane code will be added. Signed-off-by: Taehee Yoo <ap420073@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-11-01Merge branch 'netdevsim-device-and-bus'David S. Miller
Jakub Kicinski says: ==================== netdevsim: improve separation between device and bus VF config falls strangely in between device and bus responsibilities today. Because of this bus.c sticks fingers directly into struct nsim_dev and we look at nsim_bus_dev in many more places than necessary. Make bus.c contain pure interface code, and move the particulars of the logic (which touch on eswitch, devlink reloads etc) to dev.c. Rename the functions at the boundary of the interface to make the separation clearer. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2021-11-01netdevsim: rename 'driver' entry pointsJakub Kicinski
Rename functions serving as driver entry points from nsim_dev_... to nsim_drv_... this makes the API boundary between bus and dev clearer. Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-11-01netdevsim: move max vf config to devJakub Kicinski
max_vfs is a strange little beast because the file hangs off of nsim's debugfs, but it configures a field in the bus device. Move it to dev.c, let's look at it as if the device driver was imposing VF limit based on FW info (like pci_sriov_set_totalvfs()). Again, when moving refactor the function not to hold the vfs lock pointlessly while parsing the input. Wrap the access from the read side in READ_ONCE() to appease concurrency checkers. Do not check if return value from snprintf() is negative... Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-11-01netdevsim: move details of vf config to devJakub Kicinski
Since "eswitch" configuration was added bus.c contains a lot of device details which really belong to dev.c. Restructure the code while moving it. Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-11-01netdevsim: move vfconfig to nsim_devJakub Kicinski
When netdevsim got split into the faux bus vfconfig ended up in the bus device (think pci_dev) which is strange because it contains very networky not to say netdevy information. Move it to nsim_dev, which is the driver "priv" structure for the device. To make sure we don't race with probe/remove take the device lock (much like PCI). While at it remove the NULL-checking of vfconfigs. It appears to be pointless. Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-11-01netdevsim: take rtnl_lock when assigning num_vfsJakub Kicinski
Legacy VF NDOs look at num_vfs and then based on that index into vfconfig. If we don't rtnl_lock() num_vfs may get set to 0 and vfconfig freed/replaced while the NDO is running. We don't need to protect replacing vfconfig since it's only done when num_vfs is 0. Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-11-01Merge branch 'devlink-locking'David S. Miller
Jakub Kicinski says: ==================== improve ethtool/rtnl vs devlink locking During ethtool netlink development we decided to move some of the commmands to devlink. Since we don't want drivers to implement both devlink and ethtool version of the commands ethtool ioctl falls back to calling devlink. Unfortunately devlink locks must be taken before rtnl_lock. This results in a questionable dev_hold() / rtnl_unlock() / devlink / rtnl_lock() / dev_put() pattern. This method "works" but it working depends on drivers in question not doing much in ethtool_ops->begin / complete, and on the netdev not having needs_free_netdev set. Since commit 437ebfd90a25 ("devlink: Count struct devlink consumers") we can hold a reference on a devlink instance and prevent it from going away (sort of like netdev with dev_hold()). We can use this to create a more natural reference nesting where we get a ref on the devlink instance and make the devlink call entirely outside of the rtnl_lock section. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2021-11-01ethtool: don't drop the rtnl_lock half way thru the ioctlJakub Kicinski
devlink compat code needs to drop rtnl_lock to take devlink->lock to ensure correct lock ordering. This is problematic because we're not strictly guaranteed that the netdev will not disappear after we re-lock. It may open a possibility of nested ->begin / ->complete calls. Instead of calling into devlink under rtnl_lock take a ref on the devlink instance and make the call after we've dropped rtnl_lock. We (continue to) assume that netdevs have an implicit reference on the devlink returned from ndo_get_devlink_port Note that ndo_get_devlink_port will now get called under rtnl_lock. That should be fine since none of the drivers seem to be taking serious locks inside ndo_get_devlink_port. Signed-off-by: Jakub Kicinski <kuba@kernel.org> Reviewed-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-11-01devlink: expose get/put functionsJakub Kicinski
Allow those who hold implicit reference on a devlink instance to try to take a full ref on it. This will be used from netdev code which has an implicit ref because of driver call ordering. Note that after recent changes devlink_unregister() may happen before netdev unregister, but devlink_free() should still happen after, so we are safe to try, but we can't just refcount_inc() and assume it's not zero. Signed-off-by: Jakub Kicinski <kuba@kernel.org> Reviewed-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-11-01ethtool: handle info/flash data copying outside rtnl_lockJakub Kicinski
We need to increase the lifetime of the data for .get_info and .flash_update beyond their handlers inside rtnl_lock. Allocate a union on the heap and use it instead. Note that we now copy the ethcmd before we lookup dev, hopefully there is no crazy user space depending on error codes. Signed-off-by: Jakub Kicinski <kuba@kernel.org> Reviewed-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-11-01ethtool: push the rtnl_lock into dev_ethtool()Jakub Kicinski
Don't take the lock in net/core/dev_ioctl.c, we'll have things to do outside rtnl_lock soon. Signed-off-by: Jakub Kicinski <kuba@kernel.org> Reviewed-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-11-01Merge branch 'mana-misc'David S. Miller
Dexuan Cui says: ==================== net: mana: some misc patches Patch 1 is a small fix. Patch 2 reports OS info to the PF driver. Before the patch, the req fields were all zeros. Patch 3 fixes and cleans up the error handling of HWC creation failure. Patch 4 adds the callbacks for hibernation/kexec. It's based on patch 3. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2021-11-01net: mana: Support hibernation and kexecDexuan Cui
Implement the suspend/resume/shutdown callbacks for hibernation/kexec. Add mana_gd_setup() and mana_gd_cleanup() for some common code, and use them in the mand_gd_* callbacks. Reuse mana_probe/remove() for the hibernation path. Signed-off-by: Dexuan Cui <decui@microsoft.com> Reviewed-by: Haiyang Zhang <haiyangz@microsoft.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-11-01net: mana: Improve the HWC error handlingDexuan Cui
Currently when the HWC creation fails, the error handling is flawed, e.g. if mana_hwc_create_channel() -> mana_hwc_establish_channel() fails, the resources acquired in mana_hwc_init_queues() is not released. Enhance mana_hwc_destroy_channel() to do the proper cleanup work and call it accordingly. Signed-off-by: Dexuan Cui <decui@microsoft.com> Reviewed-by: Haiyang Zhang <haiyangz@microsoft.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-11-01net: mana: Report OS info to the PF driverDexuan Cui
The PF driver might use the OS info for statistical purposes. Signed-off-by: Dexuan Cui <decui@microsoft.com> Reviewed-by: Haiyang Zhang <haiyangz@microsoft.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-11-01net: mana: Fix the netdev_err()'s vPort argument in mana_init_port()Dexuan Cui
Use the correct port index rather than 0. Signed-off-by: Dexuan Cui <decui@microsoft.com> Reviewed-by: Haiyang Zhang <haiyangz@microsoft.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-11-01Merge branch 'mptcp-selftests'David S. Miller
Mat Martineau says: ==================== mptcp: Some selftest improvements Here are a couple of selftest changes for MPTCP. Patch 1 fixes a mistake where the wrong protocol (TCP vs MPTCP) could be requested on the listening socket in some link failure tests. Patch 2 refactors the simulataneous flow tests to improve timing accuracy and give more consistent results. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2021-11-01selftests: mptcp: more stable simult_flows testsPaolo Abeni
Currently the simult_flows.sh self-tests are not very stable, especially when running on slow VMs. The tests measure runtime for transfers on multiple subflows and check that the time is near the theoretical maximum. The current test infra introduces a bit of jitter in test runtime, due to multiple explicit delays. Additionally the runtime is measured by the shell script wrapper. On a slow VM, the script overhead is measurable and subject to relevant jitter. One solution to make the test more stable would be adding more slack to the expected time; that could possibly hide real regressions. Instead move the measurement inside the command doing the transfer, and drop most unneeded sleeps. Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-11-01selftests: mptcp: fix proto type in link_failure testsGeliang Tang
In listener_ns, we should pass srv_proto argument to mptcp_connect command, not cl_proto. Fixes: 7d1e6f1639044 ("selftests: mptcp: add testcase for active-back") Signed-off-by: Geliang Tang <geliang.tang@suse.com> Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-11-01ibmvnic: delay complete()Sukadev Bhattiprolu
If we get CRQ_INIT, we set errno to -EIO and first call complete() to notify the waiter. Then we try to schedule a FAILOVER reset. If this occurs while adapter is in PROBING state, ibmvnic_reset() changes the error code to EAGAIN and returns without scheduling the FAILOVER. The purpose of setting error code to EAGAIN is to ask the waiter to retry. But due to the earlier complete() call, the waiter may already have seen the -EIO response and decided not to retry. This can cause intermittent failures when bringing up ibmvnic adapters during boot, specially in in kexec/kdump kernels. Defer the complete() call until after scheduling the reset. Also streamline the error code to EAGAIN. Don't see why we need EIO sometimes. All 3 callers of ibmvnic_reset_init() can handle EAGAIN. Fixes: 17c8705838a5 ("ibmvnic: Return error code if init interrupted by transport event") Reported-by: Vaishnavi Bhat <vaish123@in.ibm.com> Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.ibm.com> Reviewed-by: Dany Madden <drt@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-11-01ibmvnic: Process crqs after enabling interruptsSukadev Bhattiprolu
Soon after registering a CRQ it is possible that we get a fail over or maybe a CRQ_INIT from the VIOS while interrupts were disabled. Look for any such CRQs after enabling interrupts. Otherwise we can intermittently fail to bring up ibmvnic adapters during boot, specially in kexec/kdump kernels. Fixes: 032c5e82847a ("Driver for IBM System i/p VNIC protocol") Reported-by: Vaishnavi Bhat <vaish123@in.ibm.com> Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.ibm.com> Reviewed-by: Dany Madden <drt@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-11-01ibmvnic: don't stop queue in xmitSukadev Bhattiprolu
If adapter's resetting bit is on, discard the packet but don't stop the transmit queue - instead leave that to the reset code. With this change, it is possible that we may get several calls to ibmvnic_xmit() that simply discard packets and return. But if we stop the queue here, we might end up doing so just after __ibmvnic_open() started the queues (during a hard/soft reset) and before the ->resetting bit was cleared. If that happens, there will be no one to restart queue and transmissions will be blocked indefinitely. This can cause a TIMEOUT reset and with auto priority failover enabled, an unnecessary FAILOVER reset to less favored backing device and then a FAILOVER back to the most favored backing device. If we hit the window repeatedly, we can get stuck in a loop of TIMEOUT, FAILOVER, FAILOVER resets leaving the adapter unusable for extended periods of time. Fixes: 7f5b030830fe ("ibmvnic: Free skb's in cases of failure in transmit") Reported-by: Abdul Haleem <abdhalee@in.ibm.com> Reported-by: Vaishnavi Bhat <vaish123@in.ibm.com> Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.ibm.com> Reviewed-by: Dany Madden <drt@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>