summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2021-07-07sock: unlock on error in sock_setsockopt()Dan Carpenter
If copy_from_sockptr() then we need to unlock before returning. Fixes: d463126e23f1 ("net: sock: extend SO_TIMESTAMPING for PHC binding") Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-07samples/bpf: xdp_redirect_cpu_user: Cpumap qsize set larger defaultJesper Dangaard Brouer
Experience from production shows queue size of 192 is too small, as this caused packet drops during cpumap-enqueue on RX-CPU. This can be diagnosed with xdp_monitor sample program. This bpftrace program was used to diagnose the problem in more detail: bpftrace -e ' tracepoint:xdp:xdp_cpumap_kthread { @deq_bulk = lhist(args->processed,0,10,1); @drop_net = lhist(args->drops,0,10,1) } tracepoint:xdp:xdp_cpumap_enqueue { @enq_bulk = lhist(args->processed,0,10,1); @enq_drops = lhist(args->drops,0,10,1); }' Watch out for the @enq_drops counter. The @drop_net counter can happen when netstack gets invalid packets, so don't despair it can be natural, and that counter will likely disappear in newer kernels as it was a source of confusion (look at netstat info for reason of the netstack @drop_net counters). The production system was configured with CPU power-saving C6 state. Learn more in this blogpost[1]. And wakeup latency in usec for the states are: # grep -H . /sys/devices/system/cpu/cpu0/cpuidle/*/latency /sys/devices/system/cpu/cpu0/cpuidle/state0/latency:0 /sys/devices/system/cpu/cpu0/cpuidle/state1/latency:2 /sys/devices/system/cpu/cpu0/cpuidle/state2/latency:10 /sys/devices/system/cpu/cpu0/cpuidle/state3/latency:133 Deepest state take 133 usec to wakeup from (133/10^6). The link speed is 25Gbit/s ((25*10^9/8) in bytes/sec). How many bytes can arrive with in 133 usec at this speed: (25*10^9/8)*(133/10^6) = 415625 bytes. With MTU size packets this is 275 packets, and with minimum Ethernet (incl intergap overhead) 84 bytes it is 4948 packets. Clearly default queue size is too small. Setting default cpumap queue to 2048 as worst-case (small packet) at 10Gbit/s is 1979 packets with 133 usec wakeup time, +64 packet before kthread wakeup call (due to xdp_do_flush) worst-case 2043 packets. Thus, if a packet burst on RX-CPU will enqueue packets to a remote cpumap CPU that is in deep-sleep state it can overrun the cpumap queue. The production system was also configured to avoid deep-sleep via: tuned-adm profile network-latency [1] https://jeremyeder.com/2013/08/30/oh-did-you-expect-the-cpu/ Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Song Liu <songliubraving@fb.com> Link: https://lore.kernel.org/bpf/162523477604.786243.13372630844944530891.stgit@firesoul
2021-07-07Merge branch 'Generic XDP improvements'Alexei Starovoitov
Kumar Kartikeya says: ==================== This small series makes some improvements to generic XDP mode and brings it closer to native XDP. Patch 1 splits out generic XDP processing into reusable parts, patch 2 adds pointer friendly wrappers for bitops (not have to cast back and forth the address of local pointer to unsigned long *), patch 3 implements generic cpumap support (details in commit) and patch 4 allows devmap bpf prog execution before generic_xdp_tx is called. Patch 5 just updates a couple of selftests to adapt to changes in behavior (in that specifying devmap/cpumap prog fd in generic mode is now allowed). Changelog: ---------- v5 -> v6 v5: https://lore.kernel.org/bpf/20210701002759.381983-1-memxor@gmail.com * Put rcpu->prog check before RCU-bh section to avoid do_softirq (Jesper) v4 -> v5 v4: https://lore.kernel.org/bpf/20210628114746.129669-1-memxor@gmail.com * Add comments and examples for new bitops macros (Alexei) v3 -> v4 v3: https://lore.kernel.org/bpf/20210622202835.1151230-1-memxor@gmail.com * Add detach now that attach of XDP program succeeds (Toke) * Clean up the test to use new ASSERT macros v2 -> v3 v2: https://lore.kernel.org/bpf/20210622195527.1110497-1-memxor@gmail.com * list_for_each_entry -> list_for_each_entry_safe (due to deletion of skb) v1 -> v2 v1: https://lore.kernel.org/bpf/20210620233200.855534-1-memxor@gmail.com * Move __ptr_{set,clear,test}_bit to bitops.h (Toke) Also changed argument order to match the bit op they wrap. * Remove map value size checking functions for cpumap/devmap (Toke) * Rework prog run for skb in cpu_map_kthread_run (Toke) * Set skb->dev to dst->dev after devmap prog has run * Don't set xdp rxq that will be overwritten in cpumap prog run ==================== Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2021-07-07bpf: Tidy xdp attach selftestsKumar Kartikeya Dwivedi
Support for cpumap and devmap entry progs in previous commits means the test needs to be updated for the new semantics. Also take this opportunity to convert it from CHECK macros to the new ASSERT macros. Since xdp_cpumap_attach has no subtest, put the sole test inside the test_xdp_cpumap_attach function. Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com> Link: https://lore.kernel.org/bpf/20210702111825.491065-6-memxor@gmail.com
2021-07-07bpf: devmap: Implement devmap prog execution for generic XDPKumar Kartikeya Dwivedi
This lifts the restriction on running devmap BPF progs in generic redirect mode. To match native XDP behavior, it is invoked right before generic_xdp_tx is called, and only supports XDP_PASS/XDP_ABORTED/ XDP_DROP actions. We also return 0 even if devmap program drops the packet, as semantically redirect has already succeeded and the devmap prog is the last point before TX of the packet to device where it can deliver a verdict on the packet. This also means it must take care of freeing the skb, as xdp_do_generic_redirect callers only do that in case an error is returned. Since devmap entry prog is supported, remove the check in generic_xdp_install entirely. Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com> Link: https://lore.kernel.org/bpf/20210702111825.491065-5-memxor@gmail.com
2021-07-07bpf: cpumap: Implement generic cpumapKumar Kartikeya Dwivedi
This change implements CPUMAP redirect support for generic XDP programs. The idea is to reuse the cpu map entry's queue that is used to push native xdp frames for redirecting skb to a different CPU. This will match native XDP behavior (in that RPS is invoked again for packet reinjected into networking stack). To be able to determine whether the incoming skb is from the driver or cpumap, we reuse skb->redirected bit that skips generic XDP processing when it is set. To always make use of this, CONFIG_NET_REDIRECT guard on it has been lifted and it is always available. >From the redirect side, we add the skb to ptr_ring with its lowest bit set to 1. This should be safe as skb is not 1-byte aligned. This allows kthread to discern between xdp_frames and sk_buff. On consumption of the ptr_ring item, the lowest bit is unset. In the end, the skb is simply added to the list that kthread is anyway going to maintain for xdp_frames converted to skb, and then received again by using netif_receive_skb_list. Bulking optimization for generic cpumap is left as an exercise for a future patch for now. Since cpumap entry progs are now supported, also remove check in generic_xdp_install for the cpumap. Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com> Acked-by: Jesper Dangaard Brouer <brouer@redhat.com> Link: https://lore.kernel.org/bpf/20210702111825.491065-4-memxor@gmail.com
2021-07-07bitops: Add non-atomic bitops for pointersKumar Kartikeya Dwivedi
cpumap needs to set, clear, and test the lowest bit in skb pointer in various places. To make these checks less noisy, add pointer friendly bitop macros that also do some typechecking to sanitize the argument. These wrap the non-atomic bitops __set_bit, __clear_bit, and test_bit but for pointer arguments. Pointer's address has to be passed in and it is treated as an unsigned long *, since width and representation of pointer and unsigned long match on targets Linux supports. They are prefixed with double underscore to indicate lack of atomicity. Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com> Link: https://lore.kernel.org/bpf/20210702111825.491065-3-memxor@gmail.com
2021-07-07net: core: Split out code to run generic XDP progKumar Kartikeya Dwivedi
This helper can later be utilized in code that runs cpumap and devmap programs in generic redirect mode and adjust skb based on changes made to xdp_buff. When returning XDP_REDIRECT/XDP_TX, it invokes __skb_push, so whenever a generic redirect path invokes devmap/cpumap prog if set, it must __skb_pull again as we expect mac header to be pulled. It also drops the skb_reset_mac_len call after do_xdp_generic, as the mac_header and network_header are advanced by the same offset, so the difference (mac_len) remains constant. Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com> Link: https://lore.kernel.org/bpf/20210702111825.491065-2-memxor@gmail.com
2021-07-07Merge branch 'bpf: support input xdp_md context in BPF_PROG_TEST_RUN'Alexei Starovoitov
Zvi Effron says: ==================== This patchset adds support for passing an xdp_md via ctx_in/ctx_out in bpf_attr for BPF_PROG_TEST_RUN of XDP programs. Patch 1 adds a function to validate XDP meta data lengths. Patch 2 adds initial support for passing XDP meta data in addition to packet data. Patch 3 adds support for also specifying the ingress interface and rx queue. Patch 4 adds selftests to ensure functionality is correct. Changelog: ---------- v7->v8 v7: https://lore.kernel.org/bpf/20210624211304.90807-1-zeffron@riotgames.com/ * Fix too long comment line in patch 3 v6->v7 v6: https://lore.kernel.org/bpf/20210617232904.1899-1-zeffron@riotgames.com/ * Add Yonghong Song's Acked-by to commit message in patch 1 * Add Yonghong Song's Acked-by to commit message in patch 2 * Extracted the post-update of the xdp_md context into a function (again) * Validate that the rx queue was registered with XDP info * Decrement the reference count on a found netdevice on failure to find a valid rx queue * Decrement the reference count on a found netdevice after the XDP program is run * Drop Yonghong Song's Acked-By for patch 3 because of patch changes * Improve a comment in the selftests * Drop Yonghong Song's Acked-By for patch 4 because of patch changes v5->v6 v5: https://lore.kernel.org/bpf/20210616224712.3243-1-zeffron@riotgames.com/ * Correct commit messages in patches 1 and 3 * Add Acked-by to commit message in patch 4 * Use gotos instead of returns to correctly free resources in bpf_prog_test_run_xdp * Rename xdp_metalen_valid to xdp_metalen_invalid * Improve the function signature for xdp_metalen_invalid * Merged declaration of ingress_ifindex and rx_queue_index into one line v4->v5 v4: https://lore.kernel.org/bpf/20210604220235.6758-1-zeffron@riotgames.com/ * Add new patch to introduce xdp_metalen_valid inline function to avoid duplicated code from net/core/filter.c * Correct size of bad_ctx in selftests * Make all declarations reverse Christmas tree * Move data check from xdp_convert_md_to_buff to bpf_prog_test_run_xdp * Merge xdp_convert_buff_to_md into bpf_prog_test_run_xdp * Fix line too long * Extracted common checks in selftests to a helper function * Removed redundant assignment in selftests * Reordered test cases in selftests * Check data against 0 instead of data_meta in selftests * Made selftests use EINVAL instead of hardcoded 22 * Dropped "_" from XDP function name * Changed casts in XDP program from unsigned long to long * Added a comment explaining the use of the loopback interface in selftests * Change parameter order in xdp_convert_md_to_buff to be input first * Assigned xdp->ingress_ifindex and xdp->rx_queue_index to local variables in xdp_convert_md_to_buff * Made use of "meta data" versus "metadata" consistent in comments and commit messages v3->v4 v3: https://lore.kernel.org/bpf/20210602190815.8096-1-zeffron@riotgames.com/ * Clean up nits * Validate xdp_md->data_end in bpf_prog_test_run_xdp * Remove intermediate metalen variables v2 -> v3 v2: https://lore.kernel.org/bpf/20210527201341.7128-1-zeffron@riotgames.com/ * Check errno first in selftests * Use DECLARE_LIBBPF_OPTS * Rename tattr to opts in selftests * Remove extra new line * Rename convert_xdpmd_to_xdpb to xdp_convert_md_to_buff * Rename convert_xdpb_to_xdpmd to xdp_convert_buff_to_md * Move declaration of device and rxqueue in xdp_convert_md_to_buff to patch 2 * Reorder the kfree calls in bpf_prog_test_run_xdp v1 -> v2 v1: https://lore.kernel.org/bpf/20210524220555.251473-1-zeffron@riotgames.com * Fix null pointer dereference with no context * Use the BPF skeleton and replace CHECK with ASSERT macros ==================== Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2021-07-07selftests/bpf: Add test for xdp_md context in BPF_PROG_TEST_RUNZvi Effron
Add a test for using xdp_md as a context to BPF_PROG_TEST_RUN for XDP programs. The test uses a BPF program that takes in a return value from XDP meta data, then reduces the size of the XDP meta data by 4 bytes. Test cases validate the possible failure cases for passing in invalid xdp_md contexts, that the return value is successfully passed in, and that the adjusted meta data is successfully copied out. Co-developed-by: Cody Haas <chaas@riotgames.com> Co-developed-by: Lisa Watanabe <lwatanabe@riotgames.com> Signed-off-by: Cody Haas <chaas@riotgames.com> Signed-off-by: Lisa Watanabe <lwatanabe@riotgames.com> Signed-off-by: Zvi Effron <zeffron@riotgames.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Yonghong Song <yhs@fb.com> Link: https://lore.kernel.org/bpf/20210707221657.3985075-5-zeffron@riotgames.com
2021-07-07bpf: Support specifying ingress via xdp_md context in BPF_PROG_TEST_RUNZvi Effron
Support specifying the ingress_ifindex and rx_queue_index of xdp_md contexts for BPF_PROG_TEST_RUN. The intended use case is to allow testing XDP programs that make decisions based on the ingress interface or RX queue. If ingress_ifindex is specified, look up the device by the provided index in the current namespace and use its xdp_rxq for the xdp_buff. If the rx_queue_index is out of range, or is non-zero when the ingress_ifindex is 0, return -EINVAL. Co-developed-by: Cody Haas <chaas@riotgames.com> Co-developed-by: Lisa Watanabe <lwatanabe@riotgames.com> Signed-off-by: Cody Haas <chaas@riotgames.com> Signed-off-by: Lisa Watanabe <lwatanabe@riotgames.com> Signed-off-by: Zvi Effron <zeffron@riotgames.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Yonghong Song <yhs@fb.com> Link: https://lore.kernel.org/bpf/20210707221657.3985075-4-zeffron@riotgames.com
2021-07-07bpf: Support input xdp_md context in BPF_PROG_TEST_RUNZvi Effron
Support passing a xdp_md via ctx_in/ctx_out in bpf_attr for BPF_PROG_TEST_RUN. The intended use case is to pass some XDP meta data to the test runs of XDP programs that are used as tail calls. For programs that use bpf_prog_test_run_xdp, support xdp_md input and output. Unlike with an actual xdp_md during a non-test run, data_meta must be 0 because it must point to the start of the provided user data. From the initial xdp_md, use data and data_end to adjust the pointers in the generated xdp_buff. All other non-zero fields are prohibited (with EINVAL). If the user has set ctx_out/ctx_size_out, copy the (potentially different) xdp_md back to the userspace. We require all fields of input xdp_md except the ones we explicitly support to be set to zero. The expectation is that in the future we might add support for more fields and we want to fail explicitly if the user runs the program on the kernel where we don't yet support them. Co-developed-by: Cody Haas <chaas@riotgames.com> Co-developed-by: Lisa Watanabe <lwatanabe@riotgames.com> Signed-off-by: Cody Haas <chaas@riotgames.com> Signed-off-by: Lisa Watanabe <lwatanabe@riotgames.com> Signed-off-by: Zvi Effron <zeffron@riotgames.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Yonghong Song <yhs@fb.com> Link: https://lore.kernel.org/bpf/20210707221657.3985075-3-zeffron@riotgames.com
2021-07-07bpf: Add function for XDP meta data length checkZvi Effron
This commit prepares to use the XDP meta data length check in multiple places by making it into a static inline function instead of a literal. Co-developed-by: Cody Haas <chaas@riotgames.com> Co-developed-by: Lisa Watanabe <lwatanabe@riotgames.com> Signed-off-by: Cody Haas <chaas@riotgames.com> Signed-off-by: Lisa Watanabe <lwatanabe@riotgames.com> Signed-off-by: Zvi Effron <zeffron@riotgames.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Yonghong Song <yhs@fb.com> Link: https://lore.kernel.org/bpf/20210707221657.3985075-2-zeffron@riotgames.com
2021-07-08Merge tag 'drm-misc-next-fixes-2021-07-01' of ↵Dave Airlie
git://anongit.freedesktop.org/drm/drm-misc into drm-next Short summary of fixes pull: * amdgpu: TTM fixes * dma-buf: Doc fixes * gma500: Fix potential BO leaks in error handling * radeon: Fix NULL-ptr deref Signed-off-by: Dave Airlie <airlied@redhat.com> From: Thomas Zimmermann <tzimmermann@suse.de> Link: https://patchwork.freedesktop.org/patch/msgid/YN2GK2SH64yqXqh9@linux-uq9g
2021-07-08Merge tag 'drm-intel-next-fixes-2021-07-07' of ↵Dave Airlie
git://anongit.freedesktop.org/drm/drm-intel into drm-next One fix targeting stable for display DP VSC, plus DG1 display fix and a bug fix of IRQs usages and cleanup references to the DRM IRQ midlayer. Signed-off-by: Dave Airlie <airlied@redhat.com> From: Rodrigo Vivi <rodrigo.vivi@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/YOXDp/+CFDgJ2/7f@intel.com
2021-07-08Merge tag 'amd-drm-next-5.14-2021-07-01' of ↵Dave Airlie
https://gitlab.freedesktop.org/agd5f/linux into drm-next amd-drm-next-5.14-2021-07-01: amdgpu: - Misc Navi fixes - Powergating fix - Yellow Carp updates - Beige Goby updates - S0ix fix - Revert overlay validation fix - GPU reset fix for DC - PPC64 fix - Add new dimgrey cavefish DID - RAS fix amdkfd: - SVM fixes radeon: - Fix missing drm_gem_object_put in error path Signed-off-by: Dave Airlie <airlied@redhat.com> From: Alex Deucher <alexander.deucher@amd.com> Link: https://patchwork.freedesktop.org/patch/msgid/20210701042241.25449-1-alexander.deucher@amd.com
2021-07-07CIFS: Clarify SMB1 code for POSIX LockSteve French
Coverity also complains about the way we calculate the offset (starting from the address of a 4 byte array within the header structure rather than from the beginning of the struct plus 4 bytes) for SMB1 PosixLock. This changeset doesn't change the address but makes it slightly clearer. Addresses-Coverity: 711520 ("Out of bounds write") Reviewed-by: Paulo Alcantara (SUSE) <pc@cjr.nz> Signed-off-by: Steve French <stfrench@microsoft.com>
2021-07-07CIFS: Clarify SMB1 code for rename open fileSteve French
Coverity also complains about the way we calculate the offset (starting from the address of a 4 byte array within the header structure rather than from the beginning of the struct plus 4 bytes) for SMB1 RenameOpenFile. This changeset doesn't change the address but makes it slightly clearer. Addresses-Coverity: 711521 ("Out of bounds write") Reviewed-by: Paulo Alcantara (SUSE) <pc@cjr.nz> Signed-off-by: Steve French <stfrench@microsoft.com>
2021-07-07tools: bpf: Fix error in 'make -C tools/ bpf_install'Wei Li
make[2]: *** No rule to make target 'install'. Stop. make[1]: *** [Makefile:122: runqslower_install] Error 2 make: *** [Makefile:116: bpf_install] Error 2 There is no rule for target 'install' in tools/bpf/runqslower/Makefile, and there is no need to install it, so just remove 'runqslower_install'. Fixes: 9c01546d26d2 ("tools/bpf: Add runqslower tool to tools/bpf") Signed-off-by: Wei Li <liwei391@huawei.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20210628030409.3459095-1-liwei391@huawei.com
2021-07-07Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nfDavid S. Miller
Pablo Neira Ayuso says: ==================== Netfilter fixes for net The following patchset contains Netfilter fixes for net: 1) Do not refresh timeout in SYN_SENT for syn retransmissions. Add selftest for unreplied TCP connection, from Florian Westphal. 2) Fix null dereference from error path with hardware offload in nftables. 3) Remove useless nf_ct_gre_keymap_flush() from netns exit path, from Vasily Averin. 4) Missing rcu read-lock side in ctnetlink helper info dump, also from Vasily. 5) Do not mark RST in the reply direction coming after SYN packet for an out-of-sync entry, from Ali Abdallah and Florian Westphal. 6) Add tcp_ignore_invalid_rst sysctl to allow to disable out of segment RSTs, from Ali. 7) KCSAN fix for nf_conntrack_all_lock(), from Manfred Spraul. 8) Honor NFTA_LAST_SET in nft_last. 9) Fix incorrect arithmetics when restore last_jiffies in nft_last. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-07selftests: icmp_redirect: IPv6 PMTU info should be cleared after redirectHangbin Liu
After redirecting, it's already a new path. So the old PMTU info should be cleared. The IPv6 test "mtu exception plus redirect" should only has redirect info without old PMTU. The IPv4 test can not be changed because of legacy. Fixes: ec8105352869 ("selftests: Add redirect tests") Signed-off-by: Hangbin Liu <liuhangbin@gmail.com> Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-07selftests: icmp_redirect: remove from checking for IPv6 route getHangbin Liu
If the kernel doesn't enable option CONFIG_IPV6_SUBTREES, the RTA_SRC info will not be exported to userspace in rt6_fill_node(). And ip cmd will not print "from ::" to the route output. So remove this check. Fixes: ec8105352869 ("selftests: Add redirect tests") Signed-off-by: Hangbin Liu <liuhangbin@gmail.com> Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-07stmmac: platform: Fix signedness bug in stmmac_probe_config_dt()YueHaibing
The "plat->phy_interface" variable is an enum and in this context GCC will treat it as an unsigned int so the error handling is never triggered. Fixes: b9f0b2f634c0 ("net: stmmac: platform: fix probe for ACPI devices") Signed-off-by: YueHaibing <yuehaibing@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-07stmmac: dwmac-loongson: Fix unsigned comparison to zeroYueHaibing
plat->phy_interface is unsigned integer, so the condition can't be less than zero and the warning will never printed. Fixes: 30bba69d7db4 ("stmmac: pci: Add dwmac support for Loongson") Signed-off-by: YueHaibing <yuehaibing@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-07Merge tag 'acpi-5.14-rc1-2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm Pull more ACPI updates from Rafael Wysocki: "These include fixes of the recently introduced support for the Platform Runtime Mechanism (PRM) feature, a new backlight quirk, a suspend-to-idle wakeup fix for non-Intel platforms and a fix for the AMBA bus resource list in /proc/iomem. Specifics: - Fix up the recently added Platform Runtime Mechanism (PRM) support by correnting a couple of implementation mistakes in it and adding a Kconfig help text to describe it (Aubrey Li, Rafael Wysocki). - Add backlight quirk for Dell Vostro 3350 (Hans de Goede). - Avoid spurious wakeups from suspend-to-idle on non-Intel platforms by restricting special EC GPE handling to the Intel ones (Mario Limonciello). - Modify the AMBA bus support in ACPI to avoid adding using resource names in /proc/iomem (Liguang Zhang)" * tag 'acpi-5.14-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: ACPI: Do not singal PRM support if not enabled ACPI: Correct \_SB._OSC bit definition for PRM ACPI: Kconfig: Provide help text for the ACPI_PRMT option ACPI: PM: Only mark EC GPE for wakeup on Intel systems ACPI: video: Add quirk for the Dell Vostro 3350 ACPI: AMBA: Fix resource name in /proc/iomem
2021-07-07Merge tag 'pm-5.14-rc1-2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm Pull more power management updates from Rafael Wysocki: "These include cpufreq core simplifications and fixes, cpufreq driver updates, cpuidle driver update, a generic power domains (genpd) locking fix and a debug-related simplification of the PM core. Specifics: - Drop the ->stop_cpu() (not really useful) and ->resolve_freq() (unused) cpufreq driver callbacks and modify the users of the former accordingly (Viresh Kumar, Rafael Wysocki). - Add frequency invariance support to the ACPI CPPC cpufreq driver again along with the related fixes and cleanups (Viresh Kumar). - Update the Meditak, qcom and SCMI ARM cpufreq drivers (Fabien Parent, Seiya Wang, Sibi Sankar, Christophe JAILLET). - Rename black/white-lists in the DT cpufreq driver (Viresh Kumar). - Add generic performance domains support to the dvfs DT bindings (Sudeep Holla). - Refine locking in the generic power domains (genpd) support code to avoid lock dependency issues (Stephen Boyd). - Update the MSM and qcom ARM cpuidle drivers (Bartosz Dudziak). - Simplify the PM core debug code by using ktime_us_delta() to compute time interval lengths (Mark-PK Tsai)" * tag 'pm-5.14-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (21 commits) PM: domains: Shrink locking area of the gpd_list_lock PM: sleep: Use ktime_us_delta() in initcall_debug_report() cpufreq: CPPC: Add support for frequency invariance arch_topology: Avoid use-after-free for scale_freq_data cpufreq: CPPC: Pass structure instance by reference cpufreq: CPPC: Fix potential memleak in cppc_cpufreq_cpu_init cpufreq: Remove ->resolve_freq() cpufreq: Reuse cpufreq_driver_resolve_freq() in __cpufreq_driver_target() cpufreq: Remove the ->stop_cpu() driver callback cpufreq: powernv: Migrate to ->exit() callback instead of ->stop_cpu() cpufreq: CPPC: Migrate to ->exit() callback instead of ->stop_cpu() cpufreq: intel_pstate: Combine ->stop_cpu() and ->offline() cpuidle: qcom: Add SPM register data for MSM8226 dt-bindings: arm: msm: Add SAW2 for MSM8226 dt-bindings: cpufreq: update cpu type and clock name for MT8173 SoC clk: mediatek: remove deprecated CLK_INFRA_CA57SEL for MT8173 SoC cpufreq: dt: Rename black/white-lists cpufreq: scmi: Fix an error message cpufreq: mediatek: add support for mt8365 dt-bindings: dvfs: Add support for generic performance domains ...
2021-07-07Merge tag 'for-v5.14' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/sre/linux-power-supply Pull power supply and reset updates from Sebastian Reichel: "Battery/charger driver changes: - convert charger-manager binding to YAML - drop bd70528-charger driver - drop pm2301-charger driver - introduce rt5033-battery driver - misc improvements and fixes" * tag 'for-v5.14' of git://git.kernel.org/pub/scm/linux/kernel/git/sre/linux-power-supply: (42 commits) power: supply: ab8500: Fix an old bug power: supply: axp288_fuel_gauge: remove redundant continue statement power: supply: axp288_fuel_gauge: Make "T3 MRD" no_battery_list DMI entry more generic power: supply: axp288_fuel_gauge: Rename fuel_gauge_blacklist to no_battery_list power: supply: bq24190_charger: drop of_match_ptr() from device ID table drivers: power: add missing MODULE_DEVICE_TABLE in keystone-reset.c power: supply: ab8500: add missing MODULE_DEVICE_TABLE power: supply: charger-manager: add missing MODULE_DEVICE_TABLE power: reset: regulator-poweroff: add missing MODULE_DEVICE_TABLE power: supply: cpcap-charger: get the battery inserted infomation from cpcap-battery power: supply: cpcap-battery: invalidate config when incompatible measurements are read power: supply: axp20x_battery: allow disabling battery charging power: supply: max17040: drop unused platform data support power: supply: max17040: simplify POWER_SUPPLY_PROP_ONLINE power: supply: max17040: remove non-working POWER_SUPPLY_PROP_STATUS power: reset: at91-sama5d2_shdwc: Remove redundant error printing in at91_shdwc_probe() power: reset: gpio-poweroff: add missing MODULE_DEVICE_TABLE power: supply: rt5033_battery: Fix device tree enumeration dt-bindings: power: supply: Add DT schema for richtek,rt5033-battery power: supply: Drop BD70528 support ...
2021-07-07Merge tag 'linux-watchdog-5.14-rc1' of ↵Linus Torvalds
git://www.linux-watchdog.org/linux-watchdog Pull watchdog updates from Wim Van Sebroeck: - Add Mstar MSC313e WDT driver - Add support for sama7g5-wdt - Add compatible for SC7280 SoC - Add compatible for Mediatek MT8195 - sbsa: Support architecture version 1 - Removal of the MV64x60 watchdog driver - Extra PCI IDs for hpwdt - Add hrtimer-based pretimeout feature - Add {min,max}_timeout sysfs nodes - keembay timeout and pre-timeout handling - Several fixes, cleanups and improvements * tag 'linux-watchdog-5.14-rc1' of git://www.linux-watchdog.org/linux-watchdog: (56 commits) watchdog: iTCO_wdt: use dev_err() instead of pr_err() watchdog: Add Mstar MSC313e WDT driver dt-bindings: watchdog: Add Mstar MSC313e WDT devicetree bindings documentation watchdog: iTCO_wdt: Account for rebooting on second timeout dt-bindings: watchdog: Convert arm,sbsa-gwdt to DT schema dt-bindings: watchdog: sama5d4-wdt: add compatible for sama7g5-wdt watchdog: sama5d4_wdt: add support for sama7g5-wdt dt-bindings: watchdog: sama5d4-wdt: convert to yaml watchdog: ziirave_wdt: Remove VERSION_FMT defines and add sysfs newlines dt-bindings: watchdog: Add compatible for Mediatek MT8195 dt-bindings: watchdog: dw-wdt: add description for rk3568 watchdog: imx_sc_wdt: fix pretimeout watchdog: diag288_wdt: Remove redundant assignment watchdog: Add hrtimer-based pretimeout feature dt-bindings: watchdog: Add compatible for SC7280 SoC watchdog: qcom: Move suspend/resume to suspend_late/resume_early watchdog: Fix a typo in the file orion_wdt.c watchdog: jz4740: Fix return value check in jz4740_wdt_probe() watchdog: Remove MV64x60 watchdog driver doc: mtk-wdt: support pre-timeout when the bark irq is available ...
2021-07-07Merge tag 'nfsd-5.14' of git://linux-nfs.org/~bfields/linuxLinus Torvalds
Pull nfsd updates from Bruce Fields: - add tracepoints for callbacks and for client creation and destruction - cache the mounts used for server-to-server copies - expose callback information in /proc/fs/nfsd/clients/*/info - don't hold locks unnecessarily while waiting for commits - update NLM to use xdr_stream, as we have for NFSv2/v3/v4 * tag 'nfsd-5.14' of git://linux-nfs.org/~bfields/linux: (69 commits) nfsd: fix NULL dereference in nfs3svc_encode_getaclres NFSD: Prevent a possible oops in the nfs_dirent() tracepoint nfsd: remove redundant assignment to pointer 'this' nfsd: Reduce contention for the nfsd_file nf_rwsem lockd: Update the NLMv4 SHARE results encoder to use struct xdr_stream lockd: Update the NLMv4 nlm_res results encoder to use struct xdr_stream lockd: Update the NLMv4 TEST results encoder to use struct xdr_stream lockd: Update the NLMv4 void results encoder to use struct xdr_stream lockd: Update the NLMv4 FREE_ALL arguments decoder to use struct xdr_stream lockd: Update the NLMv4 SHARE arguments decoder to use struct xdr_stream lockd: Update the NLMv4 SM_NOTIFY arguments decoder to use struct xdr_stream lockd: Update the NLMv4 nlm_res arguments decoder to use struct xdr_stream lockd: Update the NLMv4 UNLOCK arguments decoder to use struct xdr_stream lockd: Update the NLMv4 CANCEL arguments decoder to use struct xdr_stream lockd: Update the NLMv4 LOCK arguments decoder to use struct xdr_stream lockd: Update the NLMv4 TEST arguments decoder to use struct xdr_stream lockd: Update the NLMv4 void arguments decoder to use struct xdr_stream lockd: Update the NLMv1 SHARE results encoder to use struct xdr_stream lockd: Update the NLMv1 nlm_res results encoder to use struct xdr_stream lockd: Update the NLMv1 TEST results encoder to use struct xdr_stream ...
2021-07-07pwm: Remove redundant assignment to pointer pwmColin Ian King
The pointer pwm is being initialized with a value that is never read and it is being updated later with a new value. The initialization is redundant and can be removed. Addresses-Coverity: ("Unused value") Signed-off-by: Colin Ian King <colin.king@canonical.com> Acked-by: Uwe Kleine-König <u.kleine-koenig@pengutronix.de> Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Signed-off-by: Thierry Reding <thierry.reding@gmail.com>
2021-07-07io_uring: fix drain alloc fail return codePavel Begunkov
After a recent change io_drain_req() started to fail requests with result=0 in case of allocation failure, where it should be and have been -ENOMEM. Fixes: 76cc33d79175a ("io_uring: refactor io_req_defer()") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/e068110ac4293e0c56cfc4d280d0f22b9303ec08.1625682153.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-07-07Merge tag 'modules-for-v5.14' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/jeyu/linux Pull module updates from Jessica Yu: - Fix incorrect logic in module_kallsyms_on_each_symbol() - Fix for a Coccinelle warning * tag 'modules-for-v5.14' of git://git.kernel.org/pub/scm/linux/kernel/git/jeyu/linux: module: correctly exit module_kallsyms_on_each_symbol when fn() != 0 kernel/module: Use BUG_ON instead of if condition followed by BUG
2021-07-07Merge branches 'acpi-misc', 'acpi-video' and 'acpi-prm'Rafael J. Wysocki
* acpi-misc: ACPI: AMBA: Fix resource name in /proc/iomem * acpi-video: ACPI: video: Add quirk for the Dell Vostro 3350 * acpi-prm: ACPI: Do not singal PRM support if not enabled ACPI: Correct \_SB._OSC bit definition for PRM ACPI: Kconfig: Provide help text for the ACPI_PRMT option
2021-07-07Merge branches 'pm-cpuidle', 'pm-sleep' and 'pm-domains'Rafael J. Wysocki
* pm-cpuidle: cpuidle: qcom: Add SPM register data for MSM8226 dt-bindings: arm: msm: Add SAW2 for MSM8226 * pm-sleep: PM: sleep: Use ktime_us_delta() in initcall_debug_report() * pm-domains: PM: domains: Shrink locking area of the gpd_list_lock
2021-07-07Merge tag 'x86-fpu-2021-07-07' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 fpu updates from Thomas Gleixner: "Fixes and improvements for FPU handling on x86: - Prevent sigaltstack out of bounds writes. The kernel unconditionally writes the FPU state to the alternate stack without checking whether the stack is large enough to accomodate it. Check the alternate stack size before doing so and in case it's too small force a SIGSEGV instead of silently corrupting user space data. - MINSIGSTKZ and SIGSTKSZ are constants in signal.h and have never been updated despite the fact that the FPU state which is stored on the signal stack has grown over time which causes trouble in the field when AVX512 is available on a CPU. The kernel does not expose the minimum requirements for the alternate stack size depending on the available and enabled CPU features. ARM already added an aux vector AT_MINSIGSTKSZ for the same reason. Add it to x86 as well. - A major cleanup of the x86 FPU code. The recent discoveries of XSTATE related issues unearthed quite some inconsistencies, duplicated code and other issues. The fine granular overhaul addresses this, makes the code more robust and maintainable, which allows to integrate upcoming XSTATE related features in sane ways" * tag 'x86-fpu-2021-07-07' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (74 commits) x86/fpu/xstate: Clear xstate header in copy_xstate_to_uabi_buf() again x86/fpu/signal: Let xrstor handle the features to init x86/fpu/signal: Handle #PF in the direct restore path x86/fpu: Return proper error codes from user access functions x86/fpu/signal: Split out the direct restore code x86/fpu/signal: Sanitize copy_user_to_fpregs_zeroing() x86/fpu/signal: Sanitize the xstate check on sigframe x86/fpu/signal: Remove the legacy alignment check x86/fpu/signal: Move initial checks into fpu__restore_sig() x86/fpu: Mark init_fpstate __ro_after_init x86/pkru: Remove xstate fiddling from write_pkru() x86/fpu: Don't store PKRU in xstate in fpu_reset_fpstate() x86/fpu: Remove PKRU handling from switch_fpu_finish() x86/fpu: Mask PKRU from kernel XRSTOR[S] operations x86/fpu: Hook up PKRU into ptrace() x86/fpu: Add PKRU storage outside of task XSAVE buffer x86/fpu: Dont restore PKRU in fpregs_restore_userspace() x86/fpu: Rename xfeatures_mask_user() to xfeatures_mask_uabi() x86/fpu: Move FXSAVE_LEAK quirk info __copy_kernel_to_fpregs() x86/fpu: Rename __fpregs_load_activate() to fpregs_restore_userregs() ...
2021-07-07Merge tag 'for-linus-5.14-rc1-tag' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip Pull xen updates from Juergen Gross: "Only two minor patches this time: one cleanup patch and one patch refreshing a Xen header" * tag 'for-linus-5.14-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip: xen: sync include/xen/interface/io/ring.h with Xen's newest version xen: Use DEVICE_ATTR_*() macro
2021-07-07Merge tag 'Wimplicit-fallthrough-clang-5.14-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/gustavoars/linux Pull more fallthrough fixes from Gustavo Silva: "Fix maore fall-through warnings when building the kernel with clang and '-Wimplicit-fallthrough'" * tag 'Wimplicit-fallthrough-clang-5.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gustavoars/linux: Input: Fix fall-through warning for Clang scsi: aic94xx: Fix fall-through warning for Clang i3c: master: cdns: Fix fall-through warning for Clang net/mlx4: Fix fall-through warning for Clang
2021-07-07Merge tag 'hwlock-v5.14' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/andersson/remoteproc Pull hwspinlock updates from Bjorn Andersson: "This adds a driver for the hardware spinlock in Allwinner sun6i" * tag 'hwlock-v5.14' of git://git.kernel.org/pub/scm/linux/kernel/git/andersson/remoteproc: dt-bindings: hwlock: sun6i: Fix various warnings in binding hwspinlock: add sun6i hardware spinlock support dt-bindings: hwlock: add sun6i_hwspinlock
2021-07-07Merge tag 'rproc-v5.14' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/andersson/remoteproc Pull remoteproc updates from Bjorn Andersson: "This adds support for controlling the PRU and R5F clusters on the TI AM64x, the remote processor in i.MX7ULP, i.MX8MN/P and i.MX8ULP NXP and the audio, compute and modem remoteprocs in the Qualcomm SC8180x platform. It fixes improper ordering of cdev and device creation of the remoteproc control interface and it fixes resource leaks in the error handling path of rproc_add() and the Qualcomm modem and wifi remoteproc drivers. Lastly it fixes a few build warnings and replace the dummy parameter passed in the mailbox api of the stm32 driver to something not living on the stack" * tag 'rproc-v5.14' of git://git.kernel.org/pub/scm/linux/kernel/git/andersson/remoteproc: (32 commits) remoteproc: qcom: pas: Add SC8180X adsp, cdsp and mpss dt-bindings: remoteproc: qcom: pas: Add SC8180X adsp, cdsp and mpss remoteproc: imx_rproc: support i.MX8ULP dt-bindings: remoteproc: imx_rproc: support i.MX8ULP remoteproc: stm32: fix mbox_send_message call remoteproc: core: Cleanup device in case of failure remoteproc: core: Fix cdev remove and rproc del remoteproc: core: Move validate before device add remoteproc: core: Move cdev add before device add remoteproc: pru: Add support for various PRU cores on K3 AM64x SoCs dt-bindings: remoteproc: pru: Update bindings for K3 AM64x SoCs remoteproc: qcom_wcnss: Use devm_qcom_smem_state_get() remoteproc: qcom_q6v5: Use devm_qcom_smem_state_get() to fix missing put() soc: qcom: smem_state: Add devm_qcom_smem_state_get() dt-bindings: remoteproc: qcom: pas: Fix indentation warnings remoteproc: imx-rproc: Fix IMX_REMOTEPROC configuration remoteproc: imx_rproc: support i.MX8MN/P remoteproc: imx_rproc: support i.MX7ULP remoteproc: imx_rproc: make clk optional remoteproc: imx_rproc: initial support for mutilple start/stop method ...
2021-07-07tracing/histograms: Fix parsing of "sym-offset" modifierSteven Rostedt (VMware)
With the addition of simple mathematical operations (plus and minus), the parsing of the "sym-offset" modifier broke, as it took the '-' part of the "sym-offset" as a minus, and tried to break it up into a mathematical operation of "field.sym - offset", in which case it failed to parse (unless the event had a field called "offset"). Both .sym and .sym-offset modifiers should not be entered into mathematical calculations anyway. If ".sym-offset" is found in the modifier, then simply make it not an operation that can be calculated on. Link: https://lkml.kernel.org/r/20210707110821.188ae255@oasis.local.home Cc: Ingo Molnar <mingo@kernel.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Daniel Bristot de Oliveira <bristot@redhat.com> Cc: stable@vger.kernel.org Fixes: 100719dcef447 ("tracing: Add simple expression support to hist triggers") Reviewed-by: Tom Zanussi <zanussi@kernel.org> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2021-07-07CIFS: Clarify SMB1 code for deleteSteve French
Coverity also complains about the way we calculate the offset (starting from the address of a 4 byte array within the header structure rather than from the beginning of the struct plus 4 bytes) for SMB1 SetFileDisposition (which is used to unlink a file by setting the delete on close flag). This changeset doesn't change the address but makes it slightly clearer. Addresses-Coverity: 711524 ("Out of bounds write") Reviewed-by: Paulo Alcantara (SUSE) <pc@cjr.nz> Signed-off-by: Steve French <stfrench@microsoft.com>
2021-07-07CIFS: Clarify SMB1 code for SetFileSizeSteve French
Coverity also complains about the way we calculate the offset (starting from the address of a 4 byte array within the header structure rather than from the beginning of the struct plus 4 bytes) for setting the file size using SMB1. This changeset doesn't change the address but makes it slightly clearer. Addresses-Coverity: 711525 ("Out of bounds write") Reviewed-by: Paulo Alcantara (SUSE) <pc@cjr.nz> Signed-off-by: Steve French <stfrench@microsoft.com>
2021-07-07tools/runqslower: Use __state instead of stateSanjayKumar Jeyakumar
Commit 2f064a59a11f ("sched: Change task_struct::state") renamed task->state to task->__state in task_struct. Fix runqslower to use the new name of the field. Fixes: 2f064a59a11f ("sched: Change task_struct::state") Signed-off-by: SanjayKumar Jeyakumar <vjsanjay@gmail.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Yonghong Song <yhs@fb.com> Link: https://lore.kernel.org/bpf/20210707052914.21473-1-vjsanjay@gmail.com
2021-07-07spi: stm32: fixes pm_runtime calls in probe/removeAlain Volmat
Add pm_runtime calls in probe/probe error path and remove in order to be consistent in all places in ordering and ensure that pm_runtime is disabled prior to resources used by the SPI controller. This patch also fixes the 2 following warnings on driver remove: WARNING: CPU: 0 PID: 743 at drivers/clk/clk.c:594 clk_core_disable_lock+0x18/0x24 WARNING: CPU: 0 PID: 743 at drivers/clk/clk.c:476 clk_unprepare+0x24/0x2c Fixes: 038ac869c9d2 ("spi: stm32: add runtime PM support") Signed-off-by: Amelie Delaunay <amelie.delaunay@foss.st.com> Signed-off-by: Alain Volmat <alain.volmat@foss.st.com> Link: https://lore.kernel.org/r/1625646426-5826-2-git-send-email-alain.volmat@foss.st.com Signed-off-by: Mark Brown <broonie@kernel.org>
2021-07-07btrfs: zoned: fix wrong mutex unlock on failure to allocate log root treeFilipe Manana
When syncing the log, if we fail to allocate the root node for the log root tree: 1) We are unlocking fs_info->tree_log_mutex, but at this point we have not yet locked this mutex; 2) We have locked fs_info->tree_root->log_mutex, but we end up not unlocking it; So fix this by unlocking fs_info->tree_root->log_mutex instead of fs_info->tree_log_mutex. Fixes: e75f9fd194090e ("btrfs: zoned: move log tree node allocation out of log_root_tree->log_mutex") CC: stable@vger.kernel.org # 5.13+ Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-07-07btrfs: don't block if we can't acquire the reclaim lockJohannes Thumshirn
If we can't acquire the reclaim_bgs_lock on block group reclaim, we block until it is free. This can potentially stall for a long time. While reclaim of block groups is necessary for a good user experience on a zoned file system, there still is no need to block as it is best effort only, just like when we're deleting unused block groups. CC: stable@vger.kernel.org # 5.13 Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-07-07btrfs: properly split extent_map for REQ_OP_ZONE_APPENDNaohiro Aota
Damien reported a test failure with btrfs/209. The test itself ran fine, but the fsck ran afterwards reported a corrupted filesystem. The filesystem corruption happens because we're splitting an extent and then writing the extent twice. We have to split the extent though, because we're creating too large extents for a REQ_OP_ZONE_APPEND operation. When dumping the extent tree, we can see two EXTENT_ITEMs at the same start address but different lengths. $ btrfs inspect dump-tree /dev/nullb1 -t extent ... item 19 key (269484032 EXTENT_ITEM 126976) itemoff 15470 itemsize 53 refs 1 gen 7 flags DATA extent data backref root FS_TREE objectid 257 offset 786432 count 1 item 20 key (269484032 EXTENT_ITEM 262144) itemoff 15417 itemsize 53 refs 1 gen 7 flags DATA extent data backref root FS_TREE objectid 257 offset 786432 count 1 The duplicated EXTENT_ITEMs originally come from wrongly split extent_map in extract_ordered_extent(). Since extract_ordered_extent() uses create_io_em() to split an existing extent_map, we will have split->orig_start != split->start. Then, it will be logged with non-zero "extent data offset". Finally, the logged entries are replayed into a duplicated EXTENT_ITEM. Introduce and use proper splitting function for extent_map. The function is intended to be simple and specific usage for extract_ordered_extent() e.g. not supporting compression case (we do not allow splitting compressed extent_map anyway). There was a question raised by Qu, in summary why we want to split the extent map (and not the bio): The problem is not the limit on the zone end, which as you mention is the same as the block group end. The problem is that data write use zone append (ZA) operations. ZA BIOs cannot be split so a large extent may need to be processed with multiple ZA BIOs, While that is also true for regular writes, the major difference is that ZA are "nameless" write operation giving back the written sectors on completion. And ZA operations may be reordered by the block layer (not intentionally though). Combine both of these characteristics and you can see that the data for a large extent may end up being shuffled when written resulting in data corruption and the impossibility to map the extent to some start sector. To avoid this problem, zoned btrfs uses the principle "one data extent == one ZA BIO". So large extents need to be split. This is unfortunate, but we can revisit this later and optimize, e.g. merge back together the fragments of an extent once written if they actually were written sequentially in the zone. Reported-by: Damien Le Moal <damien.lemoal@wdc.com> Fixes: d22002fd37bd ("btrfs: zoned: split ordered extent when bio is sent") CC: stable@vger.kernel.org # 5.12+ CC: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-07-07btrfs: rework chunk allocation to avoid exhaustion of the system chunk arrayFilipe Manana
Commit eafa4fd0ad0607 ("btrfs: fix exhaustion of the system chunk array due to concurrent allocations") fixed a problem that resulted in exhausting the system chunk array in the superblock when there are many tasks allocating chunks in parallel. Basically too many tasks enter the first phase of chunk allocation without previous tasks having finished their second phase of allocation, resulting in too many system chunks being allocated. That was originally observed when running the fallocate tests of stress-ng on a PowerPC machine, using a node size of 64K. However that commit also introduced a deadlock where a task in phase 1 of the chunk allocation waited for another task that had allocated a system chunk to finish its phase 2, but that other task was waiting on an extent buffer lock held by the first task, therefore resulting in both tasks not making any progress. That change was later reverted by a patch with the subject "btrfs: fix deadlock with concurrent chunk allocations involving system chunks", since there is no simple and short solution to address it and the deadlock is relatively easy to trigger on zoned filesystems, while the system chunk array exhaustion is not so common. This change reworks the chunk allocation to avoid the system chunk array exhaustion. It accomplishes that by making the first phase of chunk allocation do the updates of the device items in the chunk btree and the insertion of the new chunk item in the chunk btree. This is done while under the protection of the chunk mutex (fs_info->chunk_mutex), in the same critical section that checks for available system space, allocates a new system chunk if needed and reserves system chunk space. This way we do not have chunk space reserved until the second phase completes. The same logic is applied to chunk removal as well, since it keeps reserved system space long after it is done updating the chunk btree. For direct allocation of system chunks, the previous behaviour remains, because otherwise we would deadlock on extent buffers of the chunk btree. Changes to the chunk btree are by large done by chunk allocation and chunk removal, which first reserve chunk system space and then later do changes to the chunk btree. The other remaining cases are uncommon and correspond to adding a device, removing a device and resizing a device. All these other cases do not pre-reserve system space, they modify the chunk btree right away, so they don't hold reserved space for a long period like chunk allocation and chunk removal do. The diff of this change is huge, but more than half of it is just addition of comments describing both how things work regarding chunk allocation and removal, including both the new behavior and the parts of the old behavior that did not change. CC: stable@vger.kernel.org # 5.12+ Tested-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com> Tested-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Tested-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-07-07btrfs: fix deadlock with concurrent chunk allocations involving system chunksFilipe Manana
When a task attempting to allocate a new chunk verifies that there is not currently enough free space in the system space_info and there is another task that allocated a new system chunk but it did not finish yet the creation of the respective block group, it waits for that other task to finish creating the block group. This is to avoid exhaustion of the system chunk array in the superblock, which is limited, when we have a thundering herd of tasks allocating new chunks. This problem was described and fixed by commit eafa4fd0ad0607 ("btrfs: fix exhaustion of the system chunk array due to concurrent allocations"). However there are two very similar scenarios where this can lead to a deadlock: 1) Task B allocated a new system chunk and task A is waiting on task B to finish creation of the respective system block group. However before task B ends its transaction handle and finishes the creation of the system block group, it attempts to allocate another chunk (like a data chunk for an fallocate operation for a very large range). Task B will be unable to progress and allocate the new chunk, because task A set space_info->chunk_alloc to 1 and therefore it loops at btrfs_chunk_alloc() waiting for task A to finish its chunk allocation and set space_info->chunk_alloc to 0, but task A is waiting on task B to finish creation of the new system block group, therefore resulting in a deadlock; 2) Task B allocated a new system chunk and task A is waiting on task B to finish creation of the respective system block group. By the time that task B enter the final phase of block group allocation, which happens at btrfs_create_pending_block_groups(), when it modifies the extent tree, the device tree or the chunk tree to insert the items for some new block group, it needs to allocate a new chunk, so it ends up at btrfs_chunk_alloc() and keeps looping there because task A has set space_info->chunk_alloc to 1, but task A is waiting for task B to finish creation of the new system block group and release the reserved system space, therefore resulting in a deadlock. In short, the problem is if a task B needs to allocate a new chunk after it previously allocated a new system chunk and if another task A is currently waiting for task B to complete the allocation of the new system chunk. Unfortunately this deadlock scenario introduced by the previous fix for the system chunk array exhaustion problem does not have a simple and short fix, and requires a big change to rework the chunk allocation code so that chunk btree updates are all made in the first phase of chunk allocation. And since this deadlock regression is being frequently hit on zoned filesystems and the system chunk array exhaustion problem is triggered in more extreme cases (originally observed on PowerPC with a node size of 64K when running the fallocate tests from stress-ng), revert the changes from that commit. The next patch in the series, with a subject of "btrfs: rework chunk allocation to avoid exhaustion of the system chunk array" does the necessary changes to fix the system chunk array exhaustion problem. Reported-by: Naohiro Aota <naohiro.aota@wdc.com> Link: https://lore.kernel.org/linux-btrfs/20210621015922.ewgbffxuawia7liz@naota-xeon/ Fixes: eafa4fd0ad0607 ("btrfs: fix exhaustion of the system chunk array due to concurrent allocations") CC: stable@vger.kernel.org # 5.12+ Tested-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com> Tested-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Tested-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-07-07btrfs: zoned: print unusable percentage when reclaiming block groupsJohannes Thumshirn
When we're automatically reclaiming a zone, because its zone_unusable value is above the reclaim threshold, we're only logging how much percent of the zone's capacity are used, but not how much of the capacity is unusable. Also print the percentage of the unusable space in the block group before we're reclaiming it. Example: BTRFS info (device sdg): reclaiming chunk 230686720 with 13% used 86% unusable CC: stable@vger.kernel.org # 5.13 Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>