summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2020-03-15Linux 5.6-rc6v5.6-rc6Linus Torvalds
2020-03-15Merge tag 'irq-urgent-2020-03-15' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull irq fix from Thomas Gleixner: "A single commit to handle an erratum in Cavium ThunderX to prevent access to GIC registers which are broken in the implementation" * tag 'irq-urgent-2020-03-15' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: irqchip/gic-v3: Workaround Cavium erratum 38539 when reading GICD_TYPER2
2020-03-15Merge tag 'locking-urgent-2020-03-15' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull futex fix from Thomas Gleixner: "Fix for yet another subtle futex issue. The futex code used ihold() to prevent inodes from vanishing, but ihold() does not guarantee inode persistence. Replace the inode pointer with a per boot, machine wide, unique inode identifier. The second commit fixes the breakage of the hash mechanism which causes a 100% performance regression" * tag 'locking-urgent-2020-03-15' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: futex: Unbreak futex hashing futex: Fix inode life-time issue
2020-03-15Merge tag 'x86-urgent-2020-03-15' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 fixes from Thomas Gleixner: "Two fixes for x86: - Map EFI runtime service data as encrypted when SEV is enabled. Otherwise e.g. SMBIOS data cannot be properly decoded by dmidecode. - Remove the warning in the vector management code which triggered when a managed interrupt affinity changed outside of a CPU hotplug operation. The warning was correct until the recent core code change that introduced a CPU isolation feature which needs to migrate managed interrupts away from online CPUs under certain conditions to achieve the isolation" * tag 'x86-urgent-2020-03-15' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/vector: Remove warning on managed interrupt migration x86/ioremap: Map EFI runtime services data as encrypted for SEV
2020-03-15Merge tag 'perf-urgent-2020-03-15' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull perf fixes from Thomas Gleixner: "A pile of perf fixes: Kernel side: - AMD uncore driver: Replace the open coded sanity check with the core variant, which provides the correct error code and also leaves a hint in dmesg Tooling: - Fix the stdio input handling with glibc versions >= 2.28 - Unbreak the futex-wake benchmark which was reduced to 0 test threads due to the conversion to cpumaps - Initialize sigaction structs before invoking sys_sigactio() - Plug the mapfile memory leak in perf jevents - Fix off by one relative directory includes - Fix an undefined string comparison in perf diff" * tag 'perf-urgent-2020-03-15' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: perf/amd/uncore: Replace manual sampling check with CAP_NO_INTERRUPT flag tools: Fix off-by 1 relative directory includes perf jevents: Fix leak of mapfile memory perf bench: Clear struct sigaction before sigaction() syscall perf bench futex-wake: Restore thread count default to online CPU count perf top: Fix stdio interface input handling with glibc 2.28+ perf diff: Fix undefined string comparision spotted by clang's -Wstring-compare perf symbols: Don't try to find a vmlinux file when looking for kernel modules perf bench: Share some global variables to fix build with gcc 10 perf parse-events: Use asprintf() instead of strncpy() to read tracepoint files perf env: Do not return pointers to local variables perf tests bp_account: Make global variable static
2020-03-15Merge tag 'timers-urgent-2020-03-15' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull timer fix from Thomas Gleixner: "A single fix adding the missing time namespace adjustment in sys/sysinfo which caused sys/sysinfo to be inconsistent with /proc/uptime when read from a task inside a time namespace" * tag 'timers-urgent-2020-03-15' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sys/sysinfo: Respect boottime inside time namespace
2020-03-15Merge tag 'ras-urgent-2020-03-15' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull RAS fixes from Thomas Gleixner: "Two RAS related fixes: - Shut down the per CPU thermal throttling poll work properly when a CPU goes offline. The missing shutdown caused the poll work to be migrated to a unbound worker which triggered warnings about the usage of smp_processor_id() in preemptible context - Fix the PPIN feature initialization which missed to enable the functionality when PPIN_CTL was enabled but the MSR locked against updates" * tag 'ras-urgent-2020-03-15' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/mce: Fix logic and comments around MSR_PPIN_CTL x86/mce/therm_throt: Undo thermal polling properly on CPU offline
2020-03-15Merge tag 'efi-urgent-2020-03-15' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull EFI fixes from Thomas Gleixner: "Two EFI fixes: - Prevent a race and buffer overflow in the sysfs efivars interface which causes kernel memory corruption. - Add the missing NULL pointer checks in efivar_store_raw()" * tag 'efi-urgent-2020-03-15' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: efi: Add a sanity check to efivar_store_raw() efi: Fix a race and a buffer overflow while reading efivars via sysfs
2020-03-15Merge tag 'iommu-fixes-v5.6-rc5' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu Pull IOMMU fixes from Joerg Roedel: - Intel VT-d fixes: - RCU list handling fixes - Replace WARN_TAINT with pr_warn + add_taint for reporting firmware issues - DebugFS fixes - Fix for hugepage handling in iova_to_phys implementation - Fix for handling VMD devices, which have a domain number which doesn't fit into 16 bits - Warning message fix - MSI allocation fix for iommu-dma code - Sign-extension fix for io page-table code - Fix for AMD-Vi to properly update the is-running bit when AVIC is used * tag 'iommu-fixes-v5.6-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu: iommu/vt-d: Populate debugfs if IOMMUs are detected iommu/amd: Fix IOMMU AVIC not properly update the is_run bit in IRTE iommu/vt-d: Ignore devices with out-of-spec domain number iommu/vt-d: Fix the wrong printing in RHSA parsing iommu/vt-d: Fix debugfs register reads iommu/vt-d: quirk_ioat_snb_local_iommu: replace WARN_TAINT with pr_warn + add_taint iommu/vt-d: dmar_parse_one_rmrr: replace WARN_TAINT with pr_warn + add_taint iommu/vt-d: dmar: replace WARN_TAINT with pr_warn + add_taint iommu/vt-d: Silence RCU-list debugging warnings iommu/vt-d: Fix RCU-list bugs in intel_iommu_init() iommu/dma: Fix MSI reservation allocation iommu/io-pgtable-arm: Fix IOVA validation for 32-bit iommu/vt-d: Fix a bug in intel_iommu_iova_to_phys() for huge page iommu/vt-d: Fix RCU list debugging warnings
2020-03-15netfilter: conntrack: re-visit sysctls in unprivileged namespacesFlorian Westphal
since commit b884fa46177659 ("netfilter: conntrack: unify sysctl handling") conntrack no longer exposes most of its sysctls (e.g. tcp timeouts settings) to network namespaces that are not owned by the initial user namespace. This patch exposes all sysctls even if the namespace is unpriviliged. compared to a 4.19 kernel, the newly visible and writeable sysctls are: net.netfilter.nf_conntrack_acct net.netfilter.nf_conntrack_timestamp .. to allow to enable accouting and timestamp extensions. net.netfilter.nf_conntrack_events .. to turn off conntrack event notifications. net.netfilter.nf_conntrack_checksum .. to disable checksum validation. net.netfilter.nf_conntrack_log_invalid .. to enable logging of packets deemed invalid by conntrack. newly visible sysctls that are only exported as read-only: net.netfilter.nf_conntrack_count .. current number of conntrack entries living in this netns. net.netfilter.nf_conntrack_max .. global upperlimit (maximum size of the table). net.netfilter.nf_conntrack_buckets .. size of the conntrack table (hash buckets). net.netfilter.nf_conntrack_expect_max .. maximum number of permitted expectations in this netns. net.netfilter.nf_conntrack_helper .. conntrack helper auto assignment. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-03-15netfilter: nft_lookup: update element stateful expressionPablo Neira Ayuso
If the set element comes with an stateful expression, update it. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-03-15netfilter: nf_tables: add nft_set_elem_update_expr() helper functionPablo Neira Ayuso
This helper function runs the eval path of the stateful expression of an existing set element. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-03-15netfilter: nf_tables: add elements with stateful expressionsPablo Neira Ayuso
Update nft_add_set_elem() to handle the NFTA_SET_ELEM_EXPR netlink attribute. This patch allows users to to add elements with stateful expressions. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-03-15netfilter: nf_tables: statify nft_expr_init()Pablo Neira Ayuso
Not exposed anymore to modules, statify this function. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-03-15netfilter: nf_tables: add nft_set_elem_expr_alloc()Pablo Neira Ayuso
Add helper function to create stateful expression. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-03-15nft_set_pipapo: Prepare for single ranged field usageStefano Brivio
A few adjustments in nft_pipapo_init() are needed to allow usage of this set back-end for a single, ranged field. Provide a convenient NFT_PIPAPO_MIN_FIELDS definition that currently makes sure that the rbtree back-end is selected instead, for sets with a single field. This finally allows a fair comparison with rbtree sets, by defining NFT_PIPAPO_MIN_FIELDS as 0 and skipping rbtree back-end initialisation: ---------------.--------------------------.-------------------------. AMD Epyc 7402 | baselines, Mpps | Mpps, % over rbtree | 1 thread |__________________________|_________________________| 3.35GHz | | | | | | 768KiB L1D$ | netdev | hash | rbtree | | pipapo | ---------------| hook | no | single | pipapo |single field| type entries | drop | ranges | field |single field| AVX2 | ---------------|--------|--------|--------|------------|------------| net,port | | | | | | 1000 | 19.0 | 10.4 | 3.8 | 6.0 +58% | 9.6 +153% | ---------------|--------|--------|--------|------------|------------| port,net | | | | | | 100 | 18.8 | 10.3 | 5.8 | 9.1 +57% |11.6 +100% | ---------------|--------|--------|--------|------------|------------| net6,port | | | | | | 1000 | 16.4 | 7.6 | 1.8 | 2.8 +55% | 6.5 +261% | ---------------|--------|--------|--------|------------|------------| port,proto | | | | [1] | [1] | 30000 | 19.6 | 11.6 | 3.9 | 0.9 -77% | 2.7 -31% | ---------------|--------|--------|--------|------------|------------| port,proto | | | | | | 10000 | 19.6 | 11.6 | 4.4 | 2.1 -52% | 5.6 +27% | ---------------|--------|--------|--------|------------|------------| port,proto | | | | | | 4 threads 10000| 77.9 | 45.1 | 17.4 | 8.3 -52% |22.4 +29% | ---------------|--------|--------|--------|------------|------------| net6,port,mac | | | | | | 10 | 16.5 | 5.4 | 4.3 | 4.5 +5% | 8.2 +91% | ---------------|--------|--------|--------|------------|------------| net6,port,mac, | | | | | | proto 1000 | 16.5 | 5.7 | 1.9 | 2.8 +47% | 6.6 +247% | ---------------|--------|--------|--------|------------|------------| net,mac | | | | | | 1000 | 19.0 | 8.4 | 3.9 | 6.0 +54% | 9.9 +154% | ---------------'--------'--------'--------'------------'------------' [1] Causes switch of lookup table buckets for 'port' to 4-bit groups Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-03-15nft_set_pipapo: Introduce AVX2-based lookup implementationStefano Brivio
If the AVX2 set is available, we can exploit the repetitive characteristic of this algorithm to provide a fast, vectorised version by using 256-bit wide AVX2 operations for bucket loads and bitwise intersections. In most cases, this implementation consistently outperforms rbtree set instances despite the fact they are configured to use a given, single, ranged data type out of the ones used for performance measurements by the nft_concat_range.sh kselftest. That script, injecting packets directly on the ingoing device path with pktgen, reports, averaged over five runs on a single AMD Epyc 7402 thread (3.35GHz, 768 KiB L1D$, 12 MiB L2$), the figures below. CONFIG_RETPOLINE was not set here. Note that this is not a fair comparison over hash and rbtree set types: non-ranged entries (used to have a reference for hash types) would be matched faster than this, and matching on a single field only (which is the case for rbtree) is also significantly faster. However, it's not possible at the moment to choose this set type for non-ranged entries, and the current implementation also needs a few minor adjustments in order to match on less than two fields. ---------------.-----------------------------------.------------. AMD Epyc 7402 | baselines, Mpps | this patch | 1 thread |___________________________________|____________| 3.35GHz | | | | | | 768KiB L1D$ | netdev | hash | rbtree | | | ---------------| hook | no | single | | pipapo | type entries | drop | ranges | field | pipapo | AVX2 | ---------------|--------|--------|--------|--------|------------| net,port | | | | | | 1000 | 19.0 | 10.4 | 3.8 | 4.0 | 7.5 +87% | ---------------|--------|--------|--------|--------|------------| port,net | | | | | | 100 | 18.8 | 10.3 | 5.8 | 6.3 | 8.1 +29% | ---------------|--------|--------|--------|--------|------------| net6,port | | | | | | 1000 | 16.4 | 7.6 | 1.8 | 2.1 | 4.8 +128% | ---------------|--------|--------|--------|--------|------------| port,proto | | | | | | 30000 | 19.6 | 11.6 | 3.9 | 0.5 | 2.6 +420% | ---------------|--------|--------|--------|--------|------------| net6,port,mac | | | | | | 10 | 16.5 | 5.4 | 4.3 | 3.4 | 4.7 +38% | ---------------|--------|--------|--------|--------|------------| net6,port,mac, | | | | | | proto 1000 | 16.5 | 5.7 | 1.9 | 1.4 | 3.6 +26% | ---------------|--------|--------|--------|--------|------------| net,mac | | | | | | 1000 | 19.0 | 8.4 | 3.9 | 2.5 | 6.4 +156% | ---------------'--------'--------'--------'--------'------------' A similar strategy could be easily reused to implement specialised versions for other SIMD sets, and I plan to post at least a NEON version at a later time. Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-03-15nft_set_pipapo: Prepare for vectorised implementation: helpersStefano Brivio
Move most macros and helpers to a header file, so that they can be conveniently used by related implementations. No functional changes are intended here. Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-03-15nft_set_pipapo: Prepare for vectorised implementation: alignmentStefano Brivio
SIMD vector extension sets require stricter alignment than native instruction sets to operate efficiently (AVX, NEON) or for some instructions to work at all (AltiVec). Provide facilities to define arbitrary alignment for lookup tables and scratch maps. By defining byte alignment with NFT_PIPAPO_ALIGN, lt_aligned and scratch_aligned pointers become available. Additional headroom is allocated, and pointers to the possibly unaligned, originally allocated areas are kept so that they can be freed. Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-03-15nft_set_pipapo: Add support for 8-bit lookup groups and dynamic switchStefano Brivio
While grouping matching bits in groups of four saves memory compared to the more natural choice of 8-bit words (lookup table size is one eighth), it comes at a performance cost, as the number of lookup comparisons is doubled, and those also needs bitshifts and masking. Introduce support for 8-bit lookup groups, together with a mapping mechanism to dynamically switch, based on defined per-table size thresholds and hysteresis, between 8-bit and 4-bit groups, as tables grow and shrink. Empty sets start with 8-bit groups, and per-field tables are converted to 4-bit groups if they get too big. An alternative approach would have been to swap per-set lookup operation functions as needed, but this doesn't allow for different group sizes in the same set, which looks desirable if some fields need significantly more matching data compared to others due to heavier impact of ranges (e.g. a big number of subnets with relatively simple port specifications). Allowing different group sizes for the same lookup functions implies the need for further conditional clauses, whose cost, however, appears to be negligible in tests. The matching rate figures below were obtained for x86_64 running the nft_concat_range.sh "performance" cases, averaged over five runs, on a single thread of an AMD Epyc 7402 CPU, and for aarch64 on a single thread of a BCM2711 (Raspberry Pi 4 Model B 4GB), clocked at a stable 2147MHz frequency: ---------------.-----------------------------------.------------. AMD Epyc 7402 | baselines, Mpps | this patch | 1 thread |___________________________________|____________| 3.35GHz | | | | | | 768KiB L1D$ | netdev | hash | rbtree | | | ---------------| hook | no | single | pipapo | pipapo | type entries | drop | ranges | field | 4 bits | bit switch | ---------------|--------|--------|--------|--------|------------| net,port | | | | | | 1000 | 19.0 | 10.4 | 3.8 | 2.8 | 4.0 +43% | ---------------|--------|--------|--------|--------|------------| port,net | | | | | | 100 | 18.8 | 10.3 | 5.8 | 5.5 | 6.3 +14% | ---------------|--------|--------|--------|--------|------------| net6,port | | | | | | 1000 | 16.4 | 7.6 | 1.8 | 1.3 | 2.1 +61% | ---------------|--------|--------|--------|--------|------------| port,proto | | | | | [1] | 30000 | 19.6 | 11.6 | 3.9 | 0.3 | 0.5 +66% | ---------------|--------|--------|--------|--------|------------| net6,port,mac | | | | | | 10 | 16.5 | 5.4 | 4.3 | 2.6 | 3.4 +31% | ---------------|--------|--------|--------|--------|------------| net6,port,mac, | | | | | | proto 1000 | 16.5 | 5.7 | 1.9 | 1.0 | 1.4 +40% | ---------------|--------|--------|--------|--------|------------| net,mac | | | | | | 1000 | 19.0 | 8.4 | 3.9 | 1.7 | 2.5 +47% | ---------------'--------'--------'--------'--------'------------' [1] Causes switch of lookup table buckets for 'port', not 'proto', to 4-bit groups ---------------.-----------------------------------.------------. BCM2711 | baselines, Mpps | this patch | 1 thread |___________________________________|____________| 2147MHz | | | | | | 32KiB L1D$ | netdev | hash | rbtree | | | ---------------| hook | no | single | pipapo | pipapo | type entries | drop | ranges | field | 4 bits | bit switch | ---------------|--------|--------|--------|--------|------------| net,port | | | | | | 1000 | 1.63 | 1.37 | 0.87 | 0.61 | 0.70 +17% | ---------------|--------|--------|--------|--------|------------| port,net | | | | | | 100 | 1.64 | 1.36 | 1.02 | 0.78 | 0.81 +4% | ---------------|--------|--------|--------|--------|------------| net6,port | | | | | | 1000 | 1.56 | 1.27 | 0.65 | 0.34 | 0.50 +47% | ---------------|--------|--------|--------|--------|------------| port,proto [2] | | | | | | 10000 | 1.68 | 1.43 | 0.84 | 0.30 | 0.40 +13% | ---------------|--------|--------|--------|--------|------------| net6,port,mac | | | | | | 10 | 1.56 | 1.14 | 1.02 | 0.62 | 0.66 +6% | ---------------|--------|--------|--------|--------|------------| net6,port,mac, | | | | | | proto 1000 | 1.56 | 1.12 | 0.64 | 0.27 | 0.40 +48% | ---------------|--------|--------|--------|--------|------------| net,mac | | | | | | 1000 | 1.63 | 1.26 | 0.87 | 0.41 | 0.53 +29% | ---------------'--------'--------'--------'--------'------------' [2] Using 10000 entries instead of 30000 as it would take way too long for the test script to generate all of them Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-03-15nft_set_pipapo: Generalise group size for bucketsStefano Brivio
Get rid of all hardcoded assumptions that buckets in lookup tables correspond to four-bit groups, and replace them with appropriate calculations based on a variable group size, now stored in struct field. The group size could now be in principle any divisor of eight. Note, though, that lookup and get functions need an implementation intimately depending on the group size, and the only supported size there, currently, is four bits, which is also the initial and only used size at the moment. While at it, drop 'groups' from struct nft_pipapo: it was never used. Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-03-15netfilter: flowtable: add tunnel encap/decap action offload supportwenxu
This patch add tunnel encap decap action offload in the flowtable offload. Signed-off-by: wenxu <wenxu@ucloud.cn> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-03-15netfilter: flowtable: add tunnel match offload supportwenxu
This patch support both ipv4 and ipv6 tunnel_id, tunnel_src and tunnel_dst match for flowtable offload Signed-off-by: wenxu <wenxu@ucloud.cn> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-03-15netfilter: flowtable: add indr block setup supportwenxu
Add etfilter flowtable support indr-block setup. It makes flowtable offload vlan and tunnel device. Signed-off-by: wenxu <wenxu@ucloud.cn> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-03-15netfilter: flowtable: add nf_flow_table_block_offload_init()wenxu
Add nf_flow_table_block_offload_init prepare for the indr block offload patch Signed-off-by: wenxu <wenxu@ucloud.cn> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-03-15netfilter: xt_IDLETIMER: clean up some indentingDan Carpenter
These lines were indented wrong so Smatch complained. net/netfilter/xt_IDLETIMER.c:81 idletimer_tg_show() warn: inconsistent indenting Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-03-15netfilter: bitwise: use more descriptive variable-names.Jeremy Sowden
Name the mask and xor data variables, "mask" and "xor," instead of "d1" and "d2." Signed-off-by: Jeremy Sowden <jeremy@azazel.net> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-03-15netfilter: Replace zero-length array with flexible-array memberGustavo A. R. Silva
The current codebase makes use of the zero-length array language extension to the C90 standard, but the preferred mechanism to declare variable-length types such as these ones is a flexible array member[1][2], introduced in C99: struct foo { int stuff; struct boo array[]; }; By making use of the mechanism above, we will get a compiler warning in case the flexible array does not occur last in the structure, which will help us prevent some kind of undefined behavior bugs from being inadvertently introduced[3] to the codebase from now on. Also, notice that, dynamic memory allocations won't be affected by this change: "Flexible array members have incomplete type, and so the sizeof operator may not be applied. As a quirk of the original implementation of zero-length arrays, sizeof evaluates to zero."[1] Lastly, fix checkpatch.pl warning WARNING: __aligned(size) is preferred over __attribute__((aligned(size))) in net/bridge/netfilter/ebtables.c This issue was found with the help of Coccinelle. [1] https://gcc.gnu.org/onlinedocs/gcc/Zero-Length.html [2] https://github.com/KSPP/linux/issues/21 [3] commit 76497732932f ("cxgb3/l2t: Fix undefined behaviour") Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-03-15netfilter: nft_set_pipapo: make the symbol 'nft_pipapo_get' staticChen Wandun
Fix the following sparse warning: net/netfilter/nft_set_pipapo.c:739:6: warning: symbol 'nft_pipapo_get' was not declared. Should it be static? Fixes: 3c4287f62044 ("nf_tables: Add set type for arbitrary concatenation of ranges") Signed-off-by: Chen Wandun <chenwandun@huawei.com> Acked-by: Stefano Brivio <sbrivio@redhat.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-03-15netfilter: cleanup unused macroLi RongQing
TEMPLATE_NULLS_VAL is not used after commit 0838aa7fcfcd ("netfilter: fix netns dependencies with conntrack templates") PFX is not used after commit 8bee4bad03c5b ("netfilter: xt extensions: use pr_<level>") Signed-off-by: Li RongQing <lirongqing@baidu.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-03-15netfilter: nf_tables: make all set structs constFlorian Westphal
They do not need to be writeable anymore. v2: remove left-over __read_mostly annotation in set_pipapo.c (Stefano) Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-03-15netfilter: nf_tables: make sets built-inFlorian Westphal
Placing nftables set support in an extra module is pointless: 1. nf_tables needs dynamic registeration interface for sake of one module 2. nft heavily relies on sets, e.g. even simple rule like "nft ... tcp dport { 80, 443 }" will not work with _SETS=n. IOW, either nftables isn't used or both nf_tables and nf_tables_set modules are needed anyway. With extra module: 307K net/netfilter/nf_tables.ko 79K net/netfilter/nf_tables_set.ko text data bss dec filename 146416 3072 545 150033 nf_tables.ko 35496 1817 0 37313 nf_tables_set.ko This patch: 373K net/netfilter/nf_tables.ko 178563 4049 545 183157 nf_tables.ko Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-03-15netfilter: nft_tunnel: add support for geneve optsXin Long
Like vxlan and erspan opts, geneve opts should also be supported in nft_tunnel. The difference is geneve RFC (draft-ietf-nvo3-geneve-14) allows a geneve packet to carry multiple geneve opts. So with this patch, nftables/libnftnl would do: # nft add table ip filter # nft add chain ip filter input { type filter hook input priority 0 \; } # nft add tunnel filter geneve_02 { type geneve\; id 2\; \ ip saddr 192.168.1.1\; ip daddr 192.168.1.2\; \ sport 9000\; dport 9001\; dscp 1234\; ttl 64\; flags 1\; \ opts \"1:1:34567890,2:2:12121212,3:3:1212121234567890\"\; } # nft list tunnels table filter table ip filter { tunnel geneve_02 { id 2 ip saddr 192.168.1.1 ip daddr 192.168.1.2 sport 9000 dport 9001 tos 18 ttl 64 flags 1 geneve opts 1:1:34567890,2:2:12121212,3:3:1212121234567890 } } v1->v2: - no changes, just post it separately. Signed-off-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-03-15netfilter: xtables: Add snapshot of hardidletimer targetManoj Basapathi
This is a snapshot of hardidletimer netfilter target. This patch implements a hardidletimer Xtables target that can be used to identify when interfaces have been idle for a certain period of time. Timers are identified by labels and are created when a rule is set with a new label. The rules also take a timeout value (in seconds) as an option. If more than one rule uses the same timer label, the timer will be restarted whenever any of the rules get a hit. One entry for each timer is created in sysfs. This attribute contains the timer remaining for the timer to expire. The attributes are located under the xt_idletimer class: /sys/class/xt_idletimer/timers/<label> When the timer expires, the target module sends a sysfs notification to the userspace, which can then decide what to do (eg. disconnect to save power) Compared to IDLETIMER, HARDIDLETIMER can send notifications when CPU is in suspend too, to notify the timer expiry. v1->v2: Moved all functionality into IDLETIMER module to avoid code duplication per comment from Florian. Signed-off-by: Manoj Basapathi <manojbm@codeaurora.org> Signed-off-by: Subash Abhinov Kasiviswanathan <subashab@codeaurora.org> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-03-15netfilter: flowtable: Use nf_flow_offload_tuple for stats as wellPaul Blakey
This patch doesn't change any functionality. Signed-off-by: Paul Blakey <paulb@mellanox.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-03-15Merge tag 'irqchip-fixes-5.6-2' of ↵Thomas Gleixner
git://git.kernel.org/pub/scm/linux/kernel/git/maz/arm-platforms into irq/urgent Pull irqchip fixes from Marc Zyngier: - Add workaround for Cavium/Marvell ThunderX unimplemented GIC registers
2020-03-15geneve: move debug check after netdev unregisterFlorian Westphal
The debug check must be done after unregister_netdevice_many() call -- the list_del() for this is done inside .ndo_stop. Fixes: 2843a25348f8 ("geneve: speedup geneve tunnels dismantle") Reported-and-tested-by: <syzbot+68a8ed58e3d17c700de5@syzkaller.appspotmail.com> Cc: Haishuang Yan <yanhaishuang@cmss.chinamobile.com> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-15cdc_ncm: Fix the build warningAlexander Bersenev
The ndp32->wLength is two bytes long, so replace cpu_to_le32 with cpu_to_le16. Fixes: 0fa81b304a79 ("cdc_ncm: Implement the 32-bit version of NCM Transfer Block") Signed-off-by: Alexander Bersenev <bay@hackerdom.ru> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-15net/packet: tpacket_rcv: avoid a producer race conditionWillem de Bruijn
PACKET_RX_RING can cause multiple writers to access the same slot if a fast writer wraps the ring while a slow writer is still copying. This is particularly likely with few, large, slots (e.g., GSO packets). Synchronize kernel thread ownership of rx ring slots with a bitmap. Writers acquire a slot race-free by testing tp_status TP_STATUS_KERNEL while holding the sk receive queue lock. They release this lock before copying and set tp_status to TP_STATUS_USER to release to userspace when done. During copying, another writer may take the lock, also see TP_STATUS_KERNEL, and start writing to the same slot. Introduce a new rx_owner_map bitmap with a bit per slot. To acquire a slot, test and set with the lock held. To release race-free, update tp_status and owner bit as a transaction, so take the lock again. This is the one of a variety of discussed options (see Link below): * instead of a shadow ring, embed the data in the slot itself, such as in tp_padding. But any test for this field may match a value left by userspace, causing deadlock. * avoid the lock on release. This leaves a small race if releasing the shadow slot before setting TP_STATUS_USER. The below reproducer showed that this race is not academic. If releasing the slot after tp_status, the race is more subtle. See the first link for details. * add a new tp_status TP_KERNEL_OWNED to avoid the transactional store of two fields. But, legacy applications may interpret all non-zero tp_status as owned by the user. As libpcap does. So this is possible only opt-in by newer processes. It can be added as an optional mode. * embed the struct at the tail of pg_vec to avoid extra allocation. The implementation proved no less complex than a separate field. The additional locking cost on release adds contention, no different than scaling on multicore or multiqueue h/w. In practice, below reproducer nor small packet tcpdump showed a noticeable change in perf report in cycles spent in spinlock. Where contention is problematic, packet sockets support mitigation through PACKET_FANOUT. And we can consider adding opt-in state TP_KERNEL_OWNED. Easy to reproduce by running multiple netperf or similar TCP_STREAM flows concurrently with `tcpdump -B 129 -n greater 60000`. Based on an earlier patchset by Jon Rosen. See links below. I believe this issue goes back to the introduction of tpacket_rcv, which predates git history. Link: https://www.mail-archive.com/netdev@vger.kernel.org/msg237222.html Suggested-by: Jon Rosen <jrosen@cisco.com> Signed-off-by: Willem de Bruijn <willemb@google.com> Signed-off-by: Jon Rosen <jrosen@cisco.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-15Merge branch 'mptcp-simplify-mptcp_accept'David S. Miller
Paolo Abeni says: ==================== mptcp: simplify mptcp_accept() Currently we allocate the MPTCP master socket at accept time. The above makes mptcp_accept() quite complex, and requires checks is several places for NULL MPTCP master socket. These series simplify the MPTCP accept implementation, moving the master socket allocation at syn-ack time, so that we drop unneeded checks with the follow-up patch. v1 -> v2: - rebased on top of 2398e3991bda7caa6b112a6f650fbab92f732b91 ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-15mptcp: drop unneeded checksPaolo Abeni
After the previous patch subflow->conn is always != NULL and is never changed. We can drop a bunch of now unneeded checks. v1 -> v2: - rebased on top of commit 2398e3991bda ("mptcp: always include dack if possible.") Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-15mptcp: create msk earlyPaolo Abeni
This change moves the mptcp socket allocation from mptcp_accept() to subflow_syn_recv_sock(), so that subflow->conn is now always set for the non fallback scenario. It allows cleaning up a bit mptcp_accept() reducing the additional locking and will allow fourther cleanup in the next patch. Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-15net: stmmac: platform: convert to devm_platform_ioremap_resourceDejin Zheng
Use devm_platform_ioremap_resource() to simplify code, which contains platform_get_resource and devm_ioremap_resource. Signed-off-by: Dejin Zheng <zhengdejin5@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-15net: mscc: ocelot: adjust maxlen on NPI port, not CPUVladimir Oltean
Being a non-physical port, the CPU port does not have an ocelot_port structure, so the ocelot_port_writel call inside the ocelot_port_set_maxlen() function would access data behind a NULL pointer. This is a patch for net-next only, the net tree boots fine, the bug was introduced during the net -> net-next merge. Fixes: 1d3435793123 ("Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net") Fixes: a8015ded89ad ("net: mscc: ocelot: properly account for VLAN header length when setting MRU") Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-15net: ip_gre: Separate ERSPAN newlink / changelink callbacksPetr Machata
ERSPAN shares most of the code path with GRE and gretap code. While that helps keep the code compact, it is also error prone. Currently a broken userspace can turn a gretap tunnel into a de facto ERSPAN one by passing IFLA_GRE_ERSPAN_VER. There has been a similar issue in ip6gretap in the past. To prevent these problems in future, split the newlink and changelink code paths. Split the ERSPAN code out of ipgre_netlink_parms() into a new function erspan_netlink_parms(). Extract a piece of common logic from ipgre_newlink() and ipgre_changelink() into ipgre_newlink_encap_setup(). Add erspan_newlink() and erspan_changelink(). Fixes: 84e54fe0a5ea ("gre: introduce native tunnel support for ERSPAN") Signed-off-by: Petr Machata <petrm@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-15cxgb4: fix delete filter entry fail in unload pathShahjada Abul Husain
Currently, the hardware TID index is assumed to start from index 0. However, with the following changeset, commit c21939998802 ("cxgb4: add support for high priority filters") hardware TID index can start after the high priority region, which has introduced a regression resulting in remove filters entry failure for cxgb4 unload path. This patch fix that. Fixes: c21939998802 ("cxgb4: add support for high priority filters") Signed-off-by: Shahjada Abul Husain <shahjada@chelsio.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-15tipc: add NULL pointer check to prevent kernel oopsHoang Le
Calling: tipc_node_link_down()-> - tipc_node_write_unlock()->tipc_mon_peer_down() - tipc_mon_peer_down() just after disabling bearer could be caused kernel oops. Fix this by adding a sanity check to make sure valid memory access. Acked-by: Jon Maloy <jmaloy@redhat.com> Signed-off-by: Hoang Le <hoang.h.le@dektech.com.au> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-15tipc: simplify trivial boolean returnHoang Le
Checking and returning 'true' boolean is useless as it will be returning at end of function Signed-off-by: Hoang Le <hoang.h.le@dektech.com.au> Acked-by: Ying Xue <ying.xue@windriver.com> Acked-by: Jon Maloy <jmaloy@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-14Merge branch 'ethtool-consolidate-irq-coalescing-part-5'David S. Miller
Jakub Kicinski says: ==================== ethtool: consolidate irq coalescing - part 5 Convert more drivers following the groundwork laid in a recent patch set [1] and continued in [2], [3], [4]. The aim of the effort is to consolidate irq coalescing parameter validation in the core. This set converts further 15 drivers in drivers/net/ethernet. One more conversion sets to come. [1] https://lore.kernel.org/netdev/20200305051542.991898-1-kuba@kernel.org/ [2] https://lore.kernel.org/netdev/20200306010602.1620354-1-kuba@kernel.org/ [3] https://lore.kernel.org/netdev/20200310021512.1861626-1-kuba@kernel.org/ [4] https://lore.kernel.org/netdev/20200311223302.2171564-1-kuba@kernel.org/ ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-14net: via: reject unsupported coalescing paramsJakub Kicinski
Set ethtool_ops->supported_coalesce_params to let the core reject unsupported coalescing parameters. This driver did not previously reject unsupported parameters. Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>