summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2021-02-13s390/time: introduce new store_tod_clock_ext()Heiko Carstens
Introduce new store_tod_clock_ext() function, which is the same like store_tod_clock_ext_cc() except that it doesn't return a condition code. Signed-off-by: Heiko Carstens <hca@linux.ibm.com> Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
2021-02-13s390/time: rename store_tod_clock_ext() and use union tod_clockHeiko Carstens
Rename store_tod_clock_ext() to store_tod_clock_ext_cc() to reflect that it returns a condition code and also use union tod_clock as parameter. Signed-off-by: Heiko Carstens <hca@linux.ibm.com> Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
2021-02-13s390/time: introduce union tod_clockHeiko Carstens
Introduce union tod_clock which is supposed to be used to decode and access various fields of the result of STORE CLOCK EXTENDED. Signed-off-by: Heiko Carstens <hca@linux.ibm.com> Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
2021-02-13s390,alpha: switch to 64-bit ino_tHeiko Carstens
s390 and alpha are the only 64 bit architectures with a 32-bit ino_t. Since this is quite unusual this causes bugs from time to time. See e.g. commit ebce3eb2f7ef ("ceph: fix inode number handling on arches with 32-bit ino_t") for an example. This (obviously) also prevents s390 and alpha to use 64-bit ino_t for tmpfs. See commit b85a7a8bb573 ("tmpfs: disallow CONFIG_TMPFS_INODE64 on s390"). Therefore switch both s390 and alpha to 64-bit ino_t. This should only have an effect on the ustat system call. To prevent ABI breakage define struct ustat compatible to the old layout and change sys_ustat() accordingly. Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Heiko Carstens <hca@linux.ibm.com> Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
2021-02-13s390: split cleanup_sieSven Schnelle
The current code uses the address in %r11 to figure out whether it was called from the machine check handler or from a normal interrupt handler. Instead of doing this implicit logic (which is mostly a leftover from the old critical cleanup approach) just add a second label and use that. Signed-off-by: Sven Schnelle <svens@linux.ibm.com> Reviewed-by: Heiko Carstens <hca@linux.ibm.com> Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
2021-02-13s390: use r13 in cleanup_sie as temp registerSven Schnelle
Instead of thrashing r11 which is normally our pointer to struct pt_regs on the stack, use r13 as temporary register in the BR_EX macro. r13 is already used in cleanup_sie, so no need to thrash another register. Signed-off-by: Sven Schnelle <svens@linux.ibm.com> Reviewed-by: Heiko Carstens <hca@linux.ibm.com> Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
2021-02-13s390: fix kernel asce loading when sie is interruptedSven Schnelle
If a machine check is coming in during sie, the PU saves the control registers to the machine check save area. Afterwards mcck_int_handler is called, which loads __LC_KERNEL_ASCE into %cr1. Later the code restores %cr1 from the machine check area, but that is wrong when SIE was interrupted because the machine check area still contains the gmap asce. Instead it should return with either __KERNEL_ASCE in %cr1 when interrupted in SIE or the previous %cr1 content saved in the machine check save area. Fixes: 87d598634521 ("s390/mm: remove set_fs / rework address space handling") Signed-off-by: Sven Schnelle <svens@linux.ibm.com> Cc: <stable@kernel.org> # v5.8+ Reviewed-by: Heiko Carstens <hca@linux.ibm.com> Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
2021-02-13s390: add stack for machine check handlerSven Schnelle
The previous code used the normal kernel stack for machine checks. This is problematic when a machine check interrupts a system call or interrupt handler right at the beginning where registers are set up. Assume system_call is interrupted at the first instruction and a machine check is triggered. The machine check handler is called, checks the PSW to see whether it is coming from user space, notices that it is already in kernel mode but %r15 still contains the user space stack. This would lead to a kernel crash. There are basically two ways of fixing that: Either using the 'critical cleanup' approach which compares the address in the PSW to see whether it is already at a point where the stack has been set up, or use an extra stack for the machine check handler. For simplicity, we will go with the second approach and allocate an extra stack. This adds some memory overhead for large systems, but usually large system have plenty of memory so this isn't really a concern. But it keeps the mchk stack setup simple and less error prone. Fixes: 0b0ed657fe00 ("s390: remove critical section cleanup from entry.S") Signed-off-by: Sven Schnelle <svens@linux.ibm.com> Cc: <stable@kernel.org> # v5.8+ Reviewed-by: Heiko Carstens <hca@linux.ibm.com> Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
2021-02-13s390: use WRITE_ONCE when re-allocating async stackSven Schnelle
The code does: S390_lowcore.async_stack = new + STACK_INIT_OFFSET; But the compiler is free to first assign one value and add the other value later. If a IRQ would be coming in between these two operations, it would run with an invalid stack. Prevent this by using WRITE_ONCE. Signed-off-by: Sven Schnelle <svens@linux.ibm.com> Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
2021-02-13s390: open code SWITCH_KERNEL macroSven Schnelle
This is a preparation patch for two later bugfixes. In the past both int_handler and machine check handler used SWITCH_KERNEL to switch to the kernel stack. However, SWITCH_KERNEL doesn't work properly in machine check context. So instead of adding more complexity to this macro, just remove it. Signed-off-by: Sven Schnelle <svens@linux.ibm.com> Cc: <stable@kernel.org> # v5.8+ Reviewed-by: Heiko Carstens <hca@linux.ibm.com> Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
2021-02-13io_uring: kill cached requests from exiting task closing the ringJens Axboe
Be nice and prune these upfront, in case the ring is being shared and one of the tasks is going away. This is a bit more important now that we account the allocations. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-13io_uring: add helper to free all request cachesJens Axboe
We have three different ones, put it in a helper for easy calling. This is in preparation for doing it outside of ring freeing as well. With that in mind, also ensure that we do the proper locking for safe calling from a context where the ring it still live. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-13io_uring: allow task match to be passed to io_req_cache_free()Jens Axboe
No changes in this patch, just allows a caller to pass in a targeted task that we must match for freeing requests in the cache. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-13MAINTAINERS: Add git tree for KVM/mipsTiezhu Yang
There is no git tree for KVM/mips in MAINTAINERS, it is not convinent to rebase, add it. Signed-off-by: Tiezhu Yang <yangtiezhu@loongson.cn> Signed-off-by: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
2021-02-13MIPS: Use common way to parse elfcorehdrJinyang He
"elfcorehdr" can be parsed at kernel/crash_dump.c Signed-off-by: Jinyang He <hejinyang@loongson.cn> Signed-off-by: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
2021-02-13MIPS: Simplify EVA cache handlingThomas Bogendoerfer
protected_cache_op is only used for flushing user addresses, so we only need to define protected_cache_op different in EVA mode and be done with it. Signed-off-by: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
2021-02-13Revert "MIPS: kernel: {ftrace,kgdb}: Set correct address limit for cache ↵Thomas Bogendoerfer
flushes" This reverts commit 6ebda44f366478d1eea180d93154e7d97b591f50. All icache flushes in this code paths are done via flush_icache_range(), which only uses normal cache instruction. And this is the correct thing for EVA mode, too. So no need to do set_fs(KERNEL_DS) here. Signed-off-by: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Reviewed-by: Christoph Hellwig <hch@lst.de>
2021-02-13MIPS: remove CONFIG_DMA_PERDEV_COHERENTChristoph Hellwig
Just select DMA_NONCOHERENT and ARCH_HAS_SETUP_DMA_OPS from the MIPS_GENERIC platform instead. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Huacai Chen <chenhuacai@kernel.org> Signed-off-by: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
2021-02-13MIPS: remove CONFIG_DMA_MAYBE_COHERENTChristoph Hellwig
CONFIG_DMA_MAYBE_COHERENT just guards two early init options now. Just enable them unconditionally for CONFIG_DMA_NONCOHERENT. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
2021-02-13driver core: lift dma_default_coherent into common codeChristoph Hellwig
Lift the dma_default_coherent variable from the mips architecture code to the driver core. This allows an architecture to sdefault all device to be DMA coherent at run time, even if the kernel is build with support for DMA noncoherent device. By allowing device_initialize to set the ->dma_coherent field to this default the amount of arch hooks required for this behavior can be greatly reduced. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
2021-02-13MIPS: refactor the runtime coherent vs noncoherent DMA indicatorsChristoph Hellwig
Replace the global coherentio enum, and the hw_coherentio (fake) boolean variables with a single boolean dma_default_coherent flag. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
2021-02-13MIPS/alchemy: factor out the DMA coherent setupChristoph Hellwig
Factor out a alchemy_dma_coherent helper that determines if the platform is DMA coherent. Also stop initializing the hw_coherentio variable, given that is only ever set to a non-zero value by the malta setup code. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
2021-02-13MIPS/malta: simplify plat_setup_iocoherencyChristoph Hellwig
Given that plat_mem_setup runs before earlyparams are handled and malta selects CONFIG_DMA_MAYBE_COHERENT, coherentio can only be set to IO_COHERENCE_DEFAULT at this point. So remove the checking for other options and merge plat_enable_iocoherency into plat_setup_iocoherency to simplify the code a bit. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
2021-02-13MIPS: Add basic support for ptrace single stepTiezhu Yang
In the current code, arch_has_single_step() is not defined on MIPS, that means MIPS does not support instruction single-step for user mode. Delve is a debugger for the Go programming language, the ptrace syscall PtraceSingleStep() failed [1] on MIPS and then the single step function can not work well, we can see that PtraceSingleStep() definition returns ptrace(PTRACE_SINGLESTEP) [2]. So it is necessary to support ptrace single step on MIPS. At the beginning, we try to use the Debug Single Step exception on the Loongson 3A4000 platform, but it has no effect when set CP0_DEBUG SSt bit, this is because CP0_DEBUG NoSSt bit is 1 which indicates no single-step feature available [3], so this way which is dependent on the hardware is almost impossible. With further research, we find out there exists a common way used with break instruction in arch/alpha/kernel/ptrace.c, it is workable. For the above analysis, define arch_has_single_step(), add the common function user_enable_single_step() and user_disable_single_step(), set flag TIF_SINGLESTEP for child process, use break instruction to set breakpoint. We can use the following testcase to test it: tools/testing/selftests/breakpoints/step_after_suspend_test.c $ make -C tools/testing/selftests TARGETS=breakpoints $ cd tools/testing/selftests/breakpoints Without this patch: $ ./step_after_suspend_test -n TAP version 13 1..4 # ptrace(PTRACE_SINGLESTEP) not supported on this architecture: Input/output error ok 1 # SKIP CPU 0 # ptrace(PTRACE_SINGLESTEP) not supported on this architecture: Input/output error ok 2 # SKIP CPU 1 # ptrace(PTRACE_SINGLESTEP) not supported on this architecture: Input/output error ok 3 # SKIP CPU 2 # ptrace(PTRACE_SINGLESTEP) not supported on this architecture: Input/output error ok 4 # SKIP CPU 3 # Totals: pass:0 fail:0 xfail:0 xpass:0 skip:4 error:0 With this patch: $ ./step_after_suspend_test -n TAP version 13 1..4 ok 1 CPU 0 ok 2 CPU 1 ok 3 CPU 2 ok 4 CPU 3 # Totals: pass:4 fail:0 xfail:0 xpass:0 skip:0 error:0 [1] https://github.com/go-delve/delve/blob/master/pkg/proc/native/threads_linux.go#L50 [2] https://github.com/go-delve/delve/blob/master/vendor/golang.org/x/sys/unix/syscall_linux.go#L1573 [3] http://www.t-es-t.hu/download/mips/md00047f.pdf Reported-by: Guoqi Chen <chenguoqi@loongson.cn> Signed-off-by: Xingxing Su <suxingxing@loongson.cn> Signed-off-by: Tiezhu Yang <yangtiezhu@loongson.cn> Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
2021-02-12Merge branch 'Xilinx-axienet-updates'David S. Miller
Robert Hancock says: ==================== Xilinx axienet updates Updates to the Xilinx AXI Ethernet driver to add support for an additional ethtool operation, and to support dynamic switching between 1000BaseX and SGMII interface modes. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2021-02-12net: axienet: Support dynamic switching between 1000BaseX and SGMIIRobert Hancock
Newer versions of the Xilinx AXI Ethernet core (specifically version 7.2 or later) allow the core to be configured with a PHY interface mode of "Both", allowing either 1000BaseX or SGMII modes to be selected at runtime. Add support for this in the driver to allow better support for applications which can use both fiber and copper SFP modules. Signed-off-by: Robert Hancock <robert.hancock@calian.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-02-12dt-bindings: net: xilinx_axienet: add xlnx,switch-x-sgmii attributeRobert Hancock
Document the new xlnx,switch-x-sgmii attribute which is used to indicate that the Ethernet core supports dynamic switching between 1000BaseX and SGMII. Signed-off-by: Robert Hancock <robert.hancock@calian.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-02-12net: axienet: hook up nway_reset ethtool operationRobert Hancock
Hook up the nway_reset ethtool operation to the corresponding phylink function so that "ethtool -r" can be supported. Signed-off-by: Robert Hancock <robert.hancock@calian.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-02-12Merge branch 'Add support of pointer to struct in global'Alexei Starovoitov
Dmitrii Banshchikov says: ==================== This patchset adds support of pointers to type with known size among global function arguments. The motivation is to overcome the limit on the maximum number of allowed arguments and avoid tricky and unoptimal ways of passing arguments. A referenced type may contain pointers but access via such pointers cannot be veirified currently. v2 -> v3 - Fix reg ID generation - Fix commit description - Fix typo - Fix tests v1 -> v2: - Allow pointer to any type with known size rather than struct only - Allow pointer in global functions only - Add more tests - Fix wrapping and v1 comments ==================== Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2021-02-12selftests/bpf: Add unit tests for pointers in global functionsDmitrii Banshchikov
test_global_func9 - check valid pointer's scenarios test_global_func10 - check that a smaller type cannot be passed as a larger one test_global_func11 - check that CTX pointer cannot be passed test_global_func12 - check access to a null pointer test_global_func13 - check access to an arbitrary pointer value test_global_func14 - check that an opaque pointer cannot be passed test_global_func15 - check that a variable has an unknown value after it was passed to a global function by pointer test_global_func16 - check access to uninitialized stack memory test_global_func_args - check read and write operations through a pointer Signed-off-by: Dmitrii Banshchikov <me@ubique.spb.ru> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20210212205642.620788-5-me@ubique.spb.ru
2021-02-12bpf: Support pointers in global func argsDmitrii Banshchikov
Add an ability to pass a pointer to a type with known size in arguments of a global function. Such pointers may be used to overcome the limit on the maximum number of arguments, avoid expensive and tricky workarounds and to have multiple output arguments. A referenced type may contain pointers but indirect access through them isn't supported. The implementation consists of two parts. If a global function has an argument that is a pointer to a type with known size then: 1) In btf_check_func_arg_match(): check that the corresponding register points to NULL or to a valid memory region that is large enough to contain the expected argument's type. 2) In btf_prepare_func_args(): set the corresponding register type to PTR_TO_MEM_OR_NULL and its size to the size of the expected type. Only global functions are supported because allowance of pointers for static functions might break validation. Consider the following scenario. A static function has a pointer argument. A caller passes pointer to its stack memory. Because the callee can change referenced memory verifier cannot longer assume any particular slot type of the caller's stack memory hence the slot type is changed to SLOT_MISC. If there is an operation that relies on slot type other than SLOT_MISC then verifier won't be able to infer safety of the operation. When verifier sees a static function that has a pointer argument different from PTR_TO_CTX then it skips arguments check and continues with "inline" validation with more information available. The operation that relies on the particular slot type now succeeds. Because global functions were not allowed to have pointer arguments different from PTR_TO_CTX it's not possible to break existing and valid code. Signed-off-by: Dmitrii Banshchikov <me@ubique.spb.ru> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20210212205642.620788-4-me@ubique.spb.ru
2021-02-12bpf: Extract nullable reg type conversion into a helper functionDmitrii Banshchikov
Extract conversion from a register's nullable type to a type with a value. The helper will be used in mark_ptr_not_null_reg(). Signed-off-by: Dmitrii Banshchikov <me@ubique.spb.ru> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20210212205642.620788-3-me@ubique.spb.ru
2021-02-12bpf: Rename bpf_reg_state variablesDmitrii Banshchikov
Using "reg" for an array of bpf_reg_state and "reg[i + 1]" for an individual bpf_reg_state is error-prone and verbose. Use "regs" for the former and "reg" for the latter as other code nearby does. Signed-off-by: Dmitrii Banshchikov <me@ubique.spb.ru> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20210212205642.620788-2-me@ubique.spb.ru
2021-02-12net: axienet: Handle deferred probe on clock properlyRobert Hancock
This driver is set up to use a clock mapping in the device tree if it is present, but still work without one for backward compatibility. However, if getting the clock returns -EPROBE_DEFER, then we need to abort and return that error from our driver initialization so that the probe can be retried later after the clock is set up. Move clock initialization to earlier in the process so we do not waste as much effort if the clock is not yet available. Switch to use devm_clk_get_optional and abort initialization on any error reported. Also enable the clock regardless of whether the controller is using an MDIO bus, as the clock is required in any case. Fixes: 09a0354cadec267be7f ("net: axienet: Use clock framework to get device clock rate") Signed-off-by: Robert Hancock <robert.hancock@calian.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-02-12Merge branch 'tcp-mem-pressure-vs-SO_RCVLOWAT'David S. Miller
Eric Dumazet says: ==================== tcp: mem pressure vs SO_RCVLOWAT First patch fixes an issue for applications using SO_RCVLOWAT to reduce context switches. Second patch is a cleanup. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2021-02-12tcp: factorize logic into tcp_epollin_ready()Eric Dumazet
Both tcp_data_ready() and tcp_stream_is_readable() share the same logic. Add tcp_epollin_ready() helper to avoid duplication. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Arjun Roy <arjunroy@google.com> Cc: Wei Wang <weiwan@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-02-12tcp: fix SO_RCVLOWAT related hangs under mem pressureEric Dumazet
While commit 24adbc1676af ("tcp: fix SO_RCVLOWAT hangs with fat skbs") fixed an issue vs too small sk_rcvbuf for given sk_rcvlowat constraint, it missed to address issue caused by memory pressure. 1) If we are under memory pressure and socket receive queue is empty. First incoming packet is allowed to be queued, after commit 76dfa6082032 ("tcp: allow one skb to be received per socket under memory pressure") But we do not send EPOLLIN yet, in case tcp_data_ready() sees sk_rcvlowat is bigger than skb length. 2) Then, when next packet comes, it is dropped, and we directly call sk->sk_data_ready(). 3) If application is using poll(), tcp_poll() will then use tcp_stream_is_readable() and decide the socket receive queue is not yet filled, so nothing will happen. Even when sender retransmits packets, phases 2) & 3) repeat and flow is effectively frozen, until memory pressure is off. Fix is to consider tcp_under_memory_pressure() to take care of global memory pressure or memcg pressure. Fixes: 24adbc1676af ("tcp: fix SO_RCVLOWAT hangs with fat skbs") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Arjun Roy <arjunroy@google.com> Suggested-by: Wei Wang <weiwan@google.com> Reviewed-by: Wei Wang <weiwan@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-02-12Merge branch '40GbE' of ↵David S. Miller
git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue Tony Nguyen says: ==================== 40GbE Intel Wired LAN Driver Updates 2021-02-12 This series contains updates to i40e, ice, and ixgbe drivers. Maciej does cleanups on the following drivers. For i40e, removes redundant check for XDP prog, cleans up no longer relevant information, and removes an unused function argument. For ice, removes local variable use, instead returning values directly. Moves skb pointer from buffer to ring and removes an unneeded check for xdp_prog in zero copy path. Also removes a redundant MTU check when changing it. For i40e, ice, and ixgbe, stores the rx_offset in the Rx ring as the value is constant so there's no need for continual calls. Bjorn folds a decrement into a while statement. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2021-02-12ibmvnic: change IBMVNIC_MAX_IND_DESCS to 16Dany Madden
The supported indirect subcrq entries on Power8 is 16. Power9 supports 128. Redefined this value to 16 to minimize the driver from having to reset when migrating between Power9 and Power8. In our rx/tx performance testing, we found no performance difference between 16 and 128 at this time. Fixes: f019fb6392e5 ("ibmvnic: Introduce indirect subordinate Command Response Queue buffer") Signed-off-by: Dany Madden <drt@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-02-12Merge branch 'tc-mpls-selftests'David S. Miller
Guillaume Nault says: ==================== selftests: tc: Test tc-flower's MPLS features A couple of patches for exercising the MPLS filters of tc-flower. Patch 1 tests basic MPLS matching features: those that only work on the first label stack entry (that is, the mpls_label, mpls_tc, mpls_bos and mpls_ttl options). Patch 2 tests the more generic "mpls" and "lse" options, which allow matching MPLS fields beyond the first stack entry. In both patches, special care is taken to skip these new tests for incompatible versions of tc. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2021-02-12selftests: tc: Add generic mpls matching support for tc-flowerGuillaume Nault
Add tests in tc_flower.sh for generic matching on MPLS Label Stack Entries. The label, tc, bos and ttl fields are tested for the first and second labels. For each field, the minimal and maximal values are tested (the former at depth 1 and the later at depth 2). There are also tests for matching the presence of a label stack entry at a given depth. In order to reduce the amount of code, all "lse" subcommands are tested in match_mpls_lse_test(). Action "continue" is used, so that test packets are evaluated by all filters. Then, we can verify if each filter matched the expected number of packets. Some versions of tc-flower produced invalid json output when dumping MPLS filters with depth > 1. Skip the test if tc isn't recent enough. Signed-off-by: Guillaume Nault <gnault@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-02-12selftests: tc: Add basic mpls_* matching support for tc-flowerGuillaume Nault
Add tests in tc_flower.sh for mpls_label, mpls_tc, mpls_bos and mpls_ttl. For each keyword, test the minimal and maximal values. Selectively skip these new mpls tests for tc versions that don't support them. Signed-off-by: Guillaume Nault <gnault@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-02-12Merge branch 'brport-flags'David S. Miller
Vladimir Oltean says: ==================== Cleanup in brport flags switchdev offload for DSA The initial goal of this series was to have better support for standalone ports mode on the DSA drivers like ocelot/felix and sja1105. This turned out to require some API adjustments in both directions: to the information presented to and by the switchdev notifier, and to the API presented to the switch drivers by the DSA layer. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2021-02-12net: dsa: sja1105: offload bridge port flags to deviceVladimir Oltean
The chip can configure unicast flooding, broadcast flooding and learning. Learning is per port, while flooding is per {ingress, egress} port pair and we need to configure the same value for all possible ingress ports towards the requested one. While multicast flooding is not officially supported, we can hack it by using a feature of the second generation (P/Q/R/S) devices, which is that FDB entries are maskable, and multicast addresses always have an odd first octet. So by putting a match-all for 00:01:00:00:00:00 addr and 00:01:00:00:00:00 mask at the end of the FDB, we make sure that it is always checked last, and does not take precedence in front of any other MDB. So it behaves effectively as an unknown multicast entry. For the first generation switches, this feature is not available, so unknown multicast will always be treated the same as unknown unicast. So the only thing we can do is request the user to offload the settings for these 2 flags in tandem, i.e. ip link set swp2 type bridge_slave flood off Error: sja1105: This chip cannot configure multicast flooding independently of unicast. ip link set swp2 type bridge_slave flood off mcast_flood off ip link set swp2 type bridge_slave mcast_flood on Error: sja1105: This chip cannot configure multicast flooding independently of unicast. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-02-12net: mscc: ocelot: offload bridge port flags to deviceVladimir Oltean
We should not be unconditionally enabling address learning, since doing that is actively detrimential when a port is standalone and not offloading a bridge. Namely, if a port in the switch is standalone and others are offloading the bridge, then we could enter a situation where we learn an address towards the standalone port, but the bridged ports could not forward the packet there, because the CPU is the only path between the standalone and the bridged ports. The solution of course is to not enable address learning unless the bridge asks for it. We need to set up the initial port flags for no learning and flooding everything, and also when the port joins and leaves the bridge. The flood configuration was already configured ok for standalone mode in ocelot_init, we just need to disable learning in ocelot_init_port. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Alexandre Belloni <alexandre.belloni@bootlin.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-02-12net: mscc: ocelot: use separate flooding PGID for broadcastVladimir Oltean
In preparation of offloading the bridge port flags which have independent settings for unknown multicast and for broadcast, we should also start reserving one destination Port Group ID for the flooding of broadcast packets, to allow configuring it individually. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-02-12net: dsa: felix: restore multicast flood to CPU when NPI tagger reinitializesVladimir Oltean
ocelot_init sets up PGID_MC to include the CPU port module, and that is fine, but the ocelot-8021q tagger removes the CPU port module from the unknown multicast replicator. So after a transition from the default ocelot tagger towards ocelot-8021q and then again towards ocelot, multicast flooding towards the CPU port module will be disabled. Fixes: e21268efbe26 ("net: dsa: felix: perform switch setup for tag_8021q") Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-02-12net: dsa: act as passthrough for bridge port flagsVladimir Oltean
There are multiple ways in which a PORT_BRIDGE_FLAGS attribute can be expressed by the bridge through switchdev, and not all of them can be emulated by DSA mid-layer API at the same time. One possible configuration is when the bridge offloads the port flags using a mask that has a single bit set - therefore only one feature should change. However, DSA currently groups together unicast and multicast flooding in the .port_egress_floods method, which limits our options when we try to add support for turning off broadcast flooding: do we extend .port_egress_floods with a third parameter which b53 and mv88e6xxx will ignore? But that means that the DSA layer, which currently implements the PRE_BRIDGE_FLAGS attribute all by itself, will see that .port_egress_floods is implemented, and will report that all 3 types of flooding are supported - not necessarily true. Another configuration is when the user specifies more than one flag at the same time, in the same netlink message. If we were to create one individual function per offloadable bridge port flag, we would limit the expressiveness of the switch driver of refusing certain combinations of flag values. For example, a switch may not have an explicit knob for flooding of unknown multicast, just for flooding in general. In that case, the only correct thing to do is to allow changes to BR_FLOOD and BR_MCAST_FLOOD in tandem, and never allow mismatched values. But having a separate .port_set_unicast_flood and .port_set_multicast_flood would not allow the driver to possibly reject that. Also, DSA doesn't consider it necessary to inform the driver that a SWITCHDEV_ATTR_ID_BRIDGE_MROUTER attribute was offloaded, because it just calls .port_egress_floods for the CPU port. When we'll add support for the plain SWITCHDEV_ATTR_ID_PORT_MROUTER, that will become a real problem because the flood settings will need to be held statefully in the DSA middle layer, otherwise changing the mrouter port attribute will impact the flooding attribute. And that's _assuming_ that the underlying hardware doesn't have anything else to do when a multicast router attaches to a port than flood unknown traffic to it. If it does, there will need to be a dedicated .port_set_mrouter anyway. So we need to let the DSA drivers see the exact form that the bridge passes this switchdev attribute in, otherwise we are standing in the way. Therefore we also need to use this form of language when communicating to the driver that it needs to configure its initial (before bridge join) and final (after bridge leave) port flags. The b53 and mv88e6xxx drivers are converted to the passthrough API and their implementation of .port_egress_floods is split into two: a function that configures unicast flooding and another for multicast. The mv88e6xxx implementation is quite hairy, and it turns out that the implementations of unknown unicast flooding are actually the same for 6185 and for 6352: behind the confusing names actually lie two individual bits: NO_UNKNOWN_MC -> FLOOD_UC = 0x4 = BIT(2) NO_UNKNOWN_UC -> FLOOD_MC = 0x8 = BIT(3) so there was no reason to entangle them in the first place. Whereas the 6185 writes to MV88E6185_PORT_CTL0_FORWARD_UNKNOWN of PORT_CTL0, which has the exact same bit index. I have left the implementations separate though, for the only reason that the names are different enough to confuse me, since I am not able to double-check with a user manual. The multicast flooding setting for 6185 is in a different register than for 6352 though. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-02-12net: switchdev: pass flags and mask to both {PRE_,}BRIDGE_FLAGS attributesVladimir Oltean
This switchdev attribute offers a counterproductive API for a driver writer, because although br_switchdev_set_port_flag gets passed a "flags" and a "mask", those are passed piecemeal to the driver, so while the PRE_BRIDGE_FLAGS listener knows what changed because it has the "mask", the BRIDGE_FLAGS listener doesn't, because it only has the final value. But certain drivers can offload only certain combinations of settings, like for example they cannot change unicast flooding independently of multicast flooding - they must be both on or both off. The way the information is passed to switchdev makes drivers not expressive enough, and unable to reject this request ahead of time, in the PRE_BRIDGE_FLAGS notifier, so they are forced to reject it during the deferred BRIDGE_FLAGS attribute, where the rejection is currently ignored. This patch also changes drivers to make use of the "mask" field for edge detection when possible. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Grygorii Strashko <grygorii.strashko@ti.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-02-12net: dsa: configure better brport flags when ports leave the bridgeVladimir Oltean
For a DSA switch port operating in standalone mode, address learning doesn't make much sense since that is a bridge function. In fact, address learning even breaks setups such as this one: +---------------------------------------------+ | | | +-------------------+ | | | br0 | send receive | | +--------+-+--------+ +--------+ +--------+ | | | | | | | | | | | | | swp0 | | swp1 | | swp2 | | swp3 | | | | | | | | | | | | +-+--------+-+--------+-+--------+-+--------+-+ | ^ | ^ | | | | | +-----------+ | | | +--------------------------------+ because if the switch has a single FDB (can offload a single bridge) then source address learning on swp3 can "steal" the source MAC address of swp2 from br0's FDB, because learning frames coming from swp2 will be done twice: first on the swp1 ingress port, second on the swp3 ingress port. So the hardware FDB will become out of sync with the software bridge, and when swp2 tries to send one more packet towards swp1, the ASIC will attempt to short-circuit the forwarding path and send it directly to swp3 (since that's the last port it learned that address on), which it obviously can't, because swp3 operates in standalone mode. So DSA drivers operating in standalone mode should still configure a list of bridge port flags even when they are standalone. Currently DSA attempts to call dsa_port_bridge_flags with 0, which disables egress flooding of unknown unicast and multicast, something which doesn't make much sense. For the switches that implement .port_egress_floods - b53 and mv88e6xxx, it probably doesn't matter too much either, since they can possibly inject traffic from the CPU into a standalone port, regardless of MAC DA, even if egress flooding is turned off for that port, but certainly not all DSA switches can do that - sja1105, for example, can't. So it makes sense to use a better common default there, such as "flood everything". It should also be noted that what DSA calls "dsa_port_bridge_flags()" is a degenerate name for just calling .port_egress_floods(), since nothing else is implemented - not learning, in particular. But disabling address learning, something that this driver is also coding up for, will be supported by individual drivers once .port_egress_floods is replaced with a more generic .port_bridge_flags. Previous attempts to code up this logic have been in the common bridge layer, but as pointed out by Ido Schimmel, there are corner cases that are missed when doing that: https://patchwork.kernel.org/project/netdevbpf/patch/20210209151936.97382-5-olteanv@gmail.com/ So, at least for now, let's leave DSA in charge of setting port flags before and after the bridge join and leave. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>