summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2019-06-27x86/tls: Fix possible spectre-v1 in do_get_thread_area()Dianzhang Chen
The index to access the threads tls array is controlled by userspace via syscall: sys_ptrace(), hence leading to a potential exploitation of the Spectre variant 1 vulnerability. The index can be controlled from: ptrace -> arch_ptrace -> do_get_thread_area. Fix this by sanitizing the user supplied index before using it to access the p->thread.tls_array. Signed-off-by: Dianzhang Chen <dianzhangchen0@gmail.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: bp@alien8.de Cc: hpa@zytor.com Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/r/1561524630-3642-1-git-send-email-dianzhangchen0@gmail.com
2019-06-27x86/ptrace: Fix possible spectre-v1 in ptrace_get_debugreg()Dianzhang Chen
The index to access the threads ptrace_bps is controlled by userspace via syscall: sys_ptrace(), hence leading to a potential exploitation of the Spectre variant 1 vulnerability. The index can be controlled from: ptrace -> arch_ptrace -> ptrace_get_debugreg. Fix this by sanitizing the user supplied index before using it access thread->ptrace_bps. Signed-off-by: Dianzhang Chen <dianzhangchen0@gmail.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: bp@alien8.de Cc: hpa@zytor.com Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/r/1561476617-3759-1-git-send-email-dianzhangchen0@gmail.com
2019-06-27hrtimer: Use a bullet for the returns bullet listMauro Carvalho Chehab
That gets rid of this warning: ./kernel/time/hrtimer.c:1119: WARNING: Block quote ends without a blank line; unexpected unindent. and displays nicely both at the source code and at the produced documentation. Signed-off-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Linux Doc Mailing List <linux-doc@vger.kernel.org> Cc: Mauro Carvalho Chehab <mchehab@infradead.org> Cc: Jonathan Corbet <corbet@lwn.net> Link: https://lkml.kernel.org/r/74ddad7dac331b4e5ce4a90e15c8a49e3a16d2ac.1561372382.git.mchehab+samsung@kernel.org
2019-06-27workqueue: Remove GPF argument from alloc_workqueue_attrs()Thomas Gleixner
All callers use GFP_KERNEL. No point in having that argument. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Tejun Heo <tj@kernel.org>
2019-06-27workqueue: Make alloc/apply/free_workqueue_attrs() staticThomas Gleixner
None of those functions have any users outside of workqueue.c. Confine them. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Tejun Heo <tj@kernel.org>
2019-06-27Merge branch 'bpf-af-xdp-mlx5e'Daniel Borkmann
Tariq Toukan says: ==================== This series contains improvements to the AF_XDP kernel infrastructure and AF_XDP support in mlx5e. The infrastructure improvements are required for mlx5e, but also some of them benefit to all drivers, and some can be useful for other drivers that want to implement AF_XDP. The performance testing was performed on a machine with the following configuration: - 24 cores of Intel Xeon E5-2620 v3 @ 2.40 GHz - Mellanox ConnectX-5 Ex with 100 Gbit/s link The results with retpoline disabled, single stream: txonly: 33.3 Mpps (21.5 Mpps with queue and app pinned to the same CPU) rxdrop: 12.2 Mpps l2fwd: 9.4 Mpps The results with retpoline enabled, single stream: txonly: 21.3 Mpps (14.1 Mpps with queue and app pinned to the same CPU) rxdrop: 9.9 Mpps l2fwd: 6.8 Mpps v2 changes: Added patches for mlx5e and addressed the comments for v1. Rebased for bpf-next. v3 changes: Rebased for the newer bpf-next, resolved conflicts in libbpf. Addressed Björn's comments for coding style. Fixed a bug in error handling flow in mlx5e_open_xsk. v4 changes: UAPI is not changed, XSK RX queues are exposed to the kernel. The lower half of the available amount of RX queues are regular queues, and the upper half are XSK RX queues. The patch "xsk: Extend channels to support combined XSK/non-XSK traffic" was dropped. The final patch was reworked accordingly. Added "net/mlx5e: Attach/detach XDP program safely", as the changes introduced in the XSK patch base on the stuff from this one. Added "libbpf: Support drivers with non-combined channels", which aligns the condition in libbpf with the condition in the kernel. Rebased over the newer bpf-next. v5 changes: In v4, ethtool reports the number of channels as 'combined' and the number of XSK RX queues as 'rx' for mlx5e. It was changed, so that 'rx' is 0, and 'combined' reports the double amount of channels if there is an active UMEM - to make libbpf happy. The patch for libbpf was dropped. Although it's still useful and fixes things, it raises some disagreement, so I'm dropping it - it's no longer useful for mlx5e anymore after the change above. v6 changes: As Maxim is out of office, I rebased the series on behalf of him, solved some conflicts, and re-spinned. ==================== Acked-by: Björn Töpel <bjorn.topel@intel.com> Tested-by: Jonathan Lemon <jonathan.lemon@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-06-27net/mlx5e: Add XSK zero-copy supportMaxim Mikityanskiy
This commit adds support for AF_XDP zero-copy RX and TX. We create a dedicated XSK RQ inside the channel, it means that two RQs are running simultaneously: one for non-XSK traffic and the other for XSK traffic. The regular and XSK RQs use a single ID namespace split into two halves: the lower half is regular RQs, and the upper half is XSK RQs. When any zero-copy AF_XDP socket is active, changing the number of channels is not allowed, because it would break to mapping between XSK RQ IDs and channels. XSK requires different page allocation and release routines. Such functions as mlx5e_{alloc,free}_rx_mpwqe and mlx5e_{get,put}_rx_frag are generic enough to be used for both regular and XSK RQs, and they use the mlx5e_page_{alloc,release} wrappers around the real allocation functions. Function pointers are not used to avoid losing the performance with retpolines. Wherever it's certain that the regular (non-XSK) page release function should be used, it's called directly. Only the stats that could be meaningful for XSK are exposed to the userspace. Those that don't take part in the XSK flow are not considered. Note that we don't wait for WQEs on the XSK RQ (unlike the regular RQ), because the newer xdpsock sample doesn't provide any Fill Ring entries at the setup stage. We create a dedicated XSK SQ in the channel. This separation has its advantages: 1. When the UMEM is closed, the XSK SQ can also be closed and stop receiving completions. If an existing SQ was used for XSK, it would continue receiving completions for the packets of the closed socket. If a new UMEM was opened at that point, it would start getting completions that don't belong to it. 2. Calculating statistics separately. When the userspace kicks the TX, the driver triggers a hardware interrupt by posting a NOP to a dedicated XSK ICO (internal control operations) SQ, in order to trigger NAPI on the right CPU core. This XSK ICO SQ is protected by a spinlock, as the userspace application may kick the TX from any core. Store the pointers to the UMEMs in the net device private context, independently from the kernel. This way the driver can distinguish between the zero-copy and non-zero-copy UMEMs. The kernel function xdp_get_umem_from_qid does not care about this difference, but the driver is only interested in zero-copy UMEMs, particularly, on the cleanup it determines whether to close the XSK RQ and SQ or not by looking at the presence of the UMEM. Use state_lock to protect the access to this area of UMEM pointers. LRO isn't compatible with XDP, but there may be active UMEMs while XDP is off. If this is the case, don't allow LRO to ensure XDP can be reenabled at any time. The validation of XSK parameters typically happens when XSK queues open. However, when the interface is down or the XDP program isn't set, it's still possible to have active AF_XDP sockets and even to open new, but the XSK queues will be closed. To cover these cases, perform the validation also in these flows: 1. A new UMEM is registered, but the XSK queues aren't going to be created due to missing XDP program or interface being down. 2. MTU changes while there are UMEMs registered. Having this early check prevents mlx5e_open_channels from failing at a later stage, where recovery is impossible and the application has no chance to handle the error, because it got the successful return value for an MTU change or XSK open operation. The performance testing was performed on a machine with the following configuration: - 24 cores of Intel Xeon E5-2620 v3 @ 2.40 GHz - Mellanox ConnectX-5 Ex with 100 Gbit/s link The results with retpoline disabled, single stream: txonly: 33.3 Mpps (21.5 Mpps with queue and app pinned to the same CPU) rxdrop: 12.2 Mpps l2fwd: 9.4 Mpps The results with retpoline enabled, single stream: txonly: 21.3 Mpps (14.1 Mpps with queue and app pinned to the same CPU) rxdrop: 9.9 Mpps l2fwd: 6.8 Mpps Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com> Signed-off-by: Tariq Toukan <tariqt@mellanox.com> Acked-by: Saeed Mahameed <saeedm@mellanox.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-06-27net/mlx5e: Move queue param structs to en/params.hMaxim Mikityanskiy
structs mlx5e_{rq,sq,cq,channel}_param are going to be used in the upcoming XSK RX and TX patches. Move them to a header file to make them accessible from other C files. Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com> Signed-off-by: Tariq Toukan <tariqt@mellanox.com> Acked-by: Saeed Mahameed <saeedm@mellanox.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-06-27net/mlx5e: Encapsulate open/close queues into a functionMaxim Mikityanskiy
Create new functions mlx5e_{open,close}_queues to encapsulate opening and closing RQs and SQs, and call the new functions from mlx5e_{open,close}_channel. It simplifies the existing functions a bit and prepares them for the upcoming AF_XDP changes. Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com> Signed-off-by: Tariq Toukan <tariqt@mellanox.com> Acked-by: Saeed Mahameed <saeedm@mellanox.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-06-27net/mlx5e: Consider XSK in XDP MTU limit calculationMaxim Mikityanskiy
Use the existing mlx5e_get_linear_rq_headroom function to calculate the headroom for mlx5e_xdp_max_mtu. This function takes the XSK headroom into consideration, which will be used in the following patches. Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com> Signed-off-by: Tariq Toukan <tariqt@mellanox.com> Acked-by: Saeed Mahameed <saeedm@mellanox.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-06-27net/mlx5e: XDP_TX from UMEM supportMaxim Mikityanskiy
When an XDP program returns XDP_TX, and the RQ is XSK-enabled, it requires careful handling, because convert_to_xdp_frame creates a new page and copies the data there, while our driver expects the xdp_frame to point to the same memory as the xdp_buff. Handle this case separately: map the page, and in the end unmap it and call xdp_return_frame. Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com> Signed-off-by: Tariq Toukan <tariqt@mellanox.com> Acked-by: Saeed Mahameed <saeedm@mellanox.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-06-27net/mlx5e: Share the XDP SQ for XDP_TX between RQsMaxim Mikityanskiy
Put the XDP SQ that is used for XDP_TX into the channel. It used to be a part of the RQ, but with introduction of AF_XDP there will be one more RQ that could share the same XDP SQ. This patch is a preparation for that change. Separate XDP_TX statistics per RQ were implemented in one of the previous patches. Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com> Signed-off-by: Tariq Toukan <tariqt@mellanox.com> Acked-by: Saeed Mahameed <saeedm@mellanox.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-06-27net/mlx5e: Refactor struct mlx5e_xdp_infoMaxim Mikityanskiy
Currently, struct mlx5e_xdp_info has some issues that have to be cleaned up before the upcoming AF_XDP support makes things too complicated and messy. This structure is used both when sending the packet and on completion. Moreover, the cleanup procedure on completion depends on the origin of the packet (XDP_REDIRECT, XDP_TX). Adding AF_XDP support will add new flows that use this structure even differently. To avoid overcomplicating the code, this commit refactors the usage of this structure in the following ways: 1. struct mlx5e_xdp_info is split into two different structures. One is struct mlx5e_xdp_xmit_data, a transient structure that doesn't need to be stored and is only used while sending the packet. The other is still struct mlx5e_xdp_info that is stored in a FIFO and contains the fields needed on completion. 2. The fields of struct mlx5e_xdp_info that are used in different flows are put into a union. A special enum indicates the cleanup mode and helps choose the right union member. This approach is clear and explicit. Although it could be possible to "guess" the mode by looking at the values of the fields and at the XDP SQ type, it wouldn't be that clear and extendable and would require looking through the whole chain to understand what's going on. For the reference, there are the fields of struct mlx5e_xdp_info that are used in different flows (including AF_XDP ones): Packet origin | Fields used on completion | Cleanup steps -----------------------+---------------------------+------------------ XDP_REDIRECT, | xdpf, dma_addr | DMA unmap and XDP_TX from XSK RQ | | xdp_return_frame. -----------------------+---------------------------+------------------ XDP_TX from regular RQ | di | Recycle page. -----------------------+---------------------------+------------------ AF_XDP TX | (none) | Increment the | | producer index in | | Completion Ring. On send, the same set of mlx5e_xdp_xmit_data fields is used in all flows: DMA and virtual addresses and length. Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com> Signed-off-by: Tariq Toukan <tariqt@mellanox.com> Acked-by: Saeed Mahameed <saeedm@mellanox.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-06-27net/mlx5e: Allow ICO SQ to be used by multiple RQsMaxim Mikityanskiy
Prepare to creation of the XSK RQ, which will require posting UMRs, too. The same ICO SQ will be used for both RQs and also to trigger interrupts by posting NOPs. UMR WQEs can't be reused any more. Optimization introduced in commit ab966d7e4ff98 ("net/mlx5e: RX, Recycle buffer of UMR WQEs") is reverted. Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com> Signed-off-by: Tariq Toukan <tariqt@mellanox.com> Acked-by: Saeed Mahameed <saeedm@mellanox.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-06-27net/mlx5e: Calculate linear RX frag size considering XSKMaxim Mikityanskiy
Additional conditions introduced: - XSK implies XDP. - Headroom includes the XSK headroom if it exists. - No space is reserved for struct shared_skb_info in XSK mode. - Fragment size smaller than the XSK chunk size is not allowed. A new auxiliary function mlx5e_get_linear_rq_headroom with the support for XSK is introduced. Use this function in the implementation of mlx5e_get_rq_headroom. Change headroom to u32 to match the headroom field in struct xdp_umem. Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com> Signed-off-by: Tariq Toukan <tariqt@mellanox.com> Acked-by: Saeed Mahameed <saeedm@mellanox.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-06-27net/mlx5e: Replace deprecated PCI_DMA_TODEVICEMaxim Mikityanskiy
The PCI API for DMA is deprecated, and PCI_DMA_TODEVICE is just defined to DMA_TO_DEVICE for backward compatibility. Just use DMA_TO_DEVICE. Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com> Signed-off-by: Tariq Toukan <tariqt@mellanox.com> Acked-by: Saeed Mahameed <saeedm@mellanox.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-06-27xsk: Return the whole xdp_desc from xsk_umem_consume_txMaxim Mikityanskiy
Some drivers want to access the data transmitted in order to implement acceleration features of the NICs. It is also useful in AF_XDP TX flow. Change the xsk_umem_consume_tx API to return the whole xdp_desc, that contains the data pointer, length and DMA address, instead of only the latter two. Adapt the implementation of i40e and ixgbe to this change. Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com> Signed-off-by: Tariq Toukan <tariqt@mellanox.com> Acked-by: Saeed Mahameed <saeedm@mellanox.com> Cc: Björn Töpel <bjorn.topel@intel.com> Cc: Magnus Karlsson <magnus.karlsson@intel.com> Acked-by: Björn Töpel <bjorn.topel@intel.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-06-27xsk: Change the default frame size to 4096 and allow controlling itMaxim Mikityanskiy
The typical XDP memory scheme is one packet per page. Change the AF_XDP frame size in libbpf to 4096, which is the page size on x86, to allow libbpf to be used with the drivers with the packet-per-page scheme. Add a command line option -f to xdpsock to allow to specify a custom frame size. Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com> Signed-off-by: Tariq Toukan <tariqt@mellanox.com> Acked-by: Saeed Mahameed <saeedm@mellanox.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-06-27libbpf: Support getsockopt XDP_OPTIONSMaxim Mikityanskiy
Query XDP_OPTIONS in libbpf to determine if the zero-copy mode is active or not. Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com> Signed-off-by: Tariq Toukan <tariqt@mellanox.com> Acked-by: Saeed Mahameed <saeedm@mellanox.com> Acked-by: Björn Töpel <bjorn.topel@intel.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-06-27xsk: Add getsockopt XDP_OPTIONSMaxim Mikityanskiy
Make it possible for the application to determine whether the AF_XDP socket is running in zero-copy mode. To achieve this, add a new getsockopt option XDP_OPTIONS that returns flags. The only flag supported for now is the zero-copy mode indicator. Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com> Signed-off-by: Tariq Toukan <tariqt@mellanox.com> Acked-by: Saeed Mahameed <saeedm@mellanox.com> Acked-by: Björn Töpel <bjorn.topel@intel.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-06-27xsk: Add API to check for available entries in FQMaxim Mikityanskiy
Add a function that checks whether the Fill Ring has the specified amount of descriptors available. It will be useful for mlx5e that wants to check in advance, whether it can allocate a bulk of RX descriptors, to get the best performance. Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com> Signed-off-by: Tariq Toukan <tariqt@mellanox.com> Acked-by: Saeed Mahameed <saeedm@mellanox.com> Acked-by: Björn Töpel <bjorn.topel@intel.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-06-27net/mlx5e: Attach/detach XDP program safelyMaxim Mikityanskiy
When an XDP program is set, a full reopen of all channels happens in two cases: 1. When there was no program set, and a new one is being set. 2. When there was a program set, but it's being unset. The full reopen is necessary, because the channel parameters may change if XDP is enabled or disabled. However, it's performed in an unsafe way: if the new channels fail to open, the old ones are already closed, and the interface goes down. Use the safe way to switch channels instead. The same way is already used for other configuration changes. Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com> Signed-off-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-06-27bpf: fix cgroup bpf release synchronizationRoman Gushchin
Since commit 4bfc0bb2c60e ("bpf: decouple the lifetime of cgroup_bpf from cgroup itself"), cgroup_bpf release occurs asynchronously (from a worker context), and before the release of the cgroup itself. This introduced a previously non-existing race between the release and update paths. E.g. if a leaf's cgroup_bpf is released and a new bpf program is attached to the one of ancestor cgroups at the same time. The race may result in double-free and other memory corruptions. To fix the problem, let's protect the body of cgroup_bpf_release() with cgroup_mutex, as it was effectively previously, when all this code was called from the cgroup release path with cgroup mutex held. Also let's skip cgroups, which have no chances to invoke a bpf program, on the update path. If the cgroup bpf refcnt reached 0, it means that the cgroup is offline (no attached processes), and there are no associated sockets left. It means there is no point in updating effective progs array! And it can lead to a leak, if it happens after the release. So, let's skip such cgroups. Big thanks for Tejun Heo for discovering and debugging of this problem! Fixes: 4bfc0bb2c60e ("bpf: decouple the lifetime of cgroup_bpf from cgroup itself") Reported-by: Tejun Heo <tj@kernel.org> Signed-off-by: Roman Gushchin <guro@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-06-27Merge tag 'blk-dim-v2' of ↵David S. Miller
git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux Saeed Mamameed says: ==================== Generic DIM From: Tal Gilboa and Yamin Fridman Implement net DIM over a generic DIM library, add RDMA DIM dim.h lib exposes an implementation of the DIM algorithm for dynamically-tuned interrupt moderation for networking interfaces. We want a similar functionality for other protocols, which might need to optimize interrupts differently. Main motivation here is DIM for NVMf storage protocol. Current DIM implementation prioritizes reducing interrupt overhead over latency. Also, in order to reduce DIM's own overhead, the algorithm might take some time to identify it needs to change profiles. While this is acceptable for networking, it might not work well on other scenarios. Here we propose a new structure to DIM. The idea is to allow a slightly modified functionality without the risk of breaking Net DIM behavior for netdev. We verified there are no degradations in current DIM behavior with the modified solution. Suggested solution: - Common logic is implemented in lib/dim/dim.c - Net DIM (existing) logic is implemented in lib/dim/net_dim.c, which uses the common logic in dim.c - Any new DIM logic will be implemented in "lib/dim/new_dim.c". This new implementation will expose modified versions of profiles, dim_step() and dim_decision(). - DIM API is declared in include/linux/dim.h for all implementations. Pros for this solution are: - Zero impact on existing net_dim implementation and usage - Relatively more code reuse (compared to two separate solutions) - Increased extensibility ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-27gfs2: replace more printk with calls to fs_info and friendsBob Peterson
This patch replaces a few leftover printk errors with calls to fs_info and similar, so that the file system having the error is properly logged. Signed-off-by: Bob Peterson <rpeterso@redhat.com> Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2019-06-27gfs2: dump fsid when dumping glock problemsBob Peterson
Before this patch, if a glock error was encountered, the glock with the problem was dumped. But sometimes you may have lots of file systems mounted, and that doesn't tell you which file system it was for. This patch adds a new boolean parameter fsid to the dump_glock family of functions. For non-error cases, such as dumping the glocks debugfs file, the fsid is not dumped in order to keep lock dumps and glocktop as clean as possible. For all error cases, such as GLOCK_BUG_ON, the file system id is now printed. This will make it easier to debug. Signed-off-by: Bob Peterson <rpeterso@redhat.com> Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2019-06-27gfs2: simplify gfs2_freeze by removing caseBob Peterson
Function gfs2_freeze had a case statement that simply checked the error code, but the break statements just made the logic hard to read. This patch simplifies the logic in favor of a simple if. Signed-off-by: Bob Peterson <rpeterso@redhat.com> Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2019-06-27gfs2: Rename SDF_SHUTDOWN to SDF_WITHDRAWNBob Peterson
Before this patch, the superblock flag indicating when a file system is withdrawn was called SDF_SHUTDOWN. This patch simply renames it to the more obvious SDF_WITHDRAWN. Signed-off-by: Bob Peterson <rpeterso@redhat.com> Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2019-06-27gfs2: Warn when a journal replay overwrites a rgrp with buffersBob Peterson
This patch adds some instrumentation in gfs2's journal replay that indicates when we're about to overwrite a rgrp for which we already have a valid buffer_head. When this problem occurs, it's a situation in which this node has been granted a rgrp glock and subsequently read in buffer_heads for it, and possibly even made changes to the rgrp bits and/or allocation values. But now another node has failed and forced us to replay its journal, but its journal contains a copy of the same rgrp, without a revoke, which means we're about to overwrite a rgrp that we now rightfully own, with an obsolete copy. That is always a problem. It means the other node (which failed and left its journal to be replayed) failed to flush out its rgrp buffers, write out the revoke, and invalidate its copy before it released the glock to our possession. No node should ever release a glock until its metadata has been written to the journal and revoked and invalidated.. We also kludge around the problem and refuse to replace our good copy with the journals bad copy by not marking the buffer dirty, but never do it silently. That's wallpapering over a larger problem that still exists. IOW, if this situation can happen to this node, it can also happen to a different node and we wouldn't even know it or be able to circumvent it: Suppose we have a 3-node cluster: Node 1 fails, leaving an obsolete rgrp block in its journal without a revoke. Node 2 grabs the rgrp as soon as the rgrp glock is released and starts making changes, allocating and freeing blocks from the rgrp, etc. Node 3 replays the journal from node 1, oblivious and unaware that it's about to overwrite node 2's changes. So we still need to be vocal and log the error to make it apparent that a corruption path still exists in gfs2. Signed-off-by: Bob Peterson <rpeterso@redhat.com> Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2019-06-27gfs2: log which portion of the journal is replayedBob Peterson
When a journal is replayed, gfs2 logs a message similar to: jid=X: Replaying journal... This patch adds the tail and block number so that the range of the replayed block is also printed. These values will match the values shown if the journal is dumped with gfs2_edit -p journalX. The resulting output looks something like this: jid=1: Replaying journal...0x28b7 to 0x2beb This will allow us to better debug file system corruption problems. Signed-off-by: Bob Peterson <rpeterso@redhat.com> Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2019-06-27gfs2: eliminate tr_num_revoke_rmBob Peterson
For its journal processing, gfs2 kept track of the number of buffers added and removed on a per-transaction basis. These values are used to calculate space needed in the journal. But while these calculations make sense for the number of buffers, they make no sense for revokes. Revokes are managed in their own list, linked from the superblock. So it's entirely unnecessary to keep separate per-transaction counts for revokes added and removed. A single count will do the same job. Therefore, this patch combines the transaction revokes into a single count. Signed-off-by: Bob Peterson <rpeterso@redhat.com> Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2019-06-27gfs2: kthread and remount improvementsBob Peterson
Before this patch, gfs2 saved the pointers to the two daemon threads (logd and quotad) in the superblock, but they were never cleared, even if the threads were stopped (e.g. on remount -o ro). That meant that certain error conditions (like a withdrawn file system) could race. For example, xfstests generic/361 caused an IO error during remount -o ro, which caused the kthreads to be stopped, then the error flagged. Later, when the test unmounted the file system, it would try to stop the threads a second time with kthread_stop. This patch does two things: First, every time it stops the threads it zeroes out the thread pointer, and also checks whether it's NULL before trying to stop it. Second, in function gfs2_remount_fs, it was returning if an error was logged by either of the two functions for gfs2_make_fs_ro and _rw, which caused it to bypass the online uevent at the bottom of the function. This removes that bypass in favor of just running the whole function, then returning the error. That way, unmounts and remounts won't hang forever. Signed-off-by: Bob Peterson <rpeterso@redhat.com> Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2019-06-27gfs2: Use IS_ERR_OR_NULLKefeng Wang
Use IS_ERR_OR_NULL where appropriate. (Several more places converted by Andreas.) Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2019-06-27gfs2: Clean up freeing struct gfs2_sbdAndreas Gruenbacher
Add a free_sbd function for freeing a struct gfs2_sbd. Use that for freeing a super-block descriptor, either directly or via kobject_put. Free sd_lkstats inside the kobject release function: that way, gfs2_put_super will no longer leak sd_lkstats. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2019-06-27net: dsa: qca8k: introduce reset via gpio featureChristian Lamparter
The QCA8337(N) has a RESETn signal on Pin B42 that triggers a chip reset if the line is pulled low. The datasheet says that: "The active low duration must be greater than 10 ms". This can hopefully fix some of the issues related to pin strapping in OpenWrt for the EA8500 which suffers from detection issues after a SoC reset. Please note that the qca8k_probe() function does currently require to read the chip's revision register for identification purposes. Signed-off-by: Christian Lamparter <chunkeey@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-27dt-bindings: net: dsa: qca8k: document reset-gpios propertyChristian Lamparter
This patch documents the qca8k's reset-gpios property that can be used if the QCA8337N ends up in a bad state during reset. Signed-off-by: Christian Lamparter <chunkeey@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-27ipv6: Convert gateway validation to use fib6_infoDavid Ahern
Gateway validation does not need a dst_entry, it only needs the fib entry to validate the gateway resolution and egress device. So, convert ip6_nh_lookup_table from ip6_pol_route to fib6_table_lookup and ip6_route_check_nh to use fib6_lookup over rt6_lookup. ip6_pol_route is a call to fib6_table_lookup and if successful a call to fib6_select_path. From there the exception cache is searched for an entry or a dst_entry is created to return to the caller. The exception entry is not relevant for gateway validation, so what matters are the calls to fib6_table_lookup and then fib6_select_path. Similarly, rt6_lookup can be replaced with a call to fib6_lookup with RT6_LOOKUP_F_IFACE set in flags. Again, the exception cache search is not relevant, only the lookup with path selection. The primary difference in the lookup paths is the use of rt6_select with fib6_lookup versus rt6_device_match with rt6_lookup. When you remove complexities in the rt6_select path, e.g., 1. saddr is not set for gateway validation, so RT6_LOOKUP_F_HAS_SADDR is not relevant 2. rt6_check_neigh is not called so that removes the RT6_NUD_FAIL_DO_RR return and round-robin logic. the code paths are believed to be equivalent for the given use case - validate the gateway and optionally given the device. Furthermore, it aligns the validation with onlink code path and the lookup path actually used for rx and tx. Adjust the users, ip6_route_check_nh_onlink and ip6_route_check_nh to handle a fib6_info vs a rt6_info when performing validation checks. Existing selftests fib-onlink-tests.sh and fib_tests.sh are used to verify the changes. Signed-off-by: David Ahern <dsahern@gmail.com> Reviewed-by: Wei Wang <weiwan@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-27Merge branch 'FDB-VLAN-and-PTP-fixes-for-SJA1105-DSA'David S. Miller
Vladimir Oltean says: ==================== FDB, VLAN and PTP fixes for SJA1105 DSA This patchset is an assortment of fixes for the net-next version of the sja1105 DSA driver: - Avoid a kernel panic when the driver fails to probe or unregisters - Finish Arnd Bermann's idea of compiling PTP support as part of the main DSA driver and not separately - Better handling of initial port-based VLAN as well as VLANs for dsa_8021q FDB entries - Fix address learning for the SJA1105 P/Q/R/S family - Make static FDB entries persistent across switch resets - Fix reporting of statically-added FDB entries in 'bridge fdb show' ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-27net: dsa: sja1105: Implement is_static for FDB entries on E/TVladimir Oltean
The first generation switches don't tell us through the dynamic config interface whether the dumped FDB entries are static or not (the LOCKEDS bit from P/Q/R/S). However, now that we're keeping a mirror of all 'bridge fdb' commands in the static config, this is an opportunity to compare a dumped FDB entry to the driver's private database. After all, what makes an entry static is that *we* added it. Signed-off-by: Vladimir Oltean <olteanv@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-27net: dsa: sja1105: Use correct dsa_8021q VIDs for FDB commandsVladimir Oltean
A FDB entry means that "frames that match this VID and DMAC must be forwarded to this port". In the case of dsa_8021q however, the VID is not a single one (and neither two, as my previous patch assumed). The VID can be set either by the CPU port (1 tx_vid), or by any of the other front-panel port (n-1 rx_vid's). Fixes: 93647594d8f5 ("net: dsa: sja1105: Hide the dsa_8021q VLANs from the bridge fdb command") Signed-off-by: Vladimir Oltean <olteanv@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-27net: dsa: sja1105: Populate is_static for FDB entries on P/Q/R/SVladimir Oltean
The reason why this wasn't tackled earlier is that I had hoped I understood the user manual wrong. But unfortunately hacks are required in order to retrieve the static/dynamic nature of FDB entries on SJA1105 P/Q/R/S, since this info is stored in the writeback buffer of the dynamic config command. Signed-off-by: Vladimir Oltean <olteanv@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-27net: dsa: sja1105: Add a high-level overview of the dynamic config interfaceVladimir Oltean
When trying to add support for LOCKEDS (static FDB entries) on SJA1105 P/Q/R/S, at first I didn't remember how the abstraction I created worked, and actually thought it works by mistake. To avoid other people staring at the code and not making much sense out of it, add some comments at the top of the file. Signed-off-by: Vladimir Oltean <olteanv@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-27net: dsa: sja1105: Back up static FDB entries in kernel memoryVladimir Oltean
After commit 8456721dd4ec ("net: dsa: sja1105: Add support for configuring address ageing time"), we started to reset the switch rather often (each time the bridge core changes the ageing time on a switch port). The unfortunate reality is that SJA1105 doesn't have any {cold, warm, whatever} reset mode in which it accepts a new configuration stream without flushing the FDB. Instead, in its world, the FDB *is* an optional part of the static configuration. So we play its game, and do what we also do for VLANs: for each 'bridge fdb' command, we add the FDB entry through the dynamic interface, and we append the in-kernel static config memory with info that we're going to use later, when the next reset command is going to be issued. The result is that 'bridge fdb' commands are now persistent (dynamically learned entries are lost, but that's ok). Signed-off-by: Vladimir Oltean <olteanv@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-27net: dsa: sja1105: Make P/Q/R/S learn MAC addressesVladimir Oltean
At the end of the commit 1da73821343c ("net: dsa: sja1105: Add FDB operations for P/Q/R/S series") message, I said that: At the moment only FDB entries installed statically through 'bridge fdb' are visible in the dump callback - the dynamically learned ones are still under investigation. It looks like the reason why they were not visible in 'bridge fdb' was that they were never learned - always flooded. SJA1105 P/Q/R/S manual says about the MAXADDRP[port] field: Specify the maximum number of MAC address dynamically learned from the respective port. It is used to limit the number of learned MAC addresses per port. It looks like not providing a value in the static config (aka providing zeroes) is enough for it to not store the learned addresses in the FDB. For now we divide the 1024 entry FDB "equally" amongst the 5 ports. This may be revisited if the situation calls for that - for now I'm happy that learning works. Signed-off-by: Vladimir Oltean <olteanv@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-27net: dsa: sja1105: Actually implement the P/Q/R/S FDB bitsVladimir Oltean
In commit 1da73821343c ("net: dsa: sja1105: Add FDB operations for P/Q/R/S series"), these bits were set in the static config, but apparently they did not do anything. The reason is that the packing accessors for them were part of a patch I forgot to send. Signed-off-by: Vladimir Oltean <olteanv@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-27net: dsa: sja1105: Make vid 1 the default pvidVladimir Oltean
In SJA1105 there is no concept of 'default values' per se, everything needs to be driver-supplied through the static configuration tables. The issue is that the hardware manual says that 'at least the default untagging VLAN' is mandatory to be provided through the static config. But VLAN 0 isn't a very good initial pvid - its use is reserved for priority-tagged frames, and the layers of the stack that care about those already make sure that this VLAN is installed, as can be seen in the message below: 8021q: adding VLAN 0 to HW filter on device swp2 So change the pvid provided through the static configuration to 1, which matches the bridge core's defaults. Signed-off-by: Vladimir Oltean <olteanv@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-27net: dsa: sja1105: Cancel PTP delayed work on unregisterVladimir Oltean
Currently when the driver unloads and PTP is enabled, the delayed work that prevents the timecounter from expiring becomes a ticking time bomb. The kernel will schedule the work thread within 60 seconds of driver removal, but the work handler is no longer there, leading to this strange and inconclusive stack trace: [ 64.473112] Unable to handle kernel paging request at virtual address 79746970 [ 64.480340] pgd = 008c4af9 [ 64.483042] [79746970] *pgd=00000000 [ 64.486620] Internal error: Oops: 80000005 [#1] SMP ARM [ 64.491820] Modules linked in: [ 64.494871] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 5.2.0-rc5-01634-ge3a2773ba9e5 #1246 [ 64.503007] Hardware name: Freescale LS1021A [ 64.507259] PC is at 0x79746970 [ 64.510393] LR is at call_timer_fn+0x3c/0x18c [ 64.514729] pc : [<79746970>] lr : [<c03bd734>] psr: 60010113 [ 64.520965] sp : c1901de0 ip : 00000000 fp : c1903080 [ 64.526163] r10: c1901e38 r9 : ffffe000 r8 : c19064ac [ 64.531363] r7 : 79746972 r6 : e98dd260 r5 : 00000100 r4 : c1a9e4a0 [ 64.537859] r3 : c1900000 r2 : ffffa400 r1 : 79746972 r0 : e98dd260 [ 64.544359] Flags: nZCv IRQs on FIQs on Mode SVC_32 ISA ARM Segment none [ 64.551460] Control: 10c5387d Table: a8a2806a DAC: 00000051 [ 64.557176] Process swapper/0 (pid: 0, stack limit = 0x1ddb27f0) [ 64.563147] Stack: (0xc1901de0 to 0xc1902000) [ 64.567481] 1de0: eb6a4918 3d60d7c3 c1a9e554 e98dd260 eb6a34c0 c1a9e4a0 ffffa400 c19064ac [ 64.575616] 1e00: ffffe000 c03bd95c c1901e34 c1901e34 eb6a34c0 c1901e30 c1903d00 c186f4c0 [ 64.583751] 1e20: c1906488 29e34000 c1903080 c03bdca4 00000000 eaa6f218 00000000 eb6a45c0 [ 64.591886] 1e40: eb6a45c0 20010193 00000003 c03c0a68 20010193 3f7231be c1903084 00000002 [ 64.600022] 1e60: 00000082 00000001 ffffe000 c1a9e0a4 00000100 c0302298 02b64722 0000000f [ 64.608157] 1e80: c186b3c8 c1877540 c19064ac 0000000a c186b350 ffffa401 c1903d00 c1107348 [ 64.616292] 1ea0: 00200102 c0d87a14 ea823c00 ffffe000 00000012 00000000 00000000 ea810800 [ 64.624427] 1ec0: f0803000 c1876ba8 00000000 c034c784 c18774b8 c039fb50 c1906c90 c1978aac [ 64.632562] 1ee0: f080200c f0802000 c1901f10 c0709ca8 c03091a0 60010013 ffffffff c1901f44 [ 64.640697] 1f00: 00000000 c1900000 c1876ba8 c0301a8c 00000000 000070a0 eb6ac1a0 c031da60 [ 64.648832] 1f20: ffffe000 c19064ac c19064f0 00000001 00000000 c1906488 c1876ba8 00000000 [ 64.656967] 1f40: ffffffff c1901f60 c030919c c03091a0 60010013 ffffffff 00000051 00000000 [ 64.665102] 1f60: ffffe000 c0376aa4 c1a9da37 ffffffff 00000037 3f7231be c1ab20c0 000000cc [ 64.673238] 1f80: c1906488 c1906480 ffffffff 00000037 c1ab20c0 c1ab20c0 00000001 c0376e1c [ 64.681373] 1fa0: c1ab2118 c1700ea8 ffffffff ffffffff 00000000 c1700754 c17dfa40 ebfffd80 [ 64.689509] 1fc0: 00000000 c17dfa40 3f7733be 00000000 00000000 c1700330 00000051 10c0387d [ 64.697644] 1fe0: 00000000 8f000000 410fc075 10c5387d 00000000 00000000 00000000 00000000 [ 64.705788] [<c03bd734>] (call_timer_fn) from [<c03bd95c>] (expire_timers+0xd8/0x144) [ 64.713579] [<c03bd95c>] (expire_timers) from [<c03bdca4>] (run_timer_softirq+0xe4/0x1dc) [ 64.721716] [<c03bdca4>] (run_timer_softirq) from [<c0302298>] (__do_softirq+0x130/0x3c8) [ 64.729854] [<c0302298>] (__do_softirq) from [<c034c784>] (irq_exit+0xbc/0xd8) [ 64.737040] [<c034c784>] (irq_exit) from [<c039fb50>] (__handle_domain_irq+0x60/0xb4) [ 64.744833] [<c039fb50>] (__handle_domain_irq) from [<c0709ca8>] (gic_handle_irq+0x58/0x9c) [ 64.753143] [<c0709ca8>] (gic_handle_irq) from [<c0301a8c>] (__irq_svc+0x6c/0x90) [ 64.760583] Exception stack(0xc1901f10 to 0xc1901f58) [ 64.765605] 1f00: 00000000 000070a0 eb6ac1a0 c031da60 [ 64.773740] 1f20: ffffe000 c19064ac c19064f0 00000001 00000000 c1906488 c1876ba8 00000000 [ 64.781873] 1f40: ffffffff c1901f60 c030919c c03091a0 60010013 ffffffff [ 64.788456] [<c0301a8c>] (__irq_svc) from [<c03091a0>] (arch_cpu_idle+0x38/0x3c) [ 64.795816] [<c03091a0>] (arch_cpu_idle) from [<c0376aa4>] (do_idle+0x1bc/0x298) [ 64.803175] [<c0376aa4>] (do_idle) from [<c0376e1c>] (cpu_startup_entry+0x18/0x1c) [ 64.810707] [<c0376e1c>] (cpu_startup_entry) from [<c1700ea8>] (start_kernel+0x480/0x4ac) [ 64.818839] Code: bad PC value [ 64.821890] ---[ end trace e226ed97b1c584cd ]--- [ 64.826482] Kernel panic - not syncing: Fatal exception in interrupt [ 64.832807] CPU1: stopping [ 64.835501] CPU: 1 PID: 0 Comm: swapper/1 Tainted: G D 5.2.0-rc5-01634-ge3a2773ba9e5 #1246 [ 64.845013] Hardware name: Freescale LS1021A [ 64.849266] [<c0312394>] (unwind_backtrace) from [<c030cc74>] (show_stack+0x10/0x14) [ 64.856972] [<c030cc74>] (show_stack) from [<c0ff4138>] (dump_stack+0xb4/0xc8) [ 64.864159] [<c0ff4138>] (dump_stack) from [<c0310854>] (handle_IPI+0x3bc/0x3dc) [ 64.871519] [<c0310854>] (handle_IPI) from [<c0709ce8>] (gic_handle_irq+0x98/0x9c) [ 64.879050] [<c0709ce8>] (gic_handle_irq) from [<c0301a8c>] (__irq_svc+0x6c/0x90) [ 64.886489] Exception stack(0xea8cbf60 to 0xea8cbfa8) [ 64.891514] bf60: 00000000 0000307c eb6c11a0 c031da60 ffffe000 c19064ac c19064f0 00000002 [ 64.899649] bf80: 00000000 c1906488 c1876ba8 00000000 00000000 ea8cbfb0 c030919c c03091a0 [ 64.907780] bfa0: 600d0013 ffffffff [ 64.911250] [<c0301a8c>] (__irq_svc) from [<c03091a0>] (arch_cpu_idle+0x38/0x3c) [ 64.918609] [<c03091a0>] (arch_cpu_idle) from [<c0376aa4>] (do_idle+0x1bc/0x298) [ 64.925967] [<c0376aa4>] (do_idle) from [<c0376e1c>] (cpu_startup_entry+0x18/0x1c) [ 64.933496] [<c0376e1c>] (cpu_startup_entry) from [<803025cc>] (0x803025cc) [ 64.940422] Rebooting in 3 seconds.. In this case, what happened is that the DSA driver failed to probe at boot time due to a PHY issue during phylink_connect_phy: [ 2.245607] fsl-gianfar soc:ethernet@2d90000 eth2: error -19 setting up slave phy [ 2.258051] sja1105 spi0.1: failed to create slave for port 0.0 Fixes: bb77f36ac21d ("net: dsa: sja1105: Add support for the PTP clock") Signed-off-by: Vladimir Oltean <olteanv@gmail.com> Acked-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-27net: dsa: sja1105: Build PTP support in main DSA driverVladimir Oltean
As Arnd Bergmann pointed out in commit 78fe8a28fb96 ("net: dsa: sja1105: fix ptp link error"), there is no point in having PTP support as a separate loadable kernel module. So remove the exported symbols and make sja1105.ko contain PTP support or not based on CONFIG_NET_DSA_SJA1105_PTP. Signed-off-by: Vladimir Oltean <olteanv@gmail.com> Acked-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-27Merge branch 'net-dsa-microchip-Convert-to-regmap'David S. Miller
Marek Vasut says: ==================== net: dsa: microchip: Convert to regmap This patchset converts KSZ9477 switch driver to regmap. This was tested with extra patches on KSZ8795. This was also tested on KSZ9477 on Microchip KSZ9477EVB board, which I now have. ==================== Signed-off-by: Marek Vasut <marex@denx.de>
2019-06-27net: dsa: microchip: Replace ad-hoc bit manipulation with regmapMarek Vasut
Regmap provides bit manipulation functions to set/clear bits, use those insted of reimplementing them. Signed-off-by: Marek Vasut <marex@denx.de> Cc: Andrew Lunn <andrew@lunn.ch> Cc: Florian Fainelli <f.fainelli@gmail.com> Cc: Tristram Ha <Tristram.Ha@microchip.com> Cc: Woojung Huh <Woojung.Huh@microchip.com> Signed-off-by: David S. Miller <davem@davemloft.net>