summaryrefslogtreecommitdiff
path: root/include
AgeCommit message (Collapse)Author
2022-02-20net: tcp: add skb drop reasons to tcp_add_backlog()Menglong Dong
Pass the address of drop_reason to tcp_add_backlog() to store the reasons for skb drops when fails. Following drop reasons are introduced: SKB_DROP_REASON_SOCKET_BACKLOG Reviewed-by: Mengen Sun <mengensun@tencent.com> Reviewed-by: Hao Peng <flyingpeng@tencent.com> Signed-off-by: Menglong Dong <imagedong@tencent.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-02-20net: tcp: add skb drop reasons to tcp_v{4,6}_inbound_md5_hash()Menglong Dong
Pass the address of drop reason to tcp_v4_inbound_md5_hash() and tcp_v6_inbound_md5_hash() to store the reasons for skb drops when this function fails. Therefore, the drop reason can be passed to kfree_skb_reason() when the skb needs to be freed. Following drop reasons are added: SKB_DROP_REASON_TCP_MD5NOTFOUND SKB_DROP_REASON_TCP_MD5UNEXPECTED SKB_DROP_REASON_TCP_MD5FAILURE SKB_DROP_REASON_TCP_MD5* above correspond to LINUX_MIB_TCPMD5* Reviewed-by: Mengen Sun <mengensun@tencent.com> Reviewed-by: Hao Peng <flyingpeng@tencent.com> Signed-off-by: Menglong Dong <imagedong@tencent.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-02-19drm/i915/uapi: document behaviour for DG2 64K supportMatthew Auld
On discrete platforms like DG2, we need to support a minimum page size of 64K when dealing with device local-memory. This is quite tricky for various reasons, so try to document the new implicit uapi for this. v4: Kdoc modification. v3: fix typos and less emphasis v2: Fixed suggestions on formatting [Daniel] Signed-off-by: Matthew Auld <matthew.auld@intel.com> Signed-off-by: Ramalingam C <ramalingam.c@intel.com> Signed-off-by: Robert Beckett <bob.beckett@collabora.com> Acked-by: Jordan Justen <jordan.l.justen@intel.com> Reviewed-by: Ramalingam C <ramalingam.c@intel.com> Reviewed-by: Thomas Hellström <thomas.hellstrom@linux.intel.com> cc: Simon Ser <contact@emersion.fr> cc: Pekka Paalanen <ppaalanen@gmail.com> Cc: Jordan Justen <jordan.l.justen@intel.com> Cc: Kenneth Graunke <kenneth@whitecape.org> Cc: mesa-dev@lists.freedesktop.org Cc: Tony Ye <tony.ye@intel.com> Cc: Slawomir Milczarek <slawomir.milczarek@intel.com> Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20220218184752.7524-13-ramalingam.c@intel.com
2022-02-19iosys-map: Add a few more helpersLucas De Marchi
First the simplest ones: - iosys_map_memset(): when abstracting system and I/O memory, just like the memcpy() use case, memset() also has dedicated functions to be called for using IO memory. - iosys_map_memcpy_from(): we may need to copy data from I/O memory, not only to. In certain situations it's useful to be able to read or write to an offset that is calculated by having the memory layout given by a struct declaration. Usually we are going to read/write a u8, u16, u32 or u64. As a pre-requisite for the implementation, add iosys_map_memcpy_from() to be the equivalent of iosys_map_memcpy_to(), but in the other direction. Then add 2 pairs of macros: - iosys_map_rd() / iosys_map_wr() - iosys_map_rd_field() / iosys_map_wr_field() The first pair takes the C-type and offset to read/write. The second pair uses a struct describing the layout of the mapping in order to calculate the offset and size being read/written. We could use readb, readw, readl, readq and the write* counterparts, however due to alignment issues this may not work on all architectures. If alignment needs to be checked to call the right function, it's not possible to decide at compile-time which function to call: so just leave the decision to the memcpy function that will do exactly that. Finally, in order to use the above macros with a map derived from another, add another initializer: IOSYS_MAP_INIT_OFFSET(). v2: - Rework IOSYS_MAP_INIT_OFFSET() so it doesn't rely on aliasing rules within the union - Add offset to both iosys_map_rd_field() and iosys_map_wr_field() to allow the struct itself to be at an offset from the mapping - Add documentation to iosys_map_rd_field() with example and expected memory layout v3: - Drop kernel.h include as it's not needed anymore Cc: Sumit Semwal <sumit.semwal@linaro.org> Cc: Christian König <christian.koenig@amd.com> Cc: Thomas Zimmermann <tzimmermann@suse.de> Cc: Mauro Carvalho Chehab <mchehab@kernel.org> Cc: dri-devel@lists.freedesktop.org Cc: linux-kernel@vger.kernel.org Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com> Reviewed-by: Mauro Carvalho Chehab <mchehab@kernel.org> Reviewed-by: Matt Atwood <matthew.s.atwood@intel.com> Reviewed-by: Thomas Zimmermann <tzimmermann@suse.de> Link: https://patchwork.freedesktop.org/patch/msgid/20220216174147.3073235-3-lucas.demarchi@intel.com
2022-02-19iosys-map: Add offset to iosys_map_memcpy_to()Lucas De Marchi
In certain situations it's useful to be able to write to an offset of the mapping. Add a dst_offset to iosys_map_memcpy_to(). Cc: Sumit Semwal <sumit.semwal@linaro.org> Cc: Christian König <christian.koenig@amd.com> Cc: Thomas Zimmermann <tzimmermann@suse.de> Cc: dri-devel@lists.freedesktop.org Cc: linux-kernel@vger.kernel.org Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com> Reviewed-by: Christian König <christian.koenig@amd.com> Reviewed-by: Thomas Zimmermann <tzimmermann@suse.de> Link: https://patchwork.freedesktop.org/patch/msgid/20220216174147.3073235-2-lucas.demarchi@intel.com
2022-02-20netfilter: nf_tables_offload: incorrect flow offload action array sizePablo Neira Ayuso
immediate verdict expression needs to allocate one slot in the flow offload action array, however, immediate data expression does not need to do so. fwd and dup expression need to allocate one slot, this is missing. Add a new offload_action interface to report if this expression needs to allocate one slot in the flow offload action array. Fixes: be2861dc36d7 ("netfilter: nft_{fwd,dup}_netdev: add offload support") Reported-and-tested-by: Nick Gregory <Nick.Gregory@Sophos.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2022-02-20ata: add/use ata_taskfile::{error|status} fieldsSergey Shtylyov
Add the explicit error and status register fields to 'struct ata_taskfile' using the anonymous *union*s ('struct ide_taskfile' had that for ages!) and update the libata taskfile code accordingly. There should be no object code changes resulting from that... Signed-off-by: Sergey Shtylyov <s.shtylyov@omp.ru> Signed-off-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
2022-02-19scsi: libsas: Add sas_abort_task()John Garry
Add a generic implementation of abort task TMF handler, and use in LLDDs. With that, some LLDDs custom TMF functions can now be deleted. Link: https://lore.kernel.org/r/1645112566-115804-18-git-send-email-john.garry@huawei.com Tested-by: Yihang Li <liyihang6@hisilicon.com> Tested-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: John Garry <john.garry@huawei.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
2022-02-19scsi: libsas: Add sas_query_task()John Garry
Add a generic implementation of query task TMF handler, and use in LLDDs. Link: https://lore.kernel.org/r/1645112566-115804-17-git-send-email-john.garry@huawei.com Tested-by: Yihang Li <liyihang6@hisilicon.com> Tested-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: John Garry <john.garry@huawei.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
2022-02-19scsi: libsas: Add sas_lu_reset()John Garry
Add a generic implementation of LU reset TMF handler, and use in LLDDs. Link: https://lore.kernel.org/r/1645112566-115804-16-git-send-email-john.garry@huawei.com Tested-by: Yihang Li <liyihang6@hisilicon.com> Tested-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: John Garry <john.garry@huawei.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
2022-02-19scsi: libsas: Add sas_clear_task_set()John Garry
Add a generic implementation of clear task set TMF handler, and use in LLDDs. Link: https://lore.kernel.org/r/1645112566-115804-15-git-send-email-john.garry@huawei.com Tested-by: Yihang Li <liyihang6@hisilicon.com> Tested-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: John Garry <john.garry@huawei.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
2022-02-19scsi: libsas: Add sas_abort_task_set()John Garry
Add a generic implementation of abort task set TMF handler, and use in LLDDs. Link: https://lore.kernel.org/r/1645112566-115804-14-git-send-email-john.garry@huawei.com Tested-by: Yihang Li <liyihang6@hisilicon.com> Tested-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: John Garry <john.garry@huawei.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
2022-02-19scsi: libsas: Add TMF handler aborted callbackJohn Garry
The hisi_sas and pm8001 TMF handlers have some special processing for when the TMF is aborted, so add a callback and fill it in for those drivers. Link: https://lore.kernel.org/r/1645112566-115804-13-git-send-email-john.garry@huawei.com Tested-by: Yihang Li <liyihang6@hisilicon.com> Tested-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: John Garry <john.garry@huawei.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
2022-02-19scsi: libsas: Add TMF handler exec complete callbackJohn Garry
The pm8001 TMF handler has some special processing when the TMF completes, so add a callback and fill it in for the pm8001 driver. Link: https://lore.kernel.org/r/1645112566-115804-12-git-send-email-john.garry@huawei.com Tested-by: Yihang Li <liyihang6@hisilicon.com> Tested-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: John Garry <john.garry@huawei.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
2022-02-19scsi: libsas: Add sas_task.tmfJohn Garry
Add a pointer to a sas_tmf_task to the sas_task struct, as this will be used when the common LLDD TMF code is factored out. Also set it for the LLDDs to store per-sas_task TMF info. Link: https://lore.kernel.org/r/1645112566-115804-9-git-send-email-john.garry@huawei.com Tested-by: Yihang Li <liyihang6@hisilicon.com> Tested-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: John Garry <john.garry@huawei.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
2022-02-19scsi: libsas: Add struct sas_tmf_taskJohn Garry
Some of the LLDDs which use libsas have their own definition of a struct to hold TMF info, so add a common struct for libsas. Also add an interim force phy id field for hisi_sas driver, which will be removed once the STP "TMF" code is factored out. Even though some LLDDs (pm8001) use a u32 for the tag, u16 will be adequate, as that named driver only uses tags in range [0, 1024). Link: https://lore.kernel.org/r/1645112566-115804-8-git-send-email-john.garry@huawei.com Tested-by: Yihang Li <liyihang6@hisilicon.com> Tested-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: John Garry <john.garry@huawei.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
2022-02-19scsi: libsas: Delete SAS_SG_ERRJohn Garry
No LLDD sets exec status as SAS_SG_ERR, so remove support. Link: https://lore.kernel.org/r/1645112566-115804-5-git-send-email-john.garry@huawei.com Tested-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Signed-off-by: John Garry <john.garry@huawei.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
2022-02-19scsi: libsas: Delete lldd_clear_aca callbackJohn Garry
This callback is never called, so remove support. Link: https://lore.kernel.org/r/1645112566-115804-4-git-send-email-john.garry@huawei.com Tested-by: Yihang Li <liyihang6@hisilicon.com> Tested-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Reviewed-by: Jack Wang <jinpu.wang@ionos.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Xiang Chen <chenxiang66@hisilicon.com> Signed-off-by: John Garry <john.garry@huawei.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
2022-02-19scsi: libsas: Use enum for response frame DATAPRES fieldJohn Garry
As defined in table 126 of the SAS spec 1.1, use an enum for the DATAPRES field, which makes reading the code easier. Also change sas_ssp_task_response() to use a switch statement, which is more suitable (than if-else), as suggested by Christoph. Link: https://lore.kernel.org/r/1645112566-115804-3-git-send-email-john.garry@huawei.com Suggested-by: Xiang Chen <chenxiang66@hisilicon.com> Tested-by: Yihang Li <liyihang6@hisilicon.com> Tested-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Reviewed-by: Jack Wang <jinpu.wang@ionos.com> Signed-off-by: John Garry <john.garry@huawei.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
2022-02-19net: phylink: remove phylink_config's pcs_pollRussell King (Oracle)
phylink_config's pcs_poll is no longer used, let's get rid of it. Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-02-19net: dsa: remove pcs_pollRussell King (Oracle)
With drivers converted over to using phylink PCS, there is no need for the struct dsa_switch member "pcs_poll" to exist anymore - there is a flag in the struct phylink_pcs which indicates whether this PCS needs to be polled which supersedes this. Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-02-19net: Force inlining of checksum functions in net/checksum.hChristophe Leroy
All functions defined as static inline in net/checksum.h are meant to be inlined for performance reason. But since commit ac7c3e4ff401 ("compiler: enable CONFIG_OPTIMIZE_INLINING forcibly") the compiler is allowed to uninline functions when it wants. Fair enough in the general case, but for tiny performance critical checksum helpers that's counter-productive. The problem mainly arises when selecting CONFIG_CC_OPTIMISE_FOR_SIZE, Those helpers being 'static inline' in header files you suddenly find them duplicated many times in the resulting vmlinux. Here is a typical exemple when building powerpc pmac32_defconfig with CONFIG_CC_OPTIMISE_FOR_SIZE. csum_sub() appears 4 times: c04a23cc <csum_sub>: c04a23cc: 7c 84 20 f8 not r4,r4 c04a23d0: 7c 63 20 14 addc r3,r3,r4 c04a23d4: 7c 63 01 94 addze r3,r3 c04a23d8: 4e 80 00 20 blr ... c04a2ce8: 4b ff f6 e5 bl c04a23cc <csum_sub> ... c04a2d2c: 4b ff f6 a1 bl c04a23cc <csum_sub> ... c04a2d54: 4b ff f6 79 bl c04a23cc <csum_sub> ... c04a754c <csum_sub>: c04a754c: 7c 84 20 f8 not r4,r4 c04a7550: 7c 63 20 14 addc r3,r3,r4 c04a7554: 7c 63 01 94 addze r3,r3 c04a7558: 4e 80 00 20 blr ... c04ac930: 4b ff ac 1d bl c04a754c <csum_sub> ... c04ad264: 4b ff a2 e9 bl c04a754c <csum_sub> ... c04e3b08 <csum_sub>: c04e3b08: 7c 84 20 f8 not r4,r4 c04e3b0c: 7c 63 20 14 addc r3,r3,r4 c04e3b10: 7c 63 01 94 addze r3,r3 c04e3b14: 4e 80 00 20 blr ... c04e5788: 4b ff e3 81 bl c04e3b08 <csum_sub> ... c04e65c8: 4b ff d5 41 bl c04e3b08 <csum_sub> ... c0512d34 <csum_sub>: c0512d34: 7c 84 20 f8 not r4,r4 c0512d38: 7c 63 20 14 addc r3,r3,r4 c0512d3c: 7c 63 01 94 addze r3,r3 c0512d40: 4e 80 00 20 blr ... c0512dfc: 4b ff ff 39 bl c0512d34 <csum_sub> ... c05138bc: 4b ff f4 79 bl c0512d34 <csum_sub> ... Restore the expected behaviour by using __always_inline for all functions defined in net/checksum.h vmlinux size is even reduced by 256 bytes with this patch: text data bss dec hex filename 6980022 2515362 194384 9689768 93daa8 vmlinux.before 6979862 2515266 194384 9689512 93d9a8 vmlinux.now Fixes: ac7c3e4ff401 ("compiler: enable CONFIG_OPTIMIZE_INLINING forcibly") Cc: Masahiro Yamada <yamada.masahiro@socionext.com> Cc: Nick Desaulniers <ndesaulniers@google.com> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-02-19net: ip6mr: add support for passing full packet on wrong mifMobashshera Rasool
This patch adds support for MRT6MSG_WRMIFWHOLE which is used to pass full packet and real vif id when the incoming interface is wrong. While the RP and FHR are setting up state we need to be sending the registers encapsulated with all the data inside otherwise we lose it. The RP then decapsulates it and forwards it to the interested parties. Currently with WRONGMIF we can only be sending empty register packets and will lose that data. This behaviour can be enabled by using MRT_PIM with val == MRT6MSG_WRMIFWHOLE. This doesn't prevent MRT6MSG_WRONGMIF from happening, it happens in addition to it, also it is controlled by the same throttling parameters as WRONGMIF (i.e. 1 packet per 3 seconds currently). Both messages are generated to keep backwards compatibily and avoid breaking someone who was enabling MRT_PIM with val == 4, since any positive val is accepted and treated the same. Signed-off-by: Mobashshera Rasool <mobash.rasool.linux@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-02-19sched/preempt: Add PREEMPT_DYNAMIC using static keysMark Rutland
Where an architecture selects HAVE_STATIC_CALL but not HAVE_STATIC_CALL_INLINE, each static call has an out-of-line trampoline which will either branch to a callee or return to the caller. On such architectures, a number of constraints can conspire to make those trampolines more complicated and potentially less useful than we'd like. For example: * Hardware and software control flow integrity schemes can require the addition of "landing pad" instructions (e.g. `BTI` for arm64), which will also be present at the "real" callee. * Limited branch ranges can require that trampolines generate or load an address into a register and perform an indirect branch (or at least have a slow path that does so). This loses some of the benefits of having a direct branch. * Interaction with SW CFI schemes can be complicated and fragile, e.g. requiring that we can recognise idiomatic codegen and remove indirections understand, at least until clang proves more helpful mechanisms for dealing with this. For PREEMPT_DYNAMIC, we don't need the full power of static calls, as we really only need to enable/disable specific preemption functions. We can achieve the same effect without a number of the pain points above by using static keys to fold early returns into the preemption functions themselves rather than in an out-of-line trampoline, effectively inlining the trampoline into the start of the function. For arm64, this results in good code generation. For example, the dynamic_cond_resched() wrapper looks as follows when enabled. When disabled, the first `B` is replaced with a `NOP`, resulting in an early return. | <dynamic_cond_resched>: | bti c | b <dynamic_cond_resched+0x10> // or `nop` | mov w0, #0x0 | ret | mrs x0, sp_el0 | ldr x0, [x0, #8] | cbnz x0, <dynamic_cond_resched+0x8> | paciasp | stp x29, x30, [sp, #-16]! | mov x29, sp | bl <preempt_schedule_common> | mov w0, #0x1 | ldp x29, x30, [sp], #16 | autiasp | ret ... compared to the regular form of the function: | <__cond_resched>: | bti c | mrs x0, sp_el0 | ldr x1, [x0, #8] | cbz x1, <__cond_resched+0x18> | mov w0, #0x0 | ret | paciasp | stp x29, x30, [sp, #-16]! | mov x29, sp | bl <preempt_schedule_common> | mov w0, #0x1 | ldp x29, x30, [sp], #16 | autiasp | ret Any architecture which implements static keys should be able to use this to implement PREEMPT_DYNAMIC with similar cost to non-inlined static calls. Since this is likely to have greater overhead than (inlined) static calls, PREEMPT_DYNAMIC is only defaulted to enabled when HAVE_PREEMPT_DYNAMIC_CALL is selected. Signed-off-by: Mark Rutland <mark.rutland@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Ard Biesheuvel <ardb@kernel.org> Acked-by: Frederic Weisbecker <frederic@kernel.org> Link: https://lore.kernel.org/r/20220214165216.2231574-6-mark.rutland@arm.com
2022-02-19sched/preempt: Simplify irqentry_exit_cond_resched() callersMark Rutland
Currently callers of irqentry_exit_cond_resched() need to be aware of whether the function should be indirected via a static call, leading to ugly ifdeffery in callers. Save them the hassle with a static inline wrapper that does the right thing. The raw_irqentry_exit_cond_resched() will also be useful in subsequent patches which will add conditional wrappers for preemption functions. Note: in arch/x86/entry/common.c, xen_pv_evtchn_do_upcall() always calls irqentry_exit_cond_resched() directly, even when PREEMPT_DYNAMIC is in use. I believe this is a latent bug (which this patch corrects), but I'm not entirely certain this wasn't deliberate. Signed-off-by: Mark Rutland <mark.rutland@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Ard Biesheuvel <ardb@kernel.org> Acked-by: Frederic Weisbecker <frederic@kernel.org> Link: https://lore.kernel.org/r/20220214165216.2231574-4-mark.rutland@arm.com
2022-02-19sched/preempt: Refactor sched_dynamic_update()Mark Rutland
Currently sched_dynamic_update needs to open-code the enabled/disabled function names for each preemption model it supports, when in practice this is a boolean enabled/disabled state for each function. Make this clearer and avoid repetition by defining the enabled/disabled states at the function definition, and using helper macros to perform the static_call_update(). Where x86 currently overrides the enabled function, it is made to provide both the enabled and disabled states for consistency, with defaults provided by the core code otherwise. In subsequent patches this will allow us to support PREEMPT_DYNAMIC without static calls. There should be no functional change as a result of this patch. Signed-off-by: Mark Rutland <mark.rutland@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Ard Biesheuvel <ardb@kernel.org> Acked-by: Frederic Weisbecker <frederic@kernel.org> Link: https://lore.kernel.org/r/20220214165216.2231574-3-mark.rutland@arm.com
2022-02-19sched: Fix yet more sched_fork() racesPeter Zijlstra
Where commit 4ef0c5c6b5ba ("kernel/sched: Fix sched_fork() access an invalid sched_task_group") fixed a fork race vs cgroup, it opened up a race vs syscalls by not placing the task on the runqueue before it gets exposed through the pidhash. Commit 13765de8148f ("sched/fair: Fix fault in reweight_entity") is trying to fix a single instance of this, instead fix the whole class of issues, effectively reverting this commit. Fixes: 4ef0c5c6b5ba ("kernel/sched: Fix sched_fork() access an invalid sched_task_group") Reported-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Tadeusz Struk <tadeusz.struk@linaro.org> Tested-by: Zhang Qiao <zhangqiao22@huawei.com> Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Link: https://lkml.kernel.org/r/YgoeCbwj5mbCR0qA@hirez.programming.kicks-ass.net
2022-02-18mctp: replace mctp_address_ok with more fine-grained helpersJeremy Kerr
Currently, we have mctp_address_ok(), which checks if an EID is in the "valid" range of 8-254 inclusive. However, 0 and 255 may also be valid addresses, depending on context. 0 is the NULL EID, which may be set when physical addressing is used. 255 is valid as a destination address for broadcasts. This change renames mctp_address_ok to mctp_address_unicast, and adds similar helpers for broadcast and null EIDs, which will be used in an upcoming commit. Signed-off-by: Jeremy Kerr <jk@codeconstruct.com.au> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-02-18net: Add new protocol attribute to IP addressesJacques de Laval
This patch adds a new protocol attribute to IPv4 and IPv6 addresses. Inspiration was taken from the protocol attribute of routes. User space applications like iproute2 can set/get the protocol with the Netlink API. The attribute is stored as an 8-bit unsigned integer. The protocol attribute is set by kernel for these categories: - IPv4 and IPv6 loopback addresses - IPv6 addresses generated from router announcements - IPv6 link local addresses User space may pass custom protocols, not defined by the kernel. Grouping addresses on their origin is useful in scenarios where you want to distinguish between addresses based on who added them, e.g. kernel vs. user space. Tagging addresses with a string label is an existing feature that could be used as a solution. Unfortunately the max length of a label is 15 characters, and for compatibility reasons the label must be prefixed with the name of the device followed by a colon. Since device names also have a max length of 15 characters, only -1 characters is guaranteed to be available for any origin tag, which is not that much. A reference implementation of user space setting and getting protocols is available for iproute2: https://github.com/westermo/iproute2/commit/9a6ea18bd79f47f293e5edc7780f315ea42ff540 Signed-off-by: Jacques de Laval <Jacques.De.Laval@westermo.com> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://lore.kernel.org/r/20220217150202.80802-1-Jacques.De.Laval@westermo.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-02-19ata: libata: make ata_host_suspend() *void*Sergey Shtylyov
ata_host_suspend() always returns 0, so the result checks in many drivers look pointless. Let's make this function return *void* instead of *int*. Found by Linux Verification Center (linuxtesting.org) with the SVACE static analysis tool. Signed-off-by: Sergey Shtylyov <s.shtylyov@omp.ru> Acked-by: Uwe Kleine-König <u.kleine-koenig@pengutronix.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
2022-02-18soc: fsl: Replace kernel.h with the necessary inclusionsAndy Shevchenko
When kernel.h is used in the headers it adds a lot into dependency hell, especially when there are circular dependencies are involved. Replace kernel.h inclusion with the list of what is really being used. Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Signed-off-by: Li Yang <leoyang.li@nxp.com>
2022-02-18Merge tag 'block-5.17-2022-02-17' of git://git.kernel.dk/linux-blockLinus Torvalds
Pull block fixes from Jens Axboe: - Surprise removal fix (Christoph) - Ensure that pages are zeroed before submitted for userspace IO (Haimin) - Fix blk-wbt accounting issue with BFQ (Laibin) - Use bsize for discard granularity in loop (Ming) - Fix missing zone handling in blk_complete_request() (Pankaj) * tag 'block-5.17-2022-02-17' of git://git.kernel.dk/linux-block: block/wbt: fix negative inflight counter when remove scsi device block: fix surprise removal for drivers calling blk_set_queue_dying block-map: add __GFP_ZERO flag for alloc_page in function bio_copy_kern block: loop:use kstatfs.f_bsize of backing file to set discard granularity block: Add handling for zone append command in blk_complete_request
2022-02-18Merge tag 'mtd/spi-mem-ecc-for-5.18' into mtd/nextMiquel Raynal
Topic branch bringing-in changes related to the support of ECC engines that can be used by SPI controllers to manage SPI NANDs as well as possibly by parallel NAND controllers. In particular, it brings support for Macronix ECC engine that can be used with Macronix SPI controller. The changes touch the NAND core, the NAND ECC core, the spi-mem layer, a SPI controller driver and add a new NAND ECC driver, as well as a number of binding updates. Binding changes: * Vendor prefixes: Clarify Macronix prefix * SPI NAND: Convert spi-nand description file to yaml * Raw NAND chip: Create a NAND chip description * Raw NAND controller: - Harmonize the property types - Fix a comment in the examples - Fix the reg property description * Describe Macronix NAND ECC engine * Macronix SPI controller: - Document the nand-ecc-engine property - Convert to yaml - The interrupt property is not mandatory NAND core changes: * ECC: - Add infrastructure to support hardware engines - Add a new helper to retrieve the ECC context - Provide a helper to retrieve a pilelined engine device NAND-ECC changes: * Macronix ECC engine: - Add Macronix external ECC engine support - Support SPI pipelined mode SPI-NAND core changes: * Delay a little bit the dirmap creation * Create direct mapping descriptors for ECC operations SPI-NAND driver changes: * macronix: Use random program load SPI changes: * Macronix SPI controller: - Fix the transmit path - Create a helper to configure the controller before an operation - Create a helper to ease the start of an operation - Add support for direct mapping - Add support for pipelined ECC operations * spi-mem: - Introduce a capability structure - Check the controller extra capabilities - cadence-quadspi/mxic: Provide capability structures - Kill the spi_mem_dtr_supports_op() helper - Add an ecc parameter to the spi_mem_op structure
2022-02-18hv_utils: Add comment about max VMbus packet size in VSS driverMichael Kelley
The VSS driver allocates a VMbus receive buffer significantly larger than sizeof(hv_vss_msg), with no explanation. To help prevent future mistakes, add a #define and comment about why this is done. No functional change. Signed-off-by: Michael Kelley <mikelley@microsoft.com> Link: https://lore.kernel.org/r/1644423070-75125-1-git-send-email-mikelley@microsoft.com Signed-off-by: Wei Liu <wei.liu@kernel.org>
2022-02-18net: dsa: add support for phylink mac_select_pcs()Russell King (Oracle)
Add DSA support for the phylink mac_select_pcs() method so DSA drivers can return provide phylink with the appropriate PCS for the PHY interface mode. Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-02-18net-timestamp: convert sk->sk_tskey to atomic_tEric Dumazet
UDP sendmsg() can be lockless, this is causing all kinds of data races. This patch converts sk->sk_tskey to remove one of these races. BUG: KCSAN: data-race in __ip_append_data / __ip_append_data read to 0xffff8881035d4b6c of 4 bytes by task 8877 on cpu 1: __ip_append_data+0x1c1/0x1de0 net/ipv4/ip_output.c:994 ip_make_skb+0x13f/0x2d0 net/ipv4/ip_output.c:1636 udp_sendmsg+0x12bd/0x14c0 net/ipv4/udp.c:1249 inet_sendmsg+0x5f/0x80 net/ipv4/af_inet.c:819 sock_sendmsg_nosec net/socket.c:705 [inline] sock_sendmsg net/socket.c:725 [inline] ____sys_sendmsg+0x39a/0x510 net/socket.c:2413 ___sys_sendmsg net/socket.c:2467 [inline] __sys_sendmmsg+0x267/0x4c0 net/socket.c:2553 __do_sys_sendmmsg net/socket.c:2582 [inline] __se_sys_sendmmsg net/socket.c:2579 [inline] __x64_sys_sendmmsg+0x53/0x60 net/socket.c:2579 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x44/0xd0 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x44/0xae write to 0xffff8881035d4b6c of 4 bytes by task 8880 on cpu 0: __ip_append_data+0x1d8/0x1de0 net/ipv4/ip_output.c:994 ip_make_skb+0x13f/0x2d0 net/ipv4/ip_output.c:1636 udp_sendmsg+0x12bd/0x14c0 net/ipv4/udp.c:1249 inet_sendmsg+0x5f/0x80 net/ipv4/af_inet.c:819 sock_sendmsg_nosec net/socket.c:705 [inline] sock_sendmsg net/socket.c:725 [inline] ____sys_sendmsg+0x39a/0x510 net/socket.c:2413 ___sys_sendmsg net/socket.c:2467 [inline] __sys_sendmmsg+0x267/0x4c0 net/socket.c:2553 __do_sys_sendmmsg net/socket.c:2582 [inline] __se_sys_sendmmsg net/socket.c:2579 [inline] __x64_sys_sendmmsg+0x53/0x60 net/socket.c:2579 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x44/0xd0 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x44/0xae value changed: 0x0000054d -> 0x0000054e Reported by Kernel Concurrency Sanitizer on: CPU: 0 PID: 8880 Comm: syz-executor.5 Not tainted 5.17.0-rc2-syzkaller-00167-gdcb85f85fa6f-dirty #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Fixes: 09c2d251b707 ("net-timestamp: add key to disambiguate concurrent datagrams") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Willem de Bruijn <willemb@google.com> Reported-by: syzbot <syzkaller@googlegroups.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-02-18net: gro: Fix a 'directive in macro's argument list' sparse warningGal Pressman
Following the cited commit, sparse started complaining about: ../include/net/gro.h:58:1: warning: directive in macro's argument list ../include/net/gro.h:59:1: warning: directive in macro's argument list Fix that by moving the defines out of the struct_group() macro. Fixes: de5a1f3ce4c8 ("net: gro: minor optimization for dev_gro_receive()") Reviewed-by: Maxim Mikityanskiy <maximmi@nvidia.com> Signed-off-by: Gal Pressman <gal@nvidia.com> Acked-by: Alexander Lobakin <alexandr.lobakin@intel.com> Acked-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-02-18iwlwifi: mvm: refactor setting PPE thresholds in STA_HE_CTXT_CMDMiri Korenblit
We are setting the PPE Thresholds in STA_HE_CTXT_CMD according to HE PHY Capabilities IE. As EHT is introduced, we will have to set this thresholds according to EHT PHY Capabilities IE if we're in an EHT connection. Some parts of the code can be used for both HE and EHT. Put this parts in functions which will be used in the patch which adds support for EHT PPE. Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com> Signed-off-by: Luca Coelho <luciano.coelho@intel.com> Link: https://lore.kernel.org/r/iwlwifi.20220205112029.48a508dfffef.If392e44d88f96ebed7fadf827e327194d4bd97b1@changeid Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
2022-02-17net: dsa: delete unused exported symbols for ethtool PHY statsVladimir Oltean
Introduced in commit cf963573039a ("net: dsa: Allow providing PHY statistics from CPU port"), it appears these were never used. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Link: https://lore.kernel.org/r/20220216193726.2926320-1-vladimir.oltean@nxp.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-02-17PCI: Add defines for normal and subtractive PCI bridgesPali Rohár
Add these PCI class codes to pci_ids.h: PCI_CLASS_BRIDGE_PCI_NORMAL PCI_CLASS_BRIDGE_PCI_SUBTRACTIVE Use these defines in all kernel code for describing PCI class codes for normal and subtractive PCI bridges. [bhelgaas: similar change in pci-mvebu.c] Link: https://lore.kernel.org/r/20220214114109.26809-1-pali@kernel.org Signed-off-by: Pali Rohár <pali@kernel.org> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
2022-02-17drm/amdgpu: add gc 10.3.6 supportYifan Zhang
this patch adds gc 10.3.6 support. Signed-off-by: Yifan Zhang <yifan1.zhang@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Reviewed-by: Huang Rui <ray.huang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-02-17Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski
Fast path bpf marge for some -next work. Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-02-17Merge https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpfJakub Kicinski
Alexei Starovoitov says: ==================== pull-request: bpf 2022-02-17 We've added 8 non-merge commits during the last 7 day(s) which contain a total of 8 files changed, 119 insertions(+), 15 deletions(-). The main changes are: 1) Add schedule points in map batch ops, from Eric. 2) Fix bpf_msg_push_data with len 0, from Felix. 3) Fix crash due to incorrect copy_map_value, from Kumar. 4) Fix crash due to out of bounds access into reg2btf_ids, from Kumar. 5) Fix a bpf_timer initialization issue with clang, from Yonghong. * https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf: bpf: Add schedule points in batch ops bpf: Fix crash due to out of bounds access into reg2btf_ids. selftests: bpf: Check bpf_msg_push_data return value bpf: Fix a bpf_timer initialization issue bpf: Emit bpf_timer in vmlinux BTF selftests/bpf: Add test for bpf_timer overwriting crash bpf: Fix crash due to incorrect copy_map_value bpf: Do not try bpf_msg_push_data with len 0 ==================== Link: https://lore.kernel.org/r/20220217190000.37925-1-alexei.starovoitov@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-02-17Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski
No conflicts. Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-02-17Merge tag 'net-5.17-rc5' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Pull networking fixes from Jakub Kicinski: "Including fixes from wireless and netfilter. Current release - regressions: - dsa: lantiq_gswip: fix use after free in gswip_remove() - smc: avoid overwriting the copies of clcsock callback functions Current release - new code bugs: - iwlwifi: - fix use-after-free when no FW is present - mei: fix the pskb_may_pull check in ipv4 - mei: retry mapping the shared area - mvm: don't feed the hardware RFKILL into iwlmei Previous releases - regressions: - ipv6: mcast: use rcu-safe version of ipv6_get_lladdr() - tipc: fix wrong publisher node address in link publications - iwlwifi: mvm: don't send SAR GEO command for 3160 devices, avoid FW assertion - bgmac: make idm and nicpm resource optional again - atl1c: fix tx timeout after link flap Previous releases - always broken: - vsock: remove vsock from connected table when connect is interrupted by a signal - ping: change destination interface checks to match raw sockets - crypto: af_alg - get rid of alg_memory_allocated to avoid confusing semantics (and null-deref) after SO_RESERVE_MEM was added - ipv6: make exclusive flowlabel checks per-netns - bonding: force carrier update when releasing slave - sched: limit TC_ACT_REPEAT loops - bridge: multicast: notify switchdev driver whenever MC processing gets disabled because of max entries reached - wifi: brcmfmac: fix crash in brcm_alt_fw_path when WLAN not found - iwlwifi: fix locking when "HW not ready" - phy: mediatek: remove PHY mode check on MT7531 - dsa: mv88e6xxx: flush switchdev FDB workqueue before removing VLAN - dsa: lan9303: - fix polarity of reset during probe - fix accelerated VLAN handling" * tag 'net-5.17-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (65 commits) bonding: force carrier update when releasing slave nfp: flower: netdev offload check for ip6gretap ipv6: fix data-race in fib6_info_hw_flags_set / fib6_purge_rt ipv4: fix data races in fib_alias_hw_flags_set net: dsa: lan9303: add VLAN IDs to master device net: dsa: lan9303: handle hwaccel VLAN tags vsock: remove vsock from connected table when connect is interrupted by a signal Revert "net: ethernet: bgmac: Use devm_platform_ioremap_resource_byname" ping: fix the dif and sdif check in ping_lookup net: usb: cdc_mbim: avoid altsetting toggling for Telit FN990 net: sched: limit TC_ACT_REPEAT loops tipc: fix wrong notification node addresses net: dsa: lantiq_gswip: fix use after free in gswip_remove() ipv6: per-netns exclusive flowlabel checks net: bridge: multicast: notify switchdev driver whenever MC processing gets disabled CDC-NCM: avoid overflow in sanity checking mctp: fix use after free net: mscc: ocelot: fix use-after-free in ocelot_vlan_del() bonding: fix data-races around agg_select_timer dpaa2-eth: Initialize mutex used in one step timestamping path ...
2022-02-17ipv6: fix data-race in fib6_info_hw_flags_set / fib6_purge_rtEric Dumazet
Because fib6_info_hw_flags_set() is called without any synchronization, all accesses to gi6->offload, fi->trap and fi->offload_failed need some basic protection like READ_ONCE()/WRITE_ONCE(). BUG: KCSAN: data-race in fib6_info_hw_flags_set / fib6_purge_rt read to 0xffff8881087d5886 of 1 bytes by task 13953 on cpu 0: fib6_drop_pcpu_from net/ipv6/ip6_fib.c:1007 [inline] fib6_purge_rt+0x4f/0x580 net/ipv6/ip6_fib.c:1033 fib6_del_route net/ipv6/ip6_fib.c:1983 [inline] fib6_del+0x696/0x890 net/ipv6/ip6_fib.c:2028 __ip6_del_rt net/ipv6/route.c:3876 [inline] ip6_del_rt+0x83/0x140 net/ipv6/route.c:3891 __ipv6_dev_ac_dec+0x2b5/0x370 net/ipv6/anycast.c:374 ipv6_dev_ac_dec net/ipv6/anycast.c:387 [inline] __ipv6_sock_ac_close+0x141/0x200 net/ipv6/anycast.c:207 ipv6_sock_ac_close+0x79/0x90 net/ipv6/anycast.c:220 inet6_release+0x32/0x50 net/ipv6/af_inet6.c:476 __sock_release net/socket.c:650 [inline] sock_close+0x6c/0x150 net/socket.c:1318 __fput+0x295/0x520 fs/file_table.c:280 ____fput+0x11/0x20 fs/file_table.c:313 task_work_run+0x8e/0x110 kernel/task_work.c:164 tracehook_notify_resume include/linux/tracehook.h:189 [inline] exit_to_user_mode_loop kernel/entry/common.c:175 [inline] exit_to_user_mode_prepare+0x160/0x190 kernel/entry/common.c:207 __syscall_exit_to_user_mode_work kernel/entry/common.c:289 [inline] syscall_exit_to_user_mode+0x20/0x40 kernel/entry/common.c:300 do_syscall_64+0x50/0xd0 arch/x86/entry/common.c:86 entry_SYSCALL_64_after_hwframe+0x44/0xae write to 0xffff8881087d5886 of 1 bytes by task 1912 on cpu 1: fib6_info_hw_flags_set+0x155/0x3b0 net/ipv6/route.c:6230 nsim_fib6_rt_hw_flags_set drivers/net/netdevsim/fib.c:668 [inline] nsim_fib6_rt_add drivers/net/netdevsim/fib.c:691 [inline] nsim_fib6_rt_insert drivers/net/netdevsim/fib.c:756 [inline] nsim_fib6_event drivers/net/netdevsim/fib.c:853 [inline] nsim_fib_event drivers/net/netdevsim/fib.c:886 [inline] nsim_fib_event_work+0x284f/0x2cf0 drivers/net/netdevsim/fib.c:1477 process_one_work+0x3f6/0x960 kernel/workqueue.c:2307 worker_thread+0x616/0xa70 kernel/workqueue.c:2454 kthread+0x2c7/0x2e0 kernel/kthread.c:327 ret_from_fork+0x1f/0x30 value changed: 0x22 -> 0x2a Reported by Kernel Concurrency Sanitizer on: CPU: 1 PID: 1912 Comm: kworker/1:3 Not tainted 5.16.0-syzkaller #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Workqueue: events nsim_fib_event_work Fixes: 0c5fcf9e249e ("IPv6: Add "offload failed" indication to routes") Fixes: bb3c4ab93e44 ("ipv6: Add "offload" and "trap" indications to routes") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Amit Cohen <amcohen@nvidia.com> Cc: Ido Schimmel <idosch@nvidia.com> Reported-by: syzbot <syzkaller@googlegroups.com> Link: https://lore.kernel.org/r/20220216173217.3792411-2-eric.dumazet@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-02-17ASoC: SOF: Replace zero-length array with flexible-array memberStephen Kitt
There is a regular need in the kernel to provide a way to declare having a dynamically sized set of trailing elements in a structure. Kernel code should always use "flexible array members"[1] for these cases. The older style of one-element or zero-length arrays should no longer be used[2]. This helps with the ongoing efforts to globally enable -Warray-bounds and get us closer to being able to tighten the FORTIFY_SOURCE routines on memcpy(). [1] https://en.wikipedia.org/wiki/Flexible_array_member [2] https://www.kernel.org/doc/html/latest/process/deprecated.html#zero-length-and-one-element-arrays Link: https://github.com/KSPP/linux/issues/78 Link: https://github.com/KSPP/linux/issues/180 Suggested-by: Gustavo A. R. Silva <gustavoars@kernel.org> Signed-off-by: Stephen Kitt <steve@sk2.org> Acked-by: Pierre-Louis Bossart <pierre-louis.bossart@linux.intel.com> Reviewed-by: Gustavo A. R. Silva <gustavoars@kernel.org> Link: https://lore.kernel.org/r/20220217132755.1786130-1-steve@sk2.org Signed-off-by: Mark Brown <broonie@kernel.org>
2022-02-17mm/munlock: maintain page->mlock_count while unevictableHugh Dickins
Previous patches have been preparatory: now implement page->mlock_count. The ordering of the "Unevictable LRU" is of no significance, and there is no point holding unevictable pages on a list: place page->mlock_count to overlay page->lru.prev (since page->lru.next is overlaid by compound_head, which needs to be even so as not to satisfy PageTail - though 2 could be added instead of 1 for each mlock, if that's ever an improvement). But it's only safe to rely on or modify page->mlock_count while lruvec lock is held and page is on unevictable "LRU" - we can save lots of edits by continuing to pretend that there's an imaginary LRU here (there is an unevictable count which still needs to be maintained, but not a list). The mlock_count technique suffers from an unreliability much like with page_mlock(): while someone else has the page off LRU, not much can be done. As before, err on the safe side (behave as if mlock_count 0), and let try_to_unlock_one() move the page to unevictable if reclaim finds out later on - a few misplaced pages don't matter, what we want to avoid is imbalancing reclaim by flooding evictable lists with unevictable pages. I am not a fan of "if (!isolate_lru_page(page)) putback_lru_page(page);": if we have taken lruvec lock to get the page off its present list, then we save everyone trouble (and however many extra atomic ops) by putting it on its destination list immediately. Signed-off-by: Hugh Dickins <hughd@google.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
2022-02-17mm/munlock: rmap call mlock_vma_page() munlock_vma_page()Hugh Dickins
Add vma argument to mlock_vma_page() and munlock_vma_page(), make them inline functions which check (vma->vm_flags & VM_LOCKED) before calling mlock_page() and munlock_page() in mm/mlock.c. Add bool compound to mlock_vma_page() and munlock_vma_page(): this is because we have understandable difficulty in accounting pte maps of THPs, and if passed a PageHead page, mlock_page() and munlock_page() cannot tell whether it's a pmd map to be counted or a pte map to be ignored. Add vma arg to page_add_file_rmap() and page_remove_rmap(), like the others, and use that to call mlock_vma_page() at the end of the page adds, and munlock_vma_page() at the end of page_remove_rmap() (end or beginning? unimportant, but end was easier for assertions in testing). No page lock is required (although almost all adds happen to hold it): delete the "Serialize with page migration" BUG_ON(!PageLocked(page))s. Certainly page lock did serialize with page migration, but I'm having difficulty explaining why that was ever important. Mlock accounting on THPs has been hard to define, differed between anon and file, involved PageDoubleMap in some places and not others, required clear_page_mlock() at some points. Keep it simple now: just count the pmds and ignore the ptes, there is no reason for ptes to undo pmd mlocks. page_add_new_anon_rmap() callers unchanged: they have long been calling lru_cache_add_inactive_or_unevictable(), which does its own VM_LOCKED handling (it also checks for not VM_SPECIAL: I think that's overcautious, and inconsistent with other checks, that mmap_region() already prevents VM_LOCKED on VM_SPECIAL; but haven't quite convinced myself to change it). Signed-off-by: Hugh Dickins <hughd@google.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
2022-02-17mm/munlock: delete FOLL_MLOCK and FOLL_POPULATEHugh Dickins
If counting page mlocks, we must not double-count: follow_page_pte() can tell if a page has already been Mlocked or not, but cannot tell if a pte has already been counted or not: that will have to be done when the pte is mapped in (which lru_cache_add_inactive_or_unevictable() already tracks for new anon pages, but there's no such tracking yet for others). Delete all the FOLL_MLOCK code - faulting in the missing pages will do all that is necessary, without special mlock_vma_page() calls from here. But then FOLL_POPULATE turns out to serve no purpose - it was there so that its absence would tell faultin_page() not to faultin page when setting up VM_LOCKONFAULT areas; but if there's no special work needed here for mlock, then there's no work at all here for VM_LOCKONFAULT. Have I got that right? I've not looked into the history, but see that FOLL_POPULATE goes back before VM_LOCKONFAULT: did it serve a different purpose before? Ah, yes, it was used to skip the old stack guard page. And is it intentional that COW is not broken on existing pages when setting up a VM_LOCKONFAULT area? I can see that being argued either way, and have no reason to disagree with current behaviour. Signed-off-by: Hugh Dickins <hughd@google.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>