summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2023-08-30blk-throttle: consider 'carryover_ios/bytes' in throtl_trim_slice()Yu Kuai
Currently, 'carryover_ios/bytes' is not handled in throtl_trim_slice(), for consequence, 'carryover_ios/bytes' will be used to throttle bio multiple times, for example: 1) set iops limit to 100, and slice start is 0, slice end is 100ms; 2) current time is 0, and 10 ios are dispatched, those io won't be throttled and io_disp is 10; 3) still at current time 0, update iops limit to 1000, carryover_ios is updated to (0 - 10) = -10; 4) in this slice(0 - 100ms), io_allowed = 100 + (-10) = 90, which means only 90 ios can be dispatched without waiting; 5) assume that io is throttled in slice(0 - 100ms), and throtl_trim_slice() update silce to (100ms - 200ms). In this case, 'carryover_ios/bytes' is not cleared and still only 90 ios can be dispatched between 100ms - 200ms. Fix this problem by updating 'carryover_ios/bytes' in throtl_trim_slice(). Fixes: a880ae93e5b5 ("blk-throttle: fix io hung due to configuration updates") Reported-by: zhuxiaohui <zhuxiaohui.400@bytedance.com> Link: https://lore.kernel.org/all/20230812072116.42321-1-zhuxiaohui.400@bytedance.com/ Signed-off-by: Yu Kuai <yukuai3@huawei.com> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20230816012708.1193747-5-yukuai1@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-08-30blk-throttle: use calculate_io/bytes_allowed() for throtl_trim_slice()Yu Kuai
There are no functional changes, just make the code cleaner. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20230816012708.1193747-4-yukuai1@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-08-30blk-throttle: fix wrong comparation while 'carryover_ios/bytes' is negativeYu Kuai
carryover_ios/bytes[] can be negative in the case that ios are dispatched in the slice in advance, and then configuration is updated. For example: 1) set iops limit to 1000, and slice start is 0, slice end is 100ms; 2) current time is 0, and 100 ios are dispatched, those ios will not be throttled, hence io_disp is 100; 3) still at current time 0, update iops limit to 100, then carryover_ios is (0 - 100) = -100; 4) then, dispatch a new io at time 0, the expected result is that this io will wait for 1s. The calculation in tg_within_iops_limit: io_disp = 0; io_allowed = calculate_io_allowed + carryover_ios = 10 + (-100) = -90; io won't be throttled if (io_disp + 1 < io_allowed) passed. Before this patch, in step 4) (io_disp + 1 < io_allowed) is passed, because -90 for unsigned value is very huge, and such io won't be throttled. Fix this problem by checking if 'io/bytes_allowed' is negative first. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20230816012708.1193747-3-yukuai1@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-08-30blk-throttle: print signed value 'carryover_bytes/ios' for userYu Kuai
'carryover_bytes/ios' can be negative, indicate that some bio is dispatched in advance within slice while configuration is updated. Print a huge value is not user-friendly. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20230816012708.1193747-2-yukuai1@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-08-30Merge tag 'lsm-pr-20230829' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/lsm Pull LSM updates from Paul Moore: - Add proper multi-LSM support for xattrs in the security_inode_init_security() hook Historically the LSM layer has only allowed a single LSM to add an xattr to an inode, with IMA/EVM measuring that and adding its own as well. As we work towards promoting IMA/EVM to a "proper LSM" instead of the special case that it is now, we need to better support the case of multiple LSMs each adding xattrs to an inode and after several attempts we now appear to have something that is working well. It is worth noting that in the process of making this change we uncovered a problem with Smack's SMACK64TRANSMUTE xattr which is also fixed in this pull request. - Additional LSM hook constification Two patches to constify parameters to security_capget() and security_binder_transfer_file(). While I generally don't make a special note of who submitted these patches, these were the work of an Outreachy intern, Khadija Kamran, and that makes me happy; hopefully it does the same for all of you reading this. - LSM hook comment header fixes One patch to add a missing hook comment header, one to fix a minor typo. - Remove an old, unused credential function declaration It wasn't clear to me who should pick this up, but it was trivial, obviously correct, and arguably the LSM layer has a vested interest in credentials so I merged it. Sadly I'm now noticing that despite my subject line cleanup I didn't cleanup the "unsued" misspelling, sigh * tag 'lsm-pr-20230829' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/lsm: lsm: constify the 'file' parameter in security_binder_transfer_file() lsm: constify the 'target' parameter in security_capget() lsm: add comment block for security_sk_classify_flow LSM hook security: Fix ret values doc for security_inode_init_security() cred: remove unsued extern declaration change_create_files_as() evm: Support multiple LSMs providing an xattr evm: Align evm_inode_init_security() definition with LSM infrastructure smack: Set the SMACK64TRANSMUTE xattr in smack_inode_init_security() security: Allow all LSMs to provide xattrs for inode_init_security hook lsm: fix typo in security_file_lock() comment header
2023-08-30io_uring: Don't set affinity on a dying sqpoll threadGabriel Krisman Bertazi
Syzbot reported a null-ptr-deref of sqd->thread inside io_sqpoll_wq_cpu_affinity. It turns out the sqd->thread can go away from under us during io_uring_register, in case the process gets a fatal signal during io_uring_register. It is not particularly hard to hit the race, and while I am not sure this is the exact case hit by syzbot, it solves it. Finally, checking ->thread is enough to close the race because we locked sqd while "parking" the thread, thus preventing it from going away. I reproduced it fairly consistently with a program that does: int main(void) { ... io_uring_queue_init(RING_LEN, &ring1, IORING_SETUP_SQPOLL); while (1) { io_uring_register_iowq_aff(ring, 1, &mask); } } Executed in a loop with timeout to trigger SIGTERM: while true; do timeout 1 /a.out ; done This will hit the following BUG() in very few attempts. BUG: kernel NULL pointer dereference, address: 00000000000007a8 PGD 800000010e949067 P4D 800000010e949067 PUD 10e46e067 PMD 0 Oops: 0000 [#1] PREEMPT SMP PTI CPU: 0 PID: 15715 Comm: dead-sqpoll Not tainted 6.5.0-rc7-next-20230825-g193296236fa0-dirty #23 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015 RIP: 0010:io_sqpoll_wq_cpu_affinity+0x27/0x70 Code: 90 90 90 0f 1f 44 00 00 55 53 48 8b 9f 98 03 00 00 48 85 db 74 4f 48 89 df 48 89 f5 e8 e2 f8 ff ff 48 8b 43 38 48 85 c0 74 22 <48> 8b b8 a8 07 00 00 48 89 ee e8 ba b1 00 00 48 89 df 89 c5 e8 70 RSP: 0018:ffffb04040ea7e70 EFLAGS: 00010282 RAX: 0000000000000000 RBX: ffff93c010749e40 RCX: 0000000000000001 RDX: 0000000000000000 RSI: ffffffffa7653331 RDI: 00000000ffffffff RBP: ffffb04040ea7eb8 R08: 0000000000000000 R09: c0000000ffffdfff R10: ffff93c01141b600 R11: ffffb04040ea7d18 R12: ffff93c00ea74840 R13: 0000000000000011 R14: 0000000000000000 R15: ffff93c00ea74800 FS: 00007fb7c276ab80(0000) GS:ffff93c36f200000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00000000000007a8 CR3: 0000000111634003 CR4: 0000000000370ef0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: <TASK> ? __die_body+0x1a/0x60 ? page_fault_oops+0x154/0x440 ? do_user_addr_fault+0x174/0x7b0 ? exc_page_fault+0x63/0x140 ? asm_exc_page_fault+0x22/0x30 ? io_sqpoll_wq_cpu_affinity+0x27/0x70 __io_register_iowq_aff+0x2b/0x60 __io_uring_register+0x614/0xa70 __x64_sys_io_uring_register+0xaa/0x1a0 do_syscall_64+0x3a/0x90 entry_SYSCALL_64_after_hwframe+0x6e/0xd8 RIP: 0033:0x7fb7c226fec9 Code: 2e 00 b8 ca 00 00 00 0f 05 eb a5 66 0f 1f 44 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 97 7f 2d 00 f7 d8 64 89 01 48 RSP: 002b:00007ffe2c0674f8 EFLAGS: 00000246 ORIG_RAX: 00000000000001ab RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007fb7c226fec9 RDX: 00007ffe2c067530 RSI: 0000000000000011 RDI: 0000000000000003 RBP: 00007ffe2c0675d0 R08: 00007ffe2c067550 R09: 00007ffe2c067550 R10: 0000000000000001 R11: 0000000000000246 R12: 0000000000000000 R13: 00007ffe2c067750 R14: 0000000000000000 R15: 0000000000000000 </TASK> Modules linked in: CR2: 00000000000007a8 ---[ end trace 0000000000000000 ]--- Reported-by: syzbot+c74fea926a78b8a91042@syzkaller.appspotmail.com Fixes: ebdfefc09c6d ("io_uring/sqpoll: fix io-wq affinity when IORING_SETUP_SQPOLL is used") Signed-off-by: Gabriel Krisman Bertazi <krisman@suse.de> Link: https://lore.kernel.org/r/87v8cybuo6.fsf@suse.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-08-30Merge tag 'selinux-pr-20230829' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/selinux Pull selinux updates from Paul Moore: "Thirty three SELinux patches, which is a pretty big number for us, but there isn't really anything scary in here; in fact we actually manage to remove 10 lines of code with this :) - Promote the SELinux DEBUG_HASHES macro to CONFIG_SECURITY_SELINUX_DEBUG The DEBUG_HASHES macro was a buried SELinux specific preprocessor debug macro that was a problem waiting to happen. Promoting the debug macro to a proper Kconfig setting should help both improve the visibility of the feature as well enable improved test coverage. We've moved some additional debug functions under the CONFIG_SECURITY_SELINUX_DEBUG flag and we may see more work in the future. - Emit a pr_notice() message if virtual memory is executable by default As this impacts the SELinux access control policy enforcement, if the system's configuration is such that virtual memory is executable by default we print a single line notice to the console. - Drop avtab_search() in favor of avtab_search_node() Both functions are nearly identical so we removed avtab_search() and converted the callers to avtab_search_node(). - Add some SELinux network auditing helpers The helpers not only reduce a small amount of code duplication, but they provide an opportunity to improve UDP flood performance slightly by delaying initialization of the audit data in some cases. - Convert GFP_ATOMIC allocators to GFP_KERNEL when reading SELinux policy There were two SELinux policy load helper functions that were allocating memory using GFP_ATOMIC, they have been converted to GFP_KERNEL. - Quiet a KMSAN warning in selinux_inet_conn_request() A one-line error path (re)set patch that resolves a KMSAN warning. It is important to note that this doesn't represent a real bug in the current code, but it quiets KMSAN and arguably hardens the code against future changes. - Cleanup the policy capability accessor functions This is a follow-up to the patch which reverted SELinux to using a global selinux_state pointer. This patch cleans up some artifacts of that change and turns each accessor into a one-line READ_ONCE() call into the policy capabilities array. - A number of patches from Christian Göttsche Christian submitted almost two-thirds of the patches in this pull request as he worked to harden the SELinux code against type differences, variable overflows, etc. - Support for separating early userspace from the kernel in policy, with a later revert We did have a patch that added a new userspace initial SID which would allow SELinux to distinguish between early user processes created before the initial policy load and the kernel itself. Unfortunately additional post-merge testing revealed a problematic interaction with an old SELinux userspace on an old version of Ubuntu so we've reverted the patch until we can resolve the compatibility issue. - Remove some outdated comments dealing with LSM hook registration When we removed the runtime disable functionality we forgot to remove some old comments discussing the importance of LSM hook registration ordering. - Minor administrative changes Stephen Smalley updated his email address and "debranded" SELinux from "NSA SELinux" to simply "SELinux". We've come a long way from the original NSA submission and I would consider SELinux a true community project at this point so removing the NSA branding just makes sense" * tag 'selinux-pr-20230829' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/selinux: (33 commits) selinux: prevent KMSAN warning in selinux_inet_conn_request() selinux: use unsigned iterator in nlmsgtab code selinux: avoid implicit conversions in policydb code selinux: avoid implicit conversions in selinuxfs code selinux: make left shifts well defined selinux: update type for number of class permissions in services code selinux: avoid implicit conversions in avtab code selinux: revert SECINITSID_INIT support selinux: use GFP_KERNEL while reading binary policy selinux: update comment on selinux_hooks[] selinux: avoid implicit conversions in services code selinux: avoid implicit conversions in mls code selinux: use identical iterator type in hashtab_duplicate() selinux: move debug functions into debug configuration selinux: log about VM being executable by default selinux: fix a 0/NULL mistmatch in ad_net_init_from_iif() selinux: introduce SECURITY_SELINUX_DEBUG configuration selinux: introduce and use lsm_ad_net_init*() helpers selinux: update my email address selinux: add missing newlines in pr_err() statements ...
2023-08-30netfilter: xt_u32: validate user space inputWander Lairson Costa
The xt_u32 module doesn't validate the fields in the xt_u32 structure. An attacker may take advantage of this to trigger an OOB read by setting the size fields with a value beyond the arrays boundaries. Add a checkentry function to validate the structure. This was originally reported by the ZDI project (ZDI-CAN-18408). Fixes: 1b50b8a371e9 ("[NETFILTER]: Add u32 match") Cc: stable@vger.kernel.org Signed-off-by: Wander Lairson Costa <wander@redhat.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2023-08-30netfilter: xt_sctp: validate the flag_info countWander Lairson Costa
sctp_mt_check doesn't validate the flag_count field. An attacker can take advantage of that to trigger a OOB read and leak memory information. Add the field validation in the checkentry function. Fixes: 2e4e6a17af35 ("[NETFILTER] x_tables: Abstraction layer for {ip,ip6,arp}_tables") Cc: stable@vger.kernel.org Reported-by: Lucas Leong <wmliang@infosec.exchange> Signed-off-by: Wander Lairson Costa <wander@redhat.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2023-08-30netfilter: nft_exthdr: Fix non-linear header modificationXiao Liang
Fix skb_ensure_writable() size. Don't use nft_tcp_header_pointer() to make it explicit that pointers point to the packet (not local buffer). Fixes: 99d1712bc41c ("netfilter: exthdr: tcp option set support") Fixes: 7890cbea66e7 ("netfilter: exthdr: add support for tcp option removal") Cc: stable@vger.kernel.org Signed-off-by: Xiao Liang <shaw.leon@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2023-08-30drm/i915: mark requests for GuC virtual engines to avoid use-after-freeAndrzej Hajda
References to i915_requests may be trapped by userspace inside a sync_file or dmabuf (dma-resv) and held indefinitely across different proceses. To counter-act the memory leaks, we try to not to keep references from the request past their completion. On the other side on fence release we need to know if rq->engine is valid and points to hw engine (true for non-virtual requests). To make it possible extra bit has been added to rq->execution_mask, for marking virtual engines. Fixes: bcb9aa45d5a0 ("Revert "drm/i915: Hold reference to intel_context over life of i915_request"") Signed-off-by: Chris Wilson <chris.p.wilson@linux.intel.com> Signed-off-by: Andrzej Hajda <andrzej.hajda@intel.com> Reviewed-by: Andi Shyti <andi.shyti@linux.intel.com> Signed-off-by: Andi Shyti <andi.shyti@linux.intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20230821153035.3903006-1-andrzej.hajda@intel.com (cherry picked from commit 280410677af763f3871b93e794a199cfcf6fb580) Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
2023-08-30Merge tag 'audit-pr-20230829' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/audit Pull audit updates from Paul Moore: "Six audit patches, the highlights are: - Add an explicit cond_resched() call when generating PATH records Certain tracefs/debugfs operations can generate a *lot* of audit PATH entries and if one has an aggressive system configuration (not the default) this can cause a soft lockup in the audit code as it works to process all of these new entries. This is in sharp contrast to the common case where only one or two PATH entries are logged. In order to fix this corner case without excessively impacting the common case we're adding a single cond_rescued() call between two of the most intensive loops in the __audit_inode_child() function. - Various minor cleanups We removed a conditional header file as the included header already had the necessary logic in place, fixed a dummy function's return value, and the usual collection of checkpatch.pl noise (whitespace, brace, and trailing statement tweaks)" * tag 'audit-pr-20230829' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/audit: audit: move trailing statements to next line audit: cleanup function braces and assignment-in-if-condition audit: add space before parenthesis and around '=', "==", and '<' audit: fix possible soft lockup in __audit_inode_child() audit: correct audit_filter_inodes() definition audit: include security.h unconditionally
2023-08-30NFSv4.2: fix handling of COPY ERR_OFFLOAD_NO_REQOlga Kornievskaia
If the client sent a synchronous copy and the server replied with ERR_OFFLOAD_NO_REQ indicating that it wants an asynchronous copy instead, the client should retry with asynchronous copy. Fixes: 539f57b3e0fd ("NFS handle COPY ERR_OFFLOAD_NO_REQS") Signed-off-by: Olga Kornievskaia <kolga@netapp.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2023-08-30NFS: Guard against READDIR loop when entry names exceed MAXNAMELENBenjamin Coddington
Commit 64cfca85bacd asserts the only valid return values for nfs2/3_decode_dirent should not include -ENAMETOOLONG, but for a server that sends a filename3 which exceeds MAXNAMELEN in a READDIR response the client's behavior will be to endlessly retry the operation. We could map -ENAMETOOLONG into -EBADCOOKIE, but that would produce truncated listings without any error. The client should return an error for this case to clearly assert that the server implementation must be corrected. Fixes: 64cfca85bacd ("NFS: Return valid errors from nfs2/3_decode_dirent()") Signed-off-by: Benjamin Coddington <bcodding@redhat.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2023-08-30bpf, docs: s/eBPF/BPF in standards documentsDavid Vernet
There isn't really anything other than just "BPF" at this point, so referring to it as "eBPF" in our standards document just causes unnecessary confusion. Let's just be consistent and use "BPF". Suggested-by: Will Hawkins <hawkinsw@obs.cr> Signed-off-by: David Vernet <void@manifault.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20230828155948.123405-4-void@manifault.com
2023-08-30bpf, docs: Add abi.rst document to standardization subdirectoryDavid Vernet
As specified in the IETF BPF charter, the BPF working group has plans to add one or more informational documents that recommend conventions and guidelines for producing portable BPF program binaries. The instruction-set.rst document currently contains a "Registers and calling convention" subsection which dictates a calling convention that belongs in an ABI document, rather than an instruction set document. Let's move it to a new abi.rst document so we can clean it up. The abi.rst document will of course be significantly changed and expanded upon over time. For now, it's really just a placeholder which will contain ABI-specific language that doesn't belong in other documents. Signed-off-by: David Vernet <void@manifault.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20230828155948.123405-3-void@manifault.com
2023-08-30bpf, docs: Move linux-notes.rst to root bpf docs treeDavid Vernet
In commit 4d496be9ca05 ("bpf,docs: Create new standardization subdirectory"), I added a standardization/ directory to the BPF documentation, which will contain the docs that will be standardized as part of the effort with the IETF. I included linux-notes.rst in that directory, but I shouldn't have. It doesn't contain anything that will be standardized. Let's move it back to Documentation/bpf. Signed-off-by: David Vernet <void@manifault.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20230828155948.123405-2-void@manifault.com
2023-08-30fs/jfs: Use common ucs2 upper case tableDr. David Alan Gilbert
Use the UCS-2 upper case tables from nls, that are shared with smb. This code in JFS is hard to test, so we're only reusing the same tables (which are identical), not trying to reuse the rest of the helper functions. Signed-off-by: Dr. David Alan Gilbert <linux@treblig.org> Reviewed-by: Dave Kleikamp <dave.kleikamp@oracle.com> Signed-off-by: Steve French <stfrench@microsoft.com>
2023-08-30fs/smb/client: Use common code in clientDr. David Alan Gilbert
Now we've got the common code, use it for the client as well. Note there's a change here where we're using the server version of UniStrcat now which had different types (__le16 vs wchar_t) but it's not interpreting the value other than checking for 0, however we do need casts to keep sparse happy. Signed-off-by: Dr. David Alan Gilbert <linux@treblig.org> Reviewed-by: Dave Kleikamp <dave.kleikamp@oracle.com> Signed-off-by: Steve French <stfrench@microsoft.com>
2023-08-30fs/smb: Swing unicode common code from smb->NLSDr. David Alan Gilbert
Swing most of the inline functions and unicode tables into nls from the copy in smb/server. This is UCS-2 rather than most of the rest of the code in NLS, but it currently seems like the best place for it. The actual unicode.c implementations vary much more between server and client so they're unmoved. Signed-off-by: Dr. David Alan Gilbert <linux@treblig.org> Reviewed-by: Dave Kleikamp <dave.kleikamp@oracle.com> Signed-off-by: Steve French <stfrench@microsoft.com>
2023-08-30fs/smb: Remove unicode 'lower' tablesDr. David Alan Gilbert
The unicode glue in smb/*/..uniupr.h has a section guarded by 'ifndef UNIUPR_NOLOWER' - but that's always defined in smb/*/..unicode.h. Nuke those tables. Signed-off-by: Dr. David Alan Gilbert <linux@treblig.org> Reviewed-by: Dave Kleikamp <dave.kleikamp@oracle.com> Signed-off-by: Steve French <stfrench@microsoft.com>
2023-08-30SMB3: rename macro CIFS_SERVER_IS_CHAN to avoid confusionSteve French
Since older dialects such as CIFS do not support multichannel the macro CIFS_SERVER_IS_CHAN can be confusing (it requires SMB 3 or later) so shorten its name to "SERVER_IS_CHAN" Suggested-by: Tom Talpey <tom@talpey.com> Acked-by: Shyam Prasad N <sprasad@microsoft.com> Signed-off-by: Steve French <stfrench@microsoft.com>
2023-08-30platform/x86: asus-wmi: corrections to egpu safety checkLuke D. Jones
An incorrect if statement was preventing the enablement of the egpu. Fixes: d49f4d1a30ac ("platform/x86: asus-wmi: don't allow eGPU switching if eGPU not connected") Signed-off-by: Luke D. Jones <luke@ljones.dev> Link: https://lore.kernel.org/r/20230830022908.36264-2-luke@ljones.dev Reviewed-by: Hans de Goede <hdegoede@redhat.com> Signed-off-by: Hans de Goede <hdegoede@redhat.com>
2023-08-30platform/x86: mlx-platform: Add dependency on PCI to KconfigVadim Pasternak
Add dependency on PCI to avoid 'mlx-platform' compilation error in case CONFIG_PCI is not set. Failed on i386: CONFIG_ACPI=y CONFIG_ISA=y Error In function 'mlxplat_pci_fpga_device_init': implicit declaration of function 'pci_request_region': 6204 | err = pci_request_region(pci_dev, 0, res_name); | ^~~~~~~~~~~~~~~~~~ | pci_request_regions Fixes: 1316e0af2dc0 ("platform: mellanox: mlx-platform: Introduce ACPI init flow") Signed-off-by: Vadim Pasternak <vadimp@nvidia.com> Reviewed-by: Michael Shych <michaelsh@nvidia.com> Reported-by: Randy Dunlap <rdunlap@infradead.org> Acked-by: Randy Dunlap <rdunlap@infradead.org> Tested-by: Randy Dunlap <rdunlap@infradead.org> Link: https://lore.kernel.org/r/20230829133748.58208-2-vadimp@nvidia.com Reviewed-by: Hans de Goede <hdegoede@redhat.com> Signed-off-by: Hans de Goede <hdegoede@redhat.com>
2023-08-30dma-contiguous: fix the Kconfig entry for CONFIG_DMA_NUMA_CMAChristoph Hellwig
It makes no sense to expose CONFIG_DMA_NUMA_CMA if CONFIG_NUMA is not enabled, and random config options shouldn't be default unless there is a good reason. Replace the default NUMA with a depends on to fix both issues. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Robin Murphy <roin.murphy@arm.com>
2023-08-30cpu/hotplug: Prevent self deadlock on CPU hot-unplugThomas Gleixner
Xiongfeng reported and debugged a self deadlock of the task which initiates and controls a CPU hot-unplug operation vs. the CFS bandwidth timer. CPU1 CPU2 T1 sets cfs_quota starts hrtimer cfs_bandwidth 'period_timer' T1 is migrated to CPU2 T1 initiates offlining of CPU1 Hotplug operation starts ... 'period_timer' expires and is re-enqueued on CPU1 ... take_cpu_down() CPU1 shuts down and does not handle timers anymore. They have to be migrated in the post dead hotplug steps by the control task. T1 runs the post dead offline operation T1 is scheduled out T1 waits for 'period_timer' to expire T1 waits there forever if it is scheduled out before it can execute the hrtimer offline callback hrtimers_dead_cpu(). Cure this by delegating the hotplug control operation to a worker thread on an online CPU. This takes the initiating user space task, which might be affected by the bandwidth timer, completely out of the picture. Reported-by: Xiongfeng Wang <wangxiongfeng2@huawei.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Yu Liao <liaoyu15@huawei.com> Acked-by: Vincent Guittot <vincent.guittot@linaro.org> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/lkml/8e785777-03aa-99e1-d20e-e956f5685be6@huawei.com Link: https://lore.kernel.org/r/87h6oqdq0i.ffs@tglx
2023-08-30tick/rcu: Fix false positive "softirq work is pending" messagesPaul Gortmaker
In commit 0345691b24c0 ("tick/rcu: Stop allowing RCU_SOFTIRQ in idle") the new function report_idle_softirq() was created by breaking code out of the existing can_stop_idle_tick() for kernels v5.18 and newer. In doing so, the code essentially went from a one conditional: if (a && b && c) warn(); to a three conditional: if (!a) return; if (!b) return; if (!c) return; warn(); But that conversion got the condition for the RT specific local_bh_blocked() wrong. The original condition was: !local_bh_blocked() but the conversion failed to negate it so it ended up as: if (!local_bh_blocked()) return false; This issue lay dormant until another fixup for the same commit was added in commit a7e282c77785 ("tick/rcu: Fix bogus ratelimit condition"). This commit realized the ratelimit was essentially set to zero instead of ten, and hence *no* softirq pending messages would ever be issued. Once this commit was backported via linux-stable, both the v6.1 and v6.4 preempt-rt kernels started printing out 10 instances of this at boot: NOHZ tick-stop error: local softirq work is pending, handler #80!!! Remove the negation and return when local_bh_blocked() evaluates to true to bring the correct behaviour back. Fixes: 0345691b24c0 ("tick/rcu: Stop allowing RCU_SOFTIRQ in idle") Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Ahmad Fatoum <a.fatoum@pengutronix.de> Reviewed-by: Wen Yang <wenyang.linux@foxmail.com> Acked-by: Frederic Weisbecker <frederic@kernel.org> Link: https://lore.kernel.org/r/20230818200757.1808398-1-paul.gortmaker@windriver.com
2023-08-30csky: Fixup compile errorGuo Ren
Add header file for asmlinkage macro. Error log: In file included from arch/csky/include/asm/ptrace.h:7, from arch/csky/include/asm/elf.h:6, from include/linux/elf.h:6, from kernel/extable.c:6: arch/csky/include/asm/traps.h:43:11: error: expected ';' before 'void' 43 | asmlinkage void do_trap_unknown(struct pt_regs *regs); | ^~~~~ Fixes: c8171a86b274 ("csky: Fixup -Wmissing-prototypes warning") Reported-by: Guenter Roeck <linux@roeck-us.net> Signed-off-by: Guo Ren <guoren@linux.alibaba.com> Signed-off-by: Guo Ren <guoren@kernel.org>
2023-08-30rbd: use list_for_each_entry() helperJinjie Ruan
Convert list_for_each() to list_for_each_entry() so that the tmp list_head pointer and list_entry() call are no longer needed, which can reduce a few lines of code. No functional changed. Signed-off-by: Jinjie Ruan <ruanjinjie@huawei.com> Reviewed-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2023-08-30Merge tag 'md-next-20230814-resend' into loongarch-nextHuacai Chen
LoongArch architecture changes for 6.5 (raid5/6 optimization) depend on the md changes to fix build and work, so merge them to create a base.
2023-08-30dma-debug: don't call __dma_entry_alloc_check_leak() under free_entries_lockSergey Senozhatsky
__dma_entry_alloc_check_leak() calls into printk -> serial console output (qcom geni) and grabs port->lock under free_entries_lock spin lock, which is a reverse locking dependency chain as qcom_geni IRQ handler can call into dma-debug code and grab free_entries_lock under port->lock. Move __dma_entry_alloc_check_leak() call out of free_entries_lock scope so that we don't acquire serial console's port->lock under it. Trimmed-down lockdep splat: The existing dependency chain (in reverse order) is: -> #2 (free_entries_lock){-.-.}-{2:2}: _raw_spin_lock_irqsave+0x60/0x80 dma_entry_alloc+0x38/0x110 debug_dma_map_page+0x60/0xf8 dma_map_page_attrs+0x1e0/0x230 dma_map_single_attrs.constprop.0+0x6c/0xc8 geni_se_rx_dma_prep+0x40/0xcc qcom_geni_serial_isr+0x310/0x510 __handle_irq_event_percpu+0x110/0x244 handle_irq_event_percpu+0x20/0x54 handle_irq_event+0x50/0x88 handle_fasteoi_irq+0xa4/0xcc handle_irq_desc+0x28/0x40 generic_handle_domain_irq+0x24/0x30 gic_handle_irq+0xc4/0x148 do_interrupt_handler+0xa4/0xb0 el1_interrupt+0x34/0x64 el1h_64_irq_handler+0x18/0x24 el1h_64_irq+0x64/0x68 arch_local_irq_enable+0x4/0x8 ____do_softirq+0x18/0x24 ... -> #1 (&port_lock_key){-.-.}-{2:2}: _raw_spin_lock_irqsave+0x60/0x80 qcom_geni_serial_console_write+0x184/0x1dc console_flush_all+0x344/0x454 console_unlock+0x94/0xf0 vprintk_emit+0x238/0x24c vprintk_default+0x3c/0x48 vprintk+0xb4/0xbc _printk+0x68/0x90 register_console+0x230/0x38c uart_add_one_port+0x338/0x494 qcom_geni_serial_probe+0x390/0x424 platform_probe+0x70/0xc0 really_probe+0x148/0x280 __driver_probe_device+0xfc/0x114 driver_probe_device+0x44/0x100 __device_attach_driver+0x64/0xdc bus_for_each_drv+0xb0/0xd8 __device_attach+0xe4/0x140 device_initial_probe+0x1c/0x28 bus_probe_device+0x44/0xb0 device_add+0x538/0x668 of_device_add+0x44/0x50 of_platform_device_create_pdata+0x94/0xc8 of_platform_bus_create+0x270/0x304 of_platform_populate+0xac/0xc4 devm_of_platform_populate+0x60/0xac geni_se_probe+0x154/0x160 platform_probe+0x70/0xc0 ... -> #0 (console_owner){-...}-{0:0}: __lock_acquire+0xdf8/0x109c lock_acquire+0x234/0x284 console_flush_all+0x330/0x454 console_unlock+0x94/0xf0 vprintk_emit+0x238/0x24c vprintk_default+0x3c/0x48 vprintk+0xb4/0xbc _printk+0x68/0x90 dma_entry_alloc+0xb4/0x110 debug_dma_map_sg+0xdc/0x2f8 __dma_map_sg_attrs+0xac/0xe4 dma_map_sgtable+0x30/0x4c get_pages+0x1d4/0x1e4 [msm] msm_gem_pin_pages_locked+0x38/0xac [msm] msm_gem_pin_vma_locked+0x58/0x88 [msm] msm_ioctl_gem_submit+0xde4/0x13ac [msm] drm_ioctl_kernel+0xe0/0x15c drm_ioctl+0x2e8/0x3f4 vfs_ioctl+0x30/0x50 ... Chain exists of: console_owner --> &port_lock_key --> free_entries_lock Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(free_entries_lock); lock(&port_lock_key); lock(free_entries_lock); lock(console_owner); *** DEADLOCK *** Call trace: dump_backtrace+0xb4/0xf0 show_stack+0x20/0x30 dump_stack_lvl+0x60/0x84 dump_stack+0x18/0x24 print_circular_bug+0x1cc/0x234 check_noncircular+0x78/0xac __lock_acquire+0xdf8/0x109c lock_acquire+0x234/0x284 console_flush_all+0x330/0x454 console_unlock+0x94/0xf0 vprintk_emit+0x238/0x24c vprintk_default+0x3c/0x48 vprintk+0xb4/0xbc _printk+0x68/0x90 dma_entry_alloc+0xb4/0x110 debug_dma_map_sg+0xdc/0x2f8 __dma_map_sg_attrs+0xac/0xe4 dma_map_sgtable+0x30/0x4c get_pages+0x1d4/0x1e4 [msm] msm_gem_pin_pages_locked+0x38/0xac [msm] msm_gem_pin_vma_locked+0x58/0x88 [msm] msm_ioctl_gem_submit+0xde4/0x13ac [msm] drm_ioctl_kernel+0xe0/0x15c drm_ioctl+0x2e8/0x3f4 vfs_ioctl+0x30/0x50 ... Reported-by: Rob Clark <robdclark@chromium.org> Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org> Acked-by: Robin Murphy <robin.murphy@arm.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2023-08-30s390/airq: remove lsi_mask from airq_structBenjamin Block
Remove the field `lsi_mask` from `struct airq_struct` as it is not utilized for any adapter interrupt, other than setting it to the default value of 0xff. Because nobody is using this functionality, all it does is cost a little bit of time with each delivered adapter interrupt. Reviewed-by: Michael Mueller <mimu@linux.ibm.com> Tested-by: Michael Mueller <mimu@linux.ibm.com> Acked-by: Peter Oberparleiter <oberpar@linux.ibm.com> Signed-off-by: Benjamin Block <bblock@linux.ibm.com> Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
2023-08-30s390/mm: use __set_memory() variants where usefulHeiko Carstens
Use the __set_memory_yy() variants instead of set_memory_yy() where useful. This allows to make the code a bit more readable. This also fixes the debug pagealloc case, where set_memory_4k() might be called for an area larger than 8TB which would lead to an overflow of the num_pages parameter of set_memory_4k(). However RELOC_HIDE() has to be used for the __set_memory_4k() case for the time being, to avoid compiler warnings because of performing pointer arithmetic on a NULL pointer, which has undefined behavior. This happens because __va(0) always translates to NULL. However this will change, and as soon as this happens the RELOC_HIDE() hack can be removed again. Reviewed-by: Alexander Gordeev <agordeev@linux.ibm.com> Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
2023-08-30s390/set_memory: add __set_memory() variantHeiko Carstens
Add a __set_memory_yy() variant for all set_memory_yy() implementations. The new variant takes start and end void pointers, which allows them to be used without the usual unsigned long cast. However more important: the new variant can be used for areas larger than 8TB. The old variant comes with an "int numpages" parameter, which overflows with more than 8TB. Given that for debug_pagealloc set_memory_4k() is used on the whole kernel mapping this is not only a theoretical problem, but must be fixed. Changing all set_memory_yy() variants only on s390 to take an "unsigned long numpages" parameter is not possible, since the common module code requires an int parameter from all architectures on these functions. See module_set_memory(). Therefore change/fix this on s390 only with a new interface, and address common code later. Reviewed-by: Alexander Gordeev <agordeev@linux.ibm.com> Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
2023-08-30s390/set_memory: generate all set_memory() functionsHeiko Carstens
The set_memory() functions all follow the same pattern. Use a macro to generate them, and in result remove a bit of code. Reviewed-by: Alexander Gordeev <agordeev@linux.ibm.com> Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
2023-08-30s390/mm: improve description of mapping permissions of prefix pagesHeiko Carstens
Slightly improve the description which explains why the first prefix page must be mapped executable when the BEAR-enhancement facility is not installed. Reviewed-by: Alexander Gordeev <agordeev@linux.ibm.com> Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
2023-08-30s390/amode31: change type of __samode31, __eamode31, etcHeiko Carstens
For consistencs reasons change the type of __samode31, __eamode31, __stext_amode31, and __etext_amode31 to a char pointer so they (nearly) match the type of all other sections. This allows for code simplifications with follow-on patches. Reviewed-by: Alexander Gordeev <agordeev@linux.ibm.com> Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
2023-08-30s390/mm: simplify kernel mapping setupHeiko Carstens
The kernel mapping is setup in two stages: in the decompressor map all pages with RWX permissions, and within the kernel change all mappings to their final permissions, where most of the mappings are changed from RWX to RWNX. Change this and map all pages RWNX from the beginning, however without enabling noexec via control register modification. This means that effectively all pages are used with RWX permissions like before. When the final permissions have been applied to the kernel mapping enable noexec via control register modification. This allows to remove quite a bit of non-obvious code. Reviewed-by: Alexander Gordeev <agordeev@linux.ibm.com> Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
2023-08-30s390: remove "noexec" optionHeiko Carstens
Do the same like x86 with commit 76ea0025a214 ("x86/cpu: Remove "noexec"") and remove the "noexec" kernel command line option. Reviewed-by: Alexander Gordeev <agordeev@linux.ibm.com> Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
2023-08-30s390/vmem: fix virtual vs physical address confusionAlexander Gordeev
Fix virtual vs physical address confusion (which currently are the same). Reviewed-by: Heiko Carstens <hca@linux.ibm.com> Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com> Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
2023-08-30s390/dcssblk: fix lockdep warningGerald Schaefer
dcssblk_remove_store() holds the dcssblk_devices_sem semaphore while calling del_gendisk(dev_info->gd), which in turn tries to acquire disk->open_mutex. Then there is dcssblk_release(), which is called with disk->open_mutex held, and tries to acquire dcssblk_devices_sem. Lockdep reports this as possible circular locking dependency (CPU0 = dcssblk_remove_store, CPU1 = dcssblk_release): [ 44.948865] Possible unsafe locking scenario: [ 44.948866] CPU0 CPU1 [ 44.948867] ---- ---- [ 44.948868] lock(&dcssblk_devices_sem); [ 44.948870] lock(&disk->open_mutex); [ 44.948872] lock(&dcssblk_devices_sem); [ 44.948874] lock(&disk->open_mutex); [ 44.948876] *** DEADLOCK *** In practice, this deadlock should not happen, since dcssblk_remove_store() checks for dev_info->use_count != 0 after acquiring dcssblk_devices_sem, and breaks out before calling del_gendisk(). dev_info->use_count will be decremented in dcssblk_release(), protected by dcssblk_devices_sem. Still there is no need for dcssblk_remove_store() to hold the dcssblk_devices_sem until after calling del_gendisk(), as this only protects dcssblk internal data. So fix the lockdep warning by releasing dcssblk_devices_sem earlier. Also move the segment_unload() loop up, similar to dcssblk_shared_store() error path, no need to do that after calling del_gendisk(). Also change dcssblk_shared_store() error path, where dcssblk_devices_sem was also released only after calling del_gendisk(), and a similar lockdep warning could be triggered (but also deadlock prevented by check for dev_info->use_count). Acked-by: Heiko Carstens <hca@linux.ibm.com> Signed-off-by: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
2023-08-30s390/monreader: fix virtual vs physical address confusionGerald Schaefer
Fix virtual vs physical address confusion (which currently are the same). Acked-by: Heiko Carstens <hca@linux.ibm.com> Signed-off-by: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
2023-08-30net: ipv4, ipv6: fix IPSTATS_MIB_OUTOCTETS increment duplicatedHeng Guo
commit edf391ff1723 ("snmp: add missing counters for RFC 4293") had already added OutOctets for RFC 4293. In commit 2d8dbb04c63e ("snmp: fix OutOctets counter to include forwarded datagrams"), OutOctets was counted again, but not removed from ip_output(). According to RFC 4293 "3.2.3. IP Statistics Tables", ipipIfStatsOutTransmits is not equal to ipIfStatsOutForwDatagrams. So "IPSTATS_MIB_OUTOCTETS must be incremented when incrementing" is not accurate. And IPSTATS_MIB_OUTOCTETS should be counted after fragment. This patch reverts commit 2d8dbb04c63e ("snmp: fix OutOctets counter to include forwarded datagrams") and move IPSTATS_MIB_OUTOCTETS to ip_finish_output2 for ipv4. Reviewed-by: Filip Pudak <filip.pudak@windriver.com> Signed-off-by: Heng Guo <heng.guo@windriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-08-30x86/audit: Fix -Wmissing-variable-declarations warning for ia32_xyz_classJustin Stitt
When building x86 defconfig with Clang-18 I get the following warnings: | arch/x86/ia32/audit.c:6:10: warning: no previous extern declaration for non-static variable 'ia32_dir_class' [-Wmissing-variable-declarations] | 6 | unsigned ia32_dir_class[] = { | arch/x86/ia32/audit.c:11:10: warning: no previous extern declaration for non-static variable 'ia32_chattr_class' [-Wmissing-variable-declarations] | 11 | unsigned ia32_chattr_class[] = { | arch/x86/ia32/audit.c:16:10: warning: no previous extern declaration for non-static variable 'ia32_write_class' [-Wmissing-variable-declarations] | 16 | unsigned ia32_write_class[] = { | arch/x86/ia32/audit.c:21:10: warning: no previous extern declaration for non-static variable 'ia32_read_class' [-Wmissing-variable-declarations] | 21 | unsigned ia32_read_class[] = { | arch/x86/ia32/audit.c:26:10: warning: no previous extern declaration for non-static variable 'ia32_signal_class' [-Wmissing-variable-declarations] | 26 | unsigned ia32_signal_class[] = { These warnings occur due to their respective extern declarations being scoped inside of audit_classes_init as well as only being enabled with `CONFIG_IA32_EMULATION=y`: | static int __init audit_classes_init(void) | { | #ifdef CONFIG_IA32_EMULATION | extern __u32 ia32_dir_class[]; | extern __u32 ia32_write_class[]; | extern __u32 ia32_read_class[]; | extern __u32 ia32_chattr_class[]; | audit_register_class(AUDIT_CLASS_WRITE_32, ia32_write_class); | audit_register_class(AUDIT_CLASS_READ_32, ia32_read_class); | audit_register_class(AUDIT_CLASS_DIR_WRITE_32, ia32_dir_class); | audit_register_class(AUDIT_CLASS_CHATTR_32, ia32_chattr_class); | #endif | audit_register_class(AUDIT_CLASS_WRITE, write_class); | audit_register_class(AUDIT_CLASS_READ, read_class); | audit_register_class(AUDIT_CLASS_DIR_WRITE, dir_class); | audit_register_class(AUDIT_CLASS_CHATTR, chattr_class); | return 0; | } Lift the extern declarations to their own header and resolve scoping issues (and thus fix the warnings). Moreover, change __u32 to unsigned so that we match the definitions: | unsigned ia32_dir_class[] = { | #include <asm-generic/audit_dir_write.h> | ~0U | }; | | unsigned ia32_chattr_class[] = { | #include <asm-generic/audit_change_attr.h> | ~0U | }; | ... This patch is similar to commit: 0e5e3d4461a22d73 ("x86/audit: Fix a -Wmissing-prototypes warning for ia32_classify_syscall()") [1] Signed-off-by: Justin Stitt <justinstitt@google.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/all/20200516123816.2680-1-b.thiel@posteo.de/ [1] Link: https://github.com/ClangBuiltLinux/linux/issues/1920 Link: https://lore.kernel.org/r/20230829-missingvardecl-audit-v1-1-34efeb7f3539@google.com
2023-08-30sched/core: Report correct state for TASK_IDLE | TASK_FREEZABLENeilBrown
task_state_index() ignores uninteresting state flags (such as TASK_FREEZABLE) for most states, but for TASK_IDLE and TASK_RTLOCK_WAIT it does not. So if a task is waiting TASK_IDLE|TASK_FREEZABLE it gets incorrectly reported as TASK_UNINTERRUPTIBLE or "D". (it is planned for nfsd to change to use this state). Fix this by only testing the interesting bits and not the irrelevant bits in __task_state_index() Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/169335025927.5133.4781141800413736103@noble.neil.brown.name
2023-08-30bpf, sockmap: Fix preempt_rt splat when using raw_spin_lock_tJohn Fastabend
Sockmap and sockhash maps are a collection of psocks that are objects representing a socket plus a set of metadata needed to manage the BPF programs associated with the socket. These maps use the stab->lock to protect from concurrent operations on the maps, e.g. trying to insert to objects into the array at the same time in the same slot. Additionally, a sockhash map has a bucket lock to protect iteration and insert/delete into the hash entry. Each psock has a psock->link which is a linked list of all the maps that a psock is attached to. This allows a psock (socket) to be included in multiple sockmap and sockhash maps. This linked list is protected the psock->link_lock. They _must_ be nested correctly to avoid deadlock: lock(stab->lock) : do BPF map operations and psock insert/delete lock(psock->link_lock) : add map to psock linked list of maps unlock(psock->link_lock) unlock(stab->lock) For non PREEMPT_RT kernels both raw_spin_lock_t and spin_lock_t are guaranteed to not sleep. But, with PREEMPT_RT kernels the spin_lock_t variants may sleep. In the current code we have many patterns like this: rcu_critical_section: raw_spin_lock(stab->lock) spin_lock(psock->link_lock) <- may sleep ouch spin_unlock(psock->link_lock) raw_spin_unlock(stab->lock) rcu_critical_section Nesting spin_lock() inside a raw_spin_lock() violates locking rules for PREEMPT_RT kernels. And additionally we do alloc(GFP_ATOMICS) inside the stab->lock, but those might sleep on PREEMPT_RT kernels. The result is splats like this: ./test_progs -t sockmap_basic [ 33.344330] bpf_testmod: loading out-of-tree module taints kernel. [ 33.441933] [ 33.442089] ============================= [ 33.442421] [ BUG: Invalid wait context ] [ 33.442763] 6.5.0-rc5-01731-gec0ded2e0282 #4958 Tainted: G O [ 33.443320] ----------------------------- [ 33.443624] test_progs/2073 is trying to lock: [ 33.443960] ffff888102a1c290 (&psock->link_lock){....}-{3:3}, at: sock_map_update_common+0x2c2/0x3d0 [ 33.444636] other info that might help us debug this: [ 33.444991] context-{5:5} [ 33.445183] 3 locks held by test_progs/2073: [ 33.445498] #0: ffff88811a208d30 (sk_lock-AF_INET){+.+.}-{0:0}, at: sock_map_update_elem_sys+0xff/0x330 [ 33.446159] #1: ffffffff842539e0 (rcu_read_lock){....}-{1:3}, at: sock_map_update_elem_sys+0xf5/0x330 [ 33.446809] #2: ffff88810d687240 (&stab->lock){+...}-{2:2}, at: sock_map_update_common+0x177/0x3d0 [ 33.447445] stack backtrace: [ 33.447655] CPU: 10 PID To fix observe we can't readily remove the allocations (for that we would need to use/create something similar to bpf_map_alloc). So convert raw_spin_lock_t to spin_lock_t. We note that sock_map_update that would trigger the allocate and potential sleep is only allowed through sys_bpf ops and via sock_ops which precludes hw interrupts and low level atomic sections in RT preempt kernel. On non RT preempt kernel there are no changes here and spin locks sections and alloc(GFP_ATOMIC) are still not sleepable. Signed-off-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20230830053517.166611-1-john.fastabend@gmail.com
2023-08-30docs/bpf: Add description for CO-RE relocationsEduard Zingerman
Add a section on CO-RE relocations to llvm_relo.rst. Describe relevant .BTF.ext structure, `enum bpf_core_relo_kind` and `struct bpf_core_relo` in some detail. Description is based on doc-strings from: - include/uapi/linux/bpf.h:struct bpf_core_relo - tools/lib/bpf/relo_core.c:__bpf_core_types_match() Signed-off-by: Eduard Zingerman <eddyz87@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Yonghong Song <yonghong.song@linux.dev> Link: https://lore.kernel.org/bpf/20230826222912.2560865-2-eddyz87@gmail.com
2023-08-30bpf, docs: Correct source of offset for program-local callWill Hawkins
The offset to use when calculating the target of a program-local call is in the instruction's imm field, not its offset field. Signed-off-by: Will Hawkins <hawkinsw@obs.cr> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Acked-by: David Vernet <void@manifault.com> Link: https://lore.kernel.org/bpf/20230826053258.1860167-1-hawkinsw@obs.cr
2023-08-30selftests/bpf: Fix flaky cgroup_iter_sleepable subtestYonghong Song
Occasionally, with './test_progs -j' on my vm, I will hit the following failure: test_cgrp_local_storage:PASS:join_cgroup /cgrp_local_storage 0 nsec test_cgroup_iter_sleepable:PASS:skel_open 0 nsec test_cgroup_iter_sleepable:PASS:skel_load 0 nsec test_cgroup_iter_sleepable:PASS:attach_iter 0 nsec test_cgroup_iter_sleepable:PASS:iter_create 0 nsec test_cgroup_iter_sleepable:FAIL:cgroup_id unexpected cgroup_id: actual 1 != expected 2812 #48/5 cgrp_local_storage/cgroup_iter_sleepable:FAIL #48 cgrp_local_storage:FAIL Finally, I decided to do some investigation since the test is introduced by myself. It turns out the reason is due to cgroup_fd with value 0. In cgroup_iter, a cgroup_fd of value 0 means the root cgroup. /* from cgroup_iter.c */ if (fd) cgrp = cgroup_v1v2_get_from_fd(fd); else if (id) cgrp = cgroup_get_from_id(id); else /* walk the entire hierarchy by default. */ cgrp = cgroup_get_from_path("/"); That is why we got cgroup_id 1 instead of expected 2812. Why we got a cgroup_fd 0? Nobody should really touch 'stdin' (fd 0) in test_progs. I traced 'close' syscall with stack trace and found the root cause, which is a bug in bpf_obj_pinning.c. Basically, the code closed fd 0 although it should not. Fixing the bug in bpf_obj_pinning.c also resolved the above cgroup_iter_sleepable subtest failure. Fixes: 3b22f98e5a05 ("selftests/bpf: Add path_fd-based BPF_OBJ_PIN and BPF_OBJ_GET tests") Signed-off-by: Yonghong Song <yonghong.song@linux.dev> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20230827150551.1743497-1-yonghong.song@linux.dev
2023-08-30xsk: Fix xsk_build_skb() error: 'skb' dereferencing possible ERR_PTR()Tirthendu Sarkar
Currently, xsk_build_skb() is a function that builds skb in two possible ways and then is ended with common error handling. We can distinguish four possible error paths and handling in xsk_build_skb(): 1. sock_alloc_send_skb fails: Retry (skb is NULL). 2. skb_store_bits fails : Free skb and retry. 3. MAX_SKB_FRAGS exceeded: Free skb, cleanup and drop packet. 4. alloc_page fails for frag: Retry page allocation w/o freeing skb 1] and 3] can happen in xsk_build_skb_zerocopy(), which is one of the two code paths responsible for building skb. Common error path in xsk_build_skb() assumes that in case errno != -EAGAIN, skb is a valid pointer, which is wrong as kernel test robot reports that in xsk_build_skb_zerocopy() other errno values are returned for skb being NULL. To fix this, set -EOVERFLOW as error when MAX_SKB_FRAGS are exceeded and packet needs to be dropped in both xsk_build_skb() and xsk_build_skb_zerocopy() and use this to distinguish against all other error cases. Also, add explicit kfree_skb() for 3] so that handling of 1], 2], and 3] becomes identical where allocation needs to be retried. Fixes: cf24f5a5feea ("xsk: add support for AF_XDP multi-buffer on Tx path") Reported-by: kernel test robot <lkp@intel.com> Reported-by: Dan Carpenter <dan.carpenter@linaro.org> Signed-off-by: Tirthendu Sarkar <tirthendu.sarkar@intel.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Magnus Karlsson <magnus.karlsson@intel.com> Closes: https://lore.kernel.org/r/202307210434.OjgqFcbB-lkp@intel.com Link: https://lore.kernel.org/bpf/20230823144713.2231808-1-tirthendu.sarkar@intel.com