summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2021-08-23scsi: qla2xxx: Open-code qla2xxx_eh_target_reset()Hannes Reinecke
Device reset and target reset will be using different calling sequences, so open-code __qla2xxx_eh_generic_reset() in qla2xxx_eh_target_reset(). No functional changes. Link: https://lore.kernel.org/r/20210819091913.94436-3-hare@suse.de Cc: Nilesh Javali <njavali@marvell.com> Reviewed-by: Nilesh Javali <njavali@marvell.com> Signed-off-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
2021-08-23scsi: qla2xxx: Do not call fc_block_scsi_eh() during bus resetHannes Reinecke
When calling bus reset the driver will be doing a full SAN resync, so there is no need to wait for any pending RSCNs; they'll be re-issued during resync anyway. Link: https://lore.kernel.org/r/20210819091913.94436-2-hare@suse.de Cc: Nilesh Javali <njavali@marvell.com> Reviewed-by: Nilesh Javali <njavali@marvell.com> Signed-off-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
2021-08-23scsi: qla2xxx: Update version to 10.02.06.200-kNilesh Javali
Link: https://lore.kernel.org/r/20210817051315.2477-13-njavali@marvell.com Signed-off-by: Nilesh Javali <njavali@marvell.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
2021-08-23scsi: qla2xxx: edif: Fix returnvar.cocci warningskernel test robot
drivers/scsi/qla2xxx/qla_edif.c:213:25-29: Unneeded variable: "rval". Return "0" on line 264 Remove unneeded variable used to store return value. Generated by: scripts/coccinelle/misc/returnvar.cocci Link: https://lore.kernel.org/r/20210817051315.2477-12-njavali@marvell.com Fixes: 7ebb336e45ef ("scsi: qla2xxx: edif: Add start + stop bsgs") Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: kernel test robot <lkp@intel.com> Signed-off-by: Nilesh Javali <njavali@marvell.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
2021-08-23scsi: qla2xxx: Fix NVMe session down detectionQuinn Tran
When Target port transitions personality from one to another (NVMe <--> FCP), there could be some overlap of the two where one layer is going down while the other layer is coming up. This overlap can cause temporary I/O error. Detect those errors/transitions and recover from them. Triggers session tear down and allow relogin to re-drive the connection under the following conditions: - NVMe command error - On PRLO + N2N (rida format 2) Link: https://lore.kernel.org/r/20210817051315.2477-11-njavali@marvell.com Signed-off-by: Quinn Tran <qutran@marvell.com> Signed-off-by: Nilesh Javali <njavali@marvell.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
2021-08-23scsi: qla2xxx: Fix NVMe retryQuinn Tran
For target port that register itself as both FCP + NVMe, initiator driver will try to login one mode at a time. If the last mode did not succeed, then driver will try the other mode. When error is encountered, current code only flip to other mode one time (NVMe->FCP) and remain on the last mode. Driver wrongly assumed target port does not support PRLI NVMe, instead it was not ready to receive PRLI. This patch will alternate back and forth on every PRLI failure until login retry count has depleted or it is succeeded. Link: https://lore.kernel.org/r/20210817051315.2477-10-njavali@marvell.com Signed-off-by: Quinn Tran <qutran@marvell.com> Signed-off-by: Nilesh Javali <njavali@marvell.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
2021-08-23scsi: qla2xxx: Fix hang on NVMe command timeoutsArun Easi
The abort callback gets called only when it gets posted to firmware. The refcounting is done properly in the callback. On internal errors, the callback is not invoked leading to a hung I/O. Fix this by having separate error code when command gets returned from firmware. Link: https://lore.kernel.org/r/20210817051315.2477-9-njavali@marvell.com Signed-off-by: Arun Easi <aeasi@marvell.com> Signed-off-by: Nilesh Javali <njavali@marvell.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
2021-08-23scsi: qla2xxx: Fix NVMe | FCP personality changeQuinn Tran
Currently driver saves the personality type (FCP|NVMe) at the start of first discovery of the remote device. If the remote device personality do change over time, then qla driver needs to present that to user to decide. Link: https://lore.kernel.org/r/20210817051315.2477-8-njavali@marvell.com Signed-off-by: Quinn Tran <qutran@marvell.com> Signed-off-by: Nilesh Javali <njavali@marvell.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
2021-08-23scsi: qla2xxx: edif: Do secure PLOGI when auth app is presentQuinn Tran
For initiator mode, always do secure login when authentication app started. Also remove redundant flags to indicate secure connection. Link: https://lore.kernel.org/r/20210817051315.2477-7-njavali@marvell.com Signed-off-by: Quinn Tran <qutran@marvell.com> Signed-off-by: Nilesh Javali <njavali@marvell.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
2021-08-23scsi: qla2xxx: edif: Add N2N support for EDIFQuinn Tran
For EDIF + N2N to work, firmware 9.8 or later is required. The driver will pause after PLOGI to allow app to authenticate. Once authentication completes, app will tell driver to do PRLI. Link: https://lore.kernel.org/r/20210817051315.2477-6-njavali@marvell.com Signed-off-by: Quinn Tran <qutran@marvell.com> Signed-off-by: Nilesh Javali <njavali@marvell.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
2021-08-23scsi: qla2xxx: Fix hang during NVMe session tear downArun Easi
The following hung task call trace was seen: [ 1230.183294] INFO: task qla2xxx_wq:523 blocked for more than 120 seconds. [ 1230.197749] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 1230.205585] qla2xxx_wq D 0 523 2 0x80004000 [ 1230.205636] Workqueue: qla2xxx_wq qlt_free_session_done [qla2xxx] [ 1230.205639] Call Trace: [ 1230.208100] __schedule+0x2c4/0x700 [ 1230.211607] schedule+0x38/0xa0 [ 1230.214769] schedule_timeout+0x246/0x2f0 [ 1230.222651] wait_for_completion+0x97/0x100 [ 1230.226921] qlt_free_session_done+0x6a0/0x6f0 [qla2xxx] [ 1230.232254] process_one_work+0x1a7/0x360 ...when device side port resets were done. Abort threads were getting out without processing due to the "deleted" flag check. The delete thread, meanwhile, could not proceed with a logout (that would have cleared out pending requests) as the logout IOCB work was not progressing. It appears like the hung qlt_free_session_done() thread is causing the ha->wq works on hold. The qlt_free_session_done() was hung waiting for nvme_fc_unregister_remoteport() + localport_delete cb to be complete, which would only happen when all I/Os are released. Fix this by allowing abort to progress until device delete is completely done. This should make the qlt_free_session_done() proceed without hang and thus clear up the deadlock. Link: https://lore.kernel.org/r/20210817051315.2477-5-njavali@marvell.com Signed-off-by: Arun Easi <aeasi@marvell.com> Signed-off-by: Nilesh Javali <njavali@marvell.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
2021-08-23scsi: qla2xxx: edif: Fix EDIF enable flagQuinn Tran
edif_enabled is prematurely turned on if hardware is capable of handling the feature. However, firmware also needs to support EDIF before enabling this bit. Link: https://lore.kernel.org/r/20210817051315.2477-4-njavali@marvell.com Signed-off-by: Quinn Tran <qutran@marvell.com> Signed-off-by: Nilesh Javali <njavali@marvell.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
2021-08-23scsi: qla2xxx: edif: Reject AUTH ELS on session downQuinn Tran
Reject inflight AUTH ELS if driver is going through session recovery. Link: https://lore.kernel.org/r/20210817051315.2477-3-njavali@marvell.com Signed-off-by: Quinn Tran <qutran@marvell.com> Signed-off-by: Nilesh Javali <njavali@marvell.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
2021-08-23scsi: qla2xxx: edif: Fix stale sessionQuinn Tran
When firmware indicates session has been torn down via UPDATE SA IOCB or ELS Passthrough IOCB, the driver needs to also tear down the session. Link: https://lore.kernel.org/r/20210817051315.2477-2-njavali@marvell.com Signed-off-by: Quinn Tran <qutran@marvell.com> Signed-off-by: Nilesh Javali <njavali@marvell.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
2021-08-23selftests/bpf: Reduce flakyness in timer_mimYucong Sun
This patch extends wait time in timer_mim. As observed in slow CI environment, it is possible to have interrupt/preemption long enough to cause the test to fail, almost 1 failure in 5 runs. Signed-off-by: Yucong Sun <fallentree@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20210823213629.3519641-1-fallentree@fb.com
2021-08-23Merge branch 'Refactor cgroup_bpf internals to use more specific attach_type'Alexei Starovoitov
Dave Marchevsky says: ==================== The cgroup_bpf struct has a few arrays (effective, progs, and flags) of size MAX_BPF_ATTACH_TYPE. These are meant to separate progs by their attach type, currently represented by the bpf_attach_type enum. There are some bpf_attach_type values which are not valid attach types for cgroup bpf programs. Programs with these attach types will never be handled by cgroup_bpf_{attach,detach} and thus will never be held in cgroup_bpf structs. Even if such programs did make it into their reserved slot in those arrays, they would never be executed. Accordingly we can migrate to a new internal cgroup_bpf-specific enum for these arrays, saving some bytes per cgroup and making it more obvious which BPF programs belong there. netns_bpf_attach_type is an existing example of this pattern, let's do similar for cgroup_bpf. v1->v2: Address Daniel's comments * Reverse xmas tree ordering for def changes * Helper macro to reduce to_cgroup_bpf_attach_type boilerplate * checkpatch.pl complains: "ERROR: Macros with complex values should be enclosed in parentheses". Found some existing macros (do 'git grep "define case"') which get same complaint. Think it's fine to keep as-is since it's immediately undef'd. * Remove CG_BPF_ prefix from cgroup_bpf_attach_type * Although I agree that the prefix is redundant, the de-prefixed names feel a bit too 'general' given the internal use of the enum. e.g. when someone sees CGROUP_INET6_BIND it's not obvious that it should only be used in certain ways internally. * Don't feel strongly about this, just my thoughts as a noob to the internals. * Rebase onto latest bpf-next/master * No significant conflicts, some small boilerplate adjustments needed to catch up to Andrii's "bpf: Refactor BPF_PROG_RUN_ARRAY family of macros into functions" change ==================== Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2021-08-23bpf: Migrate cgroup_bpf to internal cgroup_bpf_attach_type enumDave Marchevsky
Add an enum (cgroup_bpf_attach_type) containing only valid cgroup_bpf attach types and a function to map bpf_attach_type values to the new enum. Inspired by netns_bpf_attach_type. Then, migrate cgroup_bpf to use cgroup_bpf_attach_type wherever possible. Functionality is unchanged as attach_type_to_prog_type switches in bpf/syscall.c were preventing non-cgroup programs from making use of the invalid cgroup_bpf array slots. As a result struct cgroup_bpf uses 504 fewer bytes relative to when its arrays were sized using MAX_BPF_ATTACH_TYPE. bpf_cgroup_storage is notably not migrated as struct bpf_cgroup_storage_key is part of uapi and contains a bpf_attach_type member which is not meant to be opaque. Similarly, bpf_cgroup_link continues to report its bpf_attach_type member to userspace via fdinfo and bpf_link_info. To ease disambiguation, bpf_attach_type variables are renamed from 'type' to 'atype' when changed to cgroup_bpf_attach_type. Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20210819092420.1984861-2-davemarchevsky@fb.com
2021-08-23xfs: only set IOMAP_F_SHARED when providing a srcmap to a writeDarrick J. Wong
While prototyping a free space defragmentation tool, I observed an unexpected IO error while running a sequence of commands that can be recreated by the following sequence of commands: # xfs_io -f -c "pwrite -S 0x58 -b 10m 0 10m" file1 # cp --reflink=always file1 file2 # punch-alternating -o 1 file2 # xfs_io -c "funshare 0 10m" file2 fallocate: Input/output error I then scraped this (abbreviated) stack trace from dmesg: WARNING: CPU: 0 PID: 30788 at fs/iomap/buffered-io.c:577 iomap_write_begin+0x376/0x450 CPU: 0 PID: 30788 Comm: xfs_io Not tainted 5.14.0-rc6-xfsx #rc6 5ef57b62a900814b3e4d885c755e9014541c8732 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-1ubuntu1.1 04/01/2014 RIP: 0010:iomap_write_begin+0x376/0x450 RSP: 0018:ffffc90000c0fc20 EFLAGS: 00010297 RAX: 0000000000000001 RBX: ffffc90000c0fd10 RCX: 0000000000001000 RDX: ffffc90000c0fc54 RSI: 000000000000000c RDI: 000000000000000c RBP: ffff888005d5dbd8 R08: 0000000000102000 R09: ffffc90000c0fc50 R10: 0000000000b00000 R11: 0000000000101000 R12: ffffea0000336c40 R13: 0000000000001000 R14: ffffc90000c0fd10 R15: 0000000000101000 FS: 00007f4b8f62fe40(0000) GS:ffff88803ec00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 000056361c554108 CR3: 000000000524e004 CR4: 00000000001706f0 Call Trace: iomap_unshare_actor+0x95/0x140 iomap_apply+0xfa/0x300 iomap_file_unshare+0x44/0x60 xfs_reflink_unshare+0x50/0x140 [xfs 61947ea9b3a73e79d747dbc1b90205e7987e4195] xfs_file_fallocate+0x27c/0x610 [xfs 61947ea9b3a73e79d747dbc1b90205e7987e4195] vfs_fallocate+0x133/0x330 __x64_sys_fallocate+0x3e/0x70 do_syscall_64+0x35/0x80 entry_SYSCALL_64_after_hwframe+0x44/0xae RIP: 0033:0x7f4b8f79140a Looking at the iomap tracepoints, I saw this: iomap_iter: dev 8:64 ino 0x100 pos 0 length 0 flags WRITE|0x80 (0x81) ops xfs_buffered_write_iomap_ops caller iomap_file_unshare iomap_iter_dstmap: dev 8:64 ino 0x100 bdev 8:64 addr -1 offset 0 length 131072 type DELALLOC flags SHARED iomap_iter_srcmap: dev 8:64 ino 0x100 bdev 8:64 addr 147456 offset 0 length 4096 type MAPPED flags iomap_iter: dev 8:64 ino 0x100 pos 0 length 4096 flags WRITE|0x80 (0x81) ops xfs_buffered_write_iomap_ops caller iomap_file_unshare iomap_iter_dstmap: dev 8:64 ino 0x100 bdev 8:64 addr -1 offset 4096 length 4096 type DELALLOC flags SHARED console: WARNING: CPU: 0 PID: 30788 at fs/iomap/buffered-io.c:577 iomap_write_begin+0x376/0x450 The first time funshare calls ->iomap_begin, xfs sees that the first block is shared and creates a 128k delalloc reservation in the COW fork. The delalloc reservation is returned as dstmap, and the shared block is returned as srcmap. So far so good. funshare calls ->iomap_begin to try the second block. This time there's no srcmap (punch-alternating punched it out!) but we still have the delalloc reservation in the COW fork. Therefore, we again return the reservation as dstmap and the hole as srcmap. iomap_unshare_iter incorrectly tries to unshare the hole, which __iomap_write_begin rejects because shared regions must be fully written and therefore cannot require zeroing. Therefore, change the buffered write iomap_begin function not to set IOMAP_F_SHARED when there isn't a source mapping to read from for the unsharing. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Chandan Babu R <chandanrlinux@gmail.com>
2021-08-23IMA: reject unknown hash algorithms in ima_get_hash_algoTHOBY Simon
The new function validate_hash_algo() assumed that ima_get_hash_algo() always return a valid 'enum hash_algo', but it returned the user-supplied value present in the digital signature without any bounds checks. Update ima_get_hash_algo() to always return a valid hash algorithm, defaulting on 'ima_hash_algo' when the user-supplied value inside the xattr is invalid. Signed-off-by: THOBY Simon <Simon.THOBY@viveris.fr> Reported-by: syzbot+e8bafe7b82c739eaf153@syzkaller.appspotmail.com Fixes: 50f742dd9147 ("IMA: block writes of the security.ima xattr with unsupported algorithms") Reviewed-by: Lakshmi Ramasubramanian <nramas@linux.microsoft.com> Signed-off-by: Mimi Zohar <zohar@linux.ibm.com>
2021-08-23Keep read and write fds with each nlm_fileJ. Bruce Fields
We shouldn't really be using a read-only file descriptor to take a write lock. Most filesystems will put up with it. But NFS, for example, won't. Signed-off-by: J. Bruce Fields <bfields@redhat.com> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2021-08-23Merge tag 'aspeed-5.15-defconfig' of ↵Arnd Bergmann
git://git.kernel.org/pub/scm/linux/kernel/git/joel/bmc into arm/defconfig ASPEED defconfig updates for 5.15 - Enable new KCS SerIO driver - Enable SGPIO and EDAC for AST2400 now they are supported there - Switch to SLUB and enable SLAB_FREELIST_HARDENED - Regenerate defconfigs atop v5.14-rc2 * tag 'aspeed-5.15-defconfig' of git://git.kernel.org/pub/scm/linux/kernel/git/joel/bmc: ARM: config: aspeed: Regenerate defconfigs ARM: config: aspeed_g4: Enable EDAC and SPGIO ARM: config: aspeed: Enable KCS adapter for raw SerIO ARM: config: aspeed: Enable hardened allocator feature Link: https://lore.kernel.org/r/CACPK8XdzKdnyrpjKukGWieBhLgQnBs+y=LuSr_weot=Ovy3+9A@mail.gmail.com Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2021-08-23ucounts: Increase ucounts reference counter before the security hookAlexey Gladkov
We need to increment the ucounts reference counter befor security_prepare_creds() because this function may fail and abort_creds() will try to decrement this reference. [ 96.465056][ T8641] FAULT_INJECTION: forcing a failure. [ 96.465056][ T8641] name fail_page_alloc, interval 1, probability 0, space 0, times 0 [ 96.478453][ T8641] CPU: 1 PID: 8641 Comm: syz-executor668 Not tainted 5.14.0-rc6-syzkaller #0 [ 96.487215][ T8641] Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 [ 96.497254][ T8641] Call Trace: [ 96.500517][ T8641] dump_stack_lvl+0x1d3/0x29f [ 96.505758][ T8641] ? show_regs_print_info+0x12/0x12 [ 96.510944][ T8641] ? log_buf_vmcoreinfo_setup+0x498/0x498 [ 96.516652][ T8641] should_fail+0x384/0x4b0 [ 96.521141][ T8641] prepare_alloc_pages+0x1d1/0x5a0 [ 96.526236][ T8641] __alloc_pages+0x14d/0x5f0 [ 96.530808][ T8641] ? __rmqueue_pcplist+0x2030/0x2030 [ 96.536073][ T8641] ? lockdep_hardirqs_on_prepare+0x3e2/0x750 [ 96.542056][ T8641] ? alloc_pages+0x3f3/0x500 [ 96.546635][ T8641] allocate_slab+0xf1/0x540 [ 96.551120][ T8641] ___slab_alloc+0x1cf/0x350 [ 96.555689][ T8641] ? kzalloc+0x1d/0x30 [ 96.559740][ T8641] __kmalloc+0x2e7/0x390 [ 96.563980][ T8641] ? kzalloc+0x1d/0x30 [ 96.568029][ T8641] kzalloc+0x1d/0x30 [ 96.571903][ T8641] security_prepare_creds+0x46/0x220 [ 96.577174][ T8641] prepare_creds+0x411/0x640 [ 96.581747][ T8641] __sys_setfsuid+0xe2/0x3a0 [ 96.586333][ T8641] do_syscall_64+0x3d/0xb0 [ 96.590739][ T8641] entry_SYSCALL_64_after_hwframe+0x44/0xae [ 96.596611][ T8641] RIP: 0033:0x445a69 [ 96.600483][ T8641] Code: 28 00 00 00 75 05 48 83 c4 28 c3 e8 11 15 00 00 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b8 ff ff ff f7 d8 64 89 01 48 [ 96.620152][ T8641] RSP: 002b:00007f1054173318 EFLAGS: 00000246 ORIG_RAX: 000000000000007a [ 96.628543][ T8641] RAX: ffffffffffffffda RBX: 00000000004ca4c8 RCX: 0000000000445a69 [ 96.636600][ T8641] RDX: 0000000000000010 RSI: 00007f10541732f0 RDI: 0000000000000000 [ 96.644550][ T8641] RBP: 00000000004ca4c0 R08: 0000000000000001 R09: 0000000000000000 [ 96.652500][ T8641] R10: 0000000000000000 R11: 0000000000000246 R12: 00000000004ca4cc [ 96.660631][ T8641] R13: 00007fffffe0b62f R14: 00007f1054173400 R15: 0000000000022000 Fixes: 905ae01c4ae2 ("Add a reference to ucounts for each cred") Reported-by: syzbot+01985d7909f9468f013c@syzkaller.appspotmail.com Signed-off-by: Alexey Gladkov <legion@kernel.org> Link: https://lkml.kernel.org/r/97433b1742c3331f02ad92de5a4f07d673c90613.1629735352.git.legion@kernel.org Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2021-08-23ucounts: Fix regression preventing increasing of rlimits in init_user_nsEric W. Biederman
"Ma, XinjianX" <xinjianx.ma@intel.com> reported: > When lkp team run kernel selftests, we found after these series of patches, testcase mqueue: mq_perf_tests > in kselftest failed with following message. > > # selftests: mqueue: mq_perf_tests > # > # Initial system state: > # Using queue path: /mq_perf_tests > # RLIMIT_MSGQUEUE(soft): 819200 > # RLIMIT_MSGQUEUE(hard): 819200 > # Maximum Message Size: 8192 > # Maximum Queue Size: 10 > # Nice value: 0 > # > # Adjusted system state for testing: > # RLIMIT_MSGQUEUE(soft): (unlimited) > # RLIMIT_MSGQUEUE(hard): (unlimited) > # Maximum Message Size: 16777216 > # Maximum Queue Size: 65530 > # Nice value: -20 > # Continuous mode: (disabled) > # CPUs to pin: 3 > # ./mq_perf_tests: mq_open() at 296: Too many open files > not ok 2 selftests: mqueue: mq_perf_tests # exit=1 > ``` > > Test env: > rootfs: debian-10 > gcc version: 9 After investigation the problem turned out to be that ucount_max for the rlimits in init_user_ns was being set to the initial rlimit value. The practical problem is that ucount_max provides a limit that applications inside the user namespace can not exceed. Which means in practice that rlimits that have been converted to use the ucount infrastructure were not able to exceend their initial rlimits. Solve this by setting the relevant values of ucount_max to RLIM_INIFINITY. A limit in init_user_ns is pointless so the code should allow the values to grow as large as possible without riscking an underflow or an overflow. As the ltp test case was a bit of a pain I have reproduced the rlimit failure and tested the fix with the following little C program: > #include <stdio.h> > #include <fcntl.h> > #include <sys/stat.h> > #include <mqueue.h> > #include <sys/time.h> > #include <sys/resource.h> > #include <errno.h> > #include <string.h> > #include <stdlib.h> > #include <limits.h> > #include <unistd.h> > > int main(int argc, char **argv) > { > struct mq_attr mq_attr; > struct rlimit rlim; > mqd_t mqd; > int ret; > > ret = getrlimit(RLIMIT_MSGQUEUE, &rlim); > if (ret != 0) { > fprintf(stderr, "getrlimit(RLIMIT_MSGQUEUE) failed: %s\n", strerror(errno)); > exit(EXIT_FAILURE); > } > printf("RLIMIT_MSGQUEUE %lu %lu\n", > rlim.rlim_cur, rlim.rlim_max); > rlim.rlim_cur = RLIM_INFINITY; > rlim.rlim_max = RLIM_INFINITY; > ret = setrlimit(RLIMIT_MSGQUEUE, &rlim); > if (ret != 0) { > fprintf(stderr, "setrlimit(RLIMIT_MSGQUEUE, RLIM_INFINITY) failed: %s\n", strerror(errno)); > exit(EXIT_FAILURE); > } > > memset(&mq_attr, 0, sizeof(struct mq_attr)); > mq_attr.mq_maxmsg = 65536 - 1; > mq_attr.mq_msgsize = 16*1024*1024 - 1; > > mqd = mq_open("/mq_rlimit_test", O_RDONLY|O_CREAT, 0600, &mq_attr); > if (mqd == (mqd_t)-1) { > fprintf(stderr, "mq_open failed: %s\n", strerror(errno)); > exit(EXIT_FAILURE); > } > ret = mq_close(mqd); > if (ret) { > fprintf(stderr, "mq_close failed; %s\n", strerror(errno)); > exit(EXIT_FAILURE); > } > > return EXIT_SUCCESS; > } Fixes: 6e52a9f0532f ("Reimplement RLIMIT_MSGQUEUE on top of ucounts") Fixes: d7c9e99aee48 ("Reimplement RLIMIT_MEMLOCK on top of ucounts") Fixes: d64696905554 ("Reimplement RLIMIT_SIGPENDING on top of ucounts") Fixes: 21d1c5e386bc ("Reimplement RLIMIT_NPROC on top of ucounts") Reported-by: kernel test robot lkp@intel.com Acked-by: Alexey Gladkov <legion@kernel.org> Link: https://lkml.kernel.org/r/87eeajswfc.fsf_-_@disp2133 Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2021-08-23bpf: Fix ringbuf helper function compatibilityDaniel Borkmann
Commit 457f44363a88 ("bpf: Implement BPF ring buffer and verifier support for it") extended check_map_func_compatibility() by enforcing map -> helper function match, but not helper -> map type match. Due to this all of the bpf_ringbuf_*() helper functions could be used with a wrong map type such as array or hash map, leading to invalid access due to type confusion. Also, both BPF_FUNC_ringbuf_{submit,discard} have ARG_PTR_TO_ALLOC_MEM as argument and not a BPF map. Therefore, their check_map_func_compatibility() presence is incorrect since it's only for map type checking. Fixes: 457f44363a88 ("bpf: Implement BPF ring buffer and verifier support for it") Reported-by: Ryota Shiga (Flatt Security) Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org>
2021-08-23io_uring: add support for IORING_OP_LINKATDmitry Kadashev
IORING_OP_LINKAT behaves like linkat(2) and takes the same flags and arguments. In some internal places 'hardlink' is used instead of 'link' to avoid confusion with the SQE links. Name 'link' conflicts with the existing 'link' member of io_kiocb. Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Suggested-by: Christian Brauner <christian.brauner@ubuntu.com> Link: https://lore.kernel.org/io-uring/20210514145259.wtl4xcsp52woi6ab@wittgenstein/ Signed-off-by: Dmitry Kadashev <dkadashev@gmail.com> Acked-by: Christian Brauner <christian.brauner@ubuntu.com> Link: https://lore.kernel.org/r/20210708063447.3556403-12-dkadashev@gmail.com [axboe: add splice_fd_in check] Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-08-23io_uring: add support for IORING_OP_SYMLINKATDmitry Kadashev
IORING_OP_SYMLINKAT behaves like symlinkat(2) and takes the same flags and arguments. Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Suggested-by: Christian Brauner <christian.brauner@ubuntu.com> Link: https://lore.kernel.org/io-uring/20210514145259.wtl4xcsp52woi6ab@wittgenstein/ Signed-off-by: Dmitry Kadashev <dkadashev@gmail.com> Acked-by: Christian Brauner <christian.brauner@ubuntu.com> Link: https://lore.kernel.org/r/20210708063447.3556403-11-dkadashev@gmail.com [axboe: add splice_fd_in check] Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-08-23bio: improve kerneldoc documentation for bio_alloc_kiocb()Jens Axboe
We're missing a description for the 'nr_vecs' parameter. While in there, clarify that freeing a bio allocated through this function must be done from process context. Fixes: 1cbbd31c4ada ("bio: add allocation cache abstraction") Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-08-23block: provide bio_clear_hipri() helperJens Axboe
Any case that turns off REQ_HIPRI must also clear BIO_PERCPU_CACHE, as non-polled IO may complete through hard/soft IRQ and hence isn't safe for our polled bio alloc cache. Provide a helper that does just that, and use it in the merging code as well if we split a bio and turn off polling. Fixes: be863b9e4348 ("block: clear BIO_PERCPU_CACHE flag if polling isn't supported") Reported-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-08-23block: use the percpu bio cache in __blkdev_direct_IOChristoph Hellwig
Use bio_alloc_kiocb to dip into the percpu cache of bios when the caller asks for it. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-08-23io_uring: enable use of bio alloc cacheJens Axboe
Mark polled IO as being safe for dipping into the bio allocation cache, in case the targeted bio_set has it enabled. This brings an IOPOLL gen2 Optane QD=128 workload from ~3.2M IOPS to ~3.5M IOPS. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-08-23block: clear BIO_PERCPU_CACHE flag if polling isn't supportedJens Axboe
The bio alloc cache relies on the fact that a polled bio will complete in process context, clear the cacheable flag if we disable polling for a given bio. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-08-23bio: add allocation cache abstractionJens Axboe
Add a per-cpu bio_set cache for bio allocations, enabling us to quickly recycle them instead of going through the slab allocator. This cache isn't IRQ safe, and hence is only really suitable for polled IO. Very simple - keeps a count of bio's in the cache, and maintains a max of 512 with a slack of 64. If we get above max + slack, we drop slack number of bio's. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-08-23fs: add kiocb alloc cache flagJens Axboe
If this kiocb can safely use the polled bio allocation cache, then this flag must be set. Generally this can be set for polled IO, where we will not see IRQ completions of the request. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-08-23bio: optimize initialization of a bioJens Axboe
The memset() used is measurably slower in targeted benchmarks, wasting about 1% of the total runtime, or 50% of the (later) hot path cached bio alloc. Get rid of it and fill in the bio manually. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-08-23io_uring: fix io_try_cancel_userdata race for iowqPavel Begunkov
WARNING: CPU: 1 PID: 5870 at fs/io_uring.c:5975 io_try_cancel_userdata+0x30f/0x540 fs/io_uring.c:5975 CPU: 0 PID: 5870 Comm: iou-wrk-5860 Not tainted 5.14.0-rc6-next-20210820-syzkaller #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 RIP: 0010:io_try_cancel_userdata+0x30f/0x540 fs/io_uring.c:5975 Call Trace: io_async_cancel fs/io_uring.c:6014 [inline] io_issue_sqe+0x22d5/0x65a0 fs/io_uring.c:6407 io_wq_submit_work+0x1dc/0x300 fs/io_uring.c:6511 io_worker_handle_work+0xa45/0x1840 fs/io-wq.c:533 io_wqe_worker+0x2cc/0xbb0 fs/io-wq.c:582 ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:295 io_try_cancel_userdata() can be called from io_async_cancel() executing in the io-wq context, so the warning fires, which is there to alert anyone accessing task->io_uring->io_wq in a racy way. However, io_wq_put_and_exit() always first waits for all threads to complete, so the only detail left is to zero tctx->io_wq after the context is removed. note: one little assumption is that when IO_WQ_WORK_CANCEL, the executor won't touch ->io_wq, because io_wq_destroy() might cancel left pending requests in such a way. Cc: stable@vger.kernel.org Reported-by: syzbot+b0c9d1588ae92866515f@syzkaller.appspotmail.com Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/dfdd37a80cfa9ffd3e59538929c99cdd55d8699e.1629721757.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-08-23io_uring: add support for IORING_OP_MKDIRATDmitry Kadashev
IORING_OP_MKDIRAT behaves like mkdirat(2) and takes the same flags and arguments. Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Dmitry Kadashev <dkadashev@gmail.com> Acked-by: Christian Brauner <christian.brauner@ubuntu.com> Link: https://lore.kernel.org/r/20210708063447.3556403-10-dkadashev@gmail.com [axboe: add splice_fd_in check] Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-08-23namei: update do_*() helpers to return intsDmitry Kadashev
Update the following to return int rather than long, for uniformity with the rest of the do_* helpers in namei.c: * do_rmdir() * do_unlinkat() * do_mkdirat() * do_mknodat() * do_symlinkat() Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Christian Brauner <christian.brauner@ubuntu.com> Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lore.kernel.org/io-uring/20210514143202.dmzfcgz5hnauy7ze@wittgenstein/ Signed-off-by: Dmitry Kadashev <dkadashev@gmail.com> Acked-by: Christian Brauner <christian.brauner@ubuntu.com> Link: https://lore.kernel.org/r/20210708063447.3556403-9-dkadashev@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-08-23namei: make do_linkat() take struct filenameDmitry Kadashev
Pass in the struct filename pointers instead of the user string, for uniformity with do_renameat2, do_unlinkat, do_mknodat, etc. Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Christian Brauner <christian.brauner@ubuntu.com> Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lore.kernel.org/io-uring/20210330071700.kpjoyp5zlni7uejm@wittgenstein/ Signed-off-by: Dmitry Kadashev <dkadashev@gmail.com> Acked-by: Christian Brauner <christian.brauner@ubuntu.com> Link: https://lore.kernel.org/r/20210708063447.3556403-8-dkadashev@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-08-23namei: add getname_uflags()Dmitry Kadashev
There are a couple of places where we already open-code the (flags & AT_EMPTY_PATH) check and io_uring will likely add another one in the future. Let's just add a simple helper getname_uflags() that handles this directly and use it. Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Christian Brauner <christian.brauner@ubuntu.com> Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lore.kernel.org/io-uring/20210415100815.edrn4a7cy26wkowe@wittgenstein/ Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com> Signed-off-by: Dmitry Kadashev <dkadashev@gmail.com> Acked-by: Christian Brauner <christian.brauner@ubuntu.com> Link: https://lore.kernel.org/r/20210708063447.3556403-7-dkadashev@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-08-23namei: make do_symlinkat() take struct filenameDmitry Kadashev
Pass in the struct filename pointers instead of the user string, for uniformity with the recently converted do_mkdnodat(), do_unlinkat(), do_renameat(), do_mkdirat(). Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Christian Brauner <christian.brauner@ubuntu.com> Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lore.kernel.org/io-uring/20210330071700.kpjoyp5zlni7uejm@wittgenstein/ Signed-off-by: Dmitry Kadashev <dkadashev@gmail.com> Acked-by: Christian Brauner <christian.brauner@ubuntu.com> Link: https://lore.kernel.org/r/20210708063447.3556403-6-dkadashev@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-08-23namei: make do_mknodat() take struct filenameDmitry Kadashev
Pass in the struct filename pointers instead of the user string, for uniformity with the recently converted do_unlinkat(), do_renameat(), do_mkdirat(). Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Christian Brauner <christian.brauner@ubuntu.com> Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lore.kernel.org/io-uring/20210330071700.kpjoyp5zlni7uejm@wittgenstein/ Signed-off-by: Dmitry Kadashev <dkadashev@gmail.com> Acked-by: Christian Brauner <christian.brauner@ubuntu.com> Link: https://lore.kernel.org/r/20210708063447.3556403-5-dkadashev@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-08-23namei: make do_mkdirat() take struct filenameDmitry Kadashev
Pass in the struct filename pointers instead of the user string, and update the three callers to do the same. This is heavily based on commit dbea8d345177 ("fs: make do_renameat2() take struct filename"). This behaves like do_unlinkat() and do_renameat2(). Cc: Al Viro <viro@zeniv.linux.org.uk> Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Dmitry Kadashev <dkadashev@gmail.com> Acked-by: Christian Brauner <christian.brauner@ubuntu.com> Link: https://lore.kernel.org/r/20210708063447.3556403-4-dkadashev@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-08-23namei: change filename_parentat() calling conventionsDmitry Kadashev
Since commit 5c31b6cedb675 ("namei: saner calling conventions for filename_parentat()") filename_parentat() had the following behavior WRT the passed in struct filename *: * On error the name is consumed (putname() is called on it); * On success the name is returned back as the return value; Now there is a need for filename_create() and filename_lookup() variants that do not consume the passed filename, and following the same "consume the name only on error" semantics is proven to be hard to reason about and result in confusing code. Hence this preparation change splits filename_parentat() into two: one that always consumes the name and another that never consumes the name. This will allow to implement two filename_create() variants in the same way, and is a consistent and hopefully easier to reason about approach. Link: https://lore.kernel.org/io-uring/CAOKbgA7MiqZAq3t-HDCpSGUFfco4hMA9ArAE-74fTpU+EkvKPw@mail.gmail.com/ Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Christian Brauner <christian.brauner@ubuntu.com> Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Dmitry Kadashev <dkadashev@gmail.com> Link: https://lore.kernel.org/r/20210708063447.3556403-3-dkadashev@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-08-23namei: ignore ERR/NULL names in putname()Dmitry Kadashev
Supporting ERR/NULL names in putname() makes callers code cleaner, and is what some other path walking functions already support for the same reason. This also removes a few existing IS_ERR checks before putname(). Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lore.kernel.org/io-uring/CAHk-=wgCac9hBsYzKMpHk0EbLgQaXR=OUAjHaBtaY+G8A9KhFg@mail.gmail.com/ Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Christian Brauner <christian.brauner@ubuntu.com> Signed-off-by: Dmitry Kadashev <dkadashev@gmail.com> Link: https://lore.kernel.org/r/20210708063447.3556403-2-dkadashev@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-08-23io_uring: IRQ rw completion batchingPavel Begunkov
Employ inline completion logic for read/write completions done via io_req_task_complete(). If ->uring_lock is contended, just do normal request completion, but if not, make tctx_task_work() to grab the lock and do batched inline completions in io_req_task_complete(). Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/94589c3ce69eaed86a21bb1ec696407a54fab1aa.1629286357.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-08-23io_uring: batch task work lockingPavel Begunkov
Many task_work handlers either grab ->uring_lock, or may benefit from having it. Move locking logic out of individual handlers to a lazy approach controlled by tctx_task_work(), so we don't keep doing tons of mutex lock/unlock. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/d6a34e147f2507a2f3e2fa1e38a9c541dcad3929.1629286357.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-08-23io_uring: flush completions for fallbacksPavel Begunkov
io_fallback_req_func() doesn't expect anyone creating inline completions, and no one currently does that. Teach the function to flush completions preparing for further changes. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/8b941516921f72e1a64d58932d671736892d7fff.1629286357.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-08-23io_uring: add ->splice_fd_in checksPavel Begunkov
->splice_fd_in is used only by splice/tee, but no other request checks it for validity. Add the check for most of request types excluding reads/writes/sends/recvs, we don't want overhead for them and can leave them be as is until the field is actually used. Cc: stable@vger.kernel.org Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/f44bc2acd6777d932de3d71a5692235b5b2b7397.1629451684.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-08-23io_uring: add clarifying comment for io_cqring_ev_posted()Jens Axboe
We've previously had an issue where overflow flush unconditionally calls io_cqring_ev_posted() even if it didn't flush any events to the ring, causing wake and eventfd increment where no new events are available. Some applications don't like that, see commit b18032bb0a88 for details. This came up in discussion for another patch recently, hence add a comment detailing what the relationship between calling the events posted helper and CQ ring entries is. Link: https://lore.kernel.org/io-uring/77a44fce-c831-16a6-8e80-9aee77f496a2@kernel.dk/ Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-08-23io_uring: place fixed tables under memcg limitsPavel Begunkov
Fixed tables may be large enough, place all of them together with allocated tags under memcg limits. Cc: stable@vger.kernel.org Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/b3ac9f5da9821bb59837b5fe25e8ef4be982218c.1629451684.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>