linux.git - Linus' kernel tree

Age	Commit message (Collapse)	Author
2020-10-07	s390/sie: fix typo in SIGP code description	Julian Wiedmann
	s/ait address/at address Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com> Acked-by: Christian Borntraeger <borntraeger@de.ibm.com> Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
2020-10-07	s390/lib: fix kernel doc for memcmp()	Julian Wiedmann
	s/count/n Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com> Acked-by: Christian Borntraeger <borntraeger@de.ibm.com> Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
2020-10-07	s390/zcrypt: Introduce Failure Injection feature	Harald Freudenberger
	Introduce a way to specify additional debug flags with an crpyto request to be able to trigger certain failures within the zcrypt device drivers and/or ap core code. This failure injection possibility is only enabled with a kernel debug build CONFIG_ZCRYPT_DEBUG) and should never be available on a regular kernel running in production environment. Details: * The ioctl(ICARSAMODEXPO) get's a struct ica_rsa_modexpo. If the leftmost bit of the 32 bit unsigned int inputdatalength field is set, the uppermost 16 bits are separated and used as debug flag value. The process is checked to have the CAP_SYS_ADMIN capability enabled or EPERM is returned. * The ioctl(ICARSACRT) get's a struct ica_rsa_modexpo_crt. If the leftmost bit of the 32 bit unsigned int inputdatalength field is set, the uppermost 16 bits are separated and used als debug flag value. The process is checked to have the CAP_SYS_ADMIN capability enabled or EPERM is returned. * The ioctl(ZSECSENDCPRB) used to send CCA CPRBs get's a struct ica_xcRB. If the leftmost bit of the 32 bit unsigned int status field is set, the uppermost 16 bits of this field are used as debug flag value. The process is checked to have the CAP_SYS_ADMIN capability enabled or EPERM is returned. * The ioctl(ZSENDEP11CPRB) used to send EP11 CPRBs get's a struct ep11_urb. If the leftmost bit of the 64 bit unsigned int req_len field is set, the uppermost 16 bits of this field are used as debug flag value. The process is checked to have the CAP_SYS_ADMIN capability enabled or EPERM is returned. So it is possible to send an additional 16 bit value to the zcrypt API to be used to carry a failure injection command which may trigger special behavior within the zcrypt API and layers below. This 16 bit value is for the rest of the test referred as 'fi command' for Failure Injection. The lower 8 bits of the fi command construct a numerical argument in the range of 1-255 and is the 'fi action' to be performed with the request or the resulting reply: * 0x00 (all requests): No failure injection action but flags may be provided which may affect the processing of the request or reply. * 0x01 (only CCA CPRBs): The CPRB's agent_ID field is set to 'FF'. This results in an reply code 0x90 (Transport-Protocol Failure). * 0x02 (only CCA CPRBs): After the APQN to send to has been chosen, the domain field within the CPRB is overwritten with value 99 to enforce an reply with RY 0x8A. * 0x03 (all requests): At NQAP invocation the invalid qid value 0xFF00 is used causing an response code of 0x01 (AP queue not valid). The upper 8 bits of the fi command may carry bit flags which may influence the processing of an request or response: * 0x01: No retry. If this bit is set, the usual loop in the zcrypt API which retries an CPRB up to 10 times when the lower layers return with EAGAIN is abandoned after the first attempt to send the CPRB. * 0x02: Toggle special. Toggles the special bit on this request. This should result in an reply code RY~0x41 and result in an ioctl failure with errno EINVAL. This failure injection possibilities may get some further extensions in the future. As of now this is a starting point for Continuous Test and Integration to trigger some failures and watch for the reaction of the ap bus and zcrypt device driver code. Signed-off-by: Harald Freudenberger <freude@linux.ibm.com> Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
2020-10-07	s390/zcrypt: move ap_msg param one level up the call chain	Harald Freudenberger
	Move the creating and disposal of the struct ap_message one level up the call chain. The ap message was constructed in the calling functions in msgtype50 and msgtype6 but only for the ica rsa messages. For CCA and EP11 CPRBs the ap message struct is created in the zcrypt api functions. This patch moves the construction of the ap message struct into the functions zcrypt_rsa_modexpo and zcrypt_rsa_crt. So now all the 4 zcrypt api functions zcrypt_rsa_modexpo, zcrypt_rsa_crt, zcrypt_send_cprb and zcrypt_send_ep11_cprb appear and act similar. There are no functional changes coming with this patch. However, the availability of the ap_message struct has advantages which will be needed by a follow up patch. Signed-off-by: Harald Freudenberger <freude@linux.ibm.com> Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
2020-10-07	s390/ap/zcrypt: revisit ap and zcrypt error handling	Harald Freudenberger
	Revisit the ap queue error handling: Based on discussions and evaluatios with the firmware folk here is now a rework of the response code handling for all the AP instructions. The idea is to distinguish between failures because of some kind of invalid request where a retry does not make any sense and a failure where another attempt to send the very same request may succeed. The first case is handled by returning EINVAL to the userspace application. The second case results in retries within the zcrypt API controlled by a per message retry counter. Revisit the zcrpyt error handling: Similar here, based on discussions with the firmware people here comes a rework of the handling of all the reply codes. Main point here is that there are only very few cases left, where a zcrypt device queue is switched to offline. It should never be the case that an AP reply message is 'unknown' to the device driver as it indicates a total mismatch between device driver and crypto card firmware. In all other cases, the code distinguishes between failure because of invalid message (see above - EINVAL) or failures of the infrastructure (see above - EAGAIN). Signed-off-by: Harald Freudenberger <freude@linux.ibm.com> Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
2020-10-07	s390/ap: Support AP card SCLP config and deconfig operations	Harald Freudenberger
	Support SCLP AP adapter config and deconfig operations: The sysfs deconfig attribute /sys/devices/ap/cardxx/deconfig for each AP card is now read-write. Writing in a '1' triggers a synchronous SCLP request to configure the adapter, writing in a '0' sends a synchronous SCLP deconfigure request. Signed-off-by: Harald Freudenberger <freude@linux.ibm.com> Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
2020-10-07	s390/sclp: Add support for SCLP AP adapter config/deconfig	Harald Freudenberger
	Add support for AP bus adapter config and deconfig to the sclp core code. The code is statically build into the kernel when ZCRYPT is configured either as module or with static support. This is the base functionality for having configure/deconfigure support in the AP bus and card code. Another patch will exploit this soon. Signed-off-by: Harald Freudenberger <freude@linux.ibm.com> Suggested-by: Pierre Morel <pmorel@linux.ibm.com> Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
2020-10-07	s390/ap: add card/queue deconfig state	Harald Freudenberger
	This patch adds a new config state to the ap card and queue devices. This state reflects the response code 0x03 "AP deconfigured" on TQAP invocation and is tracked with every ap bus scan. Together with this new state now a card/queue device which is 'deconfigured' is not disposed any more. However, for backward compatibility the online state now needs to take this state into account. So a card/queue is offline when the device is not configured. Furthermore a device can't get switched from offline to online state when not configured. The config state is shown in sysfs at /sys/devices/ap/cardxx/config for the card and /sys/devices/ap/cardxx/xx.yyyy/config for each queue within each card. It is a read-only attribute reflecting the negation of the 'AP deconfig' state as it is noted in the AP documents. Signed-off-by: Harald Freudenberger <freude@linux.ibm.com> Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
2020-10-07	s390/ap: add error response code field for ap queue devices	Harald Freudenberger
	On AP instruction failures the last response code is now kept in the struct ap_queue. There is also a new sysfs attribute showing this field (enabled only on debug kernels). Also slight rework of the AP_DBF macros to get some more content into one debug feature message line. Signed-off-by: Harald Freudenberger <freude@linux.ibm.com> Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
2020-10-07	s390/ap: split ap queue state machine state from device state	Harald Freudenberger
	The state machine for each ap queue covered a mixture of device states and state machine (firmware queue state) states. This patch splits the device states and the state machine states into two different enums and variables. The major state is the device state with currently these values: AP_DEV_STATE_UNINITIATED - fresh and virgin, not touched AP_DEV_STATE_OPERATING - queue dev is working normal AP_DEV_STATE_SHUTDOWN - remove/unbind/shutdown in progress AP_DEV_STATE_ERROR - device is in error state only when the device state is > UNINITIATED the state machine is run. The state machine represents the states of the firmware queue: AP_SM_STATE_RESET_START - starting point, reset (RAPQ) ap queue AP_SM_STATE_RESET_WAIT - reset triggered, waiting to be finished if irqs enabled, set up irq (AQIC) AP_SM_STATE_SETIRQ_WAIT - enable irq triggered, waiting to be finished, then go to IDLE AP_SM_STATE_IDLE - queue is operational but empty AP_SM_STATE_WORKING - queue is operational, requests are stored and replies may wait for getting fetched AP_SM_STATE_QUEUE_FULL - firmware queue is full, so only replies can get fetched For debugging each ap queue shows a sysfs attribute 'states' which displays the device and state machine state and is only available when the kernel is build with CONFIG_ZCRYPT_DEBUG enabled. Signed-off-by: Harald Freudenberger <freude@linux.ibm.com> Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
2020-10-07	s390/zcrypt: New config switch CONFIG_ZCRYPT_DEBUG	Harald Freudenberger
	Introduce a new config switch CONFIG_ZCRYPT_DEBUG which will be used to enable some features for debugging the zcrypt device driver and ap bus system: Another patch will use this for displaying ap card and ap queue state information via sysfs attribute. A furher patch will use this to enable some special treatment for some fields of an crypto request to be able to inject failures and so help debugging with regards to handling of failures. Signed-off-by: Harald Freudenberger <freude@linux.ibm.com> Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
2020-10-07	s390/zcrypt: introduce msg tracking in zcrypt functions	Harald Freudenberger
	Introduce a new internal struct zcrypt_track with an retry counter field and a last return code field. Fill and update these fields at certain points during processing of an request/reply. This tracking info is then used to - avoid trying to resend the message forever. Now each message is tried to be send TRACK_AGAIN_MAX (currently 10) times and then the ioctl returns to userspace with errno EAGAIN. - avoid trying to resend the message on the very same card/domain. If possible (more than one APQN with same quality) don't use the very same qid as the previous attempt when again scheduling the request. This is done by adding penalty weight values when the dispatching takes place. There is a penalty TRACK_AGAIN_CARD_WEIGHT_PENALTY for using the same card as previously and another penalty define TRACK_AGAIN_QUEUE_WEIGHT_PENALTY to be considered when the same qid as the previous sent attempt is calculated. Both values make it harder to choose the very same card/domain but not impossible. For example when only one APQN is available a resend can only address the very same APQN. There are some more ideas for the future to extend the use of this tracking information. For example the last response code at NQAP and DQAP could be stored there, giving the possibility to extended tracing and debugging about requests failing to get processed properly. Signed-off-by: Harald Freudenberger <freude@linux.ibm.com> Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
2020-10-07	io_uring: batch account ->req_issue and task struct references	Jens Axboe
	Identical to how we handle the ctx reference counts, increase by the batch we're expecting to submit, and handle any slow path residual, if any. The request alloc-and-issue path is very hot, and this makes a noticeable difference by avoiding an two atomic incs for each individual request. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-10-07	NFS: Decode a full READ_PLUS reply	Anna Schumaker
	Decode multiple hole and data segments sent by the server, placing everything directly where they need to go in the xdr pages. Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2020-10-07	SUNRPC: Add an xdr_align_data() function	Anna Schumaker
	For now, this function simply aligns the data at the beginning of the pages. This can eventually be expanded to shift data to the correct offsets when we're ready. Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2020-10-07	NFS: Add READ_PLUS hole segment decoding	Anna Schumaker
	We keep things simple for now by only decoding a single hole or data segment returned by the server, even if they returned more to us. Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2020-10-07	SUNRPC: Add the ability to expand holes in data pages	Anna Schumaker
	This patch adds the ability to "read a hole" into a set of XDR data pages by taking the following steps: 1) Shift all data after the current xdr->p to the right, possibly into the tail, 2) Zero the specified range, and 3) Update xdr->p to point beyond the hole. Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2020-10-07	SUNRPC: Split out _shift_data_right_tail()	Anna Schumaker
	xdr_shrink_pagelen() is very similar to what we need for hole expansion, so split out the common code into its own function that can be used by both functions. Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2020-10-07	SUNRPC: Split out xdr_realign_pages() from xdr_align_pages()	Anna Schumaker
	I don't need the entire align pages code for READ_PLUS, so split out the part I do need so I don't need to reimplement anything. Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2020-10-07	NFS: Add READ_PLUS data segment support	Anna Schumaker
	This patch adds client support for decoding a single NFS4_CONTENT_DATA segment returned by the server. This is the simplest implementation possible, since it does not account for any hole segments in the reply. Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2020-10-07	NFS: Use xdr_page_pos() in NFSv4 decode_getacl()	Anna Schumaker
	Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2020-10-07	SUNRPC: Implement a xdr_page_pos() function	Anna Schumaker
	I'll need this for READ_PLUS to help figure out the offset where page data is stored at, but it might also be useful for other things. Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2020-10-07	SUNRPC: Split out a function for setting current page	Anna Schumaker
	I'm going to need this bit of code in a few places for READ_PLUS decoding, so let's make it a helper function. Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2020-10-07	bpf: Fix typo in uapi/linux/bpf.h	Jakub Wilk
	Reported-by: Samanta Navarro <ferivoz@riseup.net> Signed-off-by: Jakub Wilk <jwilk@jwilk.net> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Yonghong Song <yhs@fb.com> Link: https://lore.kernel.org/bpf/20201007055717.7319-1-jwilk@jwilk.net
2020-10-07	bpf: Fix build failure for kernel/trace/bpf_trace.c with CONFIG_NET=n	Yonghong Song
	When CONFIG_NET is not defined, I hit the following build error: kernel/trace/bpf_trace.o:(.rodata+0x110): undefined reference to `bpf_prog_test_run_raw_tp' Commit 1b4d60ec162f ("bpf: Enable BPF_PROG_TEST_RUN for raw_tracepoint") added test_run support for raw_tracepoint in /kernel/trace/bpf_trace.c. But the test_run function bpf_prog_test_run_raw_tp is defined in net/bpf/test_run.c, only available with CONFIG_NET=y. Adding a CONFIG_NET guard for .test_run = bpf_prog_test_run_raw_tp; fixed the above build issue. Fixes: 1b4d60ec162f ("bpf: Enable BPF_PROG_TEST_RUN for raw_tracepoint") Signed-off-by: Yonghong Song <yhs@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20201007062933.3425899-1-yhs@fb.com
2020-10-07	kernel/bpf/verifier: Fix build when NET is not enabled	Randy Dunlap
	Fix build errors in kernel/bpf/verifier.c when CONFIG_NET is not enabled. ../kernel/bpf/verifier.c:3995:13: error: ‘btf_sock_ids’ undeclared here (not in a function); did you mean ‘bpf_sock_ops’? .btf_id = &btf_sock_ids[BTF_SOCK_TYPE_SOCK_COMMON], ../kernel/bpf/verifier.c:3995:26: error: ‘BTF_SOCK_TYPE_SOCK_COMMON’ undeclared here (not in a function); did you mean ‘PTR_TO_SOCK_COMMON’? .btf_id = &btf_sock_ids[BTF_SOCK_TYPE_SOCK_COMMON], Fixes: 1df8f55a37bd ("bpf: Enable bpf_skc_to_* sock casting helper to networking prog type") Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Yonghong Song <yhs@fb.com> Link: https://lore.kernel.org/bpf/20201007021613.13646-1-rdunlap@infradead.org
2020-10-07	dt-bindings: Explicitly allow additional properties in common schemas	Rob Herring
	In order to add meta-schema checks for additional/unevaluatedProperties being present, all schema need to make this explicit. As common/shared schema are included by other schemas, they should always allow for additionalProperties. Acked-by: Mark Brown <broonie@kernel.org> Acked-by: Krzysztof Kozlowski <krzk@kernel.org> Acked-by: Sebastian Reichel <sre@kernel.org> Acked-by: Chanwoo Choi <cw00.choi@samsung.com> Acked-By: Vinod Koul <vkoul@kernel.org> Acked-by: Lee Jones <lee.jones@linaro.org> Acked-by: Geert Uytterhoeven <geert+renesas@glider.be> Acked-by: Srinivas Kandagatla <srinivas.kandagatla@linaro.org> Acked-by: Ulf Hansson <ulf.hansson@linaro.org> Acked-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> Acked-by: Guenter Roeck <linux@roeck-us.net> Link: https://lore.kernel.org/r/20201005183830.486085-5-robh@kernel.org Signed-off-by: Rob Herring <robh@kernel.org>
2020-10-07	dt-bindings: Use 'additionalProperties' instead of 'unevaluatedProperties'	Rob Herring
	In cases where we don't reference another schema, 'additionalProperties' can be used instead. This is preferred for now as 'unevaluatedProperties' support isn't implemented yet. In a few cases, this means adding some missing property definitions of which most are for SPI bus properties. 'unevaluatedProperties' is not going to work for the SPI bus properties anyways as they are evaluated from the parent node, not the SPI child node. Acked-by: Mark Brown <broonie@kernel.org> Acked-by: Krzysztof Kozlowski <krzk@kernel.org> Acked-by: Lee Jones <lee.jones@linaro.org> Acked-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> Acked-by: Guenter Roeck <linux@roeck-us.net> Link: https://lore.kernel.org/r/20201005183830.486085-3-robh@kernel.org Signed-off-by: Rob Herring <robh@kernel.org>
2020-10-07	dt-bindings: Add missing 'unevaluatedProperties'	Rob Herring
	This doesn't yet do anything in the tools, but make it explicit so we can check either 'unevaluatedProperties' or 'additionalProperties' is present in schemas. 'unevaluatedProperties' is appropriate when including another schema (via '$ref') and all possible properties and/or child nodes are not explicitly listed in the schema with the '$ref'. This is in preparation to add a meta-schema to check for missing 'unevaluatedProperties' or 'additionalProperties'. This has been a constant source of review issues. Acked-by: Mark Brown <broonie@kernel.org> Acked-by: Wolfram Sang <wsa@kernel.org> Acked-by: Krzysztof Kozlowski <krzk@kernel.org> Acked-By: Vinod Koul <vkoul@kernel.org> Acked-by: Geert Uytterhoeven <geert+renesas@glider.be> Acked-by: Ulf Hansson <ulf.hansson@linaro.org> Acked-by: Guenter Roeck <linux@roeck-us.net> Acked-by: Mathieu Poirier <mathieu.poirier@linaro.org> Acked-by: Dmitry Torokhov <dmitry.torokhov@gmail.com> Link: https://lore.kernel.org/r/20201005183830.486085-2-robh@kernel.org Signed-off-by: Rob Herring <robh@kernel.org>
2020-10-07	locking/atomics: Check atomic-arch-fallback.h too	Paul Bolle
	The sha1sum of include/linux/atomic-arch-fallback.h isn't checked by check-atomics.sh. It's not clear why it's skipped so let's check it too. Signed-off-by: Paul Bolle <pebolle@tiscali.nl> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Mark Rutland <mark.rutland@arm.com> Link: https://lkml.kernel.org/r/20201001202028.1048418-1-pebolle@tiscali.nl
2020-10-07	locking/seqlock: Tweak DEFINE_SEQLOCK() kernel doc	Sebastian Andrzej Siewior
	ctags creates a warning: \|ctags: Warning: include/linux/seqlock.h:738: null expansion of name pattern "\2" The DEFINE_SEQLOCK() macro is passed to ctags and being told to expect an argument. Add a dummy argument to keep ctags quiet. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Will Deacon <will@kernel.org> Link: https://lkml.kernel.org/r/20200924154851.skmswuyj322yuz4g@linutronix.de
2020-10-07	Docs: Fixing spelling errors in Documentation/devicetree/bindings/	Marlon Rac Cambasis
	Revised patch fixing six spelling errors within Documentation/devicetree/bindings/. "specfied" replaced with "specified" in all three files modified. "atleast" seperated into "at least" three times in samsung-pinctrl.txt. This should remove any confusion that a reader might have. Signed-off-by: Marlon Rac Cambasis <marlonrc08@gmail.com> Link: https://lore.kernel.org/r/20201007071705.GA11381@marlonpc-debian Signed-off-by: Rob Herring <robh@kernel.org>
2020-10-07	x86/asm: Add an enqcmds() wrapper for the ENQCMDS instruction	Dave Jiang
	Currently, the MOVDIR64B instruction is used to atomically submit 64-byte work descriptors to devices. Although it can encounter errors like device queue full, command not accepted, device not ready, etc when writing to a device MMIO, MOVDIR64B can not report back on errors from the device itself. This means that MOVDIR64B users need to separately interact with a device to see if a descriptor was successfully queued, which slows down device interactions. ENQCMD and ENQCMDS also atomically submit 64-byte work descriptors to devices. But, they can report back errors directly from the device, such as if the device was busy, or device not enabled or does not support the command. This immediate feedback from the submission instruction itself reduces the number of interactions with the device and can greatly increase efficiency. ENQCMD can be used at any privilege level, but can effectively only submit work on behalf of the current process. ENQCMDS is a ring0-only instruction and can explicitly specify a process context instead of being tied to the current process or needing to reprogram the IA32_PASID MSR. Use ENQCMDS for work submission within the kernel because a Process Address ID (PASID) is setup to translate the kernel virtual address space. This PASID is provided to ENQCMDS from the descriptor structure submitted to the device and not retrieved from IA32_PASID MSR, which is setup for the current user address space. See Intel Software Developer’s Manual for more information on the instructions. [ bp: - Make operand constraints like movdir64b() because both insns are basically doing the same thing, more or less. - Fixup comments and cleanup. ] Link: https://lkml.kernel.org/r/20200924180041.34056-3-dave.jiang@intel.com Signed-off-by: Dave Jiang <dave.jiang@intel.com> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Tony Luck <tony.luck@intel.com> Link: https://lkml.kernel.org/r/20201005151126.657029-3-dave.jiang@intel.com
2020-10-07	x86/asm: Carve out a generic movdir64b() helper for general usage	Dave Jiang
	Carve out the MOVDIR64B inline asm primitive into a generic helper so that it can be used by other functions. Move it to special_insns.h and have iosubmit_cmds512() call it. [ bp: Massage commit message. ] Suggested-by: Michael Matz <matz@suse.de> Signed-off-by: Dave Jiang <dave.jiang@intel.com> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Tony Luck <tony.luck@intel.com> Reviewed-by: Borislav Petkov <bp@suse.de> Link: https://lkml.kernel.org/r/20201005151126.657029-2-dave.jiang@intel.com
2020-10-07	xfs: fix the indent in xfs_trans_mod_dquot	Kaixu Xia
	The formatting is strange in xfs_trans_mod_dquot, so do a reindent. Signed-off-by: Kaixu Xia <kaixuxia@tencent.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2020-10-07	xfs: do the ASSERT for the arguments O_{u,g,p}dqpp	Kaixu Xia
	If we pass in XFS_QMOPT_{U,G,P}QUOTA flags and different uid/gid/prid than them currently associated with the inode, the arguments O_{u,g,p}dqpp shouldn't be NULL, so add the ASSERT for them. Signed-off-by: Kaixu Xia <kaixuxia@tencent.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2020-10-07	xfs: fix deadlock and streamline xfs_getfsmap performance	Darrick J. Wong
	Refactor xfs_getfsmap to improve its performance: instead of indirectly calling a function that copies one record to userspace at a time, create a shadow buffer in the kernel and copy the whole array once at the end. On the author's computer, this reduces the runtime on his /home by ~20%. This also eliminates a deadlock when running GETFSMAP against the realtime device. The current code locks the rtbitmap to create fsmappings and copies them into userspace, having not released the rtbitmap lock. If the userspace buffer is an mmap of a sparse file that itself resides on the realtime device, the write page fault will recurse into the fs for allocation, which will deadlock on the rtbitmap lock. Fixes: 4c934c7dd60c ("xfs: report realtime space information via the rtbitmap") Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Chandan Babu R <chandanrlinux@gmail.com>
2020-10-07	xfs: limit entries returned when counting fsmap records	Darrick J. Wong
	If userspace asked fsmap to count the number of entries, we cannot return more than UINT_MAX entries because fmh_entries is u32. Therefore, stop counting if we hit this limit or else we will waste time to return truncated results. Fixes: e89c041338ed ("xfs: implement the GETFSMAP ioctl") Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Chandan Babu R <chandanrlinux@gmail.com>
2020-10-07	xfs: only relog deferred intent items if free space in the log gets low	Darrick J. Wong
	Now that we have the ability to ask the log how far the tail needs to be pushed to maintain its free space targets, augment the decision to relog an intent item so that we only do it if the log has hit the 75% full threshold. There's no point in relogging an intent into the same checkpoint, and there's no need to relog if there's plenty of free space in the log. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com>
2020-10-07	xfs: expose the log push threshold	Darrick J. Wong
	Separate the computation of the log push threshold and the push logic in xlog_grant_push_ail. This enables higher level code to determine (for example) that it is holding on to a logged intent item and the log is so busy that it is more than 75% full. In that case, it would be desirable to move the log item towards the head to release the tail, which we will cover in the next patch. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com>
2020-10-07	xfs: periodically relog deferred intent items	Darrick J. Wong
	There's a subtle design flaw in the deferred log item code that can lead to pinning the log tail. Taking up the defer ops chain examples from the previous commit, we can get trapped in sequences like this: Caller hands us a transaction t0 with D0-D3 attached. The defer ops chain will look like the following if the transaction rolls succeed: t1: D0(t0), D1(t0), D2(t0), D3(t0) t2: d4(t1), d5(t1), D1(t0), D2(t0), D3(t0) t3: d5(t1), D1(t0), D2(t0), D3(t0) ... t9: d9(t7), D3(t0) t10: D3(t0) t11: d10(t10), d11(t10) t12: d11(t10) In transaction 9, we finish d9 and try to roll to t10 while holding onto an intent item for D3 that we logged in t0. The previous commit changed the order in which we place new defer ops in the defer ops processing chain to reduce the maximum chain length. Now make xfs_defer_finish_noroll capable of relogging the entire chain periodically so that we can always move the log tail forward. Most chains will never get relogged, except for operations that generate very long chains (large extents containing many blocks with different sharing levels) or are on filesystems with small logs and a lot of ongoing metadata updates. Callers are now required to ensure that the transaction reservation is large enough to handle logging done items and new intent items for the maximum possible chain length. Most callers are careful to keep the chain lengths low, so the overhead should be minimal. The decision to relog an intent item is made based on whether the intent was logged in a previous checkpoint, since there's no point in relogging an intent into the same checkpoint. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com>
2020-10-07	xfs: change the order in which child and parent defer ops are finished	Darrick J. Wong
	The defer ops code has been finishing items in the wrong order -- if a top level defer op creates items A and B, and finishing item A creates more defer ops A1 and A2, we'll put the new items on the end of the chain and process them in the order A B A1 A2. This is kind of weird, since it's convenient for programmers to be able to think of A and B as an ordered sequence where all the sub-tasks for A must finish before we move on to B, e.g. A A1 A2 D. Right now, our log intent items are not so complex that this matters, but this will become important for the atomic extent swapping patchset. In order to maintain correct reference counting of extents, we have to unmap and remap extents in that order, and we want to complete that work before moving on to the next range that the user wants to swap. This patch fixes defer ops to satsify that requirement. The primary symptom of the incorrect order was noticed in an early performance analysis of the atomic extent swap code. An astonishingly large number of deferred work items accumulated when userspace requested an atomic update of two very fragmented files. The cause of this was traced to the same ordering bug in the inner loop of xfs_defer_finish_noroll. If the ->finish_item method of a deferred operation queues new deferred operations, those new deferred ops are appended to the tail of the pending work list. To illustrate, say that a caller creates a transaction t0 with four deferred operations D0-D3. The first thing defer ops does is roll the transaction to t1, leaving us with: t1: D0(t0), D1(t0), D2(t0), D3(t0) Let's say that finishing each of D0-D3 will create two new deferred ops. After finish D0 and roll, we'll have the following chain: t2: D1(t0), D2(t0), D3(t0), d4(t1), d5(t1) d4 and d5 were logged to t1. Notice that while we're about to start work on D1, we haven't actually completed all the work implied by D0 being finished. So far we've been careful (or lucky) to structure the dfops callers such that D1 doesn't depend on d4 or d5 being finished, but this is a potential logic bomb. There's a second problem lurking. Let's see what happens as we finish D1-D3: t3: D2(t0), D3(t0), d4(t1), d5(t1), d6(t2), d7(t2) t4: D3(t0), d4(t1), d5(t1), d6(t2), d7(t2), d8(t3), d9(t3) t5: d4(t1), d5(t1), d6(t2), d7(t2), d8(t3), d9(t3), d10(t4), d11(t4) Let's say that d4-d11 are simple work items that don't queue any other operations, which means that we can complete each d4 and roll to t6: t6: d5(t1), d6(t2), d7(t2), d8(t3), d9(t3), d10(t4), d11(t4) t7: d6(t2), d7(t2), d8(t3), d9(t3), d10(t4), d11(t4) ... t11: d10(t4), d11(t4) t12: d11(t4) <done> When we try to roll to transaction #12, we're holding defer op d11, which we logged way back in t4. This means that the tail of the log is pinned at t4. If the log is very small or there are a lot of other threads updating metadata, this means that we might have wrapped the log and cannot get roll to t11 because there isn't enough space left before we'd run into t4. Let's shift back to the original failure. I mentioned before that I discovered this flaw while developing the atomic file update code. In that scenario, we have a defer op (D0) that finds a range of file blocks to remap, creates a handful of new defer ops to do that, and then asks to be continued with however much work remains. So, D0 is the original swapext deferred op. The first thing defer ops does is rolls to t1: t1: D0(t0) We try to finish D0, logging d1 and d2 in the process, but can't get all the work done. We log a done item and a new intent item for the work that D0 still has to do, and roll to t2: t2: D0'(t1), d1(t1), d2(t1) We roll and try to finish D0', but still can't get all the work done, so we log a done item and a new intent item for it, requeue D0 a second time, and roll to t3: t3: D0''(t2), d1(t1), d2(t1), d3(t2), d4(t2) If it takes 48 more rolls to complete D0, then we'll finally dispense with D0 in t50: t50: D<fifty primes>(t49), d1(t1), ..., d102(t50) We then try to roll again to get a chain like this: t51: d1(t1), d2(t1), ..., d101(t50), d102(t50) ... t152: d102(t50) <done> Notice that in rolling to transaction #51, we're holding on to a log intent item for d1 that was logged in transaction #1. This means that the tail of the log is pinned at t1. If the log is very small or there are a lot of other threads updating metadata, this means that we might have wrapped the log and cannot roll to t51 because there isn't enough space left before we'd run into t1. This is of course problem #2 again. But notice the third problem with this scenario: we have 102 defer ops tied to this transaction! Each of these items are backed by pinned kernel memory, which means that we risk OOM if the chains get too long. Yikes. Problem #1 is a subtle logic bomb that could hit someone in the future; problem #2 applies (rarely) to the current upstream, and problem #3 applies to work under development. This is not how incremental deferred operations were supposed to work. The dfops design of logging in the same transaction an intent-done item and a new intent item for the work remaining was to make it so that we only have to juggle enough deferred work items to finish that one small piece of work. Deferred log item recovery will find that first unfinished work item and restart it, no matter how many other intent items might follow it in the log. Therefore, it's ok to put the new intents at the start of the dfops chain. For the first example, the chains look like this: t2: d4(t1), d5(t1), D1(t0), D2(t0), D3(t0) t3: d5(t1), D1(t0), D2(t0), D3(t0) ... t9: d9(t7), D3(t0) t10: D3(t0) t11: d10(t10), d11(t10) t12: d11(t10) For the second example, the chains look like this: t1: D0(t0) t2: d1(t1), d2(t1), D0'(t1) t3: d2(t1), D0'(t1) t4: D0'(t1) t5: d1(t4), d2(t4), D0''(t4) ... t148: D0<50 primes>(t147) t149: d101(t148), d102(t148) t150: d102(t148) <done> This actually sucks more for pinning the log tail (we try to roll to t10 while holding an intent item that was logged in t1) but we've solved problem #1. We've also reduced the maximum chain length from: sum(all the new items) + nr_original_items to: max(new items that each original item creates) + nr_original_items This solves problem #3 by sharply reducing the number of defer ops that can be attached to a transaction at any given time. The change makes the problem of log tail pinning worse, but is improvement we need to solve problem #2. Actually solving #2, however, is left to the next patch. Note that a subsequent analysis of some hard-to-trigger reflink and COW livelocks on extremely fragmented filesystems (or systems running a lot of IO threads) showed the same symptoms -- uncomfortably large numbers of incore deferred work items and occasional stalls in the transaction grant code while waiting for log reservations. I think this patch and the next one will also solve these problems. As originally written, the code used list_splice_tail_init instead of list_splice_init, so change that, and leave a short comment explaining our actions. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com>
2020-10-07	xfs: fix an incore inode UAF in xfs_bui_recover	Darrick J. Wong
	In xfs_bui_item_recover, there exists a use-after-free bug with regards to the inode that is involved in the bmap replay operation. If the mapping operation does not complete, we call xfs_bmap_unmap_extent to create a deferred op to finish the unmapping work, and we retain a pointer to the incore inode. Unfortunately, the very next thing we do is commit the transaction and drop the inode. If reclaim tears down the inode before we try to finish the defer ops, we dereference garbage and blow up. Therefore, create a way to join inodes to the defer ops freezer so that we can maintain the xfs_inode reference until we're done with the inode. Note: This imposes the requirement that there be enough memory to keep every incore inode in memory throughout recovery. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
2020-10-07	xfs: clean up xfs_bui_item_recover iget/trans_alloc/ilock ordering	Darrick J. Wong
	In most places in XFS, we have a specific order in which we gather resources: grab the inode, allocate a transaction, then lock the inode. xfs_bui_item_recover doesn't do it in that order, so fix it to be more consistent. This also makes the error bailout code a bit less weird. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Brian Foster <bfoster@redhat.com>
2020-10-07	xfs: clean up bmap intent item recovery checking	Darrick J. Wong
	The bmap intent item checking code in xfs_bui_item_recover is spread all over the function. We should check the recovered log item at the top before we allocate any resources or do anything else, so do that. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
2020-10-07	xfs: xfs_defer_capture should absorb remaining transaction reservation	Darrick J. Wong
	When xfs_defer_capture extracts the deferred ops and transaction state from a transaction, it should record the transaction reservation type from the old transaction so that when we continue the dfops chain, we still use the same reservation parameters. Doing this means that the log item recovery functions get to determine the transaction reservation instead of abusing tr_itruncate in yet another part of xfs. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
2020-10-07	xfs: xfs_defer_capture should absorb remaining block reservations	Darrick J. Wong
	When xfs_defer_capture extracts the deferred ops and transaction state from a transaction, it should record the remaining block reservations so that when we continue the dfops chain, we can reserve the same number of blocks to use. We capture the reservations for both data and realtime volumes. This adds the requirement that every log intent item recovery function must be careful to reserve enough blocks to handle both itself and all defer ops that it can queue. On the other hand, this enables us to do away with the handwaving block estimation nonsense that was going on in xlog_finish_defer_ops. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Brian Foster <bfoster@redhat.com>
2020-10-07	xfs: proper replay of deferred ops queued during log recovery	Darrick J. Wong
	When we replay unfinished intent items that have been recovered from the log, it's possible that the replay will cause the creation of more deferred work items. As outlined in commit 509955823cc9c ("xfs: log recovery should replay deferred ops in order"), later work items have an implicit ordering dependency on earlier work items. Therefore, recovery must replay the items (both recovered and created) in the same order that they would have been during normal operation. For log recovery, we enforce this ordering by using an empty transaction to collect deferred ops that get created in the process of recovering a log intent item to prevent them from being committed before the rest of the recovered intent items. After we finish committing all the recovered log items, we allocate a transaction with an enormous block reservation, splice our huge list of created deferred ops into that transaction, and commit it, thereby finishing all those ops. This is /really/ hokey -- it's the one place in XFS where we allow nested transactions; the splicing of the defer ops list is is inelegant and has to be done twice per recovery function; and the broken way we handle inode pointers and block reservations cause subtle use-after-free and allocator problems that will be fixed by this patch and the two patches after it. Therefore, replace the hokey empty transaction with a structure designed to capture each chain of deferred ops that are created as part of recovering a single unfinished log intent. Finally, refactor the loop that replays those chains to do so using one transaction per chain. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
2020-10-07	xfs: remove XFS_LI_RECOVERED	Darrick J. Wong
	The ->iop_recover method of a log intent item removes the recovered intent item from the AIL by logging an intent done item and committing the transaction, so it's superfluous to have this flag check. Nothing else uses it, so get rid of the flag entirely. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
2020-10-07	xfs: remove xfs_defer_reset	Darrick J. Wong
	Remove this one-line helper since the assert is trivially true in one call site and the rest obscures a bitmask operation. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Brian Foster <bfoster@redhat.com>