summaryrefslogtreecommitdiff
path: root/include/linux
AgeCommit message (Collapse)Author
2021-01-26objtool: Combine UNWIND_HINT_RET_OFFSET and UNWIND_HINT_FUNCJosh Poimboeuf
The ORC metadata generated for UNWIND_HINT_FUNC isn't actually very func-like. With certain usages it can cause stack state mismatches because it doesn't set the return address (CFI_RA). Also, users of UNWIND_HINT_RET_OFFSET no longer need to set a custom return stack offset. Instead they just need to specify a func-like situation, so the current ret_offset code is hacky for no good reason. Solve both problems by simplifying the RET_OFFSET handling and converting it into a more useful UNWIND_HINT_FUNC. If we end up needing the old 'ret_offset' functionality again in the future, we should be able to support it pretty easily with the addition of a custom 'sp_offset' in UNWIND_HINT_FUNC. Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com> Link: https://lore.kernel.org/r/db9d1f5d79dddfbb3725ef6d8ec3477ad199948d.1611263462.git.jpoimboe@redhat.com
2021-01-26objtool: Add asm version of STACK_FRAME_NON_STANDARDJosh Poimboeuf
To be used for adding asm functions to the ignore list. The "aw" is needed to help the ELF section metadata match GCC-created sections. Otherwise the linker creates duplicate sections instead of combining them. Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com> Link: https://lore.kernel.org/r/8faa476f9a5ac89af27944ec184c89f95f3c6c49.1611263462.git.jpoimboe@redhat.com
2021-01-26Merge tag 'v5.11-rc2' into develLinus Walleij
Linux 5.11-rc2
2021-01-26HID: correct kernel-doc notation in <linux/hid*.h>Randy Dunlap
Correct kernel-doc notation in HID header files (include/linux/hid*.h). Add notation (comments) where it is missing. Use the documented "Return:" notation for function return values. Fix a few typos/spellos. Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Cc: Jiri Kosina <jikos@kernel.org> Cc: Benjamin Tissoires <benjamin.tissoires@redhat.com> Cc: linux-input@vger.kernel.org Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2021-01-26isa: Make the remove callback for isa drivers return voidUwe Kleine-König
The driver core ignores the return value of the remove callback, so don't give isa drivers the chance to provide a value. Adapt all isa_drivers with a remove callbacks accordingly; they all return 0 unconditionally anyhow. Acked-by: Marc Kleine-Budde <mkl@pengutronix.de> # for drivers/net/can/sja1000/tscan1.c Acked-by: William Breathitt Gray <vilhelm.gray@gmail.com> Acked-by: Wolfram Sang <wsa@kernel.org> # for drivers/i2c/ Reviewed-by: Takashi Iway <tiwai@suse.de> # for sound/ Reviewed-by: Hans Verkuil <hverkuil-cisco@xs4all.nl> # for drivers/media/ Signed-off-by: Uwe Kleine-König <uwe@kleine-koenig.org> Link: https://lore.kernel.org/r/20210122092449.426097-4-uwe@kleine-koenig.org Signed-off-by: Takashi Iwai <tiwai@suse.de>
2021-01-25netfilter: ctnetlink: remove get_ct indirectionFlorian Westphal
Use nf_ct_get() directly, its a small inline helper without dependencies. Add CONFIG_NF_CONNTRACK guards to elide the relevant part when conntrack isn't available at all. v2: add ifdef guard around nf_ct_get call (kernel test robot) Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-01-25SUNRPC: Move simple_get_bytes and simple_get_netobj into private headerDave Wysochanski
Remove duplicated helper functions to parse opaque XDR objects and place inside new file net/sunrpc/auth_gss/auth_gss_internal.h. In the new file carry the license and copyright from the source file net/sunrpc/auth_gss/auth_gss.c. Finally, update the comment inside include/linux/sunrpc/xdr.h since lockd is not the only user of struct xdr_netobj. Signed-off-by: Dave Wysochanski <dwysocha@redhat.com> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2021-01-25Commit 9bb48c82aced ("tty: implement write_iter") converted the ttySami Tolvanen
layer to use write_iter. Fix the redirected_tty_write declaration also in n_tty and change the comparisons to use write_iter instead of write. [ Also moved the declaration of redirected_tty_write() to the proper location in a header file. The reason for the bug was the bogus extern declaration in n_tty.c silently not matching the changed definition in tty_io.c, and because it wasn't in a shared header file, there was no cross-checking of the declaration. Sami noticed because Clang's Control Flow Integrity checking ended up incidentally noticing the inconsistent declaration. - Linus ] Fixes: 9bb48c82aced ("tty: implement write_iter") Signed-off-by: Sami Tolvanen <samitolvanen@google.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-01-25reset: Add devm_reset_control_get_optional_exclusive_released()Dmitry Osipenko
NVIDIA Tegra DRM and media drivers will need a resource-managed-optional variant of reset_control_get_exclusive_released() in order to switch away from a legacy Tegra-specific PD API to a GENPD API without much hassle. Add the new reset helper to the reset API. Tested-by: Peter Geis <pgwipeout@gmail.com> # Ouya T30 Tested-by: Nicolas Chauvet <kwizart@gmail.com> # PAZ00 T20 Tested-by: Matt Merhar <mattmerhar@protonmail.com> # Ouya T30 Signed-off-by: Dmitry Osipenko <digetx@gmail.com> Signed-off-by: Philipp Zabel <p.zabel@pengutronix.de>
2021-01-25ACPI: platform-profile: Drop const qualifier for cur_profileJiaxun Yang
Drop the const qualifier from the static global cur_profile pointer declaration. This is a preparation patch for passing the cur_profile pointer as parameter to the profile_get() and profile_set() callbacks so that drivers dynamically allocating their driver-data struct, with their platform_profile_handler struct embedded, can use this pointer to get to their driver-data. Note this also requires dropping the const from the pprof platform_profile_register() function argument. Dropping this const is not a problem, non of the queued up consumers of platform_profile_register() actually pass in a const pointer. Link: https://lore.kernel.org/linux-acpi/5e7a4d87-52ef-e487-9cc2-8e7094beaa08@redhat.com/ Link: https://lore.kernel.org/r/20210114073429.176462-2-jiaxun.yang@flygoat.com Suggested-by: Hans de Goede <hdegoede@redhat.com> Signed-off-by: Jiaxun Yang <jiaxun.yang@flygoat.com> [ hdegoede@redhat.com: Also remove const from platform_profile_register() ] Signed-off-by: Hans de Goede <hdegoede@redhat.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2021-01-25bio: don't copy bvec for direct IOPavel Begunkov
The block layer spends quite a while in blkdev_direct_IO() to copy and initialise bio's bvec. However, if we've already got a bvec in the input iterator it might be reused in some cases, i.e. when new ITER_BVEC_FLAG_FIXED flag is set. Simple tests show considerable performance boost, and it also reduces memory footprint. Suggested-by: Matthew Wilcox <willy@infradead.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-01-25bio: add a helper calculating nr segments to allocPavel Begunkov
Add a helper function calculating the number of bvec segments we need to allocate to construct a bio. It doesn't change anything functionally, but will be used to not duplicate special cases in the future. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-01-25svcrdma: Reduce Receive doorbell rateChuck Lever
This is similar to commit e340c2d6ef2a ("xprtrdma: Reduce the doorbell rate (Receive)") which added Receive batching to the client. Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2021-01-25svcrdma: Deprecate stat variables that are no longer usedChuck Lever
Clean up. We are not permitted to remove old proc files. Instead, convert these variables to stubs that are only ever allowed to display a value of zero. Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2021-01-25svcrdma: Restore read and write statsChuck Lever
Now that we have an efficient mechanism to update these two stats, let's start maintaining them again. Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2021-01-25svcrdma: Convert rdma_stat_sq_starve to a per-CPU counterChuck Lever
Avoid the overhead of a memory bus lock cycle for counting a value that is hardly every used. Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2021-01-25svcrdma: Convert rdma_stat_recv to a per-CPU counterChuck Lever
Receives are frequent events. Avoid the overhead of a memory bus lock cycle for counting a value that is hardly every used. Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2021-01-25NFSD: Add an xdr_stream-based decoder for NFSv2/3 ACLsChuck Lever
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2021-01-25SUNRPC: Move definition of XDR_UNITChuck Lever
Clean up: The unit of XDR alignment is defined by RFC 4506, not as part of the RPC message header. Thus it belongs in include/linux/sunrpc/xdr.h. Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2021-01-25SUNRPC: Make trace_svc_process() display the RPC procedure symbolicallyChuck Lever
The next few patches will employ these strings to help make server- side trace logs more human-readable. A similar technique is already in use in kernel RPC client code. Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2021-01-25usb: typec: tcpm: Create legacy PDOs for PD2 connectionKyle Tso
If the port partner is PD2, the PDOs of the local port should follow the format defined in PD2 Spec. Dynamically modify the pre-defined PD3 PDOs and transform them into PD2 format before sending them to the PD2 port partner. Reviewed-by: Guenter Roeck <linux@roeckus.net> Signed-off-by: Kyle Tso <kyletso@google.com> Link: https://lore.kernel.org/r/20210115163311.391332-1-kyletso@google.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-01-25Merge v5.11-rc5 into usb-nextGreg Kroah-Hartman
We need the fixes in here and this resolves a merge issue with drivers/usb/gadget/udc/bdc/Kconfig. Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-01-25Merge 5.11-rc5 into tty-nextGreg Kroah-Hartman
We need the fixes in here and this resolves a merge issue in drivers/tty/tty_io.c Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-01-25Merge 5.11-rc5 into staging-nextGreg Kroah-Hartman
We need the IIO/Staging fixes in here as well. Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-01-24block: remove unnecessary argument from blk_execute_rqGuoqing Jiang
We can remove 'q' from blk_execute_rq as well after the previous change in blk_execute_rq_nowait. And more importantly it never really was needed to start with given that we can trivial derive it from struct request. Cc: linux-scsi@vger.kernel.org Cc: virtualization@lists.linux-foundation.org Cc: linux-ide@vger.kernel.org Cc: linux-mmc@vger.kernel.org Cc: linux-nvme@lists.infradead.org Cc: linux-nfs@vger.kernel.org Acked-by: Ulf Hansson <ulf.hansson@linaro.org> # for mmc Signed-off-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-01-24block: remove unnecessary argument from blk_execute_rq_nowaitGuoqing Jiang
The 'q' is not used since commit a1ce35fa4985 ("block: remove dead elevator code"), also update the comment of the function. And more importantly it never really was needed to start with given that we can trivial derive it from struct request. Cc: target-devel@vger.kernel.org Cc: linux-scsi@vger.kernel.org Cc: virtualization@lists.linux-foundation.org Cc: linux-ide@vger.kernel.org Cc: linux-mmc@vger.kernel.org Cc: linux-nvme@lists.infradead.org Cc: linux-nfs@vger.kernel.org Signed-off-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-01-25Merge tag 'v5.11-rc5' of ↵Dave Airlie
git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux into drm-next Backmerge v5.11-rc5 into drm-next to clean up a bunch of conflicts we are dragging around. Signed-off-by: Dave Airlie <airlied@redhat.com>
2021-01-24block: move three bvec helpers declaration into private helperMing Lei
bvec_alloc(), bvec_free() and bvec_nr_vecs() are only used inside block layer core functions, no need to declare them in public header. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-01-24block: don't allocate inline bvecs if this bioset needn't bvecsMing Lei
The inline bvecs won't be used if user needn't bvecs by not passing BIOSET_NEED_BVECS, so don't allocate bvecs in this situation. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> Tested-by: Pavel Begunkov <asml.silence@gmail.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-01-24blk-mq: Improve performance of non-mq IO schedulers with multiple HW queuesJan Kara
Currently when non-mq aware IO scheduler (BFQ, mq-deadline) is used for a queue with multiple HW queues, the performance it rather bad. The problem is that these IO schedulers use queue-wide locking and their dispatch function does not respect the hctx it is passed in and returns any request it finds appropriate. Thus locality of request access is broken and dispatch from multiple CPUs just contends on IO scheduler locks. For these IO schedulers there's little point in dispatching from multiple CPUs. Instead dispatch always only from a single CPU to limit contention. Below is a comparison of dbench runs on XFS filesystem where the storage is a raid card with 64 HW queues and to it attached a single rotating disk. BFQ is used as IO scheduler: clients MQ SQ MQ-Patched Amean 1 39.12 (0.00%) 43.29 * -10.67%* 36.09 * 7.74%* Amean 2 128.58 (0.00%) 101.30 * 21.22%* 96.14 * 25.23%* Amean 4 577.42 (0.00%) 494.47 * 14.37%* 508.49 * 11.94%* Amean 8 610.95 (0.00%) 363.86 * 40.44%* 362.12 * 40.73%* Amean 16 391.78 (0.00%) 261.49 * 33.25%* 282.94 * 27.78%* Amean 32 324.64 (0.00%) 267.71 * 17.54%* 233.00 * 28.23%* Amean 64 295.04 (0.00%) 253.02 * 14.24%* 242.37 * 17.85%* Amean 512 10281.61 (0.00%) 10211.16 * 0.69%* 10447.53 * -1.61%* Numbers are times so lower is better. MQ is stock 5.10-rc6 kernel. SQ is the same kernel with megaraid_sas.host_tagset_enable=0 so that the card advertises just a single HW queue. MQ-Patched is a kernel with this patch applied. You can see multiple hardware queues heavily hurt performance in combination with BFQ. The patch restores the performance. Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-01-24Revert "blk-mq, elevator: Count requests per hctx to improve performance"Jan Kara
This reverts commit b445547ec1bbd3e7bf4b1c142550942f70527d95. Since both mq-deadline and BFQ completely ignore hctx they are passed to their dispatch function and dispatch whatever request they deem fit checking whether any request for a particular hctx is queued is just pointless since we'll very likely get a request from a different hctx anyway. In the following commit we'll deal with lock contention in these IO schedulers in presence of multiple HW queues in a different way. Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-01-24block: use an xarray for disk->part_tblChristoph Hellwig
Now that no fast path lookups in the partition table are left, there is no point in micro-optimizing the data structure for it. Just use a bog standard xarray. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-01-24block: remove DISK_PITER_REVERSEChristoph Hellwig
There is good reason to iterate backwards when deleting all partitions in del_gendisk, just like we don't in blk_drop_partitions. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-01-24block: add a disk_uevent helperChristoph Hellwig
Add a helper to call kobject_uevent for the disk and all partitions, and unexport the disk_part_iter_* helpers that are now only used in the core block code. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-01-24block: use ->bi_bdev for bio based I/O accountingChristoph Hellwig
Rework the I/O accounting for bio based drivers to use ->bi_bdev. This means all drivers can now simply use bio_start_io_acct to start accounting, and it will take partitions into account automatically. To end I/O account either bio_end_io_acct can be used if the driver never remaps I/O to a different device, or bio_end_io_acct_remapped if the driver did remap the I/O. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-01-24block: do not reassig ->bi_bdev when partition remappingChristoph Hellwig
There is no good reason to reassign ->bi_bdev when remapping the partition-relative block number to the device wide one, as all the information required by the drivers comes from the gendisk anyway. Keeping the original ->bi_bdev alive will allow to greatly simplify the partition-away I/O accounting. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-01-24block: store a block_device pointer in struct bioChristoph Hellwig
Replace the gendisk pointer in struct bio with a pointer to the newly improved struct block device. From that the gendisk can be trivially accessed with an extra indirection, but it also allows to directly look up all information related to partition remapping. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-01-24block: add a hard-readonly flag to struct gendiskChristoph Hellwig
Commit 20bd1d026aac ("scsi: sd: Keep disk read-only when re-reading partition") addressed a long-standing problem with user read-only policy being overridden as a result of a device-initiated revalidate. The commit has since been reverted due to a regression that left some USB devices read-only indefinitely. To fix the underlying problems with revalidate we need to keep track of hardware state and user policy separately. The gendisk has been updated to reflect the current hardware state set by the device driver. This is done to allow returning the device to the hardware state once the user clears the BLKROSET flag. The resulting semantics are as follows: - If BLKROSET sets a given partition read-only, that partition will remain read-only even if the underlying storage stack initiates a revalidate. However, the BLKRRPART ioctl will cause the partition table to be dropped and any user policy on partitions will be lost. - If BLKROSET has not been set, both the whole disk device and any partitions will reflect the current write-protect state of the underlying device. Based on a patch from Martin K. Petersen <martin.petersen@oracle.com>. Reported-by: Oleksii Kurochko <olkuroch@cisco.com> Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=201221 Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-01-24Merge tag 'block-5.11-2021-01-24' of git://git.kernel.dk/linux-blockLinus Torvalds
Pull block fixes from Jens Axboe: - NVMe pull request from Christoph: - fix a status code in nvmet (Chaitanya Kulkarni) - avoid double completions in nvme-rdma/nvme-tcp (Chao Leng) - fix the CMB support to cope with NVMe 1.4 controllers (Klaus Jensen) - fix PRINFO handling in the passthrough ioctl (Revanth Rajashekar) - fix a double DMA unmap in nvme-pci - lightnvm error path leak fix (Pan) - MD pull request from Song: - Flush request fix (Xiao) * tag 'block-5.11-2021-01-24' of git://git.kernel.dk/linux-block: lightnvm: fix memory leak when submit fails nvme-pci: fix error unwind in nvme_map_data nvme-pci: refactor nvme_unmap_data md: Set prev_flush_start and flush_bio in an atomic way nvmet: set right status on error in id-ns handler nvme-pci: allow use of cmb on v1.4 controllers nvme-tcp: avoid request double completion for concurrent nvme_tcp_timeout nvme-rdma: avoid request double completion for concurrent nvme_rdma_timeout nvme: check the PRINFO bit before deciding the host buffer length
2021-01-24Merge tag 'driver-core-5.11-rc5' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core Pull driver core fixes from Greg KH: "Here are some small driver core fixes for 5.11-rc5 that resolve some reported problems: - revert of a -rc1 patch that was causing problems with some machines - device link device name collision problem fix (busses only have to name devices unique to their bus, not unique to all busses) - kernfs splice bugfixes to resolve firmware loading problems for Qualcomm systems. - other tiny driver core fixes for minor issues reported. All of these have been in linux-next with no reported problems" * tag 'driver-core-5.11-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: driver core: Fix device link device name collision driver core: Extend device_is_dependent() kernfs: wire up ->splice_read and ->splice_write kernfs: implement ->write_iter kernfs: implement ->read_iter Revert "driver core: Reorder devices on successful probe" Driver core: platform: Add extra error check in devm_platform_get_irqs_affinity() drivers core: Free dma_range_map when driver probe failed
2021-01-24Merge tag 'sched_urgent_for_v5.11_rc5' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler fixes from Borislav Petkov: - Correct the marking of kthreads which are supposed to run on a specific, single CPU vs such which are affine to only one CPU, mark per-cpu workqueue threads as such and make sure that marking "survives" CPU hotplug. Fix CPU hotplug issues with such kthreads. - A fix to not push away tasks on CPUs coming online. - Have workqueue CPU hotplug code use cpu_possible_mask when breaking affinity on CPU offlining so that pending workers can finish on newly arrived onlined CPUs too. - Dump tasks which haven't vacated a CPU which is currently being unplugged. - Register a special scale invariance callback which gets called on resume from RAM to read out APERF/MPERF after resume and thus make the schedutil scaling governor more precise. * tag 'sched_urgent_for_v5.11_rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched: Relax the set_cpus_allowed_ptr() semantics sched: Fix CPU hotplug / tighten is_per_cpu_kthread() sched: Prepare to use balance_push in ttwu() workqueue: Restrict affinity change to rescuer workqueue: Tag bound workers with KTHREAD_IS_PER_CPU kthread: Extract KTHREAD_IS_PER_CPU sched: Don't run cpu-online with balance_push() enabled workqueue: Use cpu_possible_mask instead of cpu_active_mask to break affinity sched/core: Print out straggler tasks in sched_cpu_dying() x86: PM: Register syscore_ops for scale invariance
2021-01-24Merge tag 'timers_urgent_for_v5.11_rc5' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull timer fixes from Borislav Petkov: - Fix an integer overflow in the NTP RTC synchronization which led to the latter happening every 2 seconds instead of the intended every 11 minutes. - Get rid of now unused get_seconds(). * tag 'timers_urgent_for_v5.11_rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: ntp: Fix RTC synchronization on 32-bit platforms timekeeping: Remove unused get_seconds()
2021-01-24fs: introduce MOUNT_ATTR_IDMAPChristian Brauner
Introduce a new mount bind mount property to allow idmapping mounts. The MOUNT_ATTR_IDMAP flag can be set via the new mount_setattr() syscall together with a file descriptor referring to a user namespace. The user namespace referenced by the namespace file descriptor will be attached to the bind mount. All interactions with the filesystem going through that mount will be mapped according to the mapping specified in the user namespace attached to it. Using user namespaces to mark mounts means we can reuse all the existing infrastructure in the kernel that already exists to handle idmappings and can also use this for permission checking to allow unprivileged user to create idmapped mounts in the future. Idmapping a mount is decoupled from the caller's user and mount namespace. This means idmapped mounts can be created in the initial user namespace which is an important use-case for systemd-homed, portable usb-sticks between systems, sharing data between the initial user namespace and unprivileged containers, and other use-cases that have been brought up. For example, assume a home directory where all files are owned by uid and gid 1000 and the home directory is brought to a new laptop where the user has id 12345. The system administrator can simply create a mount of this home directory with a mapping of 1000:12345:1 and other mappings to indicate the ids should be kept. (With this it is e.g. also possible to create idmapped mounts on the host with an identity mapping 1:1:100000 where the root user is not mapped. A user with root access that e.g. has been pivot rooted into such a mount on the host will be not be able to execute, read, write, or create files as root.) Given that mapping a mount is decoupled from the caller's user namespace a sufficiently privileged process such as a container manager can set up an idmapped mount for the container and the container can simply pivot root to it. There's no need for the container to do anything. The mount will appear correctly mapped independent of the user namespace the container uses. This means we don't need to mark a mount as idmappable. In order to create an idmapped mount the caller must currently be privileged in the user namespace of the superblock the mount belongs to. Once a mount has been idmapped we don't allow it to change its mapping. This keeps permission checking and life-cycle management simple. Users wanting to change the idmapped can always create a new detached mount with a different idmapping. Link: https://lore.kernel.org/r/20210121131959.646623-36-christian.brauner@ubuntu.com Cc: Christoph Hellwig <hch@lst.de> Cc: David Howells <dhowells@redhat.com> Cc: Mauricio Vásquez Bernal <mauricio@kinvolk.io> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: linux-fsdevel@vger.kernel.org Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
2021-01-24fs: add mount_setattr()Christian Brauner
This implements the missing mount_setattr() syscall. While the new mount api allows to change the properties of a superblock there is currently no way to change the properties of a mount or a mount tree using file descriptors which the new mount api is based on. In addition the old mount api has the restriction that mount options cannot be applied recursively. This hasn't changed since changing mount options on a per-mount basis was implemented in [1] and has been a frequent request not just for convenience but also for security reasons. The legacy mount syscall is unable to accommodate this behavior without introducing a whole new set of flags because MS_REC | MS_REMOUNT | MS_BIND | MS_RDONLY | MS_NOEXEC | [...] only apply the mount option to the topmost mount. Changing MS_REC to apply to the whole mount tree would mean introducing a significant uapi change and would likely cause significant regressions. The new mount_setattr() syscall allows to recursively clear and set mount options in one shot. Multiple calls to change mount options requesting the same changes are idempotent: int mount_setattr(int dfd, const char *path, unsigned flags, struct mount_attr *uattr, size_t usize); Flags to modify path resolution behavior are specified in the @flags argument. Currently, AT_EMPTY_PATH, AT_RECURSIVE, AT_SYMLINK_NOFOLLOW, and AT_NO_AUTOMOUNT are supported. If useful, additional lookup flags to restrict path resolution as introduced with openat2() might be supported in the future. The mount_setattr() syscall can be expected to grow over time and is designed with extensibility in mind. It follows the extensible syscall pattern we have used with other syscalls such as openat2(), clone3(), sched_{set,get}attr(), and others. The set of mount options is passed in the uapi struct mount_attr which currently has the following layout: struct mount_attr { __u64 attr_set; __u64 attr_clr; __u64 propagation; __u64 userns_fd; }; The @attr_set and @attr_clr members are used to clear and set mount options. This way a user can e.g. request that a set of flags is to be raised such as turning mounts readonly by raising MOUNT_ATTR_RDONLY in @attr_set while at the same time requesting that another set of flags is to be lowered such as removing noexec from a mount tree by specifying MOUNT_ATTR_NOEXEC in @attr_clr. Note, since the MOUNT_ATTR_<atime> values are an enum starting from 0, not a bitmap, users wanting to transition to a different atime setting cannot simply specify the atime setting in @attr_set, but must also specify MOUNT_ATTR__ATIME in the @attr_clr field. So we ensure that MOUNT_ATTR__ATIME can't be partially set in @attr_clr and that @attr_set can't have any atime bits set if MOUNT_ATTR__ATIME isn't set in @attr_clr. The @propagation field lets callers specify the propagation type of a mount tree. Propagation is a single property that has four different settings and as such is not really a flag argument but an enum. Specifically, it would be unclear what setting and clearing propagation settings in combination would amount to. The legacy mount() syscall thus forbids the combination of multiple propagation settings too. The goal is to keep the semantics of mount propagation somewhat simple as they are overly complex as it is. The @userns_fd field lets user specify a user namespace whose idmapping becomes the idmapping of the mount. This is implemented and explained in detail in the next patch. [1]: commit 2e4b7fcd9260 ("[PATCH] r/o bind mounts: honor mount writer counts at remount") Link: https://lore.kernel.org/r/20210121131959.646623-35-christian.brauner@ubuntu.com Cc: David Howells <dhowells@redhat.com> Cc: Aleksa Sarai <cyphar@cyphar.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: linux-fsdevel@vger.kernel.org Cc: linux-api@vger.kernel.org Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
2021-01-24ima: handle idmapped mountsChristian Brauner
IMA does sometimes access the inode's i_uid and compares it against the rules' fowner. Enable IMA to handle idmapped mounts by passing down the mount's user namespace. We simply make use of the helpers we introduced before. If the initial user namespace is passed nothing changes so non-idmapped mounts will see identical behavior as before. Link: https://lore.kernel.org/r/20210121131959.646623-27-christian.brauner@ubuntu.com Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
2021-01-24fs: make helpers idmap mount awareChristian Brauner
Extend some inode methods with an additional user namespace argument. A filesystem that is aware of idmapped mounts will receive the user namespace the mount has been marked with. This can be used for additional permission checking and also to enable filesystems to translate between uids and gids if they need to. We have implemented all relevant helpers in earlier patches. As requested we simply extend the exisiting inode method instead of introducing new ones. This is a little more code churn but it's mostly mechanical and doesnt't leave us with additional inode methods. Link: https://lore.kernel.org/r/20210121131959.646623-25-christian.brauner@ubuntu.com Cc: Christoph Hellwig <hch@lst.de> Cc: David Howells <dhowells@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: linux-fsdevel@vger.kernel.org Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
2021-01-24open: handle idmapped mounts in do_truncate()Christian Brauner
When truncating files the vfs will verify that the caller is privileged over the inode. Extend it to handle idmapped mounts. If the inode is accessed through an idmapped mount it is mapped according to the mount's user namespace. Afterwards the permissions checks are identical to non-idmapped mounts. If the initial user namespace is passed nothing changes so non-idmapped mounts will see identical behavior as before. Link: https://lore.kernel.org/r/20210121131959.646623-16-christian.brauner@ubuntu.com Cc: Christoph Hellwig <hch@lst.de> Cc: David Howells <dhowells@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: linux-fsdevel@vger.kernel.org Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
2021-01-24namei: prepare for idmapped mountsChristian Brauner
The various vfs_*() helpers are called by filesystems or by the vfs itself to perform core operations such as create, link, mkdir, mknod, rename, rmdir, tmpfile and unlink. Enable them to handle idmapped mounts. If the inode is accessed through an idmapped mount map it into the mount's user namespace and pass it down. Afterwards the checks and operations are identical to non-idmapped mounts. If the initial user namespace is passed nothing changes so non-idmapped mounts will see identical behavior as before. Link: https://lore.kernel.org/r/20210121131959.646623-15-christian.brauner@ubuntu.com Cc: Christoph Hellwig <hch@lst.de> Cc: David Howells <dhowells@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: linux-fsdevel@vger.kernel.org Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
2021-01-24namei: introduce struct renamedataChristian Brauner
In order to handle idmapped mounts we will extend the vfs rename helper to take two new arguments in follow up patches. Since this operations already takes a bunch of arguments add a simple struct renamedata and make the current helper use it before we extend it. Link: https://lore.kernel.org/r/20210121131959.646623-14-christian.brauner@ubuntu.com Cc: Christoph Hellwig <hch@lst.de> Cc: David Howells <dhowells@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: linux-fsdevel@vger.kernel.org Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
2021-01-24namei: handle idmapped mounts in may_*() helpersChristian Brauner
The may_follow_link(), may_linkat(), may_lookup(), may_open(), may_o_create(), may_create_in_sticky(), may_delete(), and may_create() helpers determine whether the caller is privileged enough to perform the associated operations. Let them handle idmapped mounts by mapping the inode or fsids according to the mount's user namespace. Afterwards the checks are identical to non-idmapped inodes. The patch takes care to retrieve the mount's user namespace right before performing permission checks and passing it down into the fileystem so the user namespace can't change in between by someone idmapping a mount that is currently not idmapped. If the initial user namespace is passed nothing changes so non-idmapped mounts will see identical behavior as before. Link: https://lore.kernel.org/r/20210121131959.646623-13-christian.brauner@ubuntu.com Cc: Christoph Hellwig <hch@lst.de> Cc: David Howells <dhowells@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: linux-fsdevel@vger.kernel.org Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: James Morris <jamorris@linux.microsoft.com> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>