summaryrefslogtreecommitdiff
path: root/include/linux
AgeCommit message (Collapse)Author
2022-09-30block: add blk_rq_map_user_ioAnuj Gupta
Create a helper blk_rq_map_user_io for mapping of vectored as well as non-vectored requests. This will help in saving dupilcation of code at few places in scsi and nvme. Signed-off-by: Anuj Gupta <anuj20.g@samsung.com> Suggested-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20220930062749.152261-4-anuj20.g@samsung.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-30io_uring: introduce fixed buffer support for io_uring_cmdAnuj Gupta
Add IORING_URING_CMD_FIXED flag that is to be used for sending io_uring command with previously registered buffers. User-space passes the buffer index in sqe->buf_index, same as done in read/write variants that uses fixed buffers. Signed-off-by: Anuj Gupta <anuj20.g@samsung.com> Signed-off-by: Kanchan Joshi <joshi.k@samsung.com> Link: https://lore.kernel.org/r/20220930062749.152261-3-anuj20.g@samsung.com [axboe: shuffle valid flags check before acting on it] Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-30io_uring: add io_uring_cmd_import_fixedAnuj Gupta
This is a new helper that callers can use to obtain a bvec iterator for the previously mapped buffer. This is preparatory work to enable fixed-buffer support for io_uring_cmd. Signed-off-by: Anuj Gupta <anuj20.g@samsung.com> Signed-off-by: Kanchan Joshi <joshi.k@samsung.com> Link: https://lore.kernel.org/r/20220930062749.152261-2-anuj20.g@samsung.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-30block: allow end_io based requests in the completion batch handlingJens Axboe
With end_io handlers now being able to potentially pass ownership of the request upon completion, we can allow requests with end_io handlers in the batch completion handling. Reviewed-by: Anuj Gupta <anuj20.g@samsung.com> Reviewed-by: Keith Busch <kbusch@kernel.org> Co-developed-by: Stefan Roesch <shr@fb.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-30block: change request end_io handler to pass back a return valueJens Axboe
Everything is just converted to returning RQ_END_IO_NONE, and there should be no functional changes with this patch. In preparation for allowing the end_io handler to pass ownership back to the block layer, rather than retain ownership of the request. Reviewed-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-30Merge branch 'for-6.1/io_uring' into for-6.1/passthroughJens Axboe
* for-6.1/io_uring: (56 commits) io_uring/net: fix notif cqe reordering io_uring/net: don't update msg_name if not provided io_uring: don't gate task_work run on TIF_NOTIFY_SIGNAL io_uring/rw: defer fsnotify calls to task context io_uring/net: fix fast_iov assignment in io_setup_async_msg() io_uring/net: fix non-zc send with address io_uring/net: don't skip notifs for failed requests io_uring/rw: don't lose short results on io_setup_async_rw() io_uring/rw: fix unexpected link breakage io_uring/net: fix cleanup double free free_iov init io_uring: fix CQE reordering io_uring/net: fix UAF in io_sendrecv_fail() selftest/net: adjust io_uring sendzc notif handling io_uring: ensure local task_work marks task as running io_uring/net: zerocopy sendmsg io_uring/net: combine fail handlers io_uring/net: rename io_sendzc() io_uring/net: support non-zerocopy sendto io_uring/net: refactor io_setup_async_addr io_uring/net: don't lose partial send_zc on fail ...
2022-09-30Merge branch 'for-6.1/block' into for-6.1/passthroughJens Axboe
* for-6.1/block: (162 commits) sbitmap: fix lockup while swapping block: add rationale for not using blk_mq_plug() when applicable block: adapt blk_mq_plug() to not plug for writes that require a zone lock s390/dasd: use blk_mq_alloc_disk blk-cgroup: don't update the blkg lookup hint in blkg_conf_prep nvmet: don't look at the request_queue in nvmet_bdev_set_limits nvmet: don't look at the request_queue in nvmet_bdev_zone_mgmt_emulate_all blk-mq: use quiesced elevator switch when reinitializing queues block: replace blk_queue_nowait with bdev_nowait nvme: remove nvme_ctrl_init_connect_q nvme-loop: use the tagset alloc/free helpers nvme-loop: store the generic nvme_ctrl in set->driver_data nvme-loop: initialize sqsize later nvme-fc: use the tagset alloc/free helpers nvme-fc: store the generic nvme_ctrl in set->driver_data nvme-fc: keep ctrl->sqsize in sync with opts->queue_size nvme-rdma: use the tagset alloc/free helpers nvme-rdma: store the generic nvme_ctrl in set->driver_data nvme-tcp: use the tagset alloc/free helpers nvme-tcp: store the generic nvme_ctrl in set->driver_data ... Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-30counter: Introduce the COUNTER_COMP_ARRAY component typeWilliam Breathitt Gray
The COUNTER_COMP_ARRAY Counter component type is introduced to enable support for Counter array components. With Counter array components, exposure for buffers on counter devices can be defined via new Counter array component macros. This should simplify code for driver authors who would otherwise need to define individual Counter components for each array element. Eight Counter array component macros are introduced:: DEFINE_COUNTER_ARRAY_U64(_name, _length) DEFINE_COUNTER_ARRAY_CAPTURE(_name, _length) DEFINE_COUNTER_ARRAY_POLARITY(_name, _enums, _length) COUNTER_COMP_DEVICE_ARRAY_U64(_name, _read, _write, _array) COUNTER_COMP_COUNT_ARRAY_U64(_name, _read, _write, _array) COUNTER_COMP_SIGNAL_ARRAY_U64(_name, _read, _write, _array) COUNTER_COMP_ARRAY_CAPTURE(_read, _write, _array) COUNTER_COMP_ARRAY_POLARITY(_read, _write, _array) Eight Counter array callbacks are introduced as well:: int (*signal_array_u32_read)(struct counter_device *counter, struct counter_signal *signal, size_t idx, u32 *val); int (*signal_array_u32_write)(struct counter_device *counter, struct counter_signal *signal, size_t idx, u32 val); int (*device_array_u64_read)(struct counter_device *counter, size_t idx, u64 *val); int (*count_array_u64_read)(struct counter_device *counter, struct counter_count *count, size_t idx, u64 *val); int (*signal_array_u64_read)(struct counter_device *counter, struct counter_signal *signal, size_t idx, u64 *val); int (*device_array_u64_write)(struct counter_device *counter, size_t idx, u64 val); int (*count_array_u64_write)(struct counter_device *counter, struct counter_count *count, size_t idx, u64 val); int (*signal_array_u64_write)(struct counter_device *counter, struct counter_signal *signal, size_t idx, u64 val); Driver authors can handle reads/writes for an array component by receiving an element index via the `idx` parameter and processing the respective value via the `val` parameter. For example, suppose a driver wants to expose a Count's read-only capture buffer of four elements using a callback `foobar_capture_read()`:: DEFINE_COUNTER_ARRAY_CAPTURE(foobar_capture_array, 4); COUNTER_COMP_ARRAY_CAPTURE(foobar_capture_read, NULL, foobar_capture_array) Respective sysfs attributes for each array element would appear for the respective Count: * /sys/bus/counter/devices/counterX/countY/capture0 * /sys/bus/counter/devices/counterX/countY/capture1 * /sys/bus/counter/devices/counterX/countY/capture2 * /sys/bus/counter/devices/counterX/countY/capture3 If a user tries to read _capture2_ for example, `idx` will be `2` when passed to the `foobar_capture_read()` callback, and thus the driver knows which array element to handle. Counter arrays for polarity elements can be defined in a similar manner as u64 elements:: const enum counter_signal_polarity foobar_polarity_states[] = { COUNTER_SIGNAL_POLARITY_POSITIVE, COUNTER_SIGNAL_POLARITY_NEGATIVE, }; DEFINE_COUNTER_ARRAY_POLARITY(foobar_polarity_array, foobar_polarity_states, 4); COUNTER_COMP_ARRAY_POLARITY(foobar_polarity_read, foobar_polarity_write, foobar_polarity_array) Tested-by: Julien Panis <jpanis@baylibre.com> Link: https://lore.kernel.org/r/5310c22520aeae65b1b74952419f49ac4c8e1ec1.1664204990.git.william.gray@linaro.org/ Signed-off-by: William Breathitt Gray <william.gray@linaro.org> Link: https://lore.kernel.org/r/a51fd608704bdfc5a0efa503fc5481df34241e0a.1664318353.git.william.gray@linaro.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-09-30counter: Introduce the Count capture componentWilliam Breathitt Gray
Some devices provide a latch function to save historic Count values. This patch standardizes exposure of such functionality as Count capture components. A COUNTER_COMP_CAPTURE macro is provided for driver authors to define a capture component. A new event COUNTER_EVENT_CAPTURE is introduced to represent Count value capture events. Cc: Julien Panis <jpanis@baylibre.com> Link: https://lore.kernel.org/r/c239572ab4208d0d6728136e82a88ad464369a7a.1664204990.git.william.gray@linaro.org/ Signed-off-by: William Breathitt Gray <william.gray@linaro.org> Link: https://lore.kernel.org/r/3cebaa0b807a225eb277d771504fe6dba7269ffd.1664318353.git.william.gray@linaro.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-09-30counter: Introduce the Signal polarity componentWilliam Breathitt Gray
The Signal polarity component represents the active level of a respective Signal. There are two possible states: positive (rising edge) and negative (falling edge); enum counter_signal_polarity represents these states. A convenience macro COUNTER_COMP_POLARITY() is provided for driver authors to declare a Signal polarity component. Cc: Julien Panis <jpanis@baylibre.com> Link: https://lore.kernel.org/r/8f47d6e1db71a11bb1e2666f8e2a6e9d256d4131.1664204990.git.william.gray@linaro.org/ Signed-off-by: William Breathitt Gray <william.gray@linaro.org> Link: https://lore.kernel.org/r/b6e53438badcb6318997d13dd2fc052f97d808ac.1664318353.git.william.gray@linaro.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-09-30tcp: fix tcp_cwnd_validate() to not forget is_cwnd_limitedNeal Cardwell
This commit fixes a bug in the tracking of max_packets_out and is_cwnd_limited. This bug can cause the connection to fail to remember that is_cwnd_limited is true, causing the connection to fail to grow cwnd when it should, causing throughput to be lower than it should be. The following event sequence is an example that triggers the bug: (a) The connection is cwnd_limited, but packets_out is not at its peak due to TSO deferral deciding not to send another skb yet. In such cases the connection can advance max_packets_seq and set tp->is_cwnd_limited to true and max_packets_out to a small number. (b) Then later in the round trip the connection is pacing-limited (not cwnd-limited), and packets_out is larger. In such cases the connection would raise max_packets_out to a bigger number but (unexpectedly) flip tp->is_cwnd_limited from true to false. This commit fixes that bug. One straightforward fix would be to separately track (a) the next window after max_packets_out reaches a maximum, and (b) the next window after tp->is_cwnd_limited is set to true. But this would require consuming an extra u32 sequence number. Instead, to save space we track only the most important information. Specifically, we track the strongest available signal of the degree to which the cwnd is fully utilized: (1) If the connection is cwnd-limited then we remember that fact for the current window. (2) If the connection not cwnd-limited then we track the maximum number of outstanding packets in the current window. In particular, note that the new logic cannot trigger the buggy (a)/(b) sequence above because with the new logic a condition where tp->packets_out > tp->max_packets_out can only trigger an update of tp->is_cwnd_limited if tp->is_cwnd_limited is false. This first showed up in a testing of a BBRv2 dev branch, but this buggy behavior highlighted a general issue with the tcp_cwnd_validate() logic that can cause cwnd to fail to increase at the proper rate for any TCP congestion control, including Reno or CUBIC. Fixes: ca8a22634381 ("tcp: make cwnd-limited checks measurement-based, and gentler") Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Kevin(Yudong) Yang <yyd@google.com> Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-09-30net-next: skbuff: refactor pskb_pullRichard Gobert
pskb_may_pull already contains all of the checks performed by pskb_pull. Use pskb_may_pull for validation in pskb_pull, eliminating the duplication and making __pskb_pull obsolete. Replace __pskb_pull with pskb_pull where applicable. Signed-off-by: Richard Gobert <richardbgobert@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-09-29fs: record I_DIRTY_TIME even if inode already has I_DIRTY_INODELukas Czerner
Currently the I_DIRTY_TIME will never get set if the inode already has I_DIRTY_INODE with assumption that it supersedes I_DIRTY_TIME. That's true, however ext4 will only update the on-disk inode in ->dirty_inode(), not on actual writeback. As a result if the inode already has I_DIRTY_INODE state by the time we get to __mark_inode_dirty() only with I_DIRTY_TIME, the time was already filled into on-disk inode and will not get updated until the next I_DIRTY_INODE update, which might never come if we crash or get a power failure. The problem can be reproduced on ext4 by running xfstest generic/622 with -o iversion mount option. Fix it by allowing I_DIRTY_TIME to be set even if the inode already has I_DIRTY_INODE. Also make sure that the case is properly handled in writeback_single_inode() as well. Additionally changes in xfs_fs_dirty_inode() was made to accommodate for I_DIRTY_TIME in flag. Thanks Jan Kara for suggestions on how to make this work properly. Cc: Dave Chinner <david@fromorbit.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: stable@kernel.org Signed-off-by: Lukas Czerner <lczerner@redhat.com> Suggested-by: Jan Kara <jack@suse.cz> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20220825100657.44217-1-lczerner@redhat.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-09-29fs/buffer: make submit_bh & submit_bh_wbc return type as voidRitesh Harjani (IBM)
submit_bh/submit_bh_wbc are non-blocking functions which just submit the bio and return. The caller of submit_bh/submit_bh_wbc needs to wait on buffer till I/O completion and then check buffer head's b_state field to know if there was any I/O error. Hence there is no need for these functions to have any return type. Even now they always returns 0. Hence drop the return value and make their return type as void to avoid any confusion. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Link: https://lore.kernel.org/r/cb66ef823374cdd94d2d03083ce13de844fffd41.1660788334.git.ritesh.list@gmail.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-09-29net/sched: query offload capabilities through ndo_setup_tc()Vladimir Oltean
When adding optional new features to Qdisc offloads, existing drivers must reject the new configuration until they are coded up to act on it. Since modifying all drivers in lockstep with the changes in the Qdisc can create problems of its own, it would be nice if there existed an automatic opt-in mechanism for offloading optional features. Jakub proposes that we multiplex one more kind of call through ndo_setup_tc(): one where the driver populates a Qdisc-specific capability structure. First user will be taprio in further changes. Here we are introducing the definitions for the base functionality. Link: https://patchwork.kernel.org/project/netdevbpf/patch/20220923163310.3192733-3-vladimir.oltean@nxp.com/ Suggested-by: Jakub Kicinski <kuba@kernel.org> Co-developed-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-09-29net: skb: introduce and use a single page frag cachePaolo Abeni
After commit 3226b158e67c ("net: avoid 32 x truesize under-estimation for tiny skbs") we are observing 10-20% regressions in performance tests with small packets. The perf trace points to high pressure on the slab allocator. This change tries to improve the allocation schema for small packets using an idea originally suggested by Eric: a new per CPU page frag is introduced and used in __napi_alloc_skb to cope with small allocation requests. To ensure that the above does not lead to excessive truesize underestimation, the frag size for small allocation is inflated to 1K and all the above is restricted to build with 4K page size. Note that we need to update accordingly the run-time check introduced with commit fd9ea57f4e95 ("net: add napi_get_frags_check() helper"). Alex suggested a smart page refcount schema to reduce the number of atomic operations and deal properly with pfmemalloc pages. Under small packet UDP flood, I measure a 15% peak tput increases. Suggested-by: Eric Dumazet <eric.dumazet@gmail.com> Suggested-by: Alexander H Duyck <alexanderduyck@fb.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Alexander Duyck <alexanderduyck@fb.com> Link: https://lore.kernel.org/r/6b6f65957c59f86a353fc09a5127e83a32ab5999.1664350652.git.pabeni@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-09-29clk: fixed-rate: add devm_clk_hw_register_fixed_rateDmitry Baryshkov
Add devm_clk_hw_register_fixed_rate(), devres-managed helper to register fixed-rate clock. Signed-off-by: Dmitry Baryshkov <dmitry.baryshkov@linaro.org> Link: https://lore.kernel.org/r/20220916061740.87167-3-dmitry.baryshkov@linaro.org Signed-off-by: Stephen Boyd <sboyd@kernel.org>
2022-09-29clk: asm9260: use parent index to link the reference clockDmitry Baryshkov
Rewrite clk-asm9260 to use parent index to use the reference clock. During this rework two helpers are added: - clk_hw_register_mux_table_parent_data() to supplement clk_hw_register_mux_table() but using parent_data instead of parent_names - clk_hw_register_fixed_rate_parent_accuracy() to be used instead of directly calling __clk_hw_register_fixed_rate(). The later function is an internal API, which is better not to be called directly. Signed-off-by: Dmitry Baryshkov <dmitry.baryshkov@linaro.org> Link: https://lore.kernel.org/r/20220916061740.87167-2-dmitry.baryshkov@linaro.org Signed-off-by: Stephen Boyd <sboyd@kernel.org>
2022-09-29binfmt: remove taso from linux_binprm structLukas Bulwahn
With commit 987f20a9dcce ("a.out: Remove the a.out implementation"), the use of the special taso flag for alpha architectures in the linux_binprm struct is gone. Remove the definition of taso in the linux_binprm struct. No functional change. Signed-off-by: Lukas Bulwahn <lukas.bulwahn@gmail.com> Reviewed-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Kees Cook <keescook@chromium.org> Link: https://lore.kernel.org/r/20220929203903.9475-1-lukas.bulwahn@gmail.com
2022-09-29Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski
No conflicts. Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-09-29prandom: make use of smaller types in prandom_u32_maxJason A. Donenfeld
When possible at compile-time, make use of smaller types in prandom_u32_max(), so that we can use smaller batches from random.c, which in turn leads to a 2x or 4x performance boost. This makes a difference, for example, in kfence, which needs a fast stream of small numbers (booleans). At the same time, we use the occasion to update the old documentation on these functions. prandom_u32() and prandom_bytes() have direct replacements now in random.h, while prandom_u32_max() remains useful as a prandom.h function, since it's not cryptographically secure by virtue of not being evenly distributed. Cc: Dominik Brodowski <linux@dominikbrodowski.net> Acked-by: Marco Elver <elver@google.com> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2022-09-29random: add 8-bit and 16-bit batchesJason A. Donenfeld
There are numerous places in the kernel that would be sped up by having smaller batches. Currently those callsites do `get_random_u32() & 0xff` or similar. Since these are pretty spread out, and will require patches to multiple different trees, let's get ahead of the curve and lay the foundation for `get_random_u8()` and `get_random_u16()`, so that it's then possible to start submitting conversion patches leisurely. Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2022-09-29random: split initialization into early step and later stepJason A. Donenfeld
The full RNG initialization relies on some timestamps, made possible with initialization functions like time_init() and timekeeping_init(). However, these are only available rather late in initialization. Meanwhile, other things, such as memory allocator functions, make use of the RNG much earlier. So split RNG initialization into two phases. We can provide arch randomness very early on, and then later, after timekeeping and such are available, initialize the rest. This ensures that, for example, slabs are properly randomized if RDRAND is available. Without this, CONFIG_SLAB_FREELIST_RANDOM=y loses a degree of its security, because its random seed is potentially deterministic, since it hasn't yet incorporated RDRAND. It also makes it possible to use a better seed in kfence, which currently relies on only the cycle counter. Another positive consequence is that on systems with RDRAND, running with CONFIG_WARN_ALL_UNSEEDED_RANDOM=y results in no warnings at all. One subtle side effect of this change is that on systems with no RDRAND, RDTSC is now only queried by random_init() once, committing the moment of the function call, instead of multiple times as before. This is intentional, as the multiple RDTSCs in a loop before weren't accomplishing very much, with jitter being better provided by try_to_generate_entropy(). Plus, filling blocks with RDTSC is still being done in extract_entropy(), which is necessarily called before random bytes are served anyway. Cc: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Kees Cook <keescook@chromium.org> Reviewed-by: Dominik Brodowski <linux@dominikbrodowski.net> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2022-09-29bpf: tcp: Stop bpf_setsockopt(TCP_CONGESTION) in init ops to recur itselfMartin KaFai Lau
When a bad bpf prog '.init' calls bpf_setsockopt(TCP_CONGESTION, "itself"), it will trigger this loop: .init => bpf_setsockopt(tcp_cc) => .init => bpf_setsockopt(tcp_cc) ... ... => .init => bpf_setsockopt(tcp_cc). It was prevented by the prog->active counter before but the prog->active detection cannot be used in struct_ops as explained in the earlier patch of the set. In this patch, the second bpf_setsockopt(tcp_cc) is not allowed in order to break the loop. This is done by using a bit of an existing 1 byte hole in tcp_sock to check if there is on-going bpf_setsockopt(TCP_CONGESTION) in this tcp_sock. Note that this essentially limits only the first '.init' can call bpf_setsockopt(TCP_CONGESTION) to pick a fallback cc (eg. peer does not support ECN) and the second '.init' cannot fallback to another cc. This applies even the second bpf_setsockopt(TCP_CONGESTION) will not cause a loop. Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/r/20220929070407.965581-5-martin.lau@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-09-29bpf: Add __bpf_prog_{enter,exit}_struct_ops for struct_ops trampolineMartin KaFai Lau
The struct_ops prog is to allow using bpf to implement the functions in a struct (eg. kernel module). The current usage is to implement the tcp_congestion. The kernel does not call the tcp-cc's ops (ie. the bpf prog) in a recursive way. The struct_ops is sharing the tracing-trampoline's enter/exit function which tracks prog->active to avoid recursion. It is needed for tracing prog. However, it turns out the struct_ops bpf prog will hit this prog->active and unnecessarily skipped running the struct_ops prog. eg. The '.ssthresh' may run in_task() and then interrupted by softirq that runs the same '.ssthresh'. Skip running the '.ssthresh' will end up returning random value to the caller. The patch adds __bpf_prog_{enter,exit}_struct_ops for the struct_ops trampoline. They do not track the prog->active to detect recursion. One exception is when the tcp_congestion's '.init' ops is doing bpf_setsockopt(TCP_CONGESTION) and then recurs to the same '.init' ops. This will be addressed in the following patches. Fixes: ca06f55b9002 ("bpf: Add per-program recursion prevention mechanism") Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Link: https://lore.kernel.org/r/20220929070407.965581-2-martin.lau@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-09-29Merge branch irq/misc-6.1 into irq/irqchip-nextMarc Zyngier
* irq/misc-6.1: : . : Misc irqchip updates for 6.1: : : - Allow generic irqchip support without selecting CONFIG_OF_IRQ : : - Fix a couple of bindings for TI interrupts controllers : : - Yet another binding update for a Renesas SoC : : - The obligatory fixes from the spelling police : . dt-bindings: irqchip: renesas,irqc: Add r8a779g0 support irqchip/gic-v3: Fix typo in comment dt-bindings: interrupt-controller: ti,sci-intr: Fix missing reg property in the binding dt-bindings: irqchip: ti,sci-inta: Fix warning for missing #interrupt-cells irqchip: Make irqchip_init() usable on pure ACPI systems Signed-off-by: Marc Zyngier <maz@kernel.org>
2022-09-29regulator: gpio: Add input_supply support in gpio_regulator_configJerome Neanne
This is simillar as fixed-regulator. Used to extract regulator parent from the device tree. Without that property used, the parent regulator can be shut down (if not an always on). Thus leading to inappropriate behavior: On am62-SP-SK this fix is required to avoid tps65219 ldo1 (SDMMC rail) to be shut down after boot completion. Signed-off-by: Jerome Neanne <jneanne@baylibre.com> Link: https://lore.kernel.org/r/20220929132526.29427-2-jneanne@baylibre.com Signed-off-by: Mark Brown <broonie@kernel.org>
2022-09-29tracing/user_events: Use bits vs bytes for enabled status page dataBeau Belgrave
User processes may require many events and when they do the cache performance of a byte index status check is less ideal than a bit index. The previous event limit per-page was 4096, the new limit is 32,768. This change adds a bitwise index to the user_reg struct. Programs check that the bit at status_bit has a bit set within the status page(s). Link: https://lkml.kernel.org/r/20220728233309.1896-6-beaub@linux.microsoft.com Link: https://lore.kernel.org/all/2059213643.196683.1648499088753.JavaMail.zimbra@efficios.com/ Suggested-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Signed-off-by: Beau Belgrave <beaub@linux.microsoft.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2022-09-29block: adapt blk_mq_plug() to not plug for writes that require a zone lockPankaj Raghav
The current implementation of blk_mq_plug() disables plugging for all operations that involves a transfer to the device as we just check if the last bit in op_is_write() function. Modify blk_mq_plug() to disable plugging only for REQ_OP_WRITE and REQ_OP_WRITE_ZEROS as they might require a zone lock. Suggested-by: Christoph Hellwig <hch@lst.de> Suggested-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Pankaj Raghav <p.raghav@samsung.com> Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20220929074745.103073-2-p.raghav@samsung.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-29printk: Declare log_wait properlyThomas Gleixner
kernel/printk/printk.c:365:1: warning: symbol 'log_wait' was not declared. Should it be static? Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: John Ogness <john.ogness@linutronix.de> Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org> Reviewed-by: Petr Mladek <pmladek@suse.com> Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Petr Mladek <pmladek@suse.com> Link: https://lore.kernel.org/r/20220924000454.3319186-3-john.ogness@linutronix.de
2022-09-29printk: Make pr_flush() staticThomas Gleixner
No user outside the printk code and no reason to export this. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: John Ogness <john.ogness@linutronix.de> Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org> Reviewed-by: Petr Mladek <pmladek@suse.com> Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Petr Mladek <pmladek@suse.com> Link: https://lore.kernel.org/r/20220924000454.3319186-2-john.ogness@linutronix.de
2022-09-29Merge tag 'ata-6.0-rc7' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/dlemoal/libata Pull ATA fixes from Damien Le Moal: "Three late patches to fix problems discovered recently: - Add a horkage to disable link power management by default for the Pioneer BDR-207M and BDR-205 DVD drives (from Niklas) - Two patches to fix setting the maximum queue depth of libsas owned ATA devices (from me)" * tag 'ata-6.0-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/dlemoal/libata: ata: libata-sata: Fix device queue depth control ata: libata-scsi: Fix initialization of device queue depth libata: add ATA_HORKAGE_NOLPM for Pioneer BDR-207M and BDR-205
2022-09-29Merge branch 'v6.0-rc7'Peter Zijlstra
Merge upstream to get RAPTORLAKE_S Signed-off-by: Peter Zijlstra <peterz@infradead.org>
2022-09-29Merge branch 'slab/for-6.1/kmalloc_size_roundup' into slab/for-nextVlastimil Babka
The first two patches from a series by Kees Cook [1] that introduce kmalloc_size_roundup(). This will allow merging of per-subsystem patches using the new function and ultimately stop (ab)using ksize() in a way that causes ongoing trouble for debugging functionality and static checkers. [1] https://lore.kernel.org/all/20220923202822.2667581-1-keescook@chromium.org/ -- Resolved a conflict of modifying mm/slab.c __ksize() comment with a commit that unifies __ksize() implementation into mm/slab_common.c
2022-09-29Merge branch 'slab/for-6.1/slub_debug_waste' into slab/for-nextVlastimil Babka
A patch from Feng Tang that enhances the existing debugfs alloc_traces file for kmalloc caches with information about how much space is wasted by allocations that needs less space than the particular kmalloc cache provides.
2022-09-29slab: Introduce kmalloc_size_roundup()Kees Cook
In the effort to help the compiler reason about buffer sizes, the __alloc_size attribute was added to allocators. This improves the scope of the compiler's ability to apply CONFIG_UBSAN_BOUNDS and (in the near future) CONFIG_FORTIFY_SOURCE. For most allocations, this works well, as the vast majority of callers are not expecting to use more memory than what they asked for. There is, however, one common exception to this: anticipatory resizing of kmalloc allocations. These cases all use ksize() to determine the actual bucket size of a given allocation (e.g. 128 when 126 was asked for). This comes in two styles in the kernel: 1) An allocation has been determined to be too small, and needs to be resized. Instead of the caller choosing its own next best size, it wants to minimize the number of calls to krealloc(), so it just uses ksize() plus some additional bytes, forcing the realloc into the next bucket size, from which it can learn how large it is now. For example: data = krealloc(data, ksize(data) + 1, gfp); data_len = ksize(data); 2) The minimum size of an allocation is calculated, but since it may grow in the future, just use all the space available in the chosen bucket immediately, to avoid needing to reallocate later. A good example of this is skbuff's allocators: data = kmalloc_reserve(size, gfp_mask, node, &pfmemalloc); ... /* kmalloc(size) might give us more room than requested. * Put skb_shared_info exactly at the end of allocated zone, * to allow max possible filling before reallocation. */ osize = ksize(data); size = SKB_WITH_OVERHEAD(osize); In both cases, the "how much was actually allocated?" question is answered _after_ the allocation, where the compiler hinting is not in an easy place to make the association any more. This mismatch between the compiler's view of the buffer length and the code's intention about how much it is going to actually use has already caused problems[1]. It is possible to fix this by reordering the use of the "actual size" information. We can serve the needs of users of ksize() and still have accurate buffer length hinting for the compiler by doing the bucket size calculation _before_ the allocation. Code can instead ask "how large an allocation would I get for a given size?". Introduce kmalloc_size_roundup(), to serve this function so we can start replacing the "anticipatory resizing" uses of ksize(). [1] https://github.com/ClangBuiltLinux/linux/issues/1599 https://github.com/KSPP/linux/issues/183 [ vbabka@suse.cz: add SLOB version ] Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: linux-mm@kvack.org Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2022-09-29slab: Remove __malloc attribute from realloc functionsKees Cook
The __malloc attribute should not be applied to "realloc" functions, as the returned pointer may alias the storage of the prior pointer. Instead of splitting __malloc from __alloc_size, which would be a huge amount of churn, just create __realloc_size for the few cases where it is needed. Thanks to Geert Uytterhoeven <geert@linux-m68k.org> for reporting build failures with gcc-8 in earlier version which tried to remove the #ifdef. While the "alloc_size" attribute is available on all GCC versions, I forgot that it gets disabled explicitly by the kernel in GCC < 9.1 due to misbehaviors. Add a note to the compiler_attributes.h entry for it. Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com> Cc: Marco Elver <elver@google.com> Cc: linux-mm@kvack.org Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2022-09-28net/mlx5: Add the log_min_mkey_entity_size capabilityMaxim Mikityanskiy
Add the capability that will allow the driver to determine the minimal MTT page size to be able to map the smallest possible pages in XSK. The older firmwares that don't have this capability default to 12 (i.e. 4096-byte pages). Signed-off-by: Maxim Mikityanskiy <maximmi@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Reviewed-by: Saeed Mahameed <saeedm@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-09-28Merge branch 'mlx5-next' of ↵Jakub Kicinski
git://git.kernel.org/pub/scm/linux/kernel/git/mellanox/linux Saeed Mahameed says: ==================== updates from mlx5-next 2022-09-24 Updates form mlx5-next including[1]: 1) HW definitions and support for NPPS clock settings. 2) various cleanups 3) Enable hash mode by default for all NICs 4) page tracker and advanced virtualization HW definitions for vfio [1] https://lore.kernel.org/netdev/20220907233636.388475-1-saeed@kernel.org/ * 'mlx5-next' of git://git.kernel.org/pub/scm/linux/kernel/git/mellanox/linux: net/mlx5: Remove from FPGA IFC file not-needed definitions net/mlx5: Remove unused structs net/mlx5: Remove unused functions net/mlx5: detect and enable bypass port select flow table net/mlx5: Lag, enable hash mode by default for all NICs net/mlx5: Lag, set active ports if support bypass port select flow table RDMA/mlx5: Don't set tx affinity when lag is in hash mode net/mlx5: add IFC bits for bypassing port select flow table net/mlx5: Add support for NPPS with real time mode net/mlx5: Expose NPPS related registers net/mlx5: Query ADV_VIRTUALIZATION capabilities net/mlx5: Introduce ifc bits for page tracker RDMA/mlx5: Move function mlx5_core_query_ib_ppcnt() to mlx5_ib ==================== Link: https://lore.kernel.org/all/20220927201906.234015-1-saeed@kernel.org/ Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-09-28net: drop the weight argument from netif_napi_addJakub Kicinski
We tell driver developers to always pass NAPI_POLL_WEIGHT as the weight to netif_napi_add(). This may be confusing to newcomers, drop the weight argument, those who really need to tweak the weight can use netif_napi_add_weight(). Acked-by: Marc Kleine-Budde <mkl@pengutronix.de> # for CAN Link: https://lore.kernel.org/r/20220927132753.750069-1-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-09-28net: shrink struct ubuf_infoPavel Begunkov
We can benefit from a smaller struct ubuf_info, so leave only mandatory fields and let users to decide how they want to extend it. Convert MSG_ZEROCOPY to struct ubuf_info_msgzc and remove duplicated fields. This reduces the size from 48 bytes to just 16. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Acked-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-09-28net: introduce struct ubuf_info_msgzcPavel Begunkov
We're going to split struct ubuf_info and leave there only mandatory fields. Users are free to extend it. Add struct ubuf_info_msgzc, which will be an extended version for MSG_ZEROCOPY and some other users. It duplicates of struct ubuf_info for now and will be removed in a couple of patches. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Acked-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-09-28tracing: Wake up ring buffer waiters on closing of the fileSteven Rostedt (Google)
When the file that represents the ring buffer is closed, there may be waiters waiting on more input from the ring buffer. Call ring_buffer_wake_waiters() to wake up any waiters when the file is closed. Link: https://lkml.kernel.org/r/20220927231825.182416969@goodmis.org Cc: stable@vger.kernel.org Cc: Ingo Molnar <mingo@kernel.org> Cc: Andrew Morton <akpm@linux-foundation.org> Fixes: e30f53aad2202 ("tracing: Do not busy wait in buffer splice") Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2022-09-28ring-buffer: Add ring_buffer_wake_waiters()Steven Rostedt (Google)
On closing of a file that represents a ring buffer or flushing the file, there may be waiters on the ring buffer that needs to be woken up and exit the ring_buffer_wait() function. Add ring_buffer_wake_waiters() to wake up the waiters on the ring buffer and allow them to exit the wait loop. Link: https://lkml.kernel.org/r/20220928133938.28dc2c27@gandalf.local.home Cc: stable@vger.kernel.org Cc: Ingo Molnar <mingo@kernel.org> Cc: Andrew Morton <akpm@linux-foundation.org> Fixes: 15693458c4bc0 ("tracing/ring-buffer: Move poll wake ups into ring buffer code") Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2022-09-28bpf: Parameterize task iterators.Kui-Feng Lee
Allow creating an iterator that loops through resources of one thread/process. People could only create iterators to loop through all resources of files, vma, and tasks in the system, even though they were interested in only the resources of a specific task or process. Passing the additional parameters, people can now create an iterator to go through all resources or only the resources of a task. Signed-off-by: Kui-Feng Lee <kuifeng@fb.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Yonghong Song <yhs@fb.com> Acked-by: Martin KaFai Lau <martin.lau@kernel.org> Link: https://lore.kernel.org/bpf/20220926184957.208194-2-kuifeng@fb.com
2022-09-29linux/export: use inline assembler to populate symbol CRCsMasahiro Yamada
Since commit 7b4537199a4a ("kbuild: link symbol CRCs at final link, removing CONFIG_MODULE_REL_CRCS"), the module versioning on the (non-upstreamed-yet) kvx Linux port is broken due to unexpected padding for __crc_* symbols. The kvx GCC adds padding so u32 gets 8-byte alignment instead of 4. I do not know if this happens for upstream architectures in general, but any compiler has the freedom to insert padding for faster access. Use the inline assembler to directly specify the wanted data layout. This is how we previously did before the breakage. Link: https://lore.kernel.org/lkml/20220817161438.32039-1-ysionneau@kalray.eu/ Link: https://lore.kernel.org/linux-kbuild/31ce5305-a76b-13d7-ea55-afca82c46cf2@kalray.eu/ Fixes: 7b4537199a4a ("kbuild: link symbol CRCs at final link, removing CONFIG_MODULE_REL_CRCS") Reported-by: Yann Sionneau <ysionneau@kalray.eu> Signed-off-by: Masahiro Yamada <masahiroy@kernel.org> Tested-by: Yann Sionneau <ysionneau@kalray.eu>
2022-09-28Merge tag 'nvme-6.1-2022-09-28' of git://git.infradead.org/nvme into ↵Jens Axboe
for-6.1/block Pull NVMe updates from Christoph: "nvme updates for Linux 6.1 - handle effects after freeing the request (Keith Busch) - copy firmware_rev on each init (Keith Busch) - restrict management ioctls to admin (Keith Busch) - ensure subsystem reset is single threaded (Keith Busch) - report the actual number of tagset maps in nvme-pci (Keith Busch) - small fabrics authentication fixups (Christoph Hellwig) - add common code for tagset allocation and freeing (Christoph Hellwig) - stop using the request_queue in nvmet (Christoph Hellwig) - set min_align_mask before calculating max_hw_sectors (Rishabh Bhatnagar) - send a rediscover uevent when a persistent discovery controller reconnects (Sagi Grimberg) - misc nvmet-tcp fixes (Varun Prakash, zhenwei pi)" * tag 'nvme-6.1-2022-09-28' of git://git.infradead.org/nvme: (31 commits) nvmet: don't look at the request_queue in nvmet_bdev_set_limits nvmet: don't look at the request_queue in nvmet_bdev_zone_mgmt_emulate_all nvme: remove nvme_ctrl_init_connect_q nvme-loop: use the tagset alloc/free helpers nvme-loop: store the generic nvme_ctrl in set->driver_data nvme-loop: initialize sqsize later nvme-fc: use the tagset alloc/free helpers nvme-fc: store the generic nvme_ctrl in set->driver_data nvme-fc: keep ctrl->sqsize in sync with opts->queue_size nvme-rdma: use the tagset alloc/free helpers nvme-rdma: store the generic nvme_ctrl in set->driver_data nvme-tcp: use the tagset alloc/free helpers nvme-tcp: store the generic nvme_ctrl in set->driver_data nvme-tcp: remove the unused queue_size member in nvme_tcp_queue nvme: add common helpers to allocate and free tagsets nvme-auth: add a MAINTAINERS entry nvmet: add helpers to set the result field for connect commands nvme: improve the NVME_CONNECT_AUTHREQ* definitions nvmet-auth: don't try to cancel a non-initialized work_struct nvmet-tcp: remove nvmet_tcp_finish_cmd ...
2022-09-28remoteproc: Introduce rproc featuresPeng Fan
remote processor may support: - boot recovery with help from main processor - self recovery without help from main processor - iommu - etc Introduce rproc features could simplify code to avoid adding more bool flags Acked-by: Arnaud Pouliquen <arnaud.pouliquen@foss.st.com> Signed-off-by: Peng Fan <peng.fan@nxp.com> Link: https://lore.kernel.org/r/20220928064756.4059662-2-peng.fan@oss.nxp.com Signed-off-by: Mathieu Poirier <mathieu.poirier@linaro.org>
2022-09-28Revert "net: set proper memcg for net_init hooks allocations"Shakeel Butt
This reverts commit 1d0403d20f6c281cb3d14c5f1db5317caeec48e9. Anatoly Pugachev reported that the commit 1d0403d20f6c ("net: set proper memcg for net_init hooks allocations") is somehow causing the sparc64 VMs failed to boot and the VMs boot fine with that patch reverted. So, revert the patch for now and later we can debug the issue. Link: https://lore.kernel.org/all/20220918092849.GA10314@u164.east.ru/ Reported-by: Anatoly Pugachev <matorola@gmail.com> Signed-off-by: Shakeel Butt <shakeelb@google.com> Cc: Vasily Averin <vvs@openvz.org> Cc: Jakub Kicinski <kuba@kernel.org> Cc: Michal Koutný <mkoutny@suse.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: cgroups@vger.kernel.org Cc: sparclinux@vger.kernel.org Cc: linux-mm@kvack.org Cc: linux-kernel@vger.kernel.org Tested-by: Anatoly Pugachev <matorola@gmail.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Fixes: 1d0403d20f6c ("net: set proper memcg for net_init hooks allocations") Reviewed-by: Muchun Song <songmuchun@bytedance.com> Acked-by: Roman Gushchin <roman.gushchin@linux.dev> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-09-28mfd/omap1: htc-i2cpld: Convert to a pure GPIO driverLinus Walleij
Instead of passing GPIO numbers pertaining to ourselves through platform data, just request GPIO descriptors from our own GPIO chips and use them, and cut down on the unnecessary complexity. Cc: Aaro Koskinen <aaro.koskinen@iki.fi> Cc: Janusz Krzysztofik <jmkrzyszt@gmail.com> Cc: Tony Lindgren <tony@atomide.com> Cc: Cory Maccarrone <darkstar6262@gmail.com> Cc: linux-omap@vger.kernel.org Signed-off-by: Linus Walleij <linus.walleij@linaro.org> Signed-off-by: Lee Jones <lee@kernel.org> Link: https://lore.kernel.org/r/20220905115810.5987-1-linus.walleij@linaro.org