Age | Commit message (Collapse) | Author |
|
Currently, DWRR of tc will be initialized to a fixed value when this tc
is enabled, but it is not been reset to 0 when this tc is disabled. It
cause a problem that the DWRR of unused tc is not 0 after using tc tool
to add and delete multi-tc parameters.
For examples, after enabling 4 TCs and restoring to 1 TC by follow
tc commands:
$ tc qdisc add dev eth0 root mqprio num_tc 4 map 0 1 2 3 0 1 2 3 queues \
8@0 8@8 8@16 8@24 hw 1 mode channel
$ tc qdisc del dev eth0 root
Now there is just one TC is enabled for eth0, but the tc info querying by
debugfs is shown as follow:
$ cat /mnt/hns3/0000:7d:00.0/tm/tc_sch_info
enabled tc number: 1
weight_offset: 14
TC MODE WEIGHT
0 dwrr 100
1 dwrr 100
2 dwrr 100
3 dwrr 100
4 dwrr 0
5 dwrr 0
6 dwrr 0
7 dwrr 0
This patch fixes it by resetting DWRR of tc to 0 when tc is disabled.
Fixes: 848440544b41 ("net: hns3: Add support of TX Scheduler & Shaper to HNS3 driver")
Signed-off-by: Guangbin Huang <huangguangbin2@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Add configuration of interrupt type and fifo interrupt enable of TM QCN
error event if enabled, otherwise this event will not be reported when
there is error.
Fixes: d914971df022 ("net: hns3: remove redundant query in hclge_config_tm_hw_err_int()")
Signed-off-by: Jiaran Zhang <zhangjiaran@huawei.com>
Signed-off-by: Guangbin Huang <huangguangbin2@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
With PREEMPT_COUNT=y, when a CPU is offlined and then onlined again, we
get:
BUG: scheduling while atomic: swapper/1/0/0x00000000
no locks held by swapper/1/0.
CPU: 1 PID: 0 Comm: swapper/1 Not tainted 5.15.0-rc2+ #100
Call Trace:
dump_stack_lvl+0xac/0x108
__schedule_bug+0xac/0xe0
__schedule+0xcf8/0x10d0
schedule_idle+0x3c/0x70
do_idle+0x2d8/0x4a0
cpu_startup_entry+0x38/0x40
start_secondary+0x2ec/0x3a0
start_secondary_prolog+0x10/0x14
This is because powerpc's arch_cpu_idle_dead() decrements the idle task's
preempt count, for reasons explained in commit a7c2bb8279d2 ("powerpc:
Re-enable preemption before cpu_die()"), specifically "start_secondary()
expects a preempt_count() of 0."
However, since commit 2c669ef6979c ("powerpc/preempt: Don't touch the idle
task's preempt_count during hotplug") and commit f1a0a376ca0c ("sched/core:
Initialize the idle task with preemption disabled"), that justification no
longer holds.
The idle task isn't supposed to re-enable preemption, so remove the
vestigial preempt_enable() from the CPU offline path.
Tested with pseries and powernv in qemu, and pseries on PowerVM.
Fixes: 2c669ef6979c ("powerpc/preempt: Don't touch the idle task's preempt_count during hotplug")
Signed-off-by: Nathan Lynch <nathanl@linux.ibm.com>
Reviewed-by: Valentin Schneider <valentin.schneider@arm.com>
Reviewed-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20211015173902.2278118-1-nathanl@linux.ibm.com
|
|
In isa206_idle_insn_mayloss() we store various registers into the stack
red zone, which is allowed.
However inside the IDLE_STATE_ENTER_SEQ_NORET macro we save r2 again,
to 0(r1), which corrupts the stack back chain.
We used to do the same in isa206_idle_insn_mayloss() itself, but we
fixed that in 73287caa9210 ("powerpc64/idle: Fix SP offsets when saving
GPRs"), however we missed that the macro also corrupts the back chain.
Corrupting the back chain is bad for debuggability but doesn't
necessarily cause a bug.
However we recently changed the stack handling in some KVM code, and it
now relies on the stack back chain being valid when it returns. The
corruption causes that code to return with r1 pointing somewhere in
kernel data, at some point LR is restored from the stack and we branch
to NULL or somewhere else invalid.
Only affects Power8 hosts running KVM guests, with dynamic_mt_modes
enabled (which it is by default).
The fixes tag below points to the commit that changed the KVM stack
handling, exposing this bug. The actual corruption of the back chain has
always existed since 948cf67c4726 ("powerpc: Add NAP mode support on
Power7 in HV mode").
Fixes: 9b4416c5095c ("KVM: PPC: Book3S HV: Fix stack handling in idle_kvm_start_guest()")
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20211020094826.3222052-1-mpe@ellerman.id.au
|
|
This reverts commit 09e856d54bda5f288ef8437a90ab2b9b3eab83d1.
When an interface is enslaved in a VRF, prerouting conntrack hook is
called twice: once in the context of the original input interface, and
once in the context of the VRF interface. If no special precausions are
taken, this leads to creation of two conntrack entries instead of one,
and breaks SNAT.
Commit above was intended to avoid creation of extra conntrack entries
when input interface is enslaved in a VRF. It did so by resetting
conntrack related data associated with the skb when it enters VRF context.
However it breaks netfilter operation. Imagine a use case when conntrack
zone must be assigned based on the original input interface, rather than
VRF interface (that would make original interfaces indistinguishable). One
could create netfilter rules similar to these:
chain rawprerouting {
type filter hook prerouting priority raw;
iif realiface1 ct zone set 1 return
iif realiface2 ct zone set 2 return
}
This works before the mentioned commit, but not after: zone assignment
is "forgotten", and any subsequent NAT or filtering that is dependent
on the conntrack zone does not work.
Here is a reproducer script that demonstrates the difference in behaviour.
==========
#!/bin/sh
# This script demonstrates unexpected change of nftables behaviour
# caused by commit 09e856d54bda5f28 ""vrf: Reset skb conntrack
# connection on VRF rcv"
# https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=09e856d54bda5f288ef8437a90ab2b9b3eab83d1
#
# Before the commit, it was possible to assign conntrack zone to a
# packet (or mark it for `notracking`) in the prerouting chanin, raw
# priority, based on the `iif` (interface from which the packet
# arrived).
# After the change, # if the interface is enslaved in a VRF, such
# assignment is lost. Instead, assignment based on the `iif` matching
# the VRF master interface is honored. Thus it is impossible to
# distinguish packets based on the original interface.
#
# This script demonstrates this change of behaviour: conntrack zone 1
# or 2 is assigned depending on the match with the original interface
# or the vrf master interface. It can be observed that conntrack entry
# appears in different zone in the kernel versions before and after
# the commit.
IPIN=172.30.30.1
IPOUT=172.30.30.2
PFXL=30
ip li sh vein >/dev/null 2>&1 && ip li del vein
ip li sh tvrf >/dev/null 2>&1 && ip li del tvrf
nft list table testct >/dev/null 2>&1 && nft delete table testct
ip li add vein type veth peer veout
ip li add tvrf type vrf table 9876
ip li set veout master tvrf
ip li set vein up
ip li set veout up
ip li set tvrf up
/sbin/sysctl -w net.ipv4.conf.veout.accept_local=1
/sbin/sysctl -w net.ipv4.conf.veout.rp_filter=0
ip addr add $IPIN/$PFXL dev vein
ip addr add $IPOUT/$PFXL dev veout
nft -f - <<__END__
table testct {
chain rawpre {
type filter hook prerouting priority raw;
iif { veout, tvrf } meta nftrace set 1
iif veout ct zone set 1 return
iif tvrf ct zone set 2 return
notrack
}
chain rawout {
type filter hook output priority raw;
notrack
}
}
__END__
uname -rv
conntrack -F
ping -W 1 -c 1 -I vein $IPOUT
conntrack -L
Signed-off-by: Eugene Crosser <crosser@average.org>
Acked-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Make sure the security buffer's length/offset are valid with regards to
the packet length.
Acked-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Marios Makassikis <mmakassikis@freebox.fr>
Signed-off-by: Steve French <stfrench@microsoft.com>
|
|
To avoid dictionary attacks (repeated session setups rapidly sent) to
connect to server, ksmbd make a delay of a 5 seconds on session setup
failure to make it harder to send enough random connection requests
to break into a server if a user insert the wrong password 10 times
in a row.
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
|
|
Validate OutputBufferLength of QUERY_DIR, QUERY_INFO, IOCTL requests and
check the free size of response buffer for these requests.
Acked-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Hyunchul Lee <hyc.lee@gmail.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
|
|
The spi-altera driver has two flavors: platform and dfl. I'm seeing
a case where I have both device types in the same machine, and they
are conflicting on the SPI ID:
... kernel: couldn't get idr
... kernel: WARNING: CPU: 28 PID: 912 at drivers/spi/spi.c:2920 spi_register_controller.cold+0x84/0xc0a
Both the platform and dfl drivers use the parent's driver ID as the SPI
ID. In the error case, the parent devices are dfl_dev.4 and
subdev_spi_altera.4.auto. When the second spi-master is created, the
failure occurs because the SPI ID of 4 has already been allocated.
Change the ID allocation to dynamic (by initializing bus_num to -1) to
avoid duplicate SPI IDs.
Signed-off-by: Russ Weight <russell.h.weight@intel.com>
Link: https://lore.kernel.org/r/20211019002401.24041-1-russell.h.weight@intel.com
Signed-off-by: Mark Brown <broonie@kernel.org>
|
|
blk_mq_quiesce_queue() has been used a bit wide now, so far we don't support
concurrent/nested quiesce. One biggest issue is that unquiesce can happen
unexpectedly in case that quiesce/unquiesce are run concurrently from
more than one context.
This patch introduces q->mq_quiesce_depth to deal concurrent quiesce,
and we only unquiesce queue when it is the last/outer-most one of all
contexts.
Several kernel panic issue has been reported[1][2][3] when running stress
quiesce test. And this patch has been verified in these reports.
[1] https://lore.kernel.org/linux-block/9b21c797-e505-3821-4f5b-df7bf9380328@huawei.com/T/#m1fc52431fad7f33b1ffc3f12c4450e4238540787
[2] https://lore.kernel.org/linux-block/9b21c797-e505-3821-4f5b-df7bf9380328@huawei.com/T/#m10ad90afeb9c8cc318334190a7c24c8b5c5e0722
[3] https://listman.redhat.com/archives/dm-devel/2021-September/msg00189.html
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211014081710.1871747-7-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
The nvme-loop's admin queue may be freed and reallocated, and we have to
reset the flag of NVME_CTRL_ADMIN_Q_STOPPED so that the flag can match
with the quiesce state of the admin queue.
nvme-loop is the only driver to reallocate request queue, and not see
such usage in other nvme drivers.
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211014081710.1871747-6-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
The current blk_mq_quiesce_queue() and blk_mq_unquiesce_queue() always
stops and starts the queue unconditionally. And there can be concurrent
quiesce/unquiesce coming from different unrelated code paths, so
unquiesce may come unexpectedly and start queue too early.
Prepare for supporting concurrent quiesce/unquiesce from multiple
contexts, so that we can address the above issue.
NVMe has very complicated quiesce/unquiesce use pattern, add one atomic
bit for makeiing sure that blk-mq quiece/unquiesce is always called in
pair.
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211014081710.1871747-5-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Add two helpers so that we can prepare for pairing quiescing and
unquiescing which will be done in next patch.
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211014081710.1871747-4-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Apply the added two APIs to quiesce/unquiesce admin queue.
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211014081710.1871747-3-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Add two APIs for stopping and starting admin queue.
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211014081710.1871747-2-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Currently force_nonblock stands for three meanings:
- nowait or not
- in an io-worker or not(hold uring_lock or not)
Let's split the logic to two flags, IO_URING_F_NONBLOCK and
IO_URING_F_UNLOCKED for convenience of the next patch.
Suggested-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Hao Xu <haoxu@linux.alibaba.com>
Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/20211018133431.103298-1-haoxu@linux.alibaba.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
When irdma_ws_add fails, irdma_ws_remove is used to cleanup the leaf node.
This lead to holding the qos mutex twice in the QP resume path. Fix this
by avoiding the call to irdma_ws_remove and unwinding the error in
irdma_ws_add. This skips the call to irdma_tc_in_use function which is not
needed in the error unwind cases.
Fixes: 3ae331c75128 ("RDMA/irdma: Add QoS definitions")
Link: https://lore.kernel.org/r/20211019151654.1943-2-shiraz.saleem@intel.com
Signed-off-by: Mustafa Ismail <mustafa.ismail@intel.com>
Signed-off-by: Shiraz Saleem <shiraz.saleem@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
|
|
Currently VLAN is reported in UD work completion when VLAN id is zero,
i.e. no VLAN case.
Report VLAN in UD work completion only when VLAN id is non-zero.
Fixes: b48c24c2d710 ("RDMA/irdma: Implement device supported verb APIs")
Link: https://lore.kernel.org/r/20211019151654.1943-1-shiraz.saleem@intel.com
Signed-off-by: Mustafa Ismail <mustafa.ismail@intel.com>
Signed-off-by: Shiraz Saleem <shiraz.saleem@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
|
|
Normally the zero fill would hide the missing initialization, but an
errant set to desc_size in reg_create() causes a crash:
BUG: unable to handle page fault for address: 0000000800000000
PGD 0 P4D 0
Oops: 0000 [#1] SMP PTI
CPU: 5 PID: 890 Comm: ib_write_bw Not tainted 5.15.0-rc4+ #47
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
RIP: 0010:mlx5_ib_dereg_mr+0x14/0x3b0 [mlx5_ib]
Code: 48 63 cd 4c 89 f7 48 89 0c 24 e8 37 30 03 e1 48 8b 0c 24 eb a0 90 0f 1f 44 00 00 41 56 41 55 41 54 55 53 48 89 fb 48 83 ec 30 <48> 8b 2f 65 48 8b 04 25 28 00 00 00 48 89 44 24 28 31 c0 8b 87 c8
RSP: 0018:ffff88811afa3a60 EFLAGS: 00010286
RAX: 000000000000001c RBX: 0000000800000000 RCX: 0000000000000000
RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000800000000
RBP: 0000000800000000 R08: 0000000000000000 R09: c0000000fffff7ff
R10: ffff88811afa38f8 R11: ffff88811afa38f0 R12: ffffffffa02c7ac0
R13: 0000000000000000 R14: ffff88811afa3cd8 R15: ffff88810772fa00
FS: 00007f47b9080740(0000) GS:ffff88852cd40000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000800000000 CR3: 000000010761e003 CR4: 0000000000370ea0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
mlx5_ib_free_odp_mr+0x95/0xc0 [mlx5_ib]
mlx5_ib_dereg_mr+0x128/0x3b0 [mlx5_ib]
ib_dereg_mr_user+0x45/0xb0 [ib_core]
? xas_load+0x8/0x80
destroy_hw_idr_uobject+0x1a/0x50 [ib_uverbs]
uverbs_destroy_uobject+0x2f/0x150 [ib_uverbs]
uobj_destroy+0x3c/0x70 [ib_uverbs]
ib_uverbs_cmd_verbs+0x467/0xb00 [ib_uverbs]
? uverbs_finalize_object+0x60/0x60 [ib_uverbs]
? ttwu_queue_wakelist+0xa9/0xe0
? pty_write+0x85/0x90
? file_tty_write.isra.33+0x214/0x330
? process_echoes+0x60/0x60
ib_uverbs_ioctl+0xa7/0x110 [ib_uverbs]
__x64_sys_ioctl+0x10d/0x8e0
? vfs_write+0x17f/0x260
do_syscall_64+0x3c/0x80
entry_SYSCALL_64_after_hwframe+0x44/0xae
Add the missing xarray initialization and remove the desc_size set.
Fixes: a639e66703ee ("RDMA/mlx5: Zero out ODP related items in the mlx5_ib_mr")
Link: https://lore.kernel.org/r/a4846a11c9de834663e521770da895007f9f0d30.1634642730.git.leonro@nvidia.com
Signed-off-by: Aharon Landau <aharonl@nvidia.com>
Reviewed-by: Maor Gottlieb <maorg@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
|
|
Device's QP memory should only be allocated and released by IB layer.
This patch removes the redundant release of the device's qp memory and
uses completion APIs to make sure that .destroy_qp() only return, when qp
reference becomes 0.
Fixes: 514aee660df4 ("RDMA: Globally allocate and release QP memory")
Link: https://lore.kernel.org/r/20211019082212.7052-1-pkushwaha@marvell.com
Acked-by: Michal Kalderon <michal.kalderon@marvell.com>
Signed-off-by: Ariel Elior <aelior@marvell.com>
Signed-off-by: Shai Malin <smalin@marvell.com>
Signed-off-by: Alok Prasad <palok@marvell.com>
Signed-off-by: Prabhakar Kushwaha <pkushwaha@marvell.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
|
|
First, fix nr_workers checks against max_workers, with max_worker
registration, it may pretty easily happen that nr_workers > max_workers.
Also, synchronise writing to acct->max_worker with wqe->lock. It's not
an actual problem, but as we don't care about io_wqe_create_worker(),
it's better than WRITE_ONCE()/READ_ONCE().
Fixes: 2e480058ddc2 ("io-wq: provide a way to limit max number of workers")
Reported-by: Beld Zhang <beldzhang@gmail.com>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/11f90e6b49410b7d1a88f5d04fb8d95bb86b8cf3.1634671835.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
If we return before the end of the 'for_each_child_of_node()' iterator, the
reference taken on 'port' must be released.
Add the missing 'of_node_put()' calls.
Fixes: 83c0afaec7b7 ("net: dsa: Add new binding implementation")
Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Link: https://lore.kernel.org/r/15d5310d1d55ad51c1af80775865306d92432e03.1634587046.git.christophe.jaillet@wanadoo.fr
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
In bfq_pd_alloc(), the function bfqg_stats_init() init bfqg. If
blkg_rwstat_init() init bfqg_stats->bytes successful and init
bfqg_stats->ios failed, bfqg_stats_init() return failed, bfqg will
be freed. But blkg_rwstat->cpu_cnt is not deleted from the list of
percpu_counters. If we traverse the list of percpu_counters, It will
have UAF problem.
we should use blkg_rwstat_exit() to cleanup bfqg_stats bytes in the
above scenario.
Fixes: commit fd41e60331b ("bfq-iosched: stop using blkg->stat_bytes and ->stat_ios")
Signed-off-by: Zheng Liang <zhengliang6@huawei.com>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20211018024225.1493938-1-zhengliang6@huawei.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
If we don't use an IO scheduler or have shared tags, then we don't need
to call into this external function at all. This saves ~2% for such
a setup.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
This memset in the fast path costs a lot of cycles on my setup. Here's a
top-of-profile of doing ~6.7M IOPS:
+ 5.90% io_uring [nvme] [k] nvme_queue_rq
+ 5.32% io_uring [nvme_core] [k] nvme_setup_cmd
+ 5.17% io_uring [kernel.vmlinux] [k] io_submit_sqes
+ 4.97% io_uring [kernel.vmlinux] [k] blkdev_direct_IO
and a perf diff with this patch:
0.92% +4.40% [nvme_core] [k] nvme_setup_cmd
reducing it from 5.3% to only 0.9%. This takes it from the 2nd most
cycle consumer to something that's mostly irrelevant.
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
We don't have to worry about doing extra memsets by moving it outside
the protection of RQF_DONTPREP, as nvme doesn't do partial completions.
This is in preparation for making the read/write fast path not do a full
memset of the command.
Reviewed-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
This is only of historical interest, and anyone interested in the
history can dig out an old version of locks.c from from git.
Triggered by the observation that it references the now-removed
Documentation/filesystems/mandatory-locking.rst.
Reported-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
|
|
Commit 6381195ad7d0 ("ACPI: power: Rework turning off unused power
resources") caused power resources in unknown state with reference
counters equal to zero to be turned off too, but that caused issues
to appear in the field, so modify the code to only turn off power
resources that are known to be "on".
Link: https://lore.kernel.org/linux-acpi/6faf4b92-78d5-47a4-63df-cc2bab7769d0@molgen.mpg.de/
Fixes: 6381195ad7d0 ("ACPI: power: Rework turning off unused power resources")
Reported-by: Andreas K. Huettel <andreas.huettel@ur.de>
Tested-by: Andreas K. Huettel <andreas.huettel@ur.de>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: 5.14+ <stable@vger.kernel.org> # 5.14+
|
|
Return to the normal blk_mq_submit_bio flow if the bio did not end up
actually being a flush because the device didn't support it. Note that
this is basically impossible to hit without special instrumentation given
that submit_bio_checks already clears these flags usually, so we'd need a
tight race to actually hit this code path.
With this the call to blk_mq_run_hw_queue for the flush requests can be
removed given that the actual flush requests are always issued via the
requeue workqueue which runs the queue unconditionally.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211019122553.2467817-1-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Instead of leaking the ucounts in new if alloc_ucounts fails, store
the result of alloc_ucounts into a temporary variable, which is later
assigned to new->ucounts.
Cc: stable@vger.kernel.org
Fixes: 905ae01c4ae2 ("Add a reference to ucounts for each cred")
Link: https://lkml.kernel.org/r/87pms2s0v8.fsf_-_@disp2133
Tested-by: Yu Zhao <yuzhao@google.com>
Reviewed-by: Alexey Gladkov <legion@kernel.org>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
|
|
The purpose of inc_rlimit_ucounts and dec_rlimit_ucounts in commit_creds
is to change which rlimit counter is used to track a process when the
credentials changes.
Use the same test for both to guarantee the tracking is correct.
Cc: stable@vger.kernel.org
Fixes: 21d1c5e386bc ("Reimplement RLIMIT_NPROC on top of ucounts")
Link: https://lkml.kernel.org/r/87v91us0w4.fsf_-_@disp2133
Tested-by: Yu Zhao <yuzhao@google.com>
Reviewed-by: Alexey Gladkov <legion@kernel.org>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
|
|
When enabling -Wunused warnings by building with W=1, I get an
instance of the -Wunused-but-set-parameter warning in the io_uring code:
fs/io_uring.c: In function 'io_queue_async_work':
fs/io_uring.c:1445:61: error: parameter 'locked' set but not used [-Werror=unused-but-set-parameter]
1445 | static void io_queue_async_work(struct io_kiocb *req, bool *locked)
| ~~~~~~^~~~~~
There are very few warnings of this type, so it would be nice to enable
this by default and fix all the existing instances. As the assignment
serves no purpose by itself other than to prevent developers from using
the variable, an easy workaround is to remove the assignment and just
rename the argument to "dont_use".
Fixes: f237c30a5610 ("io_uring: batch task work locking")
Link: https://lore.kernel.org/lkml/20210920121352.93063-1-arnd@kernel.org/
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Link: https://lore.kernel.org/r/20211019153507.348480-1-arnd@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Commit f1a0a376ca0c ("sched/core: Initialize the idle task with
preemption disabled") removed the init_idle() call from
idle_thread_get(). This was the sole call-path on hotplug that resets
the Shadow Call Stack (scs) Stack Pointer (sp).
Not resetting the scs-sp leads to scs overflow after enough hotplug
cycles. Therefore add an explicit scs_task_reset() to the hotplug code
to make sure the scs-sp does get reset on hotplug.
Fixes: f1a0a376ca0c ("sched/core: Initialize the idle task with preemption disabled")
Signed-off-by: Woody Lin <woodylin@google.com>
[peterz: Changelog]
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Valentin Schneider <valentin.schneider@arm.com>
Link: https://lore.kernel.org/r/20211012083521.973587-1-woodylin@google.com
|
|
Add MicroLZMA support in order to maximize compression ratios for
specific scenarios. For example, it's useful for low-end embedded
boards and as a secondary algorithm in a file for specific access
patterns.
MicroLZMA is a new container format for raw LZMA1, which was created
by Lasse Collin aiming to minimize old LZMA headers and get rid of
unnecessary EOPM (end of payload marker) as well as to enable
fixed-sized output compression, especially for 4KiB pclusters.
Similar to LZ4, inplace I/O approach is used to minimize runtime
memory footprint when dealing with I/O. Overlapped decompression is
handled with 1) bounced buffer for data under processing or 2) extra
short-lived pages from the on-stack pagepool which will be shared in
the same read request (128KiB for example).
Link: https://lore.kernel.org/r/20211010213145.17462-8-xiang@kernel.org
Acked-by: Chao Yu <chao@kernel.org>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
|
|
Previously, some LZ4 methods were named with `generic'. However, while
evaluating the effective LZMA approach, it seems they aren't quite
generic at all (e.g. no need preparing dstpages for most LZMA cases.)
Avoid such naming instead.
Link: https://lore.kernel.org/r/20211010213145.17462-7-xiang@kernel.org
Acked-by: Chao Yu <chao@kernel.org>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
|
|
uncompressible -> incompressible
non-splitted -> non-split
Link: https://lore.kernel.org/r/20211010213145.17462-6-xiang@kernel.org
Signed-off-by: Lasse Collin <lasse.collin@tukaani.org>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
|
|
MicroLZMA is a yet another header format variant where the first
byte of a raw LZMA stream (without the end of stream marker) has
been replaced with a bitwise-negation of the lc/lp/pb properties
byte. MicroLZMA was created to be used in EROFS but can be used
by other things too where wasting minimal amount of space for
headers is important.
This is implemented using most of the LZMA2 code as is so the
amount of new code is small. The API has a few extra features
compared to the XZ decoder. On the other hand, the API lacks
XZ_BUF_ERROR support which is important to take into account
when using this API.
MicroLZMA doesn't support BCJ filters. In theory they could be
added later as there are many unused/reserved values for the
first byte of the compressed stream but in practice it is
somewhat unlikely to happen due to a few implementation reasons.
Link: https://lore.kernel.org/r/20211010213145.17462-5-xiang@kernel.org
Signed-off-by: Lasse Collin <lasse.collin@tukaani.org>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
|
|
It's a more logical place even if the resetting needs to be done
only once per LZMA2 stream (if lzma_reset() called in the middle
of an LZMA2 stream, .len will already be 0).
Link: https://lore.kernel.org/r/20211010213145.17462-4-xiang@kernel.org
Signed-off-by: Lasse Collin <lasse.collin@tukaani.org>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
|
|
This might matter, for example, if the underlying type of enum xz_check
was a signed char. In such a case the validation wouldn't have caught an
unsupported header. I don't know if this problem can occur in the kernel
on any arch but it's still good to fix it because some people might copy
the XZ code to their own projects from Linux instead of the upstream
XZ Embedded repository.
This change may increase the code size by a few bytes. An alternative
would have been to use an unsigned int instead of enum xz_check but
using an enumeration looks cleaner.
Link: https://lore.kernel.org/r/20211010213145.17462-3-xiang@kernel.org
Signed-off-by: Lasse Collin <lasse.collin@tukaani.org>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
|
|
decompression
With valid files, the safety margin described in lib/decompress_unxz.c
ensures that these buffers cannot overlap. But if the uncompressed size
of the input is larger than the caller thought, which is possible when
the input file is invalid/corrupt, the buffers can overlap. Obviously
the result will then be garbage (and usually the decoder will return
an error too) but no other harm will happen when such an over-run occurs.
This change only affects uncompressed LZMA2 chunks and so this
should have no effect on performance.
Link: https://lore.kernel.org/r/20211010213145.17462-2-xiang@kernel.org
Signed-off-by: Lasse Collin <lasse.collin@tukaani.org>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
|
|
Previously, the readahead window was strictly followed by EROFS
decompression strategy in order to minimize extra memory footprint.
However, it could become inefficient if just reading the partial
requested data for much big LZ4 pclusters and the upcoming LZMA
implementation.
Let's try to request the leading data in a pcluster without
triggering memory reclaiming instead for the LZ4 approach first
to boost up 100% randread of large big pclusters, and it has no real
impact on low memory scenarios.
It also introduces a way to expand read lengths in order to decompress
the whole pcluster, which is useful for LZMA since the algorithm
itself is relatively slow and causes CPU bound, but LZ4 is not.
Link: https://lore.kernel.org/r/20211008200839.24541-4-xiang@kernel.org
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
|
|
Previously, for each HEAD lcluster, it can be either HEAD or PLAIN
lcluster to indicate whether the whole pcluster is compressed or not.
In this patch, a new HEAD2 head type is introduced to specify another
compression algorithm other than the primary algorithm for each
compressed file, which can be used for upcoming LZMA compression and
LZ4 range dictionary compression for various data patterns.
It has been stayed in the EROFS roadmap for years. Complete it now!
Link: https://lore.kernel.org/r/20211017165721.2442-1-xiang@kernel.org
Reviewed-by: Yue Hu <huyue2@yulong.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
|
|
Merge misc fixes from Andrew Morton:
"19 patches.
Subsystems affected by this patch series: mm (userfaultfd, migration,
memblock, mempolicy, slub, secretmem, and thp), ocfs2, binfmt, vfs,
and misc"
* emailed patches from Andrew Morton <akpm@linux-foundation.org>:
mailmap: add Andrej Shadura
mm/thp: decrease nr_thps in file's mapping on THP split
mm/secretmem: fix NULL page->mapping dereference in page_is_secretmem()
vfs: check fd has read access in kernel_read_file_from_fd()
elfcore: correct reference to CONFIG_UML
mm, slub: fix incorrect memcg slab count for bulk free
mm, slub: fix potential use-after-free in slab_debugfs_fops
mm, slub: fix potential memoryleak in kmem_cache_open()
mm, slub: fix mismatch between reconstructed freelist depth and cnt
mm, slub: fix two bugs in slab_debug_trace_open()
mm/mempolicy: do not allow illegal MPOL_F_NUMA_BALANCING | MPOL_LOCAL in mbind()
memblock: check memory total_size
ocfs2: mount fails with buffer overflow in strlen
ocfs2: fix data corruption after conversion from inline format
mm/migrate: fix CPUHP state to update node demotion order
mm/migrate: add CPU hotplug to demotion #ifdef
mm/migrate: optimize hotplug-time demotion order updates
userfaultfd: fix a race between writeprotect and exit_mmap()
mm/userfaultfd: selftests: fix memory corruption with thp enabled
|
|
After the commit 617f3ef95177 ("locking/rwsem: Remove reader
optimistic spinning"), reader doesn't support optimistic spinning
anymore, there is no need meet the condition which OSQ is empty.
BTW, add an unlikely() for the max reader wakeup check in the loop.
Signed-off-by: Yanfei Xu <yanfei.xu@windriver.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Waiman Long <longman@redhat.com>
Link: https://lore.kernel.org/r/20211013134154.1085649-4-yanfei.xu@windriver.com
|
|
preempt_disable/enable() is equal to RCU read-side crital section, and
the spinning codes in mutex and rwsem could ensure that the preemption
is disabled. So let's remove the unnecessary rcu_read_lock/unlock for
saving some cycles in hot codes.
Signed-off-by: Yanfei Xu <yanfei.xu@windriver.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Waiman Long <longman@redhat.com>
Link: https://lore.kernel.org/r/20211013134154.1085649-2-yanfei.xu@windriver.com
|
|
The spinning region rwsem_spin_on_owner() should not be preempted,
however the rwsem_down_write_slowpath() invokes it and don't disable
preemption. Fix it by adding a pair of preempt_disable/enable().
Signed-off-by: Yanfei Xu <yanfei.xu@windriver.com>
[peterz: Fix CONFIG_RWSEM_SPIN_ON_OWNER=n build]
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Waiman Long <longman@redhat.com>
Link: https://lore.kernel.org/r/20211013134154.1085649-3-yanfei.xu@windriver.com
|
|
Since the futex code was restructured, there's no futex.c file anymore
and the implementation is split in various files. Point kernel-doc
references to the new files.
Signed-off-by: André Almeida <andrealmeid@collabora.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20211012135549.14451-1-andrealmeid@collabora.com
|
|
Mike reported that rcuwait went walk-about and is causing failures on
the PREEMPT_RT builds, restore it.
Reported-by: Mike Galbraith <efault@gmx.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
|
|
Fix the data source encodings to represent L2.1/L3.1(another core's
L2/L3 on the same node) accesses properly for power10 and older
plaforms.
Add new macros(LEVEL/REM) which can be used to add mem_lvl_num and remote
field data inside perf_mem_data_src structure.
Result in power9 system with patch changes:
localhost:~/linux/tools/perf # ./perf mem report | grep Remote
0.01% 1 252 Remote core, same node L3 or L3 hit [.] 0x0000000000002dd0 producer_consumer [.] 0x00007fff7f25eb90
anon HitM N/A No N/A 0 0
0.01% 1 220 Remote core, same node L3 or L3 hit [.] 0x0000000000002dd0 producer_consumer [.] 0x00007fff77776d90
anon HitM N/A No N/A 0 0
0.01% 1 220 Remote core, same node L3 or L3 hit [.] 0x0000000000002dd0 producer_consumer [.] 0x00007fff817d9410
anon HitM N/A No N/A 0 0
Fixes: 79e96f8f930d ("powerpc/perf: Export memory hierarchy info to user space")
Signed-off-by: Kajol Jain <kjain@linux.ibm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20211006140654.298352-5-kjain@linux.ibm.com
|
|
Going forward, future generation systems can have more hierarchy
within the node/package level but currently we don't have any data source
encoding field in perf, which can be used to represent this level of data.
Add a new field called 'mem_hops' in the perf_mem_data_src structure
which can be used to represent intra-node/package or inter-node/off-package
details. This field is of size 3 bits where PERF_MEM_HOPS_{NA, 0..6} value
can be used to present different hop levels data.
Also add corresponding macros to define mem_hop field values
and shift value.
Currently we define macro for HOPS_0 which corresponds
to data coming from another core but same node.
Add functionality to represent mem_hop field data in
perf_mem__lvl_scnprintf function with the help of added string
array called mem_hops.
For ex: Encodings for mem_hops fields with L2 cache:
L2 - local L2
L2 | REMOTE | HOPS_0 - remote core, same node L2
Since with the addition of HOPS field, now remote can be used to
denote cache access from the same node but different core, a check
is added in the c2c_decode_stats function to set mrem only when HOPS
is zero along with set remote field.
Signed-off-by: Kajol Jain <kjain@linux.ibm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20211006140654.298352-4-kjain@linux.ibm.com
|