Age | Commit message (Collapse) | Author |
|
Our phy does not support 1000M/half, this patch removes 1000M/half from
PHY_SUPPORTED_FEATURES.
Signed-off-by: Fuyun Liang <liangfuyun1@huawei.com>
Signed-off-by: Peng Li <lipeng321@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
coalesce param updates every 100 napi times, it may update a little
late if ping test after a high rate flow, may over napi poll is called
100 times as ping test sends packets every second.
This patch updates coalesce param every second, instead with every
100 napi times. It can not update the param 100% in time, but the
lag time is very short.
Signed-off-by: Peng Li <lipeng321@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
hns3_nic_uninit_vector_data()
In the hns3_nic_uninit_vector_data(), the procedure of uninitializing
the tqp_vector's IRQ has not set affinity_notify to NULL and changes
its init flag. This patch fixes it. And for simplificaton, local
variable tqp_vector is used instead of priv->tqp_vector[i].
Fixes: 424eb834a9be ("net: hns3: Unified HNS3 {VF|PF} Ethernet Driver for hip08 SoC")
Signed-off-by: Huazhong Tan <tanhuazhong@huawei.com>
Signed-off-by: Peng Li <lipeng321@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
When doing reset, it is unnecessary to get the hardware's default
configuration again, otherwise, the user's configuration will be
overwritten.
Fixes: 4ed340ab8f49 ("net: hns3: Add reset process in hclge_main")
Signed-off-by: Huazhong Tan <tanhuazhong@huawei.com>
Signed-off-by: Peng Li <lipeng321@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
When hclge_reset() completes successfully, it should update the
last_reset_time, set reset_fail_cnt to 0, and set reset_type of
hnae3_ae_dev to HNAE3_NONE_RESET.
Also when hclgevf_reset() completes successfully, it should update
the last_reset_time, and set reset_type of hnae3_ae_dev to
HNAE3_NONE_RESET.
Signed-off-by: Huazhong Tan <tanhuazhong@huawei.com>
Signed-off-by: Peng Li <lipeng321@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
While doing DOWN, the calling of napi_disable() may not return, since the
napi_complete() in the hns3_nic_common_poll() will never be called when
HNS3_NIC_STATE_DOWN is set. So we need to call napi_complete() before
checking HNS3_NIC_STETE_DOWN.
Fixes: ff0699e04b97 ("net: hns3: stop napi polling when HNS3_NIC_STATE_DOWN is set")
Signed-off-by: Huazhong Tan <tanhuazhong@huawei.com>
Signed-off-by: Peng Li <lipeng321@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
In the hclgevf_pci_reset(), it only uninitialize and initialize
the msi, so if the initialization fails, hclgevf_uninit_hdev()
does not need to uninitialize the msi, but needs to uninitialize
the pci, otherwise it will cause pci resource not free.
Fixes: 862d969a3a4d ("net: hns3: do VF's pci re-initialization while PF doing FLR")
Signed-off-by: Huazhong Tan <tanhuazhong@huawei.com>
Signed-off-by: Peng Li <lipeng321@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
When hns3_get_vector_ring_chain() failed in the
hns3_nic_init_vector_data(), it should do the error handling instead
of return directly.
Also, cur_chain should be freed instead of chain and head->next should
be set to NULL in error handling of hns3_get_vector_ring_chain.
This patch fixes them.
Fixes: 73b907a083b8 ("net: hns3: bugfix for buffer not free problem during resetting")
Signed-off-by: Huazhong Tan <tanhuazhong@huawei.com>
Signed-off-by: Peng Li <lipeng321@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Update DM to set the bdi's io_pages. This fixes reads to be capped at
the device's max request size (even if user's read IO exceeds the
established readahead setting).
Fixes: 9491ae4a ("mm: don't cap request size based on read-ahead setting")
Cc: stable@vger.kernel.org
Reviewed-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
|
|
Now that the handlers do not process their own udata we can make a
sensible ioctl that wrappers them. The ioctl follows the same format as
the write_ex() and has the user explicitly specify the core and driver
in/out opaque structures and a command number.
This works for all forms of write commands.
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
|
|
Core changes:
- Parse the 4BAIT SFDP section
- Add a bunch of SPI NOR entries to the flash_info table
- Add the concept of SFDP fixups and use it to fix a bug on MX25L25635F
- A bunch of minor cleanups/comestic changes
|
|
NAND core changes:
- kernel-doc miscellaneous fixes.
- Third batch of fixes/cleanup to the raw NAND core impacting various
controller drivers (ams-delta, marvell, fsmc, denali, tegra, vf610):
* Stopping to pass mtd_info objects to internal functions
* Reorganizing code to avoid forward declarations
* Dropping useless test in nand_legacy_set_defaults()
* Moving nand_exec_op() to internal.h
* Adding nand_[de]select_target() helpers
* Passing the CS line to be selected in struct nand_operation
* Making ->select_chip() optional when ->exec_op() is implemented
* Deprecating the ->select_chip() hook
* Moving the ->exec_op() method to nand_controller_ops
* Moving ->setup_data_interface() to nand_controller_ops
* Deprecating the dummy_controller field
* Fixing JEDEC detection
* Providing a helper for polling GPIO R/B pin
Raw NAND chip drivers changes:
- Macronix:
* Flagging 1.8V AC chips with a broken GET_FEATURES(TIMINGS)
Raw NAND controllers drivers changes:
- Ams-delta:
* Fixing the error path
* SPDX tag added
* May be compiled with COMPILE_TEST=y
* Conversion to ->exec_op() interface
* Dropping .IOADDR_R/W use
* Use GPIO API for data I/O
- Denali:
* Removing denali_reset_banks()
* Removing ->dev_ready() hook
* Including <linux/bits.h> instead of <linux/bitops.h>
* Changes to comply with the above fixes/cleanup done in the core.
- FSMC:
* Adding an SPDX tag to replace the license text
* Making conversion from chip to fsmc consistent
* Fixing unchecked return value in fsmc_read_page_hwecc
* Changes to comply with the above fixes/cleanup done in the core.
- Marvell:
* Preventing timeouts on a loaded machine (fix)
* Changes to comply with the above fixes/cleanup done in the core.
- OMAP2:
* Pass the parent of pdev to dma_request_chan() (fix)
- R852:
* Use generic DMA API
- sh_flctl:
* Converting to SPDX identifiers
- Sunxi:
* Write pageprog related opcodes to the right register: WCMD_SET (fix)
- Tegra:
* Stop implementing ->select_chip()
- VF610:
* Adding an SPDX tag to replace the license text
* Changes to comply with the above fixes/cleanup done in the core.
- Various trivial/spelling/coding style fixes.
SPI-NAND drivers changes:
- Removing the depreacated mt29f_spinand driver from staging.
- Adding support for:
* Toshiba TC58CVG2S0H
* GigaDevice GD5FxGQ4xA
* Winbond W25N01GV
|
|
Sending a check/repair message infrequently leads to -EBUSY instead of
properly identifying an active resync. This occurs because
raid_message() is testing recovery bits in a racy way.
Fix by calling decipher_sync_action() from raid_message() to properly
identify the idle state of the RAID device.
Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi
Pull SCSI fixes from James Bottomley:
"Three fixes: The t10-pi one is a regression from the 4.19 release, the
qla2xxx one is a 4.20 merge window regression and the bnx2fc is a very
old bug"
* tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi:
scsi: t10-pi: Return correct ref tag when queue has no integrity profile
scsi: bnx2fc: Fix NULL dereference in error handling
Revert "scsi: qla2xxx: Fix NVMe Target discovery"
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/maz/arm-platforms into irq/core
Pull irqchip updates from Marc Zyngier:
- A bunch of new irqchip drivers (RDA8810PL, Madera, imx-irqsteer)
- Updates for new (and old) platforms (i.MX8MQ, F1C100s)
- A number of SPDX cleanups
- A workaround for a very broken GICv3 implementation
- A platform-msi fix
- Various cleanups
|
|
The pointer was NULLed before freeing the memory, resulting in a memory
leak. Trace from kmemleak:
unreferenced object 0xffff88820ae36528 (size 512):
comm "devlink", pid 5374, jiffies 4295354033 (age 10829.296s)
hex dump (first 32 bytes):
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
backtrace:
[<00000000a43f5195>] kmem_cache_alloc_trace+0x1be/0x330
[<00000000312f8140>] mlxsw_sp_nve_init+0xcb/0x1ae0
[<0000000009201d22>] mlxsw_sp_init+0x1382/0x2690
[<000000007227d877>] mlxsw_sp1_init+0x1b5/0x260
[<000000004a16feec>] __mlxsw_core_bus_device_register+0x776/0x1360
[<0000000070ab954c>] mlxsw_devlink_core_bus_device_reload+0x129/0x220
[<00000000432313d5>] devlink_nl_cmd_reload+0x119/0x1e0
[<000000003821a06b>] genl_family_rcv_msg+0x813/0x1150
[<00000000d54d04c0>] genl_rcv_msg+0xd1/0x180
[<0000000040543d12>] netlink_rcv_skb+0x152/0x3c0
[<00000000efc4eae8>] genl_rcv+0x2d/0x40
[<00000000ea645603>] netlink_unicast+0x52f/0x740
[<00000000641fca1a>] netlink_sendmsg+0x9c7/0xf50
[<00000000fed4a4b8>] sock_sendmsg+0xbe/0x120
[<00000000d85795a9>] __sys_sendto+0x397/0x620
[<00000000c5f84622>] __x64_sys_sendto+0xe6/0x1a0
Fixes: 6e6030bd5412 ("mlxsw: spectrum_nve: Implement common NVE core")
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Reviewed-by: Petr Machata <petrm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
After a packet was decapsulated it is classified to the relevant FID
based on its VNI and undergoes L2 forwarding.
Unlike regular (non-encapsulated) ARP packets, Spectrum does not trap
decapsulated ARP packets during L2 forwarding and instead can only trap
such packets in the underlay router during decapsulation.
Add this missing packet trap, which is required for VXLAN routing when
the MAC of the target host is not known.
Fixes: b02597d513a9 ("mlxsw: spectrum: Add NVE packet traps")
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Reviewed-by: Petr Machata <petrm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
During the firmware flash process, some of the EMADs get timed out, which
causes the driver to send them again with a limit of 5 retries. There are
some situations in which 5 retries is not enough and the EMAD access fails.
If the failed EMAD was related to the flashing process, the driver fails
the flashing.
The reason for these timeouts during firmware flashing is cache misses in
the CPU running the firmware. In case the CPU needs to fetch instructions
from the flash when a firmware is flashed, it needs to wait for the
flashing to complete. Since flashing takes time, it is possible for pending
EMADs to timeout.
Fix by increasing EMADs' timeout while flashing firmware.
Fixes: ce6ef68f433f ("mlxsw: spectrum: Implement the ethtool flash_device callback")
Signed-off-by: Shalom Toledo <shalomt@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Use page directory based shared buffer implementation
now available as common code for Xen frontend drivers.
Remove flushing of shared buffer on page flip as this
workaround needs a proper fix.
Signed-off-by: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
Reviewed-by: Noralf Trønnes <noralf@tronnes.org>
Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
|
|
based frontends. Currently the frontends which implement
similar code for sharing big buffers between frontend and
backend are para-virtualized DRM and sound drivers.
Both define the same way to share grant references of a
data buffer with the corresponding backend with little
differences.
Move shared code into a helper module, so there is a single
implementation of the same functionality for all.
This patch introduces code which is used by sound and display
frontend drivers without functional changes with the intention
to remove shared code from the corresponding drivers.
Signed-off-by: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
Acked-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
|
|
When passed with nr_poll_queues setup additional queues with cq polling
context IB_POLL_DIRECT (no interrupts) and make sure to set
QUEUE_FLAG_POLL on the connect_q. In addition add the third queue
mapping for polling queues.
nvmf connect on this queue is polled for like all other requests so make
nvmf_connect_io_queue poll for polling queues.
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
|
|
This argument will specify how many polling I/O queues to connect when
creating the controller. These I/O queues will host I/O that is set with
REQ_HIPRI.
Reviewed-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
|
|
Preparation for polling support for fabrics. Polling support
means that our completion queues are not generating any interrupts
which means we need to poll for the nvmf io queue connect as well.
Reviewed by Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
|
|
Pass poll bool to indicate that we need it to poll. This prepares us for
polling support in nvmf since connect is an I/O that will be queued
and has to be polled in order to complete. If poll is passed,
we call nvme_execute_rq_polled which sends the requests and polls
for its completion.
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
|
|
There is a spelling mistake in a dev_info message, fix it.
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
|
|
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
|
|
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Sagi Grimberg <sagi@grimberg.me>
|
|
By duplicating the nvme_process_cq in both branches we keep the
sparse lock context checking happy, so do it.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
|
|
The block layer now enables polling support on a queue if nr_maps
includes the poll map, so we should only set that if we actually
support poll queues.
Fixes: 6544d229bf ("block: enable polling by default if a poll map is initalized")
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jens Axboe <axboe@kernel.dk>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
|
|
This patch defines a new macro NVMET_NO_ERROR_LOC to represent the
default error location value in the nvme-error-log-page.
This is a pure cleanup patch and it does not change any functionality.
Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
|
|
Currently the u16 req->error_loc is being compared to -1 which
will always be false. Fix this by casting -1 to u16 to fix this.
Detected by clang:
warning: result of comparison of constant -1 with expression of
type 'u16' (aka 'unsigned short') is always false
[-Wtautological-constant-out-of-range-compare]
Fixes: 76574f37bf4c ("nvmet: add interface to update error-log page")
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
|
|
If a hwspinlock is defined in device tree use it to protect
configuration registers.
Do not request for hwspinlock during the exti driver init since the
hwspinlock driver is not probed yet at that stage and the exti driver
does not support deferred probe.
Instead of this, postpone the hwspinlock request at the first time the
hwspinlock is actually needed.
Use the hwspin_trylock_raw() API which is the most appropriated here
Indeed:
- hwspin_lock_() calls are under spin_lock protection (chip_data->rlock
or gc->lock).
- the _timeout() API relies on jiffies count which won't work if IRQs
are disabled which is the case here (a large part of the IRQ setup is
done atomically (see irq/manage.c))
As a consequence implement the retry/timeout lock from here. And since
all of this is done atomically, reduce the timeout delay to 1 ms.
Signed-off-by: Benjamin Gaignard <benjamin.gaignard@st.com>
Signed-off-by: Fabien Dessenne <fabien.dessenne@st.com>
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
|
|
The irqsteer block is a interrupt multiplexer/remapper found on the
i.MX8 line of SoCs.
Signed-off-by: Fugang Duan <fugang.duan@nxp.com>
Signed-off-by: Lucas Stach <l.stach@pengutronix.de>
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
|
|
Since the runtime PM support was added in musb, dsps relies on the timer
calling otg_timer() to activate the usb subsystem. However the driver
doesn't enable the timer for peripheral port, then the peripheral port is
unable to be enumerated by a host if the other usb port is disabled or in
peripheral mode too.
So let's start the timer for peripheral port too.
Fixes: ea2f35c01d5e ("usb: musb: Fix sleeping function called from invalid context for hdrc glue")
Acked-by: Tony Lindgren <tony@atomide.com>
Signed-off-by: Bin Liu <b-liu@ti.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
Due to lack of ID pin interrupt event on AM335x devices, the musb dsps
driver uses polling to detect usb device attach for dual-role port.
But in the case if a micro-A cable adapter is attached without a USB device
attached to the cable, the musb state machine gets stuck in a_wait_vrise
state waiting for the MUSB_CONNECT interrupt which won't happen due to the
usb device is not attached. The state is stuck in a_wait_vrise even after
the micro-A cable is detached, which could cause VBUS retention if then the
dual-role port is attached to a host port.
To fix the problem, make a_wait_vrise as a transient state, then move the
state to either a_wait_bcon for host port or a_idle state for dual-role
port, if no usb device is attached to the port.
Signed-off-by: Bin Liu <b-liu@ti.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
The Cirrus Logic Madera codecs (Cirrus Logic CS47L35/85/90/91 and WM1840)
are highly complex devices containing up to 7 programmable DSPs and many
other internal sources of interrupts plus a number of GPIOs that can be
used as interrupt inputs. The large number (>150) of internal interrupt
sources are managed by an on-board interrupt controller.
This driver provides the handling for the interrupt controller. As the
codec is accessed via regmap, we can make use of the generic IRQ
functionality from regmap to do most of the work. Only around half of
the possible interrupt source are currently of interest from the driver
so only this subset is defined. Others can be added in future if needed.
The KConfig options are not user-configurable because this driver is
mandatory so is automatically included when the parent MFD driver is
selected.
Signed-off-by: Richard Fitzgerald <rf@opensource.cirrus.com>
Signed-off-by: Charles Keepax <ckeepax@opensource.cirrus.com>
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
|
|
When commit 6a23e05c2fe3c6 ("dm: remove legacy request-based IO path")
removed some q->mq_ops branching from map_request() it left in place a
goto that was only needed if that branching (and conditional 'r'
assignment) existed. Now that the branching is gone map_request()'s
goto can be removed too.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
|
|
Log the hash algorithm's driver name when a dm-verity target is created.
This will help people determine whether the expected implementation is
being used. It can make an enormous difference; e.g., SHA-256 on ARM
can be 8x faster with the crypto extensions than without. It can also
be useful to know if an implementation using an external crypto
accelerator is being used instead of a software implementation.
Example message:
[ 35.281945] device-mapper: verity: sha256 using implementation "sha256-ce"
We've already found the similar message in fs/crypto/keyinfo.c to be
very useful.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
|
|
Log the encryption algorithm's driver name when a dm-crypt target is
created. This will help people determine whether the expected
implementation is being used. In some cases we've seen people do
benchmarks and reject using encryption for performance reasons, when in
fact they used a much slower implementation than was possible on the
hardware. It can make an enormous difference; e.g., AES-XTS on ARM can
be over 10x faster with the crypto extensions than without. It can also
be useful to know if an implementation using an external crypto
accelerator is being used instead of a software implementation.
Example message:
[ 29.307629] device-mapper: crypt: xts(aes) using implementation "xts-aes-ce"
We've already found the similar message in fs/crypto/keyinfo.c to be
very useful.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
|
|
Rename the workqueue from dm-intergrity-recalc to dm-integrity-recalc.
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
|
|
The flakey target is documented to be able to corrupt the Nth byte in
a bio, but does not corrupt byte indices after the first biovec in the
bio. Change the corrupting function to actually corrupt the Nth byte
no matter in which biovec that index falls.
A test device generating two-page bios, atop a flakey device configured
to corrupt a byte index on the second page, verified both the failure
to corrupt before this patch and the expected corruption after this
change.
Signed-off-by: John Dorminy <jdorminy@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
|
|
Reference to a device in device-mapper table contains offset in sectors.
If the sector_t is 32bit integer (CONFIG_LBDAF is not set), then
several device-mapper targets can overflow this offset and validity
check is then performed on a wrong offset and a wrong table is activated.
See for example (on 32bit without CONFIG_LBDAF) this overflow:
# dmsetup create test --table "0 2048 linear /dev/sdg 4294967297"
# dmsetup table test
0 2048 linear 8:96 1
This patch adds explicit check for overflow if the offset is sector_t type.
Signed-off-by: Milan Broz <gmazyland@gmail.com>
Reviewed-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
|
|
The iv_offset in the mapping table of crypt target is a 64bit number when
IV algorithm is plain64, plain64be, essiv or benbi. It will be assigned to
iv_offset of struct crypt_config, cc_sector of struct convert_context and
iv_sector of struct dm_crypt_request. These structures members are defined
as a sector_t. But sector_t is 32bit when CONFIG_LBDAF is not set in 32bit
kernel. In this situation sector_t is not big enough to store the 64bit
iv_offset.
Here is a reproducer.
Prepare test image and device (loop is automatically allocated by cryptsetup):
# dd if=/dev/zero of=tst.img bs=1M count=1
# echo "tst"|cryptsetup open --type plain -c aes-xts-plain64 \
--skip 500000000000000000 tst.img test
On 32bit system (use IV offset value that overflows to 64bit; CONFIG_LBDAF if off)
and device checksum is wrong:
# dmsetup table test --showkeys
0 2048 crypt aes-xts-plain64 dfa7cfe3c481f2239155739c42e539ae8f2d38f304dcc89d20b26f69daaf0933 3551657984 7:0 0
# sha256sum /dev/mapper/test
533e25c09176632b3794f35303488c4a8f3f965dffffa6ec2df347c168cb6c19 /dev/mapper/test
On 64bit system (and on 32bit system with the patch), table and checksum is now correct:
# dmsetup table test --showkeys
0 2048 crypt aes-xts-plain64 dfa7cfe3c481f2239155739c42e539ae8f2d38f304dcc89d20b26f69daaf0933 500000000000000000 7:0 0
# sha256sum /dev/mapper/test
5d16160f9d5f8c33d8051e65fdb4f003cc31cd652b5abb08f03aa6fce0df75fc /dev/mapper/test
Signed-off-by: AliOS system security <alios_sys_security@linux.alibaba.com>
Tested-and-Reviewed-by: Milan Broz <gmazyland@gmail.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
|
|
When using kcopyd to run callbacks through dm_kcopyd_do_callback() or
submitting copy jobs with a source size of 0, the jobs are pushed
directly to the complete_jobs list, which could be under processing by
the kcopyd thread. As a result, the kcopyd thread can continue running
completed jobs indefinitely, without releasing the CPU, as long as
someone keeps submitting new completed jobs through the aforementioned
paths. Processing of work items, queued for execution on the same CPU as
the currently running kcopyd thread, is thus stalled for excessive
amounts of time, hurting performance.
Running the following test, from the device mapper test suite [1],
dmtest run --suite snapshot -n parallel_io_to_many_snaps_N
, with 8 active snapshots, we get, in dmesg, messages like the
following:
[68899.948523] BUG: workqueue lockup - pool cpus=0 node=0 flags=0x0 nice=0 stuck for 95s!
[68899.949282] Showing busy workqueues and worker pools:
[68899.949288] workqueue events: flags=0x0
[68899.949295] pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=2/256
[68899.949306] pending: vmstat_shepherd, cache_reap
[68899.949331] workqueue mm_percpu_wq: flags=0x8
[68899.949337] pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256
[68899.949345] pending: vmstat_update
[68899.949387] workqueue dm_bufio_cache: flags=0x8
[68899.949392] pwq 4: cpus=2 node=0 flags=0x0 nice=0 active=1/256
[68899.949400] pending: work_fn [dm_bufio]
[68899.949423] workqueue kcopyd: flags=0x8
[68899.949429] pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256
[68899.949437] pending: do_work [dm_mod]
[68899.949452] workqueue kcopyd: flags=0x8
[68899.949458] pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=2/256
[68899.949466] in-flight: 13:do_work [dm_mod]
[68899.949474] pending: do_work [dm_mod]
[68899.949487] workqueue kcopyd: flags=0x8
[68899.949493] pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256
[68899.949501] pending: do_work [dm_mod]
[68899.949515] workqueue kcopyd: flags=0x8
[68899.949521] pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256
[68899.949529] pending: do_work [dm_mod]
[68899.949541] workqueue kcopyd: flags=0x8
[68899.949547] pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256
[68899.949555] pending: do_work [dm_mod]
[68899.949568] pool 0: cpus=0 node=0 flags=0x0 nice=0 hung=95s workers=4 idle: 27130 27223 1084
Fix this by splitting the complete_jobs list into two parts: A user
facing part, named callback_jobs, and one used internally by kcopyd,
retaining the name complete_jobs. dm_kcopyd_do_callback() and
dispatch_job() now push their jobs to the callback_jobs list, which is
spliced to the complete_jobs list once, every time the kcopyd thread
wakes up. This prevents kcopyd from hogging the CPU indefinitely and
causing workqueue stalls.
Re-running the aforementioned test:
* Workqueue stalls are eliminated
* The maximum writing time among all targets is reduced from 09m37.10s
to 06m04.85s and the total run time of the test is reduced from
10m43.591s to 7m19.199s
[1] https://github.com/jthornber/device-mapper-test-suite
Signed-off-by: Nikos Tsironis <ntsironis@arrikto.com>
Signed-off-by: Ilias Tsitsimpis <iliastsi@arrikto.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
|
|
kcopyd has no upper limit to the number of jobs one can allocate and
issue. Under certain workloads this can lead to excessive memory usage
and workqueue stalls. For example, when creating multiple dm-snapshot
targets with a 4K chunk size and then writing to the origin through the
page cache. Syncing the page cache causes a large number of BIOs to be
issued to the dm-snapshot origin target, which itself issues an even
larger (because of the BIO splitting taking place) number of kcopyd
jobs.
Running the following test, from the device mapper test suite [1],
dmtest run --suite snapshot -n many_snapshots_of_same_volume_N
, with 8 active snapshots, results in the kcopyd job slab cache growing
to 10G. Depending on the available system RAM this can lead to the OOM
killer killing user processes:
[463.492878] kthreadd invoked oom-killer: gfp_mask=0x6040c0(GFP_KERNEL|__GFP_COMP),
nodemask=(null), order=1, oom_score_adj=0
[463.492894] kthreadd cpuset=/ mems_allowed=0
[463.492948] CPU: 7 PID: 2 Comm: kthreadd Not tainted 4.19.0-rc7 #3
[463.492950] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1 04/01/2014
[463.492952] Call Trace:
[463.492964] dump_stack+0x7d/0xbb
[463.492973] dump_header+0x6b/0x2fc
[463.492987] ? lockdep_hardirqs_on+0xee/0x190
[463.493012] oom_kill_process+0x302/0x370
[463.493021] out_of_memory+0x113/0x560
[463.493030] __alloc_pages_slowpath+0xf40/0x1020
[463.493055] __alloc_pages_nodemask+0x348/0x3c0
[463.493067] cache_grow_begin+0x81/0x8b0
[463.493072] ? cache_grow_begin+0x874/0x8b0
[463.493078] fallback_alloc+0x1e4/0x280
[463.493092] kmem_cache_alloc_node+0xd6/0x370
[463.493098] ? copy_process.part.31+0x1c5/0x20d0
[463.493105] copy_process.part.31+0x1c5/0x20d0
[463.493115] ? __lock_acquire+0x3cc/0x1550
[463.493121] ? __switch_to_asm+0x34/0x70
[463.493129] ? kthread_create_worker_on_cpu+0x70/0x70
[463.493135] ? finish_task_switch+0x90/0x280
[463.493165] _do_fork+0xe0/0x6d0
[463.493191] ? kthreadd+0x19f/0x220
[463.493233] kernel_thread+0x25/0x30
[463.493235] kthreadd+0x1bf/0x220
[463.493242] ? kthread_create_on_cpu+0x90/0x90
[463.493248] ret_from_fork+0x3a/0x50
[463.493279] Mem-Info:
[463.493285] active_anon:20631 inactive_anon:4831 isolated_anon:0
[463.493285] active_file:80216 inactive_file:80107 isolated_file:435
[463.493285] unevictable:0 dirty:51266 writeback:109372 unstable:0
[463.493285] slab_reclaimable:31191 slab_unreclaimable:3483521
[463.493285] mapped:526 shmem:4903 pagetables:1759 bounce:0
[463.493285] free:33623 free_pcp:2392 free_cma:0
...
[463.493489] Unreclaimable slab info:
[463.493513] Name Used Total
[463.493522] bio-6 1028KB 1028KB
[463.493525] bio-5 1028KB 1028KB
[463.493528] dm_snap_pending_exception 236783KB 243789KB
[463.493531] dm_exception 41KB 42KB
[463.493534] bio-4 1216KB 1216KB
[463.493537] bio-3 439396KB 439396KB
[463.493539] kcopyd_job 6973427KB 6973427KB
...
[463.494340] Out of memory: Kill process 1298 (ruby2.3) score 1 or sacrifice child
[463.494673] Killed process 1298 (ruby2.3) total-vm:435740kB, anon-rss:20180kB, file-rss:4kB, shmem-rss:0kB
[463.506437] oom_reaper: reaped process 1298 (ruby2.3), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
Moreover, issuing a large number of kcopyd jobs results in kcopyd
hogging the CPU, while processing them. As a result, processing of work
items, queued for execution on the same CPU as the currently running
kcopyd thread, is stalled for long periods of time, hurting performance.
Running the aforementioned test we get, in dmesg, messages like the
following:
[67501.194592] BUG: workqueue lockup - pool cpus=4 node=0 flags=0x0 nice=0 stuck for 27s!
[67501.195586] Showing busy workqueues and worker pools:
[67501.195591] workqueue events: flags=0x0
[67501.195597] pwq 8: cpus=4 node=0 flags=0x0 nice=0 active=1/256
[67501.195611] pending: cache_reap
[67501.195641] workqueue mm_percpu_wq: flags=0x8
[67501.195645] pwq 8: cpus=4 node=0 flags=0x0 nice=0 active=1/256
[67501.195656] pending: vmstat_update
[67501.195682] workqueue kblockd: flags=0x18
[67501.195687] pwq 5: cpus=2 node=0 flags=0x0 nice=-20 active=1/256
[67501.195698] pending: blk_timeout_work
[67501.195753] workqueue kcopyd: flags=0x8
[67501.195757] pwq 8: cpus=4 node=0 flags=0x0 nice=0 active=1/256
[67501.195768] pending: do_work [dm_mod]
[67501.195802] workqueue kcopyd: flags=0x8
[67501.195806] pwq 8: cpus=4 node=0 flags=0x0 nice=0 active=1/256
[67501.195817] pending: do_work [dm_mod]
[67501.195834] workqueue kcopyd: flags=0x8
[67501.195838] pwq 8: cpus=4 node=0 flags=0x0 nice=0 active=1/256
[67501.195848] pending: do_work [dm_mod]
[67501.195881] workqueue kcopyd: flags=0x8
[67501.195885] pwq 8: cpus=4 node=0 flags=0x0 nice=0 active=1/256
[67501.195896] pending: do_work [dm_mod]
[67501.195920] workqueue kcopyd: flags=0x8
[67501.195924] pwq 8: cpus=4 node=0 flags=0x0 nice=0 active=2/256
[67501.195935] in-flight: 67:do_work [dm_mod]
[67501.195945] pending: do_work [dm_mod]
[67501.195961] pool 8: cpus=4 node=0 flags=0x0 nice=0 hung=27s workers=3 idle: 129 23765
The root cause for these issues is the way dm-snapshot uses kcopyd. In
particular, the lack of an explicit or implicit limit to the maximum
number of in-flight COW jobs. The merging path is not affected because
it implicitly limits the in-flight kcopyd jobs to one.
Fix these issues by using a semaphore to limit the maximum number of
in-flight kcopyd jobs. We grab the semaphore before allocating a new
kcopyd job in start_copy() and start_full_bio() and release it after the
job finishes in copy_callback().
The initial semaphore value is configurable through a module parameter,
to allow fine tuning the maximum number of in-flight COW jobs. Setting
this parameter to zero initializes the semaphore to INT_MAX.
A default value of 2048 maximum in-flight kcopyd jobs was chosen. This
value was decided experimentally as a trade-off between memory
consumption, stalling the kernel's workqueues and maintaining a high
enough throughput.
Re-running the aforementioned test:
* Workqueue stalls are eliminated
* kcopyd's job slab cache uses a maximum of 130MB
* The time taken by the test to write to the snapshot-origin target is
reduced from 05m20.48s to 03m26.38s
[1] https://github.com/jthornber/device-mapper-test-suite
Signed-off-by: Nikos Tsironis <ntsironis@arrikto.com>
Signed-off-by: Ilias Tsitsimpis <iliastsi@arrikto.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
|
|
* Hashtable has been replaced by rbtree to manage buffers.
Update the comment.
* Fix typo in the comment for dm_bufio_issue_flush
Signed-off-by: Shenghui Wang <shhuiw@foxmail.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
|
|
The error msg should be "flush thread" instead of "endio thread"
for writecache_flush_thread.
Signed-off-by: Shenghui Wang <shhuiw@foxmail.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
|
|
No need to be so fancy.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
|
|
The workqueues are shared by many multipath devices, only flush whole
workqueue when necessary. Otherwise, we just flush works as needed.
Signed-off-by: wuzhouhui <wuzhouhui14@mails.ucas.ac.cn>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
|
|
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
|