summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2019-06-17Merge branch 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfsLinus Torvalds
Pull vfs fixes from Al Viro: "MS_MOVE regression fix + breakage in fsmount(2) (also introduced in this cycle, along with fsmount(2) itself). I'm still digging through the piles of mail, so there might be more fixes to follow, but these two are obvious and self-contained, so there's no point delaying those..." * 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: fs/namespace: fix unprivileged mount propagation vfs: fsmount: add missing mntget()
2019-06-17Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netLinus Torvalds
Pull networking fixes from David Miller: "Lots of bug fixes here: 1) Out of bounds access in __bpf_skc_lookup, from Lorenz Bauer. 2) Fix rate reporting in cfg80211_calculate_bitrate_he(), from John Crispin. 3) Use after free in psock backlog workqueue, from John Fastabend. 4) Fix source port matching in fdb peer flow rule of mlx5, from Raed Salem. 5) Use atomic_inc_not_zero() in fl6_sock_lookup(), from Eric Dumazet. 6) Network header needs to be set for packet redirect in nfp, from John Hurley. 7) Fix udp zerocopy refcnt, from Willem de Bruijn. 8) Don't assume linear buffers in vxlan and geneve error handlers, from Stefano Brivio. 9) Fix TOS matching in mlxsw, from Jiri Pirko. 10) More SCTP cookie memory leak fixes, from Neil Horman. 11) Fix VLAN filtering in rtl8366, from Linus Walluij. 12) Various TCP SACK payload size and fragmentation memory limit fixes from Eric Dumazet. 13) Use after free in pneigh_get_next(), also from Eric Dumazet. 14) LAPB control block leak fix from Jeremy Sowden" * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (145 commits) lapb: fixed leak of control-blocks. tipc: purge deferredq list for each grp member in tipc_group_delete ax25: fix inconsistent lock state in ax25_destroy_timer neigh: fix use-after-free read in pneigh_get_next tcp: fix compile error if !CONFIG_SYSCTL hv_sock: Suppress bogus "may be used uninitialized" warnings be2net: Fix number of Rx queues used for flow hashing net: handle 802.1P vlan 0 packets properly tcp: enforce tcp_min_snd_mss in tcp_mtu_probing() tcp: add tcp_min_snd_mss sysctl tcp: tcp_fragment() should apply sane memory limits tcp: limit payload size of sacked skbs Revert "net: phylink: set the autoneg state in phylink_phy_change" bpf: fix nested bpf tracepoints with per-cpu data bpf: Fix out of bounds memory access in bpf_sk_storage vsock/virtio: set SOCK_DONE on peer shutdown net: dsa: rtl8366: Fix up VLAN filtering net: phylink: set the autoneg state in phylink_phy_change net: add high_order_alloc_disable sysctl/static key tcp: add tcp_tx_skb_cache sysctl ...
2019-06-17fs/namespace: fix unprivileged mount propagationChristian Brauner
When propagating mounts across mount namespaces owned by different user namespaces it is not possible anymore to move or umount the mount in the less privileged mount namespace. Here is a reproducer: sudo mount -t tmpfs tmpfs /mnt sudo --make-rshared /mnt # create unprivileged user + mount namespace and preserve propagation unshare -U -m --map-root --propagation=unchanged # now change back to the original mount namespace in another terminal: sudo mkdir /mnt/aaa sudo mount -t tmpfs tmpfs /mnt/aaa # now in the unprivileged user + mount namespace mount --move /mnt/aaa /opt Unfortunately, this is a pretty big deal for userspace since this is e.g. used to inject mounts into running unprivileged containers. So this regression really needs to go away rather quickly. The problem is that a recent change falsely locked the root of the newly added mounts by setting MNT_LOCKED. Fix this by only locking the mounts on copy_mnt_ns() and not when adding a new mount. Fixes: 3bd045cc9c4b ("separate copying and locking mount tree on cross-userns copies") Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: <stable@vger.kernel.org> Tested-by: Christian Brauner <christian@brauner.io> Acked-by: Christian Brauner <christian@brauner.io> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Christian Brauner <christian@brauner.io> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2019-06-17vfs: fsmount: add missing mntget()Eric Biggers
sys_fsmount() needs to take a reference to the new mount when adding it to the anonymous mount namespace. Otherwise the filesystem can be unmounted while it's still in use, as found by syzkaller. Reported-by: Mark Rutland <mark.rutland@arm.com> Reported-by: syzbot+99de05d099a170867f22@syzkaller.appspotmail.com Reported-by: syzbot+7008b8b8ba7df475fdc8@syzkaller.appspotmail.com Fixes: 93766fbd2696 ("vfs: syscall: Add fsmount() to create a mount for a superblock") Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2019-06-17cifs: fix GlobalMid_Lock bug in cifs_reconnectRonnie Sahlberg
We can not hold the GlobalMid_Lock spinlock during the dfs processing in cifs_reconnect since it invokes things that may sleep and thus trigger : BUG: sleeping function called from invalid context at kernel/locking/rwsem.c:23 Thus we need to drop the spinlock during this code block. RHBZ: 1716743 Cc: stable@vger.kernel.org Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com> Acked-by: Pavel Shilovsky <pshilov@microsoft.com> Signed-off-by: Steve French <stfrench@microsoft.com>
2019-06-17SMB3: retry on STATUS_INSUFFICIENT_RESOURCES instead of failing writeSteve French
Some servers such as Windows 10 will return STATUS_INSUFFICIENT_RESOURCES as the number of simultaneous SMB3 requests grows (even though the client has sufficient credits). Return EAGAIN on STATUS_INSUFFICIENT_RESOURCES so that we can retry writes which fail with this status code. This (for example) fixes large file copies to Windows 10 on fast networks. Signed-off-by: Steve French <stfrench@microsoft.com> CC: Stable <stable@vger.kernel.org> Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com> Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
2019-06-17Merge branch 'erofs_fix' into staging-linusGreg Kroah-Hartman
A branch is needed here to get the fix into staging-next as well. Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-06-17staging: erofs: add requirements field in superblockGao Xiang
There are some backward incompatible features pending for months, mainly due to on-disk format expensions. However, we should ensure that it cannot be mounted with old kernels. Otherwise, it will causes unexpected behaviors. Fixes: ba2b77a82022 ("staging: erofs: add super block operations") Cc: <stable@vger.kernel.org> # 4.19+ Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Gao Xiang <gaoxiang25@huawei.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-06-17Merge tag 'iio-fixes-for-5.2b' of ↵Greg Kroah-Hartman
git://git.kernel.org/pub/scm/linux/kernel/git/jic23/iio into staging-linus Jonathan writes: Second set of IIO fixes for the 5.2 cycle. * ad7150 - sense of bit for controlling adaptive vs fixed threshold was flipped. * adt7316 - Fix a build issue due to wrong headers for gpio usage. * lsm6dsx - correctly suspend / resume i2c slaves when the host goes to sleep. * mlx90632 - relax a compatability check to allow for newer devices. Also one counters fix * counter/ftm-quaddec - missing dependencies in Kconfig. * tag 'iio-fixes-for-5.2b' of git://git.kernel.org/pub/scm/linux/kernel/git/jic23/iio: counter/ftm-quaddec: Add missing dependencies in Kconfig staging: iio: adt7316: Fix build errors when GPIOLIB is not set iio: temperature: mlx90632 Relax the compatibility check iio: imu: st_lsm6dsx: fix PM support for st_lsm6dsx i2c controller staging:iio:ad7150: fix threshold mode config bit
2019-06-17integrity: Fix __integrity_init_keyring() section mismatchGeert Uytterhoeven
With gcc-4.6.3: WARNING: vmlinux.o(.text.unlikely+0x24c64): Section mismatch in reference from the function __integrity_init_keyring() to the function .init.text:set_platform_trusted_keys() The function __integrity_init_keyring() references the function __init set_platform_trusted_keys(). This is often because __integrity_init_keyring lacks a __init annotation or the annotation of set_platform_trusted_keys is wrong. Indeed, if the compiler decides not to inline __integrity_init_keyring(), a warning is issued. Fix this by adding the missing __init annotation. Fixes: 9dc92c45177ab70e ("integrity: Define a trusted platform keyring") Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org> Reviewed-by: Nayna Jain <nayna@linux.ibm.com> Reviewed-by: James Morris <jamorris@linux.microsoft.com> Signed-off-by: Mimi Zohar <zohar@linux.ibm.com>
2019-06-17Merge branch 'tcp-fixes'David S. Miller
Eric Dumazet says: ==================== tcp: make sack processing more robust Jonathan Looney brought to our attention multiple problems in TCP stack at the sender side. SACK processing can be abused by malicious peers to either cause overflows, or increase of memory usage. First two patches fix the immediate problems. Since the malicious peers abuse senders by advertizing a very small MSS in their SYN or SYNACK packet, the last two patches add a new sysctl so that admins can chose a higher limit for MSS clamping. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-17arm64: ssbd: explicitly depend on <linux/prctl.h>Anisse Astier
Fix ssbd.c which depends implicitly on asm/ptrace.h including linux/prctl.h (through for example linux/compat.h, then linux/time.h, linux/seqlock.h, linux/spinlock.h and linux/irqflags.h), and uses PR_SPEC* defines. This is an issue since we'll soon be removing the include from asm/ptrace.h. Fixes: 9cdc0108baa8 ("arm64: ssbd: Add prctl interface for per-thread mitigation") Cc: stable@vger.kernel.org Signed-off-by: Anisse Astier <aastier@freebox.fr> Signed-off-by: Will Deacon <will.deacon@arm.com>
2019-06-17Merge tag 'riscv-for-v5.2/fixes-rc6' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux Pull RISC-V fixes from Paul Walmsley: "This contains fixes, defconfig, and DT data changes for the v5.2-rc series. The fixes are relatively straightforward: - Addition of a TLB fence in the vmalloc_fault path, so the CPU doesn't enter an infinite page fault loop - Readdition of the pm_power_off export, so device drivers that reassign it can now be built as modules - A udelay() fix for RV32, fixing a miscomputation of the delay time - Removal of deprecated smp_mb__*() barriers This also adds initial DT data infrastructure for arch/riscv, along with initial data for the SiFive FU540-C000 SoC and the corresponding HiFive Unleashed board. We also update the RV64 defconfig to include some core drivers for the FU540 in the build" * tag 'riscv-for-v5.2/fixes-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux: riscv: remove unused barrier defines riscv: mm: synchronize MMU after pte change riscv: dts: add initial board data for the SiFive HiFive Unleashed riscv: dts: add initial support for the SiFive FU540-C000 SoC dt-bindings: riscv: convert cpu binding to json-schema dt-bindings: riscv: sifive: add YAML documentation for the SiFive FU540 arch: riscv: add support for building DTB files from DT source data riscv: Fix udelay in RV32. riscv: export pm_power_off again RISC-V: defconfig: enable clocks, serial console
2019-06-17block: fix page leak when merging to same pageChristoph Hellwig
When multiple iovecs reference the same page, each get_user_page call will add a reference to the page. But once we've created the bio that information gets lost and only a single reference will be dropped after I/O completion. Use the same_page information returned from __bio_try_merge_page to drop additional references to pages that were already present in the bio. Based on a patch from Ming Lei. Link: https://lkml.org/lkml/2019/4/23/64 Fixes: 576ed913 ("block: use bio_add_page in bio_iov_iter_get_pages") Reported-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-06-17block: return from __bio_try_merge_page if merging occured in the same pageChristoph Hellwig
We currently have an input same_page parameter to __bio_try_merge_page to prohibit merging in the same page. The rationale for that is that some callers need to account for every page added to a bio. Instead of letting these callers call twice into the merge code to account for the new vs existing page cases, just turn the paramter into an output one that returns if a merge in the same page occured and let them act accordingly. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-06-17Btrfs: fix failure to persist compression property xattr deletion on fsyncFilipe Manana
After the recent series of cleanups in the properties and xattrs modules that landed in the 5.2 merge window, we ended up with a regression where after deleting the compression xattr property through the setflags ioctl, we don't set the BTRFS_INODE_COPY_EVERYTHING flag in the inode anymore. As a consequence, if the inode was fsync'ed when it had the compression property set, after deleting the compression property through the setflags ioctl and fsync'ing again the inode, the log will still contain the compression xattr, because the inode did not had that bit set, which made the fsync not delete all xattrs from the log and copy all xattrs from the subvolume tree to the log tree. This regression happens due to the fact that that series of cleanups made btrfs_set_prop() call the old function do_setxattr() (which is now named btrfs_setxattr()), and not the old version of btrfs_setxattr(), which is now called btrfs_setxattr_trans(). Fix this by setting the BTRFS_INODE_COPY_EVERYTHING bit in the current btrfs_setxattr() function and remove it from everywhere else, including its setup at btrfs_ioctl_setflags(). This is cleaner, avoids similar regressions in the future, and centralizes the setup of the bit. After all, the need to setup this bit should only be in the xattrs module, since it is an implementation of xattrs. Fixes: 04e6863b19c722 ("btrfs: split btrfs_setxattr calls regarding transaction") CC: stable@vger.kernel.org # 4.4+ Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-06-17clk: Do a DT parent lookup even when index < 0Stephen Boyd
We want to allow the parent lookup to happen even if the index is some value less than 0. This may be the case if a clk provider only specifies the .name member to match a string in the "clock-names" DT property. We shouldn't require that the index be >= 0 to make this use case work. Fixes: 601b6e93304a ("clk: Allow parents to be specified via clkspec index") Reported-by: Alexandre Mergnat <amergnat@baylibre.com> Cc: Jerome Brunet <jbrunet@baylibre.com> Cc: Chen-Yu Tsai <wens@csie.org> Reviewed-by: Chen-Yu Tsai <wens@csie.org> Signed-off-by: Stephen Boyd <sboyd@kernel.org>
2019-06-17riscv: remove unused barrier definesRolf Eike Beer
They were introduced in commit fab957c11efe ("RISC-V: Atomic and Locking Code") long after commit 2e39465abc4b ("locking: Remove deprecated smp_mb__() barriers") removed the remnants of all previous instances from the tree. Signed-off-by: Rolf Eike Beer <eb@emlix.com> [paul.walmsley@sifive.com: stripped spurious mbox header from patch description; fixed commit references in patch header] Signed-off-by: Paul Walmsley <paul.walmsley@sifive.com>
2019-06-17usb: chipidea: udc: workaround for endpoint conflict issuePeter Chen
An endpoint conflict occurs when the USB is working in device mode during an isochronous communication. When the endpointA IN direction is an isochronous IN endpoint, and the host sends an IN token to endpointA on another device, then the OUT transaction may be missed regardless the OUT endpoint number. Generally, this occurs when the device is connected to the host through a hub and other devices are connected to the same hub. The affected OUT endpoint can be either control, bulk, isochronous, or an interrupt endpoint. After the OUT endpoint is primed, if an IN token to the same endpoint number on another device is received, then the OUT endpoint may be unprimed (cannot be detected by software), which causes this endpoint to no longer respond to the host OUT token, and thus, no corresponding interrupt occurs. There is no good workaround for this issue, the only thing the software could do is numbering isochronous IN from the highest endpoint since we have observed most of device number endpoint from the lowest. Cc: <stable@vger.kernel.org> #v3.14+ Cc: Fabio Estevam <festevam@gmail.com> Cc: Greg KH <gregkh@linuxfoundation.org> Cc: Sergei Shtylyov <sergei.shtylyov@cogentembedded.com> Cc: Jun Li <jun.li@nxp.com> Signed-off-by: Peter Chen <peter.chen@nxp.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-06-17MAINTAINERS: Change QCOM repo locationAndy Gross
This patch updates the Qualcomm SoC repo to a new location. Signed-off-by: Andy Gross <agross@kernel.org> Reviewed-by: Bjorn Andersson <bjorn.andersson@linaro.org> Signed-off-by: Olof Johansson <olof@lixom.net>
2019-06-17s390/cio: Combine direct and indirect CCW pathsEric Farman
With both the direct-addressed and indirect-addressed CCW paths simplified to this point, the amount of shared code between them is (hopefully) more easily visible. Move the processing of IDA-specific bits into the direct-addressed path, and add some useful commentary of what the individual pieces are doing. This allows us to remove the entire ccwchain_fetch_idal() routine and maintain a single function for any non-TIC CCW. Signed-off-by: Eric Farman <farman@linux.ibm.com> Reviewed-by: Cornelia Huck <cohuck@redhat.com> Message-Id: <20190606202831.44135-10-farman@linux.ibm.com> Signed-off-by: Cornelia Huck <cohuck@redhat.com>
2019-06-17vfio-ccw: Rearrange IDAL allocation in direct CCWEric Farman
This is purely deck furniture, to help understand the merge of the direct and indirect handlers. Signed-off-by: Eric Farman <farman@linux.ibm.com> Reviewed-by: Cornelia Huck <cohuck@redhat.com> Message-Id: <20190606202831.44135-9-farman@linux.ibm.com> Signed-off-by: Cornelia Huck <cohuck@redhat.com>
2019-06-17vfio-ccw: Remove pfn_array_tableEric Farman
Now that both CCW codepaths build this nested array: ccwchain->pfn_array_table[1]->pfn_array[#idaws/#pages] We can collapse this into simply: ccwchain->pfn_array[#idaws/#pages] Let's do that, so that we don't have to continually navigate two nested arrays when the first array always has a count of one. Signed-off-by: Eric Farman <farman@linux.ibm.com> Reviewed-by: Cornelia Huck <cohuck@redhat.com> Message-Id: <20190606202831.44135-8-farman@linux.ibm.com> Signed-off-by: Cornelia Huck <cohuck@redhat.com>
2019-06-17vfio-ccw: Adjust the first IDAW outside of the nested loopsEric Farman
Now that pfn_array_table[] is always an array of 1, it seems silly to check for the very first entry in an array in the middle of two nested loops, since we know it'll only ever happen once. Let's move this outside the loops to simplify things, even though the "k" variable is still necessary. Signed-off-by: Eric Farman <farman@linux.ibm.com> Reviewed-by: Cornelia Huck <cohuck@redhat.com> Message-Id: <20190606202831.44135-7-farman@linux.ibm.com> Signed-off-by: Cornelia Huck <cohuck@redhat.com>
2019-06-17vfio-ccw: Rearrange pfn_array and pfn_array_table arraysEric Farman
While processing a channel program, we currently have two nested arrays that carry a slightly different structure. The direct CCW path creates this: ccwchain->pfn_array_table[1]->pfn_array[#pages] while an IDA CCW creates: ccwchain->pfn_array_table[#idaws]->pfn_array[1] The distinction appears to state that each pfn_array_table entry points to an array of contiguous pages, represented by a pfn_array, um, array. Since the direct-addressed scenario can ONLY represent contiguous pages, it makes the intermediate array necessary but difficult to recognize. Meanwhile, since an IDAL can contain non-contiguous pages and there is no logic in vfio-ccw to detect adjacent IDAWs, it is the second array that is necessary but appearing to be superfluous. I am not aware of any documentation that states the pfn_array[] needs to be of contiguous pages; it is just what the code does today. I don't see any reason for this either, let's just flip the IDA codepath around so that it generates: ch_pat->pfn_array_table[1]->pfn_array[#idaws] This will bring it in line with the direct-addressed codepath, so that we can understand the behavior of this memory regardless of what type of CCW is being processed. And it means the casual observer does not need to know/care whether the pfn_array[] represents contiguous pages or not. NB: The existing vfio-ccw code only supports 4K-block Format-2 IDAs, so that "#pages" == "#idaws" in this area. This means that we will have difficulty with this overlap in terminology if support for Format-1 or 2K-block Format-2 IDAs is ever added. I don't think that this patch changes our ability to make that distinction. Signed-off-by: Eric Farman <farman@linux.ibm.com> Reviewed-by: Cornelia Huck <cohuck@redhat.com> Message-Id: <20190606202831.44135-6-farman@linux.ibm.com> Signed-off-by: Cornelia Huck <cohuck@redhat.com>
2019-06-17s390/cio: Use generalized CCW handler in cp_init()Eric Farman
It is now pretty apparent that ccwchain_handle_ccw() (nee ccwchain_handle_tic()) does everything that cp_init() wants to do. Let's remove that duplicated code from cp_init() and let ccwchain_handle_ccw() handle it itself. Signed-off-by: Eric Farman <farman@linux.ibm.com> Reviewed-by: Cornelia Huck <cohuck@redhat.com> Message-Id: <20190606202831.44135-5-farman@linux.ibm.com> Signed-off-by: Cornelia Huck <cohuck@redhat.com>
2019-06-17s390/cio: Generalize the TIC handlerEric Farman
Refactor ccwchain_handle_tic() into a routine that handles a channel program address (which itself is a CCW pointer), rather than a CCW pointer that is only a TIC CCW. This will make it easier to reuse this code for other CCW commands. Signed-off-by: Eric Farman <farman@linux.ibm.com> Reviewed-by: Cornelia Huck <cohuck@redhat.com> Message-Id: <20190606202831.44135-4-farman@linux.ibm.com> Signed-off-by: Cornelia Huck <cohuck@redhat.com>
2019-06-17s390/cio: Refactor the routine that handles TIC CCWsEric Farman
Extract the "does the target of this TIC already exist?" check from ccwchain_handle_tic(), so that it's easier to refactor that function into one that cp_init() is able to use. Signed-off-by: Eric Farman <farman@linux.ibm.com> Reviewed-by: Cornelia Huck <cohuck@redhat.com> Message-Id: <20190606202831.44135-3-farman@linux.ibm.com> Signed-off-by: Cornelia Huck <cohuck@redhat.com>
2019-06-17s390/cio: Squash cp_free() and cp_unpin_free()Eric Farman
The routine cp_free() does nothing but call cp_unpin_free(), and while most places call cp_free() there is one caller of cp_unpin_free() used when the cp is guaranteed to have not been marked initialized. This seems like a dubious way to make a distinction, so let's combine these routines and make cp_free() do all the work. Signed-off-by: Eric Farman <farman@linux.ibm.com> Reviewed-by: Cornelia Huck <cohuck@redhat.com> Message-Id: <20190606202831.44135-2-farman@linux.ibm.com> Signed-off-by: Cornelia Huck <cohuck@redhat.com>
2019-06-17mmc: mediatek: fix SDIO IRQ detection issuejjian zhou
If cmd19 timeout or response crcerr occurs during execute_tuning(), it need invoke msdc_reset_hw(). Otherwise SDIO IRQ can't be detected. Signed-off-by: jjian zhou <jjian.zhou@mediatek.com> Signed-off-by: Chaotian Jing <chaotian.jing@mediatek.com> Signed-off-by: Yong Mao <yong.mao@mediatek.com> Fixes: 5215b2e952f3 ("mmc: mediatek: Add MMC_CAP_SDIO_IRQ support") Cc: stable@vger.kernel.org Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org>
2019-06-17mmc: mediatek: fix SDIO IRQ interrupt handle flowjjian zhou
SDIO IRQ is triggered by low level. It need disable SDIO IRQ detected function. Otherwise the interrupt register can't be cleared. It will process the interrupt more. Signed-off-by: Jjian Zhou <jjian.zhou@mediatek.com> Signed-off-by: Chaotian Jing <chaotian.jing@mediatek.com> Signed-off-by: Yong Mao <yong.mao@mediatek.com> Fixes: 5215b2e952f3 ("mmc: mediatek: Add MMC_CAP_SDIO_IRQ support") Cc: stable@vger.kernel.org Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org>
2019-06-17mmc: core: complete HS400 before checking statusWolfram Sang
We don't have a reproducible error case, yet our BSP team suggested that the mmc_switch_status() command in mmc_select_hs400() should come after the callback into the driver completing HS400 setup. It makes sense to me because we want the status of a fully setup HS400, so it will increase the reliability of the mmc_switch_status() command. Reported-by: Yoshihiro Shimoda <yoshihiro.shimoda.uh@renesas.com> Signed-off-by: Wolfram Sang <wsa+renesas@sang-engineering.com> Fixes: ba6c7ac3a2f4 ("mmc: core: more fine-grained hooks for HS400 tuning") Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org>
2019-06-17arm64/mm: Correct the cache line size warning with non coherent deviceMasayoshi Mizuma
If the cache line size is greater than ARCH_DMA_MINALIGN (128), the warning shows and it's tainted as TAINT_CPU_OUT_OF_SPEC. However, it's not good because as discussed in the thread [1], the cpu cache line size will be problem only on non-coherent devices. Since the coherent flag is already introduced to struct device, show the warning only if the device is non-coherent device and ARCH_DMA_MINALIGN is smaller than the cpu cache size. [1] https://lore.kernel.org/linux-arm-kernel/20180514145703.celnlobzn3uh5tc2@localhost/ Signed-off-by: Masayoshi Mizuma <m.mizuma@jp.fujitsu.com> Reviewed-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com> Tested-by: Zhang Lei <zhang.lei@jp.fujitsu.com> [catalin.marinas@arm.com: removed 'if' block for WARN_TAINT] Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2019-06-17riscv: mm: synchronize MMU after pte changeShihPo Hung
Because RISC-V compliant implementations can cache invalid entries in TLB, an SFENCE.VMA is necessary after changes to the page table. This patch adds an SFENCE.vma for the vmalloc_fault path. Signed-off-by: ShihPo Hung <shihpo.hung@sifive.com> [paul.walmsley@sifive.com: reversed tab->whitespace conversion, wrapped comment lines] Signed-off-by: Paul Walmsley <paul.walmsley@sifive.com> Cc: Palmer Dabbelt <palmer@sifive.com> Cc: Albert Ou <aou@eecs.berkeley.edu> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: linux-riscv@lists.infradead.org Cc: stable@vger.kernel.org
2019-06-17x86/percpu: Optimize raw_cpu_xchg()Peter Zijlstra
Since raw_cpu_xchg() doesn't need to be IRQ-safe, like this_cpu_xchg(), we can use a simple load-store instead of the cmpxchg loop. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-06-17x86/percpu, sched/fair: Avoid local_clock()Peter Zijlstra
Nadav reported that code-gen changed because of the this_cpu_*() constraints, avoid this for select_idle_cpu() because that runs with preemption (and IRQs) disabled anyway. Reported-by: Nadav Amit <nadav.amit@gmail.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-06-17x86/percpu, x86/irq: Relax {set,get}_irq_regs()Peter Zijlstra
Nadav reported that since the this_cpu_*() ops got asm-volatile constraints on, code generation suffered for do_IRQ(), but since this is all with IRQs disabled we can use __this_cpu_*(). smp_x86_platform_ipi 234 222 -12,+0 smp_kvm_posted_intr_ipi 74 66 -8,+0 smp_kvm_posted_intr_wakeup_ipi 86 78 -8,+0 smp_apic_timer_interrupt 292 284 -8,+0 smp_kvm_posted_intr_nested_ipi 74 66 -8,+0 do_IRQ 195 187 -8,+0 Reported-by: Nadav Amit <nadav.amit@gmail.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-06-17x86/percpu: Relax smp_processor_id()Peter Zijlstra
Nadav reported that since this_cpu_read() became asm-volatile, many smp_processor_id() users generated worse code due to the extra constraints. However since smp_processor_id() is reading a stable value, we can use __this_cpu_read(). While this does reduce text size somewhat, this mostly results in code movement to .text.unlikely as a result of more/larger .cold. subfunctions. Less text on the hotpath is good for I$. $ ./compare.sh defconfig-build1 defconfig-build2 vmlinux.o setup_APIC_ibs 90 98 -12,+20 force_ibs_eilvt_setup 400 413 -57,+70 pci_serr_error 109 104 -54,+49 pci_serr_error 109 104 -54,+49 unknown_nmi_error 125 120 -76,+71 unknown_nmi_error 125 120 -76,+71 io_check_error 125 132 -97,+104 intel_thermal_interrupt 730 822 +92,+0 intel_init_thermal 951 945 -6,+0 generic_get_mtrr 301 294 -7,+0 generic_get_mtrr 301 294 -7,+0 generic_set_all 749 754 -44,+49 get_fixed_ranges 352 360 -41,+49 x86_acpi_suspend_lowlevel 369 363 -6,+0 check_tsc_sync_source 412 412 -71,+71 irq_migrate_all_off_this_cpu 662 674 -14,+26 clocksource_watchdog 748 748 -113,+113 __perf_event_account_interrupt 204 197 -7,+0 attempt_merge 1748 1741 -7,+0 intel_guc_send_ct 1424 1409 -15,+0 __fini_doorbell 235 231 -4,+0 bdw_set_cdclk 928 923 -5,+0 gen11_dsi_disable 1571 1556 -15,+0 gmbus_wait 493 488 -5,+0 md_make_request 376 369 -7,+0 __split_and_process_bio 543 536 -7,+0 delay_tsc 96 89 -7,+0 hsw_disable_pc8 696 691 -5,+0 tsc_verify_tsc_adjust 215 228 -22,+35 cpuidle_driver_unref 56 49 -7,+0 blk_account_io_completion 159 148 -11,+0 mtrr_wrmsr 95 99 -29,+33 __intel_wait_for_register_fw 401 419 +18,+0 cpuidle_driver_ref 43 36 -7,+0 cpuidle_get_driver 15 8 -7,+0 blk_account_io_done 535 528 -7,+0 irq_migrate_all_off_this_cpu 662 674 -14,+26 check_tsc_sync_source 412 412 -71,+71 irq_wait_for_poll 170 163 -7,+0 generic_end_io_acct 329 322 -7,+0 x86_acpi_suspend_lowlevel 369 363 -6,+0 nohz_balance_enter_idle 198 191 -7,+0 generic_start_io_acct 254 247 -7,+0 blk_account_io_start 341 334 -7,+0 perf_event_task_tick 682 675 -7,+0 intel_init_thermal 951 945 -6,+0 amd_e400_c1e_apic_setup 47 51 -28,+32 setup_APIC_eilvt 350 328 -22,+0 hsw_enable_pc8 1611 1605 -6,+0 total 12985947 12985892 -994,+939 Reported-by: Nadav Amit <nadav.amit@gmail.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-06-17x86/percpu: Differentiate this_cpu_{}() and __this_cpu_{}()Peter Zijlstra
Nadav Amit reported that commit: b59167ac7baf ("x86/percpu: Fix this_cpu_read()") added a bunch of constraints to all sorts of code; and while some of that was correct and desired, some of that seems superfluous. The thing is, the this_cpu_*() operations are defined IRQ-safe, this means the values are subject to change from IRQs, and thus must be reloaded. Also, the generic form: local_irq_save() __this_cpu_read() local_irq_restore() would not allow the re-use of previous values; if by nothing else, then the barrier()s implied by local_irq_*(). Which raises the point that percpu_from_op() and the others also need that volatile. OTOH __this_cpu_*() operations are not IRQ-safe and assume external preempt/IRQ disabling and could thus be allowed more room for optimization. This makes the this_cpu_*() vs __this_cpu_*() behaviour more consistent with other architectures. $ ./compare.sh defconfig-build defconfig-build1 vmlinux.o x86_pmu_cancel_txn 80 71 -9,+0 __text_poke 919 964 +45,+0 do_user_addr_fault 1082 1058 -24,+0 __do_page_fault 1194 1178 -16,+0 do_exit 2995 3027 -43,+75 process_one_work 1008 989 -67,+48 finish_task_switch 524 505 -19,+0 __schedule_bug 103 98 -59,+54 __schedule_bug 103 98 -59,+54 __sched_setscheduler 2015 2030 +15,+0 freeze_processes 203 230 +31,-4 rcu_gp_kthread_wake 106 99 -7,+0 rcu_core 1841 1834 -7,+0 call_timer_fn 298 286 -12,+0 can_stop_idle_tick 146 139 -31,+24 perf_pending_event 253 239 -14,+0 shmem_alloc_page 209 213 +4,+0 __alloc_pages_slowpath 3284 3269 -15,+0 umount_tree 671 694 +23,+0 advance_transaction 803 798 -5,+0 con_put_char 71 51 -20,+0 xhci_urb_enqueue 1302 1295 -7,+0 xhci_urb_enqueue 1302 1295 -7,+0 tcp_sacktag_write_queue 2130 2075 -55,+0 tcp_try_undo_loss 229 208 -21,+0 tcp_v4_inbound_md5_hash 438 411 -31,+4 tcp_v4_inbound_md5_hash 438 411 -31,+4 tcp_v6_inbound_md5_hash 469 411 -33,-25 tcp_v6_inbound_md5_hash 469 411 -33,-25 restricted_pointer 434 420 -14,+0 irq_exit 162 154 -8,+0 get_perf_callchain 638 624 -14,+0 rt_mutex_trylock 169 156 -13,+0 avc_has_extended_perms 1092 1089 -3,+0 avc_has_perm_noaudit 309 306 -3,+0 __perf_sw_event 138 122 -16,+0 perf_swevent_get_recursion_context 116 102 -14,+0 __local_bh_enable_ip 93 72 -21,+0 xfrm_input 4175 4161 -14,+0 avc_has_perm 446 443 -3,+0 vm_events_fold_cpu 57 56 -1,+0 vfree 68 61 -7,+0 freeze_processes 203 230 +31,-4 _local_bh_enable 44 30 -14,+0 ip_do_fragment 1982 1944 -38,+0 do_exit 2995 3027 -43,+75 __do_softirq 742 724 -18,+0 cpu_init 1510 1489 -21,+0 account_system_time 80 79 -1,+0 total 12985281 12984819 -742,+280 Reported-by: Nadav Amit <nadav.amit@gmail.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: https://lkml.kernel.org/r/20181206112433.GB13675@hirez.programming.kicks-ass.net Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-06-17MAINTAINERS: Update my email address to use @kernel.orgWill Deacon
My @arm.com address will stop working at the end of August, so update to my @kernel.org address where you'll still be able to reach me. When I say "stop working" I really mean "will go to my line manager", so send patches there at your peril because they may reply with roadmaps and spreadsheets. You have been warned. Cc: Will Deacon <will@kernel.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Marc Zyngier <marc.zyngier@arm.com> Cc: Russell King <linux@armlinux.org.uk> Cc: "Paul E . McKenney" <paulmck@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@kernel.org> Cc: Bjorn Helgaas <bhelgaas@google.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Joerg Roedel <joro@8bytes.org> Cc: arm-soc <arm@kernel.org> Signed-off-by: Will Deacon <will.deacon@arm.com>
2019-06-17locking/rwsem: Guard against making count negativeWaiman Long
The upper bits of the count field is used as reader count. When sufficient number of active readers are present, the most significant bit will be set and the count becomes negative. If the number of active readers keep on piling up, we may eventually overflow the reader counts. This is not likely to happen unless the number of bits reserved for reader count is reduced because those bits are need for other purpose. To prevent this count overflow from happening, the most significant bit is now treated as a guard bit (RWSEM_FLAG_READFAIL). Read-lock attempts will now fail for both the fast and slow paths whenever this bit is set. So all those extra readers will be put to sleep in the wait list. Wakeup will not happen until the reader count reaches 0. Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tim Chen <tim.c.chen@linux.intel.com> Cc: Will Deacon <will.deacon@arm.com> Cc: huang ying <huang.ying.caritas@gmail.com> Link: https://lkml.kernel.org/r/20190520205918.22251-17-longman@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-06-17locking/rwsem: Adaptive disabling of reader optimistic spinningWaiman Long
Reader optimistic spinning is helpful when the reader critical section is short and there aren't that many readers around. It makes readers relatively more preferred than writers. When a writer times out spinning on a reader-owned lock and set the nospinnable bits, there are two main reasons for that. 1) The reader critical section is long, perhaps the task sleeps after acquiring the read lock. 2) There are just too many readers contending the lock causing it to take a while to service all of them. In the former case, long reader critical section will impede the progress of writers which is usually more important for system performance. In the later case, reader optimistic spinning tends to make the reader groups that contain readers that acquire the lock together smaller leading to more of them. That may hurt performance in some cases. In other words, the setting of nonspinnable bits indicates that reader optimistic spinning may not be helpful for those workloads that cause it. Therefore, any writers that have observed the setting of the writer nonspinnable bit for a given rwsem after they fail to acquire the lock via optimistic spinning will set the reader nonspinnable bit once they acquire the write lock. Similarly, readers that observe the setting of reader nonspinnable bit at slowpath entry will also set the reader nonspinnable bit when they acquire the read lock via the wakeup path. Once the reader nonspinnable bit is on, it will only be reset when a writer is able to acquire the rwsem in the fast path or somehow a reader or writer in the slowpath doesn't observe the nonspinable bit. This is to discourage reader optmistic spinning on that particular rwsem and make writers more preferred. This adaptive disabling of reader optimistic spinning will alleviate some of the negative side effect of this feature. In addition, this patch tries to make readers in the spinning queue follow the phase-fair principle after quitting optimistic spinning by checking if another reader has somehow acquired a read lock after this reader enters the optimistic spinning queue. If so and the rwsem is still reader-owned, this reader is in the right read-phase and can attempt to acquire the lock. On a 2-socket 40-core 80-thread Skylake system, the page_fault1 test of the will-it-scale benchmark was run with various number of threads. The number of operations done before reader optimistic spinning patches, this patch and after this patch were: Threads Before rspin Before patch After patch %change ------- ------------ ------------ ----------- ------- 20 5541068 5345484 5455667 -3.5%/ +2.1% 40 10185150 7292313 9219276 -28.5%/+26.4% 60 8196733 6460517 7181209 -21.2%/+11.2% 80 9508864 6739559 8107025 -29.1%/+20.3% This patch doesn't recover all the lost performance, but it is more than half. Given the fact that reader optimistic spinning does benefit some workloads, this is a good compromise. Using the rwsem locking microbenchmark with very short critical section, this patch doesn't have too much impact on locking performance as shown by the locking rates (kops/s) below with equal numbers of readers and writers before and after this patch: # of Threads Pre-patch Post-patch ------------ --------- ---------- 2 4,730 4,969 4 4,814 4,786 8 4,866 4,815 16 4,715 4,511 32 3,338 3,500 64 3,212 3,389 80 3,110 3,044 When running the locking microbenchmark with 40 dedicated reader and writer threads, however, the reader performance is curtailed to favor the writer. Before patch: 40 readers, Iterations Min/Mean/Max = 204,026/234,309/254,816 40 writers, Iterations Min/Mean/Max = 88,515/95,884/115,644 After patch: 40 readers, Iterations Min/Mean/Max = 33,813/35,260/36,791 40 writers, Iterations Min/Mean/Max = 95,368/96,565/97,798 Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tim Chen <tim.c.chen@linux.intel.com> Cc: Will Deacon <will.deacon@arm.com> Cc: huang ying <huang.ying.caritas@gmail.com> Link: https://lkml.kernel.org/r/20190520205918.22251-16-longman@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-06-17locking/rwsem: Enable time-based spinning on reader-owned rwsemWaiman Long
When the rwsem is owned by reader, writers stop optimistic spinning simply because there is no easy way to figure out if all the readers are actively running or not. However, there are scenarios where the readers are unlikely to sleep and optimistic spinning can help performance. This patch provides a simple mechanism for spinning on a reader-owned rwsem by a writer. It is a time threshold based spinning where the allowable spinning time can vary from 10us to 25us depending on the condition of the rwsem. When the time threshold is exceeded, the nonspinnable bits will be set in the owner field to indicate that no more optimistic spinning will be allowed on this rwsem until it becomes writer owned again. Not even readers is allowed to acquire the reader-locked rwsem by optimistic spinning for fairness. We also want a writer to acquire the lock after the readers hold the lock for a relatively long time. In order to give preference to writers under such a circumstance, the single RWSEM_NONSPINNABLE bit is now split into two - one for reader and one for writer. When optimistic spinning is disabled, both bits will be set. When the reader count drop down to 0, the writer nonspinnable bit will be cleared to allow writers to spin on the lock, but not the readers. When a writer acquires the lock, it will write its own task structure pointer into sem->owner and clear the reader nonspinnable bit in the process. The time taken for each iteration of the reader-owned rwsem spinning loop varies. Below are sample minimum elapsed times for 16 iterations of the loop. System Time for 16 Iterations ------ ---------------------- 1-socket Skylake ~800ns 4-socket Broadwell ~300ns 2-socket ThunderX2 (arm64) ~250ns When the lock cacheline is contended, we can see up to almost 10X increase in elapsed time. So 25us will be at most 500, 1300 and 1600 iterations for each of the above systems. With a locking microbenchmark running on 5.1 based kernel, the total locking rates (in kops/s) on a 8-socket IvyBridge-EX system with equal numbers of readers and writers before and after this patch were as follows: # of Threads Pre-patch Post-patch ------------ --------- ---------- 2 1,759 6,684 4 1,684 6,738 8 1,074 7,222 16 900 7,163 32 458 7,316 64 208 520 128 168 425 240 143 474 This patch gives a big boost in performance for mixed reader/writer workloads. With 32 locking threads, the rwsem lock event data were: rwsem_opt_fail=79850 rwsem_opt_nospin=5069 rwsem_opt_rlock=597484 rwsem_opt_wlock=957339 rwsem_sleep_reader=57782 rwsem_sleep_writer=55663 With 64 locking threads, the data looked like: rwsem_opt_fail=346723 rwsem_opt_nospin=6293 rwsem_opt_rlock=1127119 rwsem_opt_wlock=1400628 rwsem_sleep_reader=308201 rwsem_sleep_writer=72281 So a lot more threads acquired the lock in the slowpath and more threads went to sleep. Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tim Chen <tim.c.chen@linux.intel.com> Cc: Will Deacon <will.deacon@arm.com> Cc: huang ying <huang.ying.caritas@gmail.com> Link: https://lkml.kernel.org/r/20190520205918.22251-15-longman@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-06-17locking/rwsem: Make rwsem->owner an atomic_long_tWaiman Long
The rwsem->owner contains not just the task structure pointer, it also holds some flags for storing the current state of the rwsem. Some of the flags may have to be atomically updated. To reflect the new reality, the owner is now changed to an atomic_long_t type. New helper functions are added to properly separate out the task structure pointer and the embedded flags. Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tim Chen <tim.c.chen@linux.intel.com> Cc: Will Deacon <will.deacon@arm.com> Cc: huang ying <huang.ying.caritas@gmail.com> Link: https://lkml.kernel.org/r/20190520205918.22251-14-longman@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-06-17locking/rwsem: Enable readers spinning on writerWaiman Long
This patch enables readers to optimistically spin on a rwsem when it is owned by a writer instead of going to sleep directly. The rwsem_can_spin_on_owner() function is extracted out of rwsem_optimistic_spin() and is called directly by rwsem_down_read_slowpath() and rwsem_down_write_slowpath(). With a locking microbenchmark running on 5.1 based kernel, the total locking rates (in kops/s) on a 8-socket IvyBrige-EX system with equal numbers of readers and writers before and after the patch were as follows: # of Threads Pre-patch Post-patch ------------ --------- ---------- 4 1,674 1,684 8 1,062 1,074 16 924 900 32 300 458 64 195 208 128 164 168 240 149 143 The performance change wasn't significant in this case, but this change is required by a follow-on patch. Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tim Chen <tim.c.chen@linux.intel.com> Cc: Will Deacon <will.deacon@arm.com> Cc: huang ying <huang.ying.caritas@gmail.com> Link: https://lkml.kernel.org/r/20190520205918.22251-13-longman@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-06-17locking/rwsem: Clarify usage of owner's nonspinaable bitWaiman Long
Bit 1 of sem->owner (RWSEM_ANONYMOUSLY_OWNED) is used to designate an anonymous owner - readers or an anonymous writer. The setting of this anonymous bit is used as an indicator that optimistic spinning cannot be done on this rwsem. With the upcoming reader optimistic spinning patches, a reader-owned rwsem can be spinned on for a limit period of time. We still need this bit to indicate a rwsem is nonspinnable, but not setting this bit loses its meaning that the owner is known. So rename the bit to RWSEM_NONSPINNABLE to clarify its meaning. This patch also fixes a DEBUG_RWSEMS_WARN_ON() bug in __up_write(). Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tim Chen <tim.c.chen@linux.intel.com> Cc: Will Deacon <will.deacon@arm.com> Cc: huang ying <huang.ying.caritas@gmail.com> Link: https://lkml.kernel.org/r/20190520205918.22251-12-longman@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-06-17locking/rwsem: Wake up almost all readers in wait queueWaiman Long
When the front of the wait queue is a reader, other readers immediately following the first reader will also be woken up at the same time. However, if there is a writer in between. Those readers behind the writer will not be woken up. Because of optimistic spinning, the lock acquisition order is not FIFO anyway. The lock handoff mechanism will ensure that lock starvation will not happen. Assuming that the lock hold times of the other readers still in the queue will be about the same as the readers that are being woken up, there is really not much additional cost other than the additional latency due to the wakeup of additional tasks by the waker. Therefore all the readers up to a maximum of 256 in the queue are woken up when the first waiter is a reader to improve reader throughput. This is somewhat similar in concept to a phase-fair R/W lock. With a locking microbenchmark running on 5.1 based kernel, the total locking rates (in kops/s) on a 8-socket IvyBridge-EX system with equal numbers of readers and writers before and after this patch were as follows: # of Threads Pre-Patch Post-patch ------------ --------- ---------- 4 1,641 1,674 8 731 1,062 16 564 924 32 78 300 64 38 195 240 50 149 There is no performance gain at low contention level. At high contention level, however, this patch gives a pretty decent performance boost. Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tim Chen <tim.c.chen@linux.intel.com> Cc: Will Deacon <will.deacon@arm.com> Cc: huang ying <huang.ying.caritas@gmail.com> Link: https://lkml.kernel.org/r/20190520205918.22251-11-longman@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-06-17locking/rwsem: More optimal RT task handling of null ownerWaiman Long
An RT task can do optimistic spinning only if the lock holder is actually running. If the state of the lock holder isn't known, there is a possibility that high priority of the RT task may block forward progress of the lock holder if it happens to reside on the same CPU. This will lead to deadlock. So we have to make sure that an RT task will not spin on a reader-owned rwsem. When the owner is temporarily set to NULL, there are two cases where we may want to continue spinning: 1) The lock owner is in the process of releasing the lock, sem->owner is cleared but the lock has not been released yet. 2) The lock was free and owner cleared, but another task just comes in and acquire the lock before we try to get it. The new owner may be a spinnable writer. So an RT task is now made to retry one more time to see if it can acquire the lock or continue spinning on the new owning writer. When testing on a 8-socket IvyBridge-EX system, the one additional retry seems to improve locking performance of RT write locking threads under heavy contentions. The table below shows the locking rates (in kops/s) with various write locking threads before and after the patch. Locking threads Pre-patch Post-patch --------------- --------- ----------- 4 2,753 2,608 8 2,529 2,520 16 1,727 1,918 32 1,263 1,956 64 889 1,343 Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tim Chen <tim.c.chen@linux.intel.com> Cc: Will Deacon <will.deacon@arm.com> Cc: huang ying <huang.ying.caritas@gmail.com> Link: https://lkml.kernel.org/r/20190520205918.22251-10-longman@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-06-17locking/rwsem: Always release wait_lock before waking up tasksWaiman Long
With the use of wake_q, we can do task wakeups without holding the wait_lock. There is one exception in the rwsem code, though. It is when the writer in the slowpath detects that there are waiters ahead but the rwsem is not held by a writer. This can lead to a long wait_lock hold time especially when a large number of readers are to be woken up. Remediate this situation by releasing the wait_lock before waking up tasks and re-acquiring it afterward. The rwsem_try_write_lock() function is also modified to read the rwsem count directly to avoid stale count value. Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tim Chen <tim.c.chen@linux.intel.com> Cc: Will Deacon <will.deacon@arm.com> Cc: huang ying <huang.ying.caritas@gmail.com> Link: https://lkml.kernel.org/r/20190520205918.22251-9-longman@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-06-17locking/rwsem: Implement lock handoff to prevent lock starvationWaiman Long
Because of writer lock stealing, it is possible that a constant stream of incoming writers will cause a waiting writer or reader to wait indefinitely leading to lock starvation. This patch implements a lock handoff mechanism to disable lock stealing and force lock handoff to the first waiter or waiters (for readers) in the queue after at least a 4ms waiting period unless it is a RT writer task which doesn't need to wait. The waiting period is used to avoid discouraging lock stealing too much to affect performance. The setting and clearing of the handoff bit is serialized by the wait_lock. So racing is not possible. A rwsem microbenchmark was run for 5 seconds on a 2-socket 40-core 80-thread Skylake system with a v5.1 based kernel and 240 write_lock threads with 5us sleep critical section. Before the patch, the min/mean/max numbers of locking operations for the locking threads were 1/7,792/173,696. After the patch, the figures became 5,842/6,542/7,458. It can be seen that the rwsem became much more fair, though there was a drop of about 16% in the mean locking operations done which was a tradeoff of having better fairness. Making the waiter set the handoff bit right after the first wakeup can impact performance especially with a mixed reader/writer workload. With the same microbenchmark with short critical section and equal number of reader and writer threads (40/40), the reader/writer locking operation counts with the current patch were: 40 readers, Iterations Min/Mean/Max = 1,793/1,794/1,796 40 writers, Iterations Min/Mean/Max = 1,793/34,956/86,081 By making waiter set handoff bit immediately after wakeup: 40 readers, Iterations Min/Mean/Max = 43/44/46 40 writers, Iterations Min/Mean/Max = 43/1,263/3,191 Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tim Chen <tim.c.chen@linux.intel.com> Cc: Will Deacon <will.deacon@arm.com> Cc: huang ying <huang.ying.caritas@gmail.com> Link: https://lkml.kernel.org/r/20190520205918.22251-8-longman@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org>