summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2021-11-29pata_falcon: Avoid type warnings from sparseFinn Thain
The zero day bot reported some sparse complaints in pata_falcon.c. E.g. drivers/ata/pata_falcon.c:58:41: warning: cast removes address space '__iomem' of expression drivers/ata/pata_falcon.c:58:41: warning: incorrect type in argument 1 (different address spaces) drivers/ata/pata_falcon.c:58:41: expected unsigned short volatile [noderef] [usertype] __iomem *port drivers/ata/pata_falcon.c:58:41: got unsigned short [usertype] * The same thing shows up in 8 places, all told. Avoid this by removing unnecessary type casts. Cc: Jens Axboe <axboe@kernel.dk> Cc: Michael Schmitz <schmitzmic@gmail.com> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Reported-by: kernel test robot <lkp@intel.com> Suggested-by: Geert Uytterhoeven <geert@linux-m68k.org> Signed-off-by: Finn Thain <fthain@linux-m68k.org> Reviewed-by: Geert Uytterhoeven <geert@linux-m68k.org> Signed-off-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
2021-11-29rt2x00: do not mark device gone on EPROTO errors during startStanislaw Gruszka
As reported by Exuvo is possible that we have lot's of EPROTO errors during device start i.e. firmware load. But after that device works correctly. Hence marking device gone by few EPROTO errors done by commit e383c70474db ("rt2x00: check number of EPROTO errors") caused regression - Exuvo device stop working after kernel update. To fix disable the check during device start. Link: https://lore.kernel.org/linux-wireless/bff7d309-a816-6a75-51b6-5928ef4f7a8c@exuvo.se/ Reported-and-tested-by: Exuvo <exuvo@exuvo.se> Fixes: e383c70474db ("rt2x00: check number of EPROTO errors") Cc: stable@vger.kernel.org Signed-off-by: Stanislaw Gruszka <stf_xl@wp.pl> Signed-off-by: Kalle Valo <kvalo@codeaurora.org> Link: https://lore.kernel.org/r/20211111141003.GA134627@wp.pl
2021-11-29rtlwifi: rtl8192de: Style clean-upsKees Cook
Clean up some style issues: - Use ARRAY_SIZE() even though it's a u8 array. - Remove redundant CHANNEL_MAX_NUMBER_2G define. Additionally fix some dead code WARNs. Acked-by: Ping-Ke Shih <pkshih@realtek.com> Link: https://lore.kernel.org/lkml/57d0d1b6064342309f680f692192556c@realtek.com/ Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: Kalle Valo <kvalo@codeaurora.org> Link: https://lore.kernel.org/r/20211119192233.1021063-1-keescook@chromium.org
2021-11-29drm/virtio: use drm_poll(..) instead of virtio_gpu_poll(..)Gurchetan Singh
With the use of dummy events, we can drop virtgpu specific behavior. Fixes: cd7f5ca33585 ("drm/virtio: implement context init: add virtio_gpu_fence_event") Reported-by: Daniel Vetter <daniel.vetter@ffwll.ch> Signed-off-by: Gurchetan Singh <gurchetansingh@chromium.org> Reviewed-by: Daniel Vetter <daniel.vetter@ffwll.ch> Link: http://patchwork.freedesktop.org/patch/msgid/20211122232210.602-3-gurchetansingh@google.com Signed-off-by: Gerd Hoffmann <kraxel@redhat.com>
2021-11-29drm/virtgpu api: define a dummy fence signaled eventGurchetan Singh
The current virtgpu implementation of poll(..) drops events when VIRTGPU_CONTEXT_PARAM_POLL_RINGS_MASK is enabled (otherwise it's like a normal DRM driver). This is because paravirtualized userspaces receives responses in a buffer of type BLOB_MEM_GUEST, not by read(..). To be in line with other DRM drivers and avoid specialized behavior, it is possible to define a dummy event for virtgpu. Paravirtualized userspace will now have to call read(..) on the DRM fd to receive the dummy event. Fixes: b10790434cf2 ("drm/virtgpu api: create context init feature") Reported-by: Daniel Vetter <daniel.vetter@ffwll.ch> Signed-off-by: Gurchetan Singh <gurchetansingh@chromium.org> Reviewed-by: Daniel Vetter <daniel.vetter@ffwll.ch> Link: http://patchwork.freedesktop.org/patch/msgid/20211122232210.602-2-gurchetansingh@google.com Signed-off-by: Gerd Hoffmann <kraxel@redhat.com>
2021-11-29mwl8k: Use named struct for memcpy() regionKees Cook
In preparation for FORTIFY_SOURCE performing compile-time and run-time field bounds checking for memcpy(), memmove(), and memset(), avoid intentionally writing across neighboring fields. Use named struct in struct mwl8k_cmd_set_key around members key_material, tkip_tx_mic_key, and tkip_rx_mic_key so they can be referenced together. This will allow memcpy() and sizeof() to more easily reason about sizes, improve readability, and avoid future warnings about writing beyond the end of key_material. "pahole" shows no size nor member offset changes to struct mwl8k_cmd_set_key. "objdump -d" shows no object code changes. Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: Kalle Valo <kvalo@codeaurora.org> Link: https://lore.kernel.org/r/20211119004905.2348143-1-keescook@chromium.org
2021-11-29intersil: Use struct_group() for memcpy() regionKees Cook
In preparation for FORTIFY_SOURCE performing compile-time and run-time field bounds checking for memcpy(), memmove(), and memset(), avoid intentionally writing across neighboring fields. Use struct_group() in struct hfa384x_tx_frame around members frame_control, duration_id, addr1, addr2, addr3, and seq_ctrl, so they can be referenced together. This will allow memcpy() and sizeof() to more easily reason about sizes, improve readability, and avoid future warnings about writing beyond the end of frame_control. "pahole" shows no size nor member offset changes to struct hfa384x_tx_frame. "objdump -d" shows no object code changes. Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: Kalle Valo <kvalo@codeaurora.org> Link: https://lore.kernel.org/r/20211119004646.2347920-1-keescook@chromium.org
2021-11-29libertas_tf: Use struct_group() for memcpy() regionKees Cook
In preparation for FORTIFY_SOURCE performing compile-time and run-time field array bounds checking for memcpy(), memmove(), and memset(), avoid intentionally writing across neighboring fields. Use struct_group() in struct txpd around members tx_dest_addr_high and tx_dest_addr_low so they can be referenced together. This will allow memcpy() and sizeof() to more easily reason about sizes, improve readability, and avoid future warnings about writing beyond the end of tx_dest_addr_high. "pahole" shows no size nor member offset changes to struct txpd. "objdump -d" shows no object code changes. Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: Kalle Valo <kvalo@codeaurora.org> Link: https://lore.kernel.org/r/20211118184121.1283821-1-keescook@chromium.org
2021-11-29libertas: Use struct_group() for memcpy() regionKees Cook
In preparation for FORTIFY_SOURCE performing compile-time and run-time field bounds checking for memcpy(), memmove(), and memset(), avoid intentionally writing across neighboring fields. Use struct_group() in struct txpd around members tx_dest_addr_high and tx_dest_addr_low so they can be referenced together. This will allow memcpy() and sizeof() to more easily reason about sizes, improve readability, and avoid future warnings about writing beyond the end of queue_id. "pahole" shows no size nor member offset changes to struct txpd. "objdump -d" shows no object code changes. Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: Kalle Valo <kvalo@codeaurora.org> Link: https://lore.kernel.org/r/20211118184104.1283637-1-keescook@chromium.org
2021-11-29wlcore: no need to initialise statics to falseJason Wang
Static variables do not need to be initialized to false. The compiler will do that. Signed-off-by: Jason Wang <wangborong@cdjrlc.com> Signed-off-by: Kalle Valo <kvalo@codeaurora.org> Link: https://lore.kernel.org/r/20211113063551.257804-1-wangborong@cdjrlc.com
2021-11-29rsi: Fix out-of-bounds read in rsi_read_pkt()Zekun Shen
rsi_get_* functions rely on an offset variable from usb input. The size of usb input is RSI_MAX_RX_USB_PKT_SIZE(3000), while 2-byte offset can be up to 0xFFFF. Thus a large offset can cause out-of-bounds read. The patch adds a bound checking condition when rcv_pkt_len is 0, indicating it's USB. It's unclear whether this is triggerable from other type of bus. The following check might help in that case. offset > rcv_pkt_len - FRAME_DESC_SZ The bug is trigerrable with conpromised/malfunctioning USB devices. I tested the patch with the crashing input and got no more bug report. Attached is the KASAN report from fuzzing. BUG: KASAN: slab-out-of-bounds in rsi_read_pkt+0x42e/0x500 [rsi_91x] Read of size 2 at addr ffff888019439fdb by task RX-Thread/227 CPU: 0 PID: 227 Comm: RX-Thread Not tainted 5.6.0 #66 Call Trace: dump_stack+0x76/0xa0 print_address_description.constprop.0+0x16/0x200 ? rsi_read_pkt+0x42e/0x500 [rsi_91x] ? rsi_read_pkt+0x42e/0x500 [rsi_91x] __kasan_report.cold+0x37/0x7c ? rsi_read_pkt+0x42e/0x500 [rsi_91x] kasan_report+0xe/0x20 rsi_read_pkt+0x42e/0x500 [rsi_91x] rsi_usb_rx_thread+0x1b1/0x2fc [rsi_usb] ? rsi_probe+0x16a0/0x16a0 [rsi_usb] ? _raw_spin_lock_irqsave+0x7b/0xd0 ? _raw_spin_trylock_bh+0x120/0x120 ? __wake_up_common+0x10b/0x520 ? rsi_probe+0x16a0/0x16a0 [rsi_usb] kthread+0x2b5/0x3b0 ? kthread_create_on_node+0xd0/0xd0 ret_from_fork+0x22/0x40 Reported-by: Brendan Dolan-Gavitt <brendandg@nyu.edu> Signed-off-by: Zekun Shen <bruceshenzk@gmail.com> Signed-off-by: Kalle Valo <kvalo@codeaurora.org> Link: https://lore.kernel.org/r/YXxXS4wgu2OsmlVv@10-18-43-117.dynapool.wireless.nyu.edu
2021-11-29rsi: Fix use-after-free in rsi_rx_done_handler()Zekun Shen
When freeing rx_cb->rx_skb, the pointer is not set to NULL, a later rsi_rx_done_handler call will try to read the freed address. This bug will very likley lead to double free, although detected early as use-after-free bug. The bug is triggerable with a compromised/malfunctional usb device. After applying the patch, the same input no longer triggers the use-after-free. Attached is the kasan report from fuzzing. BUG: KASAN: use-after-free in rsi_rx_done_handler+0x354/0x430 [rsi_usb] Read of size 4 at addr ffff8880188e5930 by task modprobe/231 Call Trace: <IRQ> dump_stack+0x76/0xa0 print_address_description.constprop.0+0x16/0x200 ? rsi_rx_done_handler+0x354/0x430 [rsi_usb] ? rsi_rx_done_handler+0x354/0x430 [rsi_usb] __kasan_report.cold+0x37/0x7c ? dma_direct_unmap_page+0x90/0x110 ? rsi_rx_done_handler+0x354/0x430 [rsi_usb] kasan_report+0xe/0x20 rsi_rx_done_handler+0x354/0x430 [rsi_usb] __usb_hcd_giveback_urb+0x1e4/0x380 usb_giveback_urb_bh+0x241/0x4f0 ? __usb_hcd_giveback_urb+0x380/0x380 ? apic_timer_interrupt+0xa/0x20 tasklet_action_common.isra.0+0x135/0x330 __do_softirq+0x18c/0x634 ? handle_irq_event+0xcd/0x157 ? handle_edge_irq+0x1eb/0x7b0 irq_exit+0x114/0x140 do_IRQ+0x91/0x1e0 common_interrupt+0xf/0xf </IRQ> Reported-by: Brendan Dolan-Gavitt <brendandg@nyu.edu> Signed-off-by: Zekun Shen <bruceshenzk@gmail.com> Signed-off-by: Kalle Valo <kvalo@codeaurora.org> Link: https://lore.kernel.org/r/YXxQL/vIiYcZUu/j@10-18-43-117.dynapool.wireless.nyu.edu
2021-11-29brcmfmac: Configure keep-alive packet on suspendLoic Poulain
When entering suspend as a client station with wowlan enabled, the Wi-Fi link is supposed to be maintained. In that state, no more data is generated from client side, and the link stays idle as long the station is suspended and as long the AP as no data to transmit. However, most of the APs kick-off such 'inactive' stations after few minutes, causing unexpected disconnect (reconnect, etc...). The usual way to prevent this is to submit a Null function frame periodically as a keep-alive. This is something that can be host /software generated (e.g. wpa_supplicant), but that needs to be offloaded to the Wi-Fi controller in case of suspended host. This change enables firmware generated keep-alive frames when entering wowlan suspend, using the 'mkeep_alive' IOVAR. Signed-off-by: Loic Poulain <loic.poulain@linaro.org> Signed-off-by: Kalle Valo <kvalo@codeaurora.org> Link: https://lore.kernel.org/r/1637596046-21651-1-git-send-email-loic.poulain@linaro.org
2021-11-29HID: quirks: Add quirk for the Microsoft Surface 3 type-coverHans de Goede
Add a HID_QUIRK_NO_INIT_REPORTS quirk for the Microsoft Surface 3 (non pro) type-cover. Trying to init the reports seems to confuse the type-cover and causes 2 issues: 1. Despite hid-multitouch sending the command to switch the touchpad to multitouch mode, it keeps sending events on the mouse emulation interface. 2. The touchpad completely stops sending events after a reboot. Adding the HID_QUIRK_NO_INIT_REPORTS quirk fixes both issues. Signed-off-by: Hans de Goede <hdegoede@redhat.com> Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2021-11-29i2c: cbus-gpio: set atomic transfer callbackAaro Koskinen
CBUS transfers have always been atomic, but after commit 63b96983a5dd ("i2c: core: introduce callbacks for atomic transfers") we started to see warnings during e.g. poweroff as the atomic callback is not explicitly set. Fix that. Fixes the following WARNING seen during Nokia N810 power down: [ 786.570617] reboot: Power down [ 786.573913] ------------[ cut here ]------------ [ 786.578826] WARNING: CPU: 0 PID: 672 at drivers/i2c/i2c-core.h:40 i2c_smbus_xfer+0x100/0x110 [ 786.587799] No atomic I2C transfer handler for 'i2c-2' Fixes: 63b96983a5dd ("i2c: core: introduce callbacks for atomic transfers") Signed-off-by: Aaro Koskinen <aaro.koskinen@iki.fi> Signed-off-by: Wolfram Sang <wsa@kernel.org>
2021-11-29s390/pci: move pseudo-MMIO to prevent MIO overlapNiklas Schnelle
When running without MIO support, with pci=nomio or for devices which are not MIO-capable the zPCI subsystem generates pseudo-MMIO addresses to allow access to PCI BARs via MMIO based Linux APIs even though the platform uses function handles and BAR numbers. This is done by stashing an index into our global IOMAP array which contains the function handle in the 16 most significant bits of the addresses returned by ioremap() always setting the most significant bit. On the other hand the MIO addresses assigned by the platform for use, while requiring special instructions, allow PCI access with virtually mapped physical addresses. Now the problem is that these MIO addresses and our own pseudo-MMIO addresses may overlap, while functionally this would not be a problem by itself this overlap is detected by common code as both address types are added as resources in the iomem_resource tree. This leads to the overlapping resource claim of either the MIO capable or non-MIO capable devices with being rejected. Since PCI is tightly coupled to the use of the iomem_resource tree, see for example the code for request_mem_region(), we can't reasonably get rid of the overlap being detected by keeping our pseudo-MMIO addresses out of the iomem_resource tree. Instead let's move the range used by our own pseudo-MMIO addresses by starting at (1UL << 62) and only using addresses below (1UL << 63) thus avoiding the range currently used for MIO addresses. Fixes: c7ff0e918a7c ("s390/pci: deal with devices that have no support for MIO instructions") Cc: stable@vger.kernel.org # 5.3+ Reviewed-by: Pierre Morel <pmorel@linux.ibm.com> Signed-off-by: Niklas Schnelle <schnelle@linux.ibm.com> Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
2021-11-29ALSA: hda/cs8409: Set PMSG_ON earlier inside cs8409 driverStefan Binding
For cs8409, it is required to run Jack Detect on resume. Jack Detect on cs8409+cs42l42 requires an interrupt from cs42l42 to be sent to cs8409 which is propogated to the driver via an unsolicited event. However, the hda_codec drops unsolicited events if the power_state is not set to PMSG_ON. Which is set at the end of the resume call. This means there is a race condition between setting power_state to PMSG_ON and receiving the interrupt. To solve this, we can add an API to set the power_state earlier and call that before we start Jack Detect. This does not cause issues, since we know inside our driver that we are already initialized, and ready to handle the unsolicited events. Signed-off-by: Stefan Binding <sbinding@opensource.cirrus.com> Signed-off-by: Vitaly Rodionov <vitalyr@opensource.cirrus.com> Cc: <stable@vger.kernel.org> # v5.15+ Link: https://lore.kernel.org/r/20211128115558.71683-1-vitalyr@opensource.cirrus.com Signed-off-by: Takashi Iwai <tiwai@suse.de>
2021-11-28Merge branch 'Support static initialization of BPF_MAP_TYPE_PROG_ARRAY'Andrii Nakryiko
Hengqi Chen says: ==================== Make libbpf support static initialization of BPF_MAP_TYPE_PROG_ARRAY with a syntax similar to map-in-map initialization: SEC("socket") int tailcall_1(void *ctx) { return 0; } struct { __uint(type, BPF_MAP_TYPE_PROG_ARRAY); __uint(max_entries, 2); __uint(key_size, sizeof(__u32)); __array(values, int (void *)); } prog_array_init SEC(".maps") = { .values = { [1] = (void *)&tailcall_1, }, }; v1->v2: - Add stricter checks on relos collect, some renamings (Andrii) - Update selftest to check tailcall result (Andrii) ==================== Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
2021-11-28selftests/bpf: Test BPF_MAP_TYPE_PROG_ARRAY static initializationHengqi Chen
Add testcase for BPF_MAP_TYPE_PROG_ARRAY static initialization. Signed-off-by: Hengqi Chen <hengqi.chen@gmail.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20211128141633.502339-3-hengqi.chen@gmail.com
2021-11-28libbpf: Support static initialization of BPF_MAP_TYPE_PROG_ARRAYHengqi Chen
Support static initialization of BPF_MAP_TYPE_PROG_ARRAY with a syntax similar to map-in-map initialization ([0]): SEC("socket") int tailcall_1(void *ctx) { return 0; } struct { __uint(type, BPF_MAP_TYPE_PROG_ARRAY); __uint(max_entries, 2); __uint(key_size, sizeof(__u32)); __array(values, int (void *)); } prog_array_init SEC(".maps") = { .values = { [1] = (void *)&tailcall_1, }, }; Here's the relevant part of libbpf debug log showing what's going on with prog-array initialization: libbpf: sec '.relsocket': collecting relocation for section(3) 'socket' libbpf: sec '.relsocket': relo #0: insn #2 against 'prog_array_init' libbpf: prog 'entry': found map 0 (prog_array_init, sec 4, off 0) for insn #0 libbpf: .maps relo #0: for 3 value 0 rel->r_offset 32 name 53 ('tailcall_1') libbpf: .maps relo #0: map 'prog_array_init' slot [1] points to prog 'tailcall_1' libbpf: map 'prog_array_init': created successfully, fd=5 libbpf: map 'prog_array_init': slot [1] set to prog 'tailcall_1' fd=6 [0] Closes: https://github.com/libbpf/libbpf/issues/354 Signed-off-by: Hengqi Chen <hengqi.chen@gmail.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20211128141633.502339-2-hengqi.chen@gmail.com
2021-11-28Linux 5.16-rc3Linus Torvalds
2021-11-28Merge tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhostLinus Torvalds
Pull vhost,virtio,vdpa bugfixes from Michael Tsirkin: "Misc fixes all over the place. Revert of virtio used length validation series: the approach taken does not seem to work, breaking too many guests in the process. We'll need to do length validation using some other approach" [ This merge also ends up reverting commit f7a36b03a732 ("vsock/virtio: suppress used length validation"), which came in through the networking tree in the meantime, and was part of that whole used length validation series - Linus ] * tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost: vdpa_sim: avoid putting an uninitialized iova_domain vhost-vdpa: clean irqs before reseting vdpa device virtio-blk: modify the value type of num in virtio_queue_rq() vhost/vsock: cleanup removing `len` variable vhost/vsock: fix incorrect used length reported to the guest Revert "virtio_ring: validate used buffer length" Revert "virtio-net: don't let virtio core to validate used length" Revert "virtio-blk: don't let virtio core to validate used length" Revert "virtio-scsi: don't let virtio core to validate used buffer length"
2021-11-28Merge tag 'x86-urgent-2021-11-28' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 build fix from Thomas Gleixner: "A single fix for a missing __init annotation of prepare_command_line()" * tag 'x86-urgent-2021-11-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/boot: Mark prepare_command_line() __init
2021-11-28Merge tag 'sched-urgent-2021-11-28' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler fix from Thomas Gleixner: "A single scheduler fix to ensure that there is no stale KASAN shadow state left on the idle task's stack when a CPU is brought up after it was brought down before" * tag 'sched-urgent-2021-11-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched/scs: Reset task stack state in bringup_cpu()
2021-11-28Merge tag 'perf-urgent-2021-11-28' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull perf fix from Thomas Gleixner: "A single fix for perf to prevent it from sending SIGTRAP to another task from a trace point event as it's not possible to deliver a synchronous signal to a different task from there" * tag 'perf-urgent-2021-11-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: perf: Ignore sigtrap for tracepoints destined for other tasks
2021-11-28Merge tag 'locking-urgent-2021-11-28' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull locking fixes from Thomas Gleixner: "Two regression fixes for reader writer semaphores: - Plug a race in the lock handoff which is caused by inconsistency of the reader and writer path and can lead to corruption of the underlying counter. - down_read_trylock() is suboptimal when the lock is contended and multiple readers trylock concurrently. That's due to the initial value being read non-atomically which results in at least two compare exchange loops. Making the initial readout atomic reduces this significantly. Whith 40 readers by 11% in a benchmark which enforces contention on mmap_sem" * tag 'locking-urgent-2021-11-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: locking/rwsem: Optimize down_read_trylock() under highly contended case locking/rwsem: Make handoff bit handling more consistent
2021-11-28Merge tag 'trace-v5.16-rc2-3' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace Pull another tracing fix from Steven Rostedt: "Fix the fix of pid filtering The setting of the pid filtering flag tested the "trace only this pid" case twice, and ignored the "trace everything but this pid" case. The 5.15 kernel does things a little differently due to the new sparse pid mask introduced in 5.16, and as the bug was discovered running the 5.15 kernel, and the first fix was initially done for that kernel, that fix handled both cases (only pid and all but pid), but the forward port to 5.16 created this bug" * tag 'trace-v5.16-rc2-3' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: tracing: Test the 'Do not trace this pid' case in create event
2021-11-28Merge tag 'iommu-fixes-v5.16-rc2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu Pull iommu fixes from Joerg Roedel: - Intel VT-d fixes: - Remove unused PASID_DISABLED - Fix RCU locking - Fix for the unmap_pages call-back - Rockchip RK3568 address mask fix - AMD IOMMUv2 log message clarification * tag 'iommu-fixes-v5.16-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu: iommu/vt-d: Fix unmap_pages support iommu/vt-d: Fix an unbalanced rcu_read_lock/rcu_read_unlock() iommu/rockchip: Fix PAGE_DESC_HI_MASKs for RK3568 iommu/amd: Clarify AMD IOMMUv2 initialization messages iommu/vt-d: Remove unused PASID_DISABLED
2021-11-27Merge tag '5.16-rc2-ksmbd-fixes' of git://git.samba.org/ksmbdLinus Torvalds
Pull ksmbd fixes from Steve French: "Five ksmbd server fixes, four of them for stable: - memleak fix - fix for default data stream on filesystems that don't support xattr - error logging fix - session setup fix - minor doc cleanup" * tag '5.16-rc2-ksmbd-fixes' of git://git.samba.org/ksmbd: ksmbd: fix memleak in get_file_stream_info() ksmbd: contain default data stream even if xattr is empty ksmbd: downgrade addition info error msg to debug in smb2_get_info_sec() docs: filesystem: cifs: ksmbd: Fix small layout issues ksmbd: Fix an error handling path in 'smb2_sess_setup()'
2021-11-27vmxnet3: Use generic Kconfig option for page size limitGuenter Roeck
Use the architecture independent Kconfig option PAGE_SIZE_LESS_THAN_64KB to indicate that VMXNET3 requires a page size smaller than 64kB. Signed-off-by: Guenter Roeck <linux@roeck-us.net> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-11-27fs: ntfs: Limit NTFS_RW to page sizes smaller than 64kGuenter Roeck
NTFS_RW code allocates page size dependent arrays on the stack. This results in build failures if the page size is 64k or larger. fs/ntfs/aops.c: In function 'ntfs_write_mst_block': fs/ntfs/aops.c:1311:1: error: the frame size of 2240 bytes is larger than 2048 bytes Since commit f22969a66041 ("powerpc/64s: Default to 64K pages for 64 bit book3s") this affects ppc:allmodconfig builds, but other architectures supporting page sizes of 64k or larger are also affected. Increasing the maximum frame size for affected architectures just to silence this error does not really help. The frame size would have to be set to a really large value for 256k pages. Also, a large frame size could potentially result in stack overruns in this code and elsewhere and is therefore not desirable. Make NTFS_RW dependent on page sizes smaller than 64k instead. Signed-off-by: Guenter Roeck <linux@roeck-us.net> Cc: Anton Altaparmakov <anton@tuxera.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-11-27arch: Add generic Kconfig option indicating page size smaller than 64kGuenter Roeck
NTFS_RW and VMXNET3 require a page size smaller than 64kB. Add generic Kconfig option for use outside architecture code to avoid architecture specific Kconfig options in that code. Suggested-by: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: Guenter Roeck <linux@roeck-us.net> Cc: Anton Altaparmakov <anton@tuxera.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-11-27tracing: Test the 'Do not trace this pid' case in create eventSteven Rostedt (VMware)
When creating a new event (via a module, kprobe, eprobe, etc), the descriptors that are created must add flags for pid filtering if an instance has pid filtering enabled, as the flags are used at the time the event is executed to know if pid filtering should be done or not. The "Only trace this pid" case was added, but a cut and paste error made that case checked twice, instead of checking the "Trace all but this pid" case. Link: https://lore.kernel.org/all/202111280401.qC0z99JB-lkp@intel.com/ Fixes: 6cb206508b62 ("tracing: Check pid filtering when creating events") Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2021-11-27Merge tag 'xfs-5.16-fixes-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linuxLinus Torvalds
Pull xfs fixes from Darrick Wong: "Fixes for a resource leak and a build robot complaint about totally dead code: - Fix buffer resource leak that could lead to livelock on corrupt fs. - Remove unused function xfs_inew_wait to shut up the build robots" * tag 'xfs-5.16-fixes-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: xfs: remove xfs_inew_wait xfs: Fix the free logic of state in xfs_attr_node_hasname
2021-11-27Merge tag 'iomap-5.16-fixes-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linuxLinus Torvalds
Pull iomap fixes from Darrick Wong: "A single iomap bug fix and a cleanup for 5.16-rc2. The bug fix changes how iomap deals with reading from an inline data region -- whereas the current code (incorrectly) lets the iomap read iter try for more bytes after reading the inline region (which zeroes the rest of the page!) and hopes the next iteration terminates, we surveyed the inlinedata implementations and realized that all inlinedata implementations also require that the inlinedata region end at EOF, so we can simply terminate the read. The second patch documents these assumptions in the code so that they're not subtle implications anymore, and cleans up some of the grosser parts of that function. Summary: - Fix an accounting problem where unaligned inline data reads can run off the end of the read iomap iterator. iomap has historically required that inline data mappings only exist at the end of a file, though this wasn't documented anywhere. - Document iomap_read_inline_data and change its return type to be appropriate for the information that it's actually returning" * tag 'iomap-5.16-fixes-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: iomap: iomap_read_inline_data cleanup iomap: Fix inline extent handling in iomap_readpage
2021-11-27Merge tag 'trace-v5.16-rc2-2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace Pull tracing fixes from Steven Rostedt: "Two fixes to event pid filtering: - Make sure newly created events reflect the current state of pid filtering - Take pid filtering into account when recording trigger events. (Also clean up the if statement to be cleaner)" * tag 'trace-v5.16-rc2-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: tracing: Fix pid filtering when triggers are attached tracing: Check pid filtering when creating events
2021-11-27Merge tag 'io_uring-5.16-2021-11-27' of git://git.kernel.dk/linux-blockLinus Torvalds
Pull more io_uring fixes from Jens Axboe: "The locking fixup that was applied earlier this rc has both a deadlock and IRQ safety issue, let's get that ironed out before -rc3. This contains: - Link traversal locking fix (Pavel) - Cancelation fix (Pavel) - Relocate cond_resched() for huge buffer chain freeing, avoiding a softlockup warning (Ye) - Fix timespec validation (Ye)" * tag 'io_uring-5.16-2021-11-27' of git://git.kernel.dk/linux-block: io_uring: Fix undefined-behaviour in io_issue_sqe io_uring: fix soft lockup when call __io_remove_buffers io_uring: fix link traversal locking io_uring: fail cancellation for EXITING tasks
2021-11-27Merge tag 'block-5.16-2021-11-27' of git://git.kernel.dk/linux-blockLinus Torvalds
Pull more block fixes from Jens Axboe: "Turns out that the flushing out of pending fixes before the Thanksgiving break didn't quite work out in terms of timing, so here's a followup set of fixes: - rq_qos_done() should be called regardless of whether or not we're the final put of the request, it's not related to the freeing of the state. This fixes an IO stall with wbt that a few users have reported, a regression in this release. - Only define zram_wb_devops if it's used, fixing a compilation warning for some compilers" * tag 'block-5.16-2021-11-27' of git://git.kernel.dk/linux-block: zram: only make zram_wb_devops for CONFIG_ZRAM_WRITEBACK block: call rq_qos_done() before ref check in batch completions
2021-11-27Merge tag 'scsi-fixes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi Pull SCSI fixes from James Bottomley: "Twelve fixes, eleven in drivers (target, qla2xx, scsi_debug, mpt3sas, ufs). The core fix is a minor correction to the previous state update fix for the iscsi daemons" * tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi: scsi: scsi_debug: Zero clear zones at reset write pointer scsi: core: sysfs: Fix setting device state to SDEV_RUNNING scsi: scsi_debug: Sanity check block descriptor length in resp_mode_select() scsi: target: configfs: Delete unnecessary checks for NULL scsi: target: core: Use RCU helpers for INQUIRY t10_alua_tg_pt_gp scsi: mpt3sas: Fix incorrect system timestamp scsi: mpt3sas: Fix system going into read-only mode scsi: mpt3sas: Fix kernel panic during drive powercycle test scsi: ufs: ufs-mediatek: Add put_device() after of_find_device_by_node() scsi: scsi_debug: Fix type in min_t to avoid stack OOB scsi: qla2xxx: edif: Fix off by one bug in qla_edif_app_getfcinfo() scsi: ufs: ufshpb: Fix warning in ufshpb_set_hpb_read_to_upiu()
2021-11-27Merge tag 'nfs-for-5.16-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfsLinus Torvalds
Pull NFS client fixes from Trond Myklebust: "Highlights include: Stable fixes: - NFSv42: Fix pagecache invalidation after COPY/CLONE Bugfixes: - NFSv42: Don't fail clone() just because the server failed to return post-op attributes - SUNRPC: use different lockdep keys for INET6 and LOCAL - NFSv4.1: handle NFS4ERR_NOSPC from CREATE_SESSION - SUNRPC: fix header include guard in trace header" * tag 'nfs-for-5.16-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: SUNRPC: use different lock keys for INET6 and LOCAL sunrpc: fix header include guard in trace header NFSv4.1: handle NFS4ERR_NOSPC by CREATE_SESSION NFSv42: Fix pagecache invalidation after COPY/CLONE NFS: Add a tracepoint to show the results of nfs_set_cache_invalid() NFSv42: Don't fail clone() unless the OP_CLONE operation failed
2021-11-27Merge tag 'erofs-for-5.16-rc3-fixes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs Pull erofs fix from Gao Xiang: "Fix an ABBA deadlock introduced by XArray conversion" * tag 'erofs-for-5.16-rc3-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs: erofs: fix deadlock when shrink erofs slab
2021-11-27Merge tag 'powerpc-5.16-3' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux Pull powerpc fixes from Michael Ellerman: "Fix KVM using a Power9 instruction on earlier CPUs, which could lead to the host SLB being incorrectly invalidated and a subsequent host crash. Fix kernel hardlockup on vmap stack overflow on 32-bit. Thanks to Christophe Leroy, Nicholas Piggin, and Fabiano Rosas" * tag 'powerpc-5.16-3' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: powerpc/32: Fix hardlockup on vmap stack overflow KVM: PPC: Book3S HV: Prevent POWER7/8 TLB flush flushing SLB
2021-11-27Merge tag 'mips-fixes_5.16_2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/mips/linux Pull MIPS fixes from Thomas Bogendoerfer: - build fix for ZSTD enabled configs - fix for preempt warning - fix for loongson FTLB detection - fix for page table level selection * tag 'mips-fixes_5.16_2' of git://git.kernel.org/pub/scm/linux/kernel/git/mips/linux: MIPS: use 3-level pgtable for 64KB page size on MIPS_VA_BITS_48 MIPS: loongson64: fix FTLB configuration MIPS: Fix using smp_processor_id() in preemptible in show_cpuinfo() MIPS: boot/compressed/: add __ashldi3 to target for ZSTD compression
2021-11-27io_uring: Fix undefined-behaviour in io_issue_sqeYe Bin
We got issue as follows: ================================================================================ UBSAN: Undefined behaviour in ./include/linux/ktime.h:42:14 signed integer overflow: -4966321760114568020 * 1000000000 cannot be represented in type 'long long int' CPU: 1 PID: 2186 Comm: syz-executor.2 Not tainted 4.19.90+ #12 Hardware name: linux,dummy-virt (DT) Call trace: dump_backtrace+0x0/0x3f0 arch/arm64/kernel/time.c:78 show_stack+0x28/0x38 arch/arm64/kernel/traps.c:158 __dump_stack lib/dump_stack.c:77 [inline] dump_stack+0x170/0x1dc lib/dump_stack.c:118 ubsan_epilogue+0x18/0xb4 lib/ubsan.c:161 handle_overflow+0x188/0x1dc lib/ubsan.c:192 __ubsan_handle_mul_overflow+0x34/0x44 lib/ubsan.c:213 ktime_set include/linux/ktime.h:42 [inline] timespec64_to_ktime include/linux/ktime.h:78 [inline] io_timeout fs/io_uring.c:5153 [inline] io_issue_sqe+0x42c8/0x4550 fs/io_uring.c:5599 __io_queue_sqe+0x1b0/0xbc0 fs/io_uring.c:5988 io_queue_sqe+0x1ac/0x248 fs/io_uring.c:6067 io_submit_sqe fs/io_uring.c:6137 [inline] io_submit_sqes+0xed8/0x1c88 fs/io_uring.c:6331 __do_sys_io_uring_enter fs/io_uring.c:8170 [inline] __se_sys_io_uring_enter fs/io_uring.c:8129 [inline] __arm64_sys_io_uring_enter+0x490/0x980 fs/io_uring.c:8129 invoke_syscall arch/arm64/kernel/syscall.c:53 [inline] el0_svc_common+0x374/0x570 arch/arm64/kernel/syscall.c:121 el0_svc_handler+0x190/0x260 arch/arm64/kernel/syscall.c:190 el0_svc+0x10/0x218 arch/arm64/kernel/entry.S:1017 ================================================================================ As ktime_set only judge 'secs' if big than KTIME_SEC_MAX, but if we pass negative value maybe lead to overflow. To address this issue, we must check if 'sec' is negative. Signed-off-by: Ye Bin <yebin10@huawei.com> Link: https://lore.kernel.org/r/20211118015907.844807-1-yebin10@huawei.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-11-27io_uring: fix soft lockup when call __io_remove_buffersYe Bin
I got issue as follows: [ 567.094140] __io_remove_buffers: [1]start ctx=0xffff8881067bf000 bgid=65533 buf=0xffff8881fefe1680 [ 594.360799] watchdog: BUG: soft lockup - CPU#2 stuck for 26s! [kworker/u32:5:108] [ 594.364987] Modules linked in: [ 594.365405] irq event stamp: 604180238 [ 594.365906] hardirqs last enabled at (604180237): [<ffffffff93fec9bd>] _raw_spin_unlock_irqrestore+0x2d/0x50 [ 594.367181] hardirqs last disabled at (604180238): [<ffffffff93fbbadb>] sysvec_apic_timer_interrupt+0xb/0xc0 [ 594.368420] softirqs last enabled at (569080666): [<ffffffff94200654>] __do_softirq+0x654/0xa9e [ 594.369551] softirqs last disabled at (569080575): [<ffffffff913e1d6a>] irq_exit_rcu+0x1ca/0x250 [ 594.370692] CPU: 2 PID: 108 Comm: kworker/u32:5 Tainted: G L 5.15.0-next-20211112+ #88 [ 594.371891] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ?-20190727_073836-buildvm-ppc64le-16.ppc.fedoraproject.org-3.fc31 04/01/2014 [ 594.373604] Workqueue: events_unbound io_ring_exit_work [ 594.374303] RIP: 0010:_raw_spin_unlock_irqrestore+0x33/0x50 [ 594.375037] Code: 48 83 c7 18 53 48 89 f3 48 8b 74 24 10 e8 55 f5 55 fd 48 89 ef e8 ed a7 56 fd 80 e7 02 74 06 e8 43 13 7b fd fb bf 01 00 00 00 <e8> f8 78 474 [ 594.377433] RSP: 0018:ffff888101587a70 EFLAGS: 00000202 [ 594.378120] RAX: 0000000024030f0d RBX: 0000000000000246 RCX: 1ffffffff2f09106 [ 594.379053] RDX: 0000000000000000 RSI: ffffffff9449f0e0 RDI: 0000000000000001 [ 594.379991] RBP: ffffffff9586cdc0 R08: 0000000000000001 R09: fffffbfff2effcab [ 594.380923] R10: ffffffff977fe557 R11: fffffbfff2effcaa R12: ffff8881b8f3def0 [ 594.381858] R13: 0000000000000246 R14: ffff888153a8b070 R15: 0000000000000000 [ 594.382787] FS: 0000000000000000(0000) GS:ffff888399c00000(0000) knlGS:0000000000000000 [ 594.383851] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 594.384602] CR2: 00007fcbe71d2000 CR3: 00000000b4216000 CR4: 00000000000006e0 [ 594.385540] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 594.386474] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 594.387403] Call Trace: [ 594.387738] <TASK> [ 594.388042] find_and_remove_object+0x118/0x160 [ 594.389321] delete_object_full+0xc/0x20 [ 594.389852] kfree+0x193/0x470 [ 594.390275] __io_remove_buffers.part.0+0xed/0x147 [ 594.390931] io_ring_ctx_free+0x342/0x6a2 [ 594.392159] io_ring_exit_work+0x41e/0x486 [ 594.396419] process_one_work+0x906/0x15a0 [ 594.399185] worker_thread+0x8b/0xd80 [ 594.400259] kthread+0x3bf/0x4a0 [ 594.401847] ret_from_fork+0x22/0x30 [ 594.402343] </TASK> Message from syslogd@localhost at Nov 13 09:09:54 ... kernel:watchdog: BUG: soft lockup - CPU#2 stuck for 26s! [kworker/u32:5:108] [ 596.793660] __io_remove_buffers: [2099199]start ctx=0xffff8881067bf000 bgid=65533 buf=0xffff8881fefe1680 We can reproduce this issue by follow syzkaller log: r0 = syz_io_uring_setup(0x401, &(0x7f0000000300), &(0x7f0000003000/0x2000)=nil, &(0x7f0000ff8000/0x4000)=nil, &(0x7f0000000280)=<r1=>0x0, &(0x7f0000000380)=<r2=>0x0) sendmsg$ETHTOOL_MSG_FEATURES_SET(0xffffffffffffffff, &(0x7f0000003080)={0x0, 0x0, &(0x7f0000003040)={&(0x7f0000000040)=ANY=[], 0x18}}, 0x0) syz_io_uring_submit(r1, r2, &(0x7f0000000240)=@IORING_OP_PROVIDE_BUFFERS={0x1f, 0x5, 0x0, 0x401, 0x1, 0x0, 0x100, 0x0, 0x1, {0xfffd}}, 0x0) io_uring_enter(r0, 0x3a2d, 0x0, 0x0, 0x0, 0x0) The reason above issue is 'buf->list' has 2,100,000 nodes, occupied cpu lead to soft lockup. To solve this issue, we need add schedule point when do while loop in '__io_remove_buffers'. After add schedule point we do regression, get follow data. [ 240.141864] __io_remove_buffers: [1]start ctx=0xffff888170603000 bgid=65533 buf=0xffff8881116fcb00 [ 268.408260] __io_remove_buffers: [1]start ctx=0xffff8881b92d2000 bgid=65533 buf=0xffff888130c83180 [ 275.899234] __io_remove_buffers: [2099199]start ctx=0xffff888170603000 bgid=65533 buf=0xffff8881116fcb00 [ 296.741404] __io_remove_buffers: [1]start ctx=0xffff8881b659c000 bgid=65533 buf=0xffff8881010fe380 [ 305.090059] __io_remove_buffers: [2099199]start ctx=0xffff8881b92d2000 bgid=65533 buf=0xffff888130c83180 [ 325.415746] __io_remove_buffers: [1]start ctx=0xffff8881b92d1000 bgid=65533 buf=0xffff8881a17d8f00 [ 333.160318] __io_remove_buffers: [2099199]start ctx=0xffff8881b659c000 bgid=65533 buf=0xffff8881010fe380 ... Fixes:8bab4c09f24e("io_uring: allow conditional reschedule for intensive iterators") Signed-off-by: Ye Bin <yebin10@huawei.com> Link: https://lore.kernel.org/r/20211122024737.2198530-1-yebin10@huawei.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-11-26Merge branch 'af_unix-replace-unix_table_lock-with-per-hash-locks'Jakub Kicinski
Kuniyuki Iwashima says: ==================== af_unix: Replace unix_table_lock with per-hash locks. The hash table of AF_UNIX sockets is protected by a single big lock, unix_table_lock. This series replaces it with small per-hash locks. 1st - 2nd : Misc refactoring 3rd - 8th : Separate BSD/abstract address logics 9th - 11th : Prep to save a hash in each socket 12th : Replace the big lock 13th : Speed up autobind() Note to maintainers: The 12th patch adds two kinds of Sparse warnings on patchwork: about unix_table_double_lock/unlock() We can avoid this by adding two apparent acquires/releases annotations, but there are the same kinds of warnings about unix_state_double_lock(). about unix_next_socket() and unix_seq_stop() (/proc/net/unix) This is because Sparse does not understand logic in unix_next_socket(), which leaves a spin lock held until it returns NULL. Also, tcp_seq_stop() causes a warning for the same reason. These warnings seem reasonable, but let me know if there is any better way. Please see [0] for details. [0]: https://lore.kernel.org/netdev/20211117001611.74123-1-kuniyu@amazon.co.jp/ ==================== Link: https://lore.kernel.org/r/20211124021431.48956-1-kuniyu@amazon.co.jp Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-11-26af_unix: Relax race in unix_autobind().Kuniyuki Iwashima
When we bind an AF_UNIX socket without a name specified, the kernel selects an available one from 0x00000 to 0xFFFFF. unix_autobind() starts searching from a number in the 'static' variable and increments it after acquiring two locks. If multiple processes try autobind, they obtain the same lock and check if a socket in the hash list has the same name. If not, one process uses it, and all except one end up retrying the _next_ number (actually not, it may be incremented by the other processes). The more we autobind sockets in parallel, the longer the latency gets. We can avoid such a race by searching for a name from a random number. These show latency in unix_autobind() while 64 CPUs are simultaneously autobind-ing 1024 sockets for each. Without this patch: usec : count distribution 0 : 1176 |*** | 2 : 3655 |*********** | 4 : 4094 |************* | 6 : 3831 |************ | 8 : 3829 |************ | 10 : 3844 |************ | 12 : 3638 |*********** | 14 : 2992 |********* | 16 : 2485 |******* | 18 : 2230 |******* | 20 : 2095 |****** | 22 : 1853 |***** | 24 : 1827 |***** | 26 : 1677 |***** | 28 : 1473 |**** | 30 : 1573 |***** | 32 : 1417 |**** | 34 : 1385 |**** | 36 : 1345 |**** | 38 : 1344 |**** | 40 : 1200 |*** | With this patch: usec : count distribution 0 : 1855 |****** | 2 : 6464 |********************* | 4 : 9936 |******************************** | 6 : 12107 |****************************************| 8 : 10441 |********************************** | 10 : 7264 |*********************** | 12 : 4254 |************** | 14 : 2538 |******** | 16 : 1596 |***** | 18 : 1088 |*** | 20 : 800 |** | 22 : 670 |** | 24 : 601 |* | 26 : 562 |* | 28 : 525 |* | 30 : 446 |* | 32 : 378 |* | 34 : 337 |* | 36 : 317 |* | 38 : 314 |* | 40 : 298 | | Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-11-26af_unix: Replace the big lock with small locks.Kuniyuki Iwashima
The hash table of AF_UNIX sockets is protected by the single lock. This patch replaces it with per-hash locks. The effect is noticeable when we handle multiple sockets simultaneously. Here is a test result on an EC2 c5.24xlarge instance. It shows latency (under 10us only) in unix_insert_unbound_socket() while 64 CPUs creating 1024 sockets for each in parallel. Without this patch: nsec : count distribution 0 : 179 | | 500 : 3021 |********* | 1000 : 6271 |******************* | 1500 : 6318 |******************* | 2000 : 5828 |***************** | 2500 : 5124 |*************** | 3000 : 4426 |************* | 3500 : 3672 |*********** | 4000 : 3138 |********* | 4500 : 2811 |******** | 5000 : 2384 |******* | 5500 : 2023 |****** | 6000 : 1954 |***** | 6500 : 1737 |***** | 7000 : 1749 |***** | 7500 : 1520 |**** | 8000 : 1469 |**** | 8500 : 1394 |**** | 9000 : 1232 |*** | 9500 : 1138 |*** | 10000 : 994 |*** | With this patch: nsec : count distribution 0 : 1634 |**** | 500 : 13170 |****************************************| 1000 : 13156 |*************************************** | 1500 : 9010 |*************************** | 2000 : 6363 |******************* | 2500 : 4443 |************* | 3000 : 3240 |********* | 3500 : 2549 |******* | 4000 : 1872 |***** | 4500 : 1504 |**** | 5000 : 1247 |*** | 5500 : 1035 |*** | 6000 : 889 |** | 6500 : 744 |** | 7000 : 634 |* | 7500 : 498 |* | 8000 : 433 |* | 8500 : 355 |* | 9000 : 336 |* | 9500 : 284 | | 10000 : 243 | | Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-11-26af_unix: Save hash in sk_hash.Kuniyuki Iwashima
To replace unix_table_lock with per-hash locks in the next patch, we need to save a hash in each socket because /proc/net/unix or BPF prog iterate sockets while holding a hash table lock and release it later in a different function. Currently, we store a real/pseudo hash in struct unix_address. However, we do not allocate it to unbound sockets, nor should we do just for that. For this purpose, we can use sk_hash. Then, we no longer use the hash field in struct unix_address and can remove it. Also, this patch does - rename unix_insert_socket() to unix_insert_unbound_socket() - remove the redundant list argument from __unix_insert_socket() and unix_insert_unbound_socket() - use 'unsigned int' instead of 'unsigned' in __unix_set_addr_hash() - remove 'inline' from unix_remove_socket() and unix_insert_unbound_socket(). Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-11-26af_unix: Add helpers to calculate hashes.Kuniyuki Iwashima
This patch adds three helper functions that calculate hashes for unbound sockets and bound sockets with BSD/abstract addresses. Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp> Signed-off-by: Jakub Kicinski <kuba@kernel.org>