summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2021-07-08pwm: spear: Ensure configuring period and duty_cycle isn't wrongly skippedUwe Kleine-König
As the last call to spear_pwm_apply() might have exited early if state->enabled was false, the values for period and duty_cycle stored in pwm->state might not have been written to hardware and it must be ensured that they are configured before enabling the PWM. Fixes: 98761ce4b91b ("pwm: spear: Implement .apply() callback") Signed-off-by: Uwe Kleine-König <u.kleine-koenig@pengutronix.de> Signed-off-by: Thierry Reding <thierry.reding@gmail.com>
2021-07-08pwm: sprd: Ensure configuring period and duty_cycle isn't wrongly skippedUwe Kleine-König
As the last call to sprd_pwm_apply() might have exited early if state->enabled was false, the values for period and duty_cycle stored in pwm->state might not have been written to hardware and it must be ensured that they are configured before enabling the PWM. Fixes: 8aae4b02e8a6 ("pwm: sprd: Add Spreadtrum PWM support") Signed-off-by: Uwe Kleine-König <u.kleine-koenig@pengutronix.de> Signed-off-by: Thierry Reding <thierry.reding@gmail.com>
2021-07-08powerpc/preempt: Don't touch the idle task's preempt_count during hotplugValentin Schneider
Powerpc currently resets a CPU's idle task preempt_count to 0 before said task starts executing the secondary startup routine (and becomes an idle task proper). This conflicts with commit f1a0a376ca0c ("sched/core: Initialize the idle task with preemption disabled"). which initializes all of the idle tasks' preempt_count to PREEMPT_DISABLED during smp_init(). Note that this was superfluous before said commit, as back then the hotplug machinery would invoke init_idle() via idle_thread_get(), which would have already reset the CPU's idle task's preempt_count to PREEMPT_ENABLED. Get rid of this preempt_count write. Fixes: f1a0a376ca0c ("sched/core: Initialize the idle task with preemption disabled") Reported-by: Bharata B Rao <bharata@linux.ibm.com> Signed-off-by: Valentin Schneider <valentin.schneider@arm.com> Tested-by: Guenter Roeck <linux@roeck-us.net> Tested-by: Bharata B Rao <bharata@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20210707183831.2106509-1-valentin.schneider@arm.com
2021-07-08s390/vdso: add minimal compat vdsoSven Schnelle
Add a small vdso for 31 bit compat application that provides trampolines for calls to sigreturn,rt_sigreturn,syscall_restart. This is requird for moving these syscalls away from the signal frame to the vdso. Note that this patch effectively disables CONFIG_COMPAT when using clang to compile the kernel. clang doesn't support 31 bit mode. We want to redirect sigreturn and restart_syscall to the vdso. However, the kernel cannot parse the ELF vdso file, so we need to generate header files which contain the offsets of the syscall instructions in the vdso page. Signed-off-by: Sven Schnelle <svens@linux.ibm.com> Reviewed-by: Heiko Carstens <hca@linux.ibm.com> Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
2021-07-08s390/vdso: rename VDSO64_LBASE to VDSO_LBASESven Schnelle
Will be used by both vdso32 and vdso64, so change the name. Signed-off-by: Sven Schnelle <svens@linux.ibm.com> Reviewed-by: Heiko Carstens <hca@linux.ibm.com> Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
2021-07-08s390/vdso64: add sigreturn,rt_sigreturn and restart_syscallSven Schnelle
Add minimalistic trampolines to vdso64 so we can return from signal without using the stack which requires pgm check handler hacks when NX is enabled. restart_syscall will be called from vdso to work around the architectural limitation that the syscall number might be encoded in the svc instruction, and therefore can not be changed. Signed-off-by: Sven Schnelle <svens@linux.ibm.com> Reviewed-by: Heiko Carstens <hca@linux.ibm.com> Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
2021-07-08s390/vdso: always enable vdsoSven Schnelle
With the upcoming move of the svc sigreturn instruction from the signal frame to vdso we need to have vdso always enabled. Signed-off-by: Sven Schnelle <svens@linux.ibm.com> Reviewed-by: Heiko Carstens <hca@linux.ibm.com> Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
2021-07-08s390/ap: get rid of register asmHeiko Carstens
Using register asm statements has been proven to be very error prone, especially when using code instrumentation where gcc may add function calls, which clobbers register contents in an unexpected way. Therefore get rid of register asm statements in ap code. There are also potential bugs, depending on inline decisions of the compiler. E.g. for: static inline struct ap_queue_status ap_tapq(ap_qid_t qid, unsigned long *info) { register unsigned long reg0 asm ("0") = qid; register struct ap_queue_status reg1 asm ("1"); register unsigned long reg2 asm ("2"); asm volatile(".long 0xb2af0000" /* PQAP(TAPQ) */ : "=d" (reg1), "=d" (reg2) : "d" (reg0) : "cc"); if (info) *info = reg2; return reg1; } In case of KCOV the "if (info)" line could cause a generated function call, which could clobber the contents of both reg2, and reg1. Similar can happen in case of KASAN for the "*info = reg2" line. Even though compilers will likely inline the function and optimize things away, this is not guaranteed. To get rid of this bug class, simply get rid of register asm constructs. Note: The inline function ap_dqap() will be handled in a separate patch because this one requires an addressing of the odd register of a register pair (which is done with %N[xxx] in the assembler code) and that's currently not supported by clang. Signed-off-by: Heiko Carstens <hca@linux.ibm.com> Signed-off-by: Harald Freudenberger <freude@linux.ibm.com> Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
2021-07-08s390/irq: remove HAVE_IRQ_EXIT_ON_IRQ_STACKSven Schnelle
This is no longer true since we switched to generic entry. The code switches to the IRQ stack before calling do_IRQ, but switches back to the kernel stack before calling irq_exit(). Fixes: 56e62a737028 ("s390: convert to generic entry") Signed-off-by: Sven Schnelle <svens@linux.ibm.com> Reviewed-by: Heiko Carstens <hca@linux.ibm.com> Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
2021-07-08s390/traps: do not test MONITOR CALL without CONFIG_BUGIlya Leoshkevich
tinyconfig fails to boot, because without CONFIG_BUG report_bug() always returns BUG_TRAP_TYPE_BUG, which causes mc 0,0 in test_monitor_call() to panic. Fix by skipping the test without CONFIG_BUG. Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com> Reviewed-by: Heiko Carstens <hca@linux.ibm.com> Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
2021-07-08s390/ap: Rework ap_dqap to deal with messages greater than recv bufferHarald Freudenberger
Rework of the ap_dqap() inline function with the dqap inline assembler invocation and the caller code in ap_queue.c to be able to handle replies which exceed the receive buffer size. ap_dqap() now provides two additional parameters to handle together with the caller the case where a reply in the firmware queue entry exceeds the given message buffer size. It depends on the caller how to exactly handle this. The behavior implemented now by ap_sm_recv() in ap_queue.c is to simple purge this entry from the firmware queue and let the caller 'receive' a -EMSGSIZE for the request without delivering any reply data - not even a truncated reply message. However, the reworked ap_dqap() could now get invoked in a way that the message is received in multiple parts and the caller assembles the parts into one reply message. Signed-off-by: Harald Freudenberger <freude@linux.ibm.com> Suggested-by: Juergen Christ <jchrist@linux.ibm.com> Reviewed-by: Juergen Christ <jchrist@linux.ibm.com> Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
2021-07-08ext4: inline jbd2_journal_[un]register_shrinker()Theodore Ts'o
The function jbd2_journal_unregister_shrinker() was getting called twice when the file system was getting unmounted. On Power and ARM platforms this was causing kernel crash when unmounting the file system, when a percpu_counter was destroyed twice. Fix this by removing jbd2_journal_[un]register_shrinker() functions, and inlining the shrinker setup and teardown into journal_init_common() and jbd2_journal_destroy(). This means that ext4 and ocfs2 now no longer need to know about registering and unregistering jbd2's shrinker. Also, while we're at it, rename the percpu counter from j_jh_shrink_count to j_checkpoint_jh_count, since this makes it clearer what this counter is intended to track. Link: https://lore.kernel.org/r/20210705145025.3363130-1-tytso@mit.edu Fixes: 4ba3fcdde7e3 ("jbd2,ext4: add a shrinker to release checkpointed buffers") Reported-by: Jon Hunter <jonathanh@nvidia.com> Reported-by: Sachin Sant <sachinp@linux.vnet.ibm.com> Tested-by: Sachin Sant <sachinp@linux.vnet.ibm.com> Tested-by: Jon Hunter <jonathanh@nvidia.com> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2021-07-08ext4: fix flags validity checking for EXT4_IOC_CHECKPOINTTheodore Ts'o
Use the correct bitmask when checking for any not-yet-supported flags. Link: https://lore.kernel.org/r/20210702173425.1276158-1-tytso@mit.edu Fixes: 351a0a3fbc35 ("ext4: add ioctl EXT4_IOC_CHECKPOINT") Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Leah Rumancik <leah.rumancik@gmail.com>
2021-07-08ext4: fix possible UAF when remounting r/o a mmp-protected file systemTheodore Ts'o
After commit 618f003199c6 ("ext4: fix memory leak in ext4_fill_super"), after the file system is remounted read-only, there is a race where the kmmpd thread can exit, causing sbi->s_mmp_tsk to point at freed memory, which the call to ext4_stop_mmpd() can trip over. Fix this by only allowing kmmpd() to exit when it is stopped via ext4_stop_mmpd(). Link: https://lore.kernel.org/r/20210707002433.3719773-1-tytso@mit.edu Reported-by: Ye Bin <yebin10@huawei.com> Bug-Report-Link: <20210629143603.2166962-1-yebin10@huawei.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Jan Kara <jack@suse.cz>
2021-07-08virtio-mem: prioritize unplug from ZONE_MOVABLE in Big Block ModeDavid Hildenbrand
Let's handle unplug in Big Block Mode similar to Sub Block Mode -- prioritize memory blocks onlined to ZONE_MOVABLE. We won't care further about big blocks with mixed zones, as it's rather a corner case that won't matter in practice. Signed-off-by: David Hildenbrand <david@redhat.com> Link: https://lore.kernel.org/r/20210602185720.31821-8-david@redhat.com Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2021-07-08virtio-mem: simplify high-level unplug handling in Big Block ModeDavid Hildenbrand
Let's simplify high-level big block selection when unplugging in Big Block Mode. Combine handling of offline and online blocks. We can get rid of virtio_mem_bbm_bb_is_offline() and simply use virtio_mem_bbm_offline_remove_and_unplug_bb(), as that already tolerates offline parts. We can race with concurrent onlining/offlining either way, so we don;t have to be super correct by failing if an offline big block we'd like to unplug just got (partially) onlined. Signed-off-by: David Hildenbrand <david@redhat.com> Link: https://lore.kernel.org/r/20210602185720.31821-7-david@redhat.com Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2021-07-08virtio-mem: prioritize unplug from ZONE_MOVABLE in Sub Block ModeDavid Hildenbrand
Until now, memory provided by a single virtio-mem device was usually either onlined completely to ZONE_MOVABLE (online_movable) or to ZONE_NORMAL (online_kernel); however, that will change in the future. There are two reasons why we want to track to which zone a memory blocks belongs to and prioritize ZONE_MOVABLE blocks: 1) Memory managed by ZONE_MOVABLE can more likely get unplugged, therefore, resulting in a faster memory hotunplug process. Further, we can more reliably unplug and remove complete memory blocks, removing metadata allocated for the whole memory block. 2) We want to avoid corner cases where unplugging with the current scheme (highest to lowest address) could result in accidential zone imbalances, whereby we remove too much ZONE_NORMAL memory for ZONE_MOVABLE memory of the same device. Let's track the zone via memory block states and try unplug from ZONE_MOVABLE first. Rename VIRTIO_MEM_SBM_MB_ONLINE* to VIRTIO_MEM_SBM_MB_KERNEL* to avoid even longer state names. In commit 27f852795a06 ("virtio-mem: don't special-case ZONE_MOVABLE"), we removed slightly similar tracking for fully plugged memory blocks to support unplugging from ZONE_MOVABLE at all -- as we didn't allow partially plugged memory blocks in ZONE_MOVABLE before that. That commit already mentioned "In the future, we might want to remember the zone again and use the information when (un)plugging memory." Signed-off-by: David Hildenbrand <david@redhat.com> Link: https://lore.kernel.org/r/20210602185720.31821-6-david@redhat.com Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2021-07-08virtio-mem: simplify high-level unplug handling in Sub Block ModeDavid Hildenbrand
Let's simplify by introducing a new virtio_mem_sbm_unplug_any_sb(), similar to virtio_mem_sbm_plug_any_sb(), to simplify high-level memory block selection when unplugging in Sub Block Mode. Rename existing virtio_mem_sbm_unplug_any_sb() to virtio_mem_sbm_unplug_any_sb_raw(). The only change is that we now temporarily unlock the hotplug mutex around cond_resched() when processing offline memory blocks, which doesn't make a real difference as we already have to temporarily unlock in virtio_mem_sbm_unplug_any_sb_offline() when removing a memory block. Signed-off-by: David Hildenbrand <david@redhat.com> Link: https://lore.kernel.org/r/20210602185720.31821-5-david@redhat.com Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2021-07-08virtio-mem: simplify high-level plug handling in Sub Block ModeDavid Hildenbrand
Let's simplify high-level memory block selection when plugging in Sub Block Mode. No need for two separate loops when selecting memory blocks for plugging memory. Avoid passing the "online" state by simply obtaining the state in virtio_mem_sbm_plug_any_sb(). Signed-off-by: David Hildenbrand <david@redhat.com> Link: https://lore.kernel.org/r/20210602185720.31821-4-david@redhat.com Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2021-07-08virtio-mem: use page_zonenum() in virtio_mem_fake_offline()David Hildenbrand
Let's use page_zonenum() instead of zone_idx(page_zone()). Signed-off-by: David Hildenbrand <david@redhat.com> Link: https://lore.kernel.org/r/20210602185720.31821-3-david@redhat.com Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2021-07-08virtio-mem: don't read big block size in Sub Block ModeDavid Hildenbrand
We are reading a Big Block Mode value while in Sub Block Mode when initializing. Fortunately, vm->bbm.bb_size maps to some counter in the vm->sbm.mb_count array, which is 0 at that point in time. No harm done; still, this was unintended and is not future-proof. Fixes: 4ba50cd3355d ("virtio-mem: Big Block Mode (BBM) memory hotplug") Signed-off-by: David Hildenbrand <david@redhat.com> Link: https://lore.kernel.org/r/20210602185720.31821-2-david@redhat.com Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2021-07-08virtio/vdpa: clear the virtqueue state during probeEli Cohen
Clear the available index as part of the initialization process to clear and values that might be left from previous usage of the device. For example, if the device was previously used by vhost_vdpa and now probed by vhost_vdpa, you want to start with indices. Fixes: c043b4a8cf3b ("virtio: introduce a vDPA based transport") Signed-off-by: Eli Cohen <elic@nvidia.com> Signed-off-by: Jason Wang <jasowang@redhat.com> Link: https://lore.kernel.org/r/20210602021536.39525-5-jasowang@redhat.com Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Reviewed-by: Eli Cohen <elic@nvidia.com>
2021-07-08vp_vdpa: allow set vq state to initial state after resetJason Wang
We used to fail the set_vq_state() since it was not supported yet by the virtio spec. But if the bus tries to set the state which is equal to the device initial state after reset, we can let it go. This is a must for virtio_vdpa() to set vq state during probe which is required for some vDPA parents. Signed-off-by: Jason Wang <jasowang@redhat.com> Link: https://lore.kernel.org/r/20210602021536.39525-4-jasowang@redhat.com Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Reviewed-by: Eli Cohen <elic@nvidia.com>
2021-07-08virtio-pci library: introduce vp_modern_get_driver_features()Jason Wang
This patch introduce a helper to get driver/guest features from the device. Signed-off-by: Jason Wang <jasowang@redhat.com> Link: https://lore.kernel.org/r/20210602021536.39525-3-jasowang@redhat.com Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Reviewed-by: Eli Cohen <elic@nvidia.com>
2021-07-08vdpa: support packed virtqueue for set/get_vq_state()Jason Wang
This patch extends the vdpa_vq_state to support packed virtqueue state which is basically the device/driver ring wrap counters and the avail and used index. This will be used for the virito-vdpa support for the packed virtqueue and the future vhost/vhost-vdpa support for the packed virtqueue. Signed-off-by: Jason Wang <jasowang@redhat.com> Link: https://lore.kernel.org/r/20210602021536.39525-2-jasowang@redhat.com Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Reviewed-by: Eli Cohen <elic@nvidia.com>
2021-07-08virtio-ring: store DMA metadata in desc_extra for split virtqueueJason Wang
For split virtqueue, we used to depend on the address, length and flags stored in the descriptor ring for DMA unmapping. This is unsafe for the case since the device can manipulate the behavior of virtio driver, IOMMU drivers and swiotlb. For safety, maintain the DMA address, DMA length, descriptor flags and next filed of the non indirect descriptors in vring_desc_state_extra when DMA API is used for virtio as we did for packed virtqueue and use those metadata for performing DMA operations. Indirect descriptors should be safe since they are using streaming mappings. With this the descriptor ring is write only form the view of the driver. This slight increase the footprint of the drive but it's not noticed through pktgen (64B) test and netperf test in the case of virtio-net. Signed-off-by: Jason Wang <jasowang@redhat.com> Link: https://lore.kernel.org/r/20210604055350.58753-8-jasowang@redhat.com Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2021-07-08virtio: use err label in __vring_new_virtqueue()Jason Wang
Using error label for unwind in __vring_new_virtqueue. This is useful for future refacotring. Signed-off-by: Jason Wang <jasowang@redhat.com> Link: https://lore.kernel.org/r/20210604055350.58753-7-jasowang@redhat.com Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2021-07-08virtio_ring: introduce virtqueue_desc_add_split()Jason Wang
This patch introduces a helper for storing descriptor in the descriptor table for split virtqueue. Signed-off-by: Jason Wang <jasowang@redhat.com> Link: https://lore.kernel.org/r/20210604055350.58753-6-jasowang@redhat.com Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2021-07-08virtio_ring: secure handling of mapping errorsJason Wang
We should not depend on the DMA address, length and flag of descriptor table since they could be wrote with arbitrary value by the device. So this patch switches to use the stored one in desc_extra. Note that the indirect descriptors are fine since they are read-only streaming mappings. Signed-off-by: Jason Wang <jasowang@redhat.com> Link: https://lore.kernel.org/r/20210604055350.58753-5-jasowang@redhat.com Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2021-07-08virtio-ring: factor out desc_extra allocationJason Wang
A helper is introduced for the logic of allocating the descriptor extra data. This will be reused by split virtqueue. Signed-off-by: Jason Wang <jasowang@redhat.com> Link: https://lore.kernel.org/r/20210604055350.58753-4-jasowang@redhat.com Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2021-07-08virtio_ring: rename vring_desc_extra_packedJason Wang
Rename vring_desc_extra_packed to vring_desc_extra since the structure are pretty generic which could be reused by split virtqueue as well. Signed-off-by: Jason Wang <jasowang@redhat.com> Link: https://lore.kernel.org/r/20210604055350.58753-3-jasowang@redhat.com Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2021-07-08virtio-ring: maintain next in extra state for packed virtqueueJason Wang
This patch moves next from vring_desc_state_packed to vring_desc_desc_extra_packed. This makes it simpler to let extra state to be reused by split virtqueue. Signed-off-by: Jason Wang <jasowang@redhat.com> Link: https://lore.kernel.org/r/20210604055350.58753-2-jasowang@redhat.com Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2021-07-08vdpa/mlx5: Clear vq ready indication upon device resetEli Cohen
After device reset, the virtqueues are not ready so clear the ready field. Failing to do so can result in virtio_vdpa failing to load if the device was previously used by vhost_vdpa and the old values are ready. virtio_vdpa expects to find VQs in "not ready" state. Fixes: 1a86b377aa21 ("vdpa/mlx5: Add VDPA driver for supported mlx5 devices") Signed-off-by: Eli Cohen <elic@nvidia.com> Link: https://lore.kernel.org/r/20210606053128.170399-1-elic@nvidia.com Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Acked-by: Jason Wang <jasowang@redhat.com>
2021-07-08vdpa/mlx5: Add support for doorbell bypassingEli Cohen
Implement mlx5_get_vq_notification() to return the doorbell address. Since the notification area is mapped to userspace, make sure that the BAR size is at least PAGE_SIZE large. Signed-off-by: Eli Cohen <elic@nvidia.com> Link: https://lore.kernel.org/r/20210603081153.5750-1-elic@nvidia.com Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Acked-by: Jason Wang <jasowang@redhat.com>
2021-07-08virtio_net: disable cb aggressivelyMichael S. Tsirkin
There are currently two cases where we poll TX vq not in response to a callback: start xmit and rx napi. We currently do this with callbacks enabled which can cause extra interrupts from the card. Used not to be a big issue as we run with interrupts disabled but that is no longer the case, and in some cases the rate of spurious interrupts is so high linux detects this and actually kills the interrupt. Fix up by disabling the callbacks before polling the tx vq. Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2021-07-08ALSA: intel8x0: Fix breakage at ac97 clock measurementTakashi Iwai
The recent workaround for the wild interrupts in commit c1f0616124c4 ("ALSA: intel8x0: Don't update period unless prepared") leaded to a regression, causing the interrupt storm during ac97 clock measurement at the driver probe. We need to handle the interrupt while the clock measurement as well as the proper PCM streams. Fixes: c1f0616124c4 ("ALSA: intel8x0: Don't update period unless prepared") Reported-and-tested-by: Max Filippov <jcmvbkbc@gmail.com> Tested-by: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: <stable@vger.kernel.org> Link: https://lore.kernel.org/r/CAMo8BfKKMQkcsbOQaeEjq_FsJhdK=fn598dvh7YOcZshUSOH=g@mail.gmail.com Link: https://lore.kernel.org/r/20210708090738.1569-1-tiwai@suse.de Signed-off-by: Takashi Iwai <tiwai@suse.de>
2021-07-08drm/i915/gvt: Clear d3_entered on elsp cmd submission.Colin Xu
d3_entered flag is used to mark for vgpu_reset a previous power transition from D3->D0, typically for VM resume from S3, so that gvt could skip PPGTT invalidation in current vgpu_reset during resuming. In case S0ix exit, although there is D3->D0, guest driver continue to use vgpu as normal, with d3_entered set, until next shutdown/reboot or power transition. If a reboot follows a S0ix exit, device power state transite as: D0->D3->D0->D0(reboot), while system power state transites as: S0->S0 (reboot). There is no vgpu_reset until D0(reboot), thus d3_entered won't be cleared, the vgpu_reset will skip PPGTT invalidation however those PPGTT entries are no longer valid. Err appears like: gvt: vgpu 2: vfio_pin_pages failed for gfn 0xxxxx, ret -22 gvt: vgpu 2: fail: spt xxxx guest entry 0xxxxx type 2 gvt: vgpu 2: fail: shadow page xxxx guest entry 0xxxxx type 2. Give gvt a chance to clear d3_entered on elsp cmd submission so that the states before & after S0ix enter/exit are consistent. Fixes: ba25d977571e ("drm/i915/gvt: Do not destroy ppgtt_mm during vGPU D3->D0.") Reviewed-by: Zhenyu Wang <zhenyuw@linux.intel.com> Signed-off-by: Colin Xu <colin.xu@intel.com> Signed-off-by: Zhenyu Wang <zhenyuw@linux.intel.com> Link: http://patchwork.freedesktop.org/patch/msgid/20210707004531.4873-1-colin.xu@intel.com
2021-07-08skbuff: Fix build with SKB extensions disabledFlorian Fainelli
We will fail to build with CONFIG_SKB_EXTENSIONS disabled after 8550ff8d8c75 ("skbuff: Release nfct refcount on napi stolen or re-used skbs") since there is an unconditionally use of skb_ext_find() without an appropriate stub. Simply build the code conditionally and properly guard against both COFNIG_SKB_EXTENSIONS as well as CONFIG_NET_TC_SKB_EXT being disabled. Fixes: Fixes: 8550ff8d8c75 ("skbuff: Release nfct refcount on napi stolen or re-used skbs") Signed-off-by: Florian Fainelli <f.fainelli@gmail.com> Reviewed-by: Roi Dayan <roid@nvidia.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-07ipmr: Fix indentation issueRoy, UjjaL
Fixed indentation by removing extra spaces. Signed-off-by: Roy, UjjaL <royujjal@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-07sock: unlock on error in sock_setsockopt()Dan Carpenter
If copy_from_sockptr() then we need to unlock before returning. Fixes: d463126e23f1 ("net: sock: extend SO_TIMESTAMPING for PHC binding") Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-07samples/bpf: xdp_redirect_cpu_user: Cpumap qsize set larger defaultJesper Dangaard Brouer
Experience from production shows queue size of 192 is too small, as this caused packet drops during cpumap-enqueue on RX-CPU. This can be diagnosed with xdp_monitor sample program. This bpftrace program was used to diagnose the problem in more detail: bpftrace -e ' tracepoint:xdp:xdp_cpumap_kthread { @deq_bulk = lhist(args->processed,0,10,1); @drop_net = lhist(args->drops,0,10,1) } tracepoint:xdp:xdp_cpumap_enqueue { @enq_bulk = lhist(args->processed,0,10,1); @enq_drops = lhist(args->drops,0,10,1); }' Watch out for the @enq_drops counter. The @drop_net counter can happen when netstack gets invalid packets, so don't despair it can be natural, and that counter will likely disappear in newer kernels as it was a source of confusion (look at netstat info for reason of the netstack @drop_net counters). The production system was configured with CPU power-saving C6 state. Learn more in this blogpost[1]. And wakeup latency in usec for the states are: # grep -H . /sys/devices/system/cpu/cpu0/cpuidle/*/latency /sys/devices/system/cpu/cpu0/cpuidle/state0/latency:0 /sys/devices/system/cpu/cpu0/cpuidle/state1/latency:2 /sys/devices/system/cpu/cpu0/cpuidle/state2/latency:10 /sys/devices/system/cpu/cpu0/cpuidle/state3/latency:133 Deepest state take 133 usec to wakeup from (133/10^6). The link speed is 25Gbit/s ((25*10^9/8) in bytes/sec). How many bytes can arrive with in 133 usec at this speed: (25*10^9/8)*(133/10^6) = 415625 bytes. With MTU size packets this is 275 packets, and with minimum Ethernet (incl intergap overhead) 84 bytes it is 4948 packets. Clearly default queue size is too small. Setting default cpumap queue to 2048 as worst-case (small packet) at 10Gbit/s is 1979 packets with 133 usec wakeup time, +64 packet before kthread wakeup call (due to xdp_do_flush) worst-case 2043 packets. Thus, if a packet burst on RX-CPU will enqueue packets to a remote cpumap CPU that is in deep-sleep state it can overrun the cpumap queue. The production system was also configured to avoid deep-sleep via: tuned-adm profile network-latency [1] https://jeremyeder.com/2013/08/30/oh-did-you-expect-the-cpu/ Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Song Liu <songliubraving@fb.com> Link: https://lore.kernel.org/bpf/162523477604.786243.13372630844944530891.stgit@firesoul
2021-07-07Merge branch 'Generic XDP improvements'Alexei Starovoitov
Kumar Kartikeya says: ==================== This small series makes some improvements to generic XDP mode and brings it closer to native XDP. Patch 1 splits out generic XDP processing into reusable parts, patch 2 adds pointer friendly wrappers for bitops (not have to cast back and forth the address of local pointer to unsigned long *), patch 3 implements generic cpumap support (details in commit) and patch 4 allows devmap bpf prog execution before generic_xdp_tx is called. Patch 5 just updates a couple of selftests to adapt to changes in behavior (in that specifying devmap/cpumap prog fd in generic mode is now allowed). Changelog: ---------- v5 -> v6 v5: https://lore.kernel.org/bpf/20210701002759.381983-1-memxor@gmail.com * Put rcpu->prog check before RCU-bh section to avoid do_softirq (Jesper) v4 -> v5 v4: https://lore.kernel.org/bpf/20210628114746.129669-1-memxor@gmail.com * Add comments and examples for new bitops macros (Alexei) v3 -> v4 v3: https://lore.kernel.org/bpf/20210622202835.1151230-1-memxor@gmail.com * Add detach now that attach of XDP program succeeds (Toke) * Clean up the test to use new ASSERT macros v2 -> v3 v2: https://lore.kernel.org/bpf/20210622195527.1110497-1-memxor@gmail.com * list_for_each_entry -> list_for_each_entry_safe (due to deletion of skb) v1 -> v2 v1: https://lore.kernel.org/bpf/20210620233200.855534-1-memxor@gmail.com * Move __ptr_{set,clear,test}_bit to bitops.h (Toke) Also changed argument order to match the bit op they wrap. * Remove map value size checking functions for cpumap/devmap (Toke) * Rework prog run for skb in cpu_map_kthread_run (Toke) * Set skb->dev to dst->dev after devmap prog has run * Don't set xdp rxq that will be overwritten in cpumap prog run ==================== Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2021-07-07bpf: Tidy xdp attach selftestsKumar Kartikeya Dwivedi
Support for cpumap and devmap entry progs in previous commits means the test needs to be updated for the new semantics. Also take this opportunity to convert it from CHECK macros to the new ASSERT macros. Since xdp_cpumap_attach has no subtest, put the sole test inside the test_xdp_cpumap_attach function. Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com> Link: https://lore.kernel.org/bpf/20210702111825.491065-6-memxor@gmail.com
2021-07-07bpf: devmap: Implement devmap prog execution for generic XDPKumar Kartikeya Dwivedi
This lifts the restriction on running devmap BPF progs in generic redirect mode. To match native XDP behavior, it is invoked right before generic_xdp_tx is called, and only supports XDP_PASS/XDP_ABORTED/ XDP_DROP actions. We also return 0 even if devmap program drops the packet, as semantically redirect has already succeeded and the devmap prog is the last point before TX of the packet to device where it can deliver a verdict on the packet. This also means it must take care of freeing the skb, as xdp_do_generic_redirect callers only do that in case an error is returned. Since devmap entry prog is supported, remove the check in generic_xdp_install entirely. Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com> Link: https://lore.kernel.org/bpf/20210702111825.491065-5-memxor@gmail.com
2021-07-07bpf: cpumap: Implement generic cpumapKumar Kartikeya Dwivedi
This change implements CPUMAP redirect support for generic XDP programs. The idea is to reuse the cpu map entry's queue that is used to push native xdp frames for redirecting skb to a different CPU. This will match native XDP behavior (in that RPS is invoked again for packet reinjected into networking stack). To be able to determine whether the incoming skb is from the driver or cpumap, we reuse skb->redirected bit that skips generic XDP processing when it is set. To always make use of this, CONFIG_NET_REDIRECT guard on it has been lifted and it is always available. >From the redirect side, we add the skb to ptr_ring with its lowest bit set to 1. This should be safe as skb is not 1-byte aligned. This allows kthread to discern between xdp_frames and sk_buff. On consumption of the ptr_ring item, the lowest bit is unset. In the end, the skb is simply added to the list that kthread is anyway going to maintain for xdp_frames converted to skb, and then received again by using netif_receive_skb_list. Bulking optimization for generic cpumap is left as an exercise for a future patch for now. Since cpumap entry progs are now supported, also remove check in generic_xdp_install for the cpumap. Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com> Acked-by: Jesper Dangaard Brouer <brouer@redhat.com> Link: https://lore.kernel.org/bpf/20210702111825.491065-4-memxor@gmail.com
2021-07-07bitops: Add non-atomic bitops for pointersKumar Kartikeya Dwivedi
cpumap needs to set, clear, and test the lowest bit in skb pointer in various places. To make these checks less noisy, add pointer friendly bitop macros that also do some typechecking to sanitize the argument. These wrap the non-atomic bitops __set_bit, __clear_bit, and test_bit but for pointer arguments. Pointer's address has to be passed in and it is treated as an unsigned long *, since width and representation of pointer and unsigned long match on targets Linux supports. They are prefixed with double underscore to indicate lack of atomicity. Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com> Link: https://lore.kernel.org/bpf/20210702111825.491065-3-memxor@gmail.com
2021-07-07net: core: Split out code to run generic XDP progKumar Kartikeya Dwivedi
This helper can later be utilized in code that runs cpumap and devmap programs in generic redirect mode and adjust skb based on changes made to xdp_buff. When returning XDP_REDIRECT/XDP_TX, it invokes __skb_push, so whenever a generic redirect path invokes devmap/cpumap prog if set, it must __skb_pull again as we expect mac header to be pulled. It also drops the skb_reset_mac_len call after do_xdp_generic, as the mac_header and network_header are advanced by the same offset, so the difference (mac_len) remains constant. Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com> Link: https://lore.kernel.org/bpf/20210702111825.491065-2-memxor@gmail.com
2021-07-07Merge branch 'bpf: support input xdp_md context in BPF_PROG_TEST_RUN'Alexei Starovoitov
Zvi Effron says: ==================== This patchset adds support for passing an xdp_md via ctx_in/ctx_out in bpf_attr for BPF_PROG_TEST_RUN of XDP programs. Patch 1 adds a function to validate XDP meta data lengths. Patch 2 adds initial support for passing XDP meta data in addition to packet data. Patch 3 adds support for also specifying the ingress interface and rx queue. Patch 4 adds selftests to ensure functionality is correct. Changelog: ---------- v7->v8 v7: https://lore.kernel.org/bpf/20210624211304.90807-1-zeffron@riotgames.com/ * Fix too long comment line in patch 3 v6->v7 v6: https://lore.kernel.org/bpf/20210617232904.1899-1-zeffron@riotgames.com/ * Add Yonghong Song's Acked-by to commit message in patch 1 * Add Yonghong Song's Acked-by to commit message in patch 2 * Extracted the post-update of the xdp_md context into a function (again) * Validate that the rx queue was registered with XDP info * Decrement the reference count on a found netdevice on failure to find a valid rx queue * Decrement the reference count on a found netdevice after the XDP program is run * Drop Yonghong Song's Acked-By for patch 3 because of patch changes * Improve a comment in the selftests * Drop Yonghong Song's Acked-By for patch 4 because of patch changes v5->v6 v5: https://lore.kernel.org/bpf/20210616224712.3243-1-zeffron@riotgames.com/ * Correct commit messages in patches 1 and 3 * Add Acked-by to commit message in patch 4 * Use gotos instead of returns to correctly free resources in bpf_prog_test_run_xdp * Rename xdp_metalen_valid to xdp_metalen_invalid * Improve the function signature for xdp_metalen_invalid * Merged declaration of ingress_ifindex and rx_queue_index into one line v4->v5 v4: https://lore.kernel.org/bpf/20210604220235.6758-1-zeffron@riotgames.com/ * Add new patch to introduce xdp_metalen_valid inline function to avoid duplicated code from net/core/filter.c * Correct size of bad_ctx in selftests * Make all declarations reverse Christmas tree * Move data check from xdp_convert_md_to_buff to bpf_prog_test_run_xdp * Merge xdp_convert_buff_to_md into bpf_prog_test_run_xdp * Fix line too long * Extracted common checks in selftests to a helper function * Removed redundant assignment in selftests * Reordered test cases in selftests * Check data against 0 instead of data_meta in selftests * Made selftests use EINVAL instead of hardcoded 22 * Dropped "_" from XDP function name * Changed casts in XDP program from unsigned long to long * Added a comment explaining the use of the loopback interface in selftests * Change parameter order in xdp_convert_md_to_buff to be input first * Assigned xdp->ingress_ifindex and xdp->rx_queue_index to local variables in xdp_convert_md_to_buff * Made use of "meta data" versus "metadata" consistent in comments and commit messages v3->v4 v3: https://lore.kernel.org/bpf/20210602190815.8096-1-zeffron@riotgames.com/ * Clean up nits * Validate xdp_md->data_end in bpf_prog_test_run_xdp * Remove intermediate metalen variables v2 -> v3 v2: https://lore.kernel.org/bpf/20210527201341.7128-1-zeffron@riotgames.com/ * Check errno first in selftests * Use DECLARE_LIBBPF_OPTS * Rename tattr to opts in selftests * Remove extra new line * Rename convert_xdpmd_to_xdpb to xdp_convert_md_to_buff * Rename convert_xdpb_to_xdpmd to xdp_convert_buff_to_md * Move declaration of device and rxqueue in xdp_convert_md_to_buff to patch 2 * Reorder the kfree calls in bpf_prog_test_run_xdp v1 -> v2 v1: https://lore.kernel.org/bpf/20210524220555.251473-1-zeffron@riotgames.com * Fix null pointer dereference with no context * Use the BPF skeleton and replace CHECK with ASSERT macros ==================== Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2021-07-07selftests/bpf: Add test for xdp_md context in BPF_PROG_TEST_RUNZvi Effron
Add a test for using xdp_md as a context to BPF_PROG_TEST_RUN for XDP programs. The test uses a BPF program that takes in a return value from XDP meta data, then reduces the size of the XDP meta data by 4 bytes. Test cases validate the possible failure cases for passing in invalid xdp_md contexts, that the return value is successfully passed in, and that the adjusted meta data is successfully copied out. Co-developed-by: Cody Haas <chaas@riotgames.com> Co-developed-by: Lisa Watanabe <lwatanabe@riotgames.com> Signed-off-by: Cody Haas <chaas@riotgames.com> Signed-off-by: Lisa Watanabe <lwatanabe@riotgames.com> Signed-off-by: Zvi Effron <zeffron@riotgames.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Yonghong Song <yhs@fb.com> Link: https://lore.kernel.org/bpf/20210707221657.3985075-5-zeffron@riotgames.com
2021-07-07bpf: Support specifying ingress via xdp_md context in BPF_PROG_TEST_RUNZvi Effron
Support specifying the ingress_ifindex and rx_queue_index of xdp_md contexts for BPF_PROG_TEST_RUN. The intended use case is to allow testing XDP programs that make decisions based on the ingress interface or RX queue. If ingress_ifindex is specified, look up the device by the provided index in the current namespace and use its xdp_rxq for the xdp_buff. If the rx_queue_index is out of range, or is non-zero when the ingress_ifindex is 0, return -EINVAL. Co-developed-by: Cody Haas <chaas@riotgames.com> Co-developed-by: Lisa Watanabe <lwatanabe@riotgames.com> Signed-off-by: Cody Haas <chaas@riotgames.com> Signed-off-by: Lisa Watanabe <lwatanabe@riotgames.com> Signed-off-by: Zvi Effron <zeffron@riotgames.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Yonghong Song <yhs@fb.com> Link: https://lore.kernel.org/bpf/20210707221657.3985075-4-zeffron@riotgames.com