summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2021-01-18pinctrl: qcom: Don't clear pending interrupts when enablingDouglas Anderson
In Linux, if a driver does disable_irq() and later does enable_irq() on its interrupt, I believe it's expecting these properties: * If an interrupt was pending when the driver disabled then it will still be pending after the driver re-enables. * If an edge-triggered interrupt comes in while an interrupt is disabled it should assert when the interrupt is re-enabled. If you think that the above sounds a lot like the disable_irq() and enable_irq() are supposed to be masking/unmasking the interrupt instead of disabling/enabling it then you've made an astute observation. Specifically when talking about interrupts, "mask" usually means to stop posting interrupts but keep tracking them and "disable" means to fully shut off interrupt detection. It's unfortunate that this is so confusing, but presumably this is all the way it is for historical reasons. Perhaps more confusing than the above is that, even though clients of IRQs themselves don't have a way to request mask/unmask vs. disable/enable calls, IRQ chips themselves can implement both. ...and yet more confusing is that if an IRQ chip implements disable/enable then they will be called when a client driver calls disable_irq() / enable_irq(). It does feel like some of the above could be cleared up. However, without any other core interrupt changes it should be clear that when an IRQ chip gets a request to "disable" an IRQ that it has to treat it like a mask of that IRQ. In any case, after that long interlude you can see that the "unmask and clear" can break things. Maulik tried to fix it so that we no longer did "unmask and clear" in commit 71266d9d3936 ("pinctrl: qcom: Move clearing pending IRQ to .irq_request_resources callback"), but it only handled the PDC case and it had problems (it caused sc7180-trogdor devices to fail to suspend). Let's fix. >From my understanding the source of the phantom interrupt in the were these two things: 1. One that could have been introduced in msm_gpio_irq_set_type() (only for the non-PDC case). 2. Edges could have been detected when a GPIO was muxed away. Fixing case #1 is easy. We can just add a clear in msm_gpio_irq_set_type(). Fixing case #2 is harder. Let's use a concrete example. In sc7180-trogdor.dtsi we configure the uart3 to have two pinctrl states, sleep and default, and mux between the two during runtime PM and system suspend (see geni_se_resources_{on,off}() for more details). The difference between the sleep and default state is that the RX pin is muxed to a GPIO during sleep and muxed to the UART otherwise. As per Qualcomm, when we mux the pin over to the UART function the PDC (or the non-PDC interrupt detection logic) is still watching it / latching edges. These edges don't cause interrupts because the current code masks the interrupt unless we're entering suspend. However, as soon as we enter suspend we unmask the interrupt and it's counted as a wakeup. Let's deal with the problem like this: * When we mux away, we'll mask our interrupt. This isn't necessary in the above case since the client already masked us, but it's a good idea in general. * When we mux back will clear any interrupts and unmask. Fixes: 4b7618fdc7e6 ("pinctrl: qcom: Add irq_enable callback for msm gpio") Fixes: 71266d9d3936 ("pinctrl: qcom: Move clearing pending IRQ to .irq_request_resources callback") Signed-off-by: Douglas Anderson <dianders@chromium.org> Reviewed-by: Maulik Shah <mkshah@codeaurora.org> Tested-by: Maulik Shah <mkshah@codeaurora.org> Reviewed-by: Stephen Boyd <swboyd@chromium.org> Link: https://lore.kernel.org/r/20210114191601.v7.4.I7cf3019783720feb57b958c95c2b684940264cd1@changeid Signed-off-by: Linus Walleij <linus.walleij@linaro.org>
2021-01-18pinctrl: qcom: Properly clear "intr_ack_high" interrupts when unmaskingDouglas Anderson
In commit 4b7618fdc7e6 ("pinctrl: qcom: Add irq_enable callback for msm gpio") we tried to Ack interrupts during unmask. However, that patch forgot to check "intr_ack_high" so, presumably, it only worked for a certain subset of SoCs. Let's add a small accessor so we don't need to open-code the logic in both places. This was found by code inspection. I don't have any access to the hardware in question nor software that needs the Ack during unmask. Fixes: 4b7618fdc7e6 ("pinctrl: qcom: Add irq_enable callback for msm gpio") Signed-off-by: Douglas Anderson <dianders@chromium.org> Reviewed-by: Maulik Shah <mkshah@codeaurora.org> Tested-by: Maulik Shah <mkshah@codeaurora.org> Reviewed-by: Stephen Boyd <swboyd@chromium.org> Reviewed-by: Bjorn Andersson <bjorn.andersson@linaro.org> Link: https://lore.kernel.org/r/20210114191601.v7.3.I32d0f4e174d45363b49ab611a13c3da8f1e87d0f@changeid Signed-off-by: Linus Walleij <linus.walleij@linaro.org>
2021-01-18pinctrl: qcom: No need to read-modify-write the interrupt statusDouglas Anderson
When the Qualcomm pinctrl driver wants to Ack an interrupt, it does a read-modify-write on the interrupt status register. On some SoCs it makes sure that the status bit is 1 to "Ack" and on others it makes sure that the bit is 0 to "Ack". Presumably the first type of interrupt controller is a "write 1 to clear" type register and the second just let you directly set the interrupt status register. As far as I can tell from scanning structure definitions, the interrupt status bit is always in a register by itself. Thus with both types of interrupt controllers it is safe to "Ack" interrupts without doing a read-modify-write. We can do a simple write. It should be noted that if the interrupt status bit _was_ ever in a register with other things (like maybe status bits for other GPIOs): a) For "write 1 clear" type controllers then read-modify-write would be totally wrong because we'd accidentally end up clearing interrupts we weren't looking at. b) For "direct set" type controllers then read-modify-write would also be wrong because someone setting one of the other bits in the register might accidentally clear (or set) our interrupt. I say this simply to show that the current read-modify-write doesn't provide any sort of "future proofing" of the code. In fact (for "write 1 clear" controllers) the new code is slightly more "future proof" since it would allow more than one interrupt status bits to share a register. NOTE: this code fixes no bugs--it simply avoids an extra register read. Signed-off-by: Douglas Anderson <dianders@chromium.org> Reviewed-by: Maulik Shah <mkshah@codeaurora.org> Tested-by: Maulik Shah <mkshah@codeaurora.org> Reviewed-by: Stephen Boyd <swboyd@chromium.org> Reviewed-by: Bjorn Andersson <bjorn.andersson@linaro.org> Link: https://lore.kernel.org/r/20210114191601.v7.2.I3635de080604e1feda770591c5563bd6e63dd39d@changeid Signed-off-by: Linus Walleij <linus.walleij@linaro.org>
2021-01-18pinctrl: qcom: Allow SoCs to specify a GPIO function that's not 0Douglas Anderson
There's currently a comment in the code saying function 0 is GPIO. Instead of hardcoding it, let's add a member where an SoC can specify it. No known SoCs use a number other than 0, but this just makes the code clearer. NOTE: no SoC code needs to be updated since we can rely on zero-initialization. Signed-off-by: Douglas Anderson <dianders@chromium.org> Reviewed-by: Stephen Boyd <swboyd@chromium.org> Reviewed-by: Maulik Shah <mkshah@codeaurora.org> Tested-by: Maulik Shah <mkshah@codeaurora.org> Reviewed-by: Bjorn Andersson <bjorn.andersson@linaro.org> Link: https://lore.kernel.org/r/20210114191601.v7.1.I3ad184e3423d8e479bc3e86f5b393abb1704a1d1@changeid Signed-off-by: Linus Walleij <linus.walleij@linaro.org>
2021-01-18btrfs: don't clear ret in btrfs_start_dirty_block_groupsJosef Bacik
If we fail to update a block group item in the loop we'll break, however we'll do btrfs_run_delayed_refs and lose our error value in ret, and thus not clean up properly. Fix this by only running the delayed refs if there was no failure. CC: stable@vger.kernel.org # 4.4+ Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-01-18btrfs: fix lockdep splat in btrfs_recover_relocationJosef Bacik
While testing the error paths of relocation I hit the following lockdep splat: ====================================================== WARNING: possible circular locking dependency detected 5.10.0-rc6+ #217 Not tainted ------------------------------------------------------ mount/779 is trying to acquire lock: ffffa0e676945418 (&fs_info->balance_mutex){+.+.}-{3:3}, at: btrfs_recover_balance+0x2f0/0x340 but task is already holding lock: ffffa0e60ee31da8 (btrfs-root-00){++++}-{3:3}, at: __btrfs_tree_read_lock+0x27/0x100 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #2 (btrfs-root-00){++++}-{3:3}: down_read_nested+0x43/0x130 __btrfs_tree_read_lock+0x27/0x100 btrfs_read_lock_root_node+0x31/0x40 btrfs_search_slot+0x462/0x8f0 btrfs_update_root+0x55/0x2b0 btrfs_drop_snapshot+0x398/0x750 clean_dirty_subvols+0xdf/0x120 btrfs_recover_relocation+0x534/0x5a0 btrfs_start_pre_rw_mount+0xcb/0x170 open_ctree+0x151f/0x1726 btrfs_mount_root.cold+0x12/0xea legacy_get_tree+0x30/0x50 vfs_get_tree+0x28/0xc0 vfs_kern_mount.part.0+0x71/0xb0 btrfs_mount+0x10d/0x380 legacy_get_tree+0x30/0x50 vfs_get_tree+0x28/0xc0 path_mount+0x433/0xc10 __x64_sys_mount+0xe3/0x120 do_syscall_64+0x33/0x40 entry_SYSCALL_64_after_hwframe+0x44/0xa9 -> #1 (sb_internal#2){.+.+}-{0:0}: start_transaction+0x444/0x700 insert_balance_item.isra.0+0x37/0x320 btrfs_balance+0x354/0xf40 btrfs_ioctl_balance+0x2cf/0x380 __x64_sys_ioctl+0x83/0xb0 do_syscall_64+0x33/0x40 entry_SYSCALL_64_after_hwframe+0x44/0xa9 -> #0 (&fs_info->balance_mutex){+.+.}-{3:3}: __lock_acquire+0x1120/0x1e10 lock_acquire+0x116/0x370 __mutex_lock+0x7e/0x7b0 btrfs_recover_balance+0x2f0/0x340 open_ctree+0x1095/0x1726 btrfs_mount_root.cold+0x12/0xea legacy_get_tree+0x30/0x50 vfs_get_tree+0x28/0xc0 vfs_kern_mount.part.0+0x71/0xb0 btrfs_mount+0x10d/0x380 legacy_get_tree+0x30/0x50 vfs_get_tree+0x28/0xc0 path_mount+0x433/0xc10 __x64_sys_mount+0xe3/0x120 do_syscall_64+0x33/0x40 entry_SYSCALL_64_after_hwframe+0x44/0xa9 other info that might help us debug this: Chain exists of: &fs_info->balance_mutex --> sb_internal#2 --> btrfs-root-00 Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(btrfs-root-00); lock(sb_internal#2); lock(btrfs-root-00); lock(&fs_info->balance_mutex); *** DEADLOCK *** 2 locks held by mount/779: #0: ffffa0e60dc040e0 (&type->s_umount_key#47/1){+.+.}-{3:3}, at: alloc_super+0xb5/0x380 #1: ffffa0e60ee31da8 (btrfs-root-00){++++}-{3:3}, at: __btrfs_tree_read_lock+0x27/0x100 stack backtrace: CPU: 0 PID: 779 Comm: mount Not tainted 5.10.0-rc6+ #217 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-2.fc32 04/01/2014 Call Trace: dump_stack+0x8b/0xb0 check_noncircular+0xcf/0xf0 ? trace_call_bpf+0x139/0x260 __lock_acquire+0x1120/0x1e10 lock_acquire+0x116/0x370 ? btrfs_recover_balance+0x2f0/0x340 __mutex_lock+0x7e/0x7b0 ? btrfs_recover_balance+0x2f0/0x340 ? btrfs_recover_balance+0x2f0/0x340 ? rcu_read_lock_sched_held+0x3f/0x80 ? kmem_cache_alloc_trace+0x2c4/0x2f0 ? btrfs_get_64+0x5e/0x100 btrfs_recover_balance+0x2f0/0x340 open_ctree+0x1095/0x1726 btrfs_mount_root.cold+0x12/0xea ? rcu_read_lock_sched_held+0x3f/0x80 legacy_get_tree+0x30/0x50 vfs_get_tree+0x28/0xc0 vfs_kern_mount.part.0+0x71/0xb0 btrfs_mount+0x10d/0x380 ? __kmalloc_track_caller+0x2f2/0x320 legacy_get_tree+0x30/0x50 vfs_get_tree+0x28/0xc0 ? capable+0x3a/0x60 path_mount+0x433/0xc10 __x64_sys_mount+0xe3/0x120 do_syscall_64+0x33/0x40 entry_SYSCALL_64_after_hwframe+0x44/0xa9 This is straightforward to fix, simply release the path before we setup the balance_ctl. CC: stable@vger.kernel.org # 4.4+ Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-01-18btrfs: do not double free backref nodes on errorJosef Bacik
Zygo reported the following KASAN splat: BUG: KASAN: use-after-free in btrfs_backref_cleanup_node+0x18a/0x420 Read of size 8 at addr ffff888112402950 by task btrfs/28836 CPU: 0 PID: 28836 Comm: btrfs Tainted: G W 5.10.0-e35f27394290-for-next+ #23 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014 Call Trace: dump_stack+0xbc/0xf9 ? btrfs_backref_cleanup_node+0x18a/0x420 print_address_description.constprop.8+0x21/0x210 ? record_print_text.cold.34+0x11/0x11 ? btrfs_backref_cleanup_node+0x18a/0x420 ? btrfs_backref_cleanup_node+0x18a/0x420 kasan_report.cold.10+0x20/0x37 ? btrfs_backref_cleanup_node+0x18a/0x420 __asan_load8+0x69/0x90 btrfs_backref_cleanup_node+0x18a/0x420 btrfs_backref_release_cache+0x83/0x1b0 relocate_block_group+0x394/0x780 ? merge_reloc_roots+0x4a0/0x4a0 btrfs_relocate_block_group+0x26e/0x4c0 btrfs_relocate_chunk+0x52/0x120 btrfs_balance+0xe2e/0x1900 ? check_flags.part.50+0x6c/0x1e0 ? btrfs_relocate_chunk+0x120/0x120 ? kmem_cache_alloc_trace+0xa06/0xcb0 ? _copy_from_user+0x83/0xc0 btrfs_ioctl_balance+0x3a7/0x460 btrfs_ioctl+0x24c8/0x4360 ? __kasan_check_read+0x11/0x20 ? check_chain_key+0x1f4/0x2f0 ? __asan_loadN+0xf/0x20 ? btrfs_ioctl_get_supported_features+0x30/0x30 ? kvm_sched_clock_read+0x18/0x30 ? check_chain_key+0x1f4/0x2f0 ? lock_downgrade+0x3f0/0x3f0 ? handle_mm_fault+0xad6/0x2150 ? do_vfs_ioctl+0xfc/0x9d0 ? ioctl_file_clone+0xe0/0xe0 ? check_flags.part.50+0x6c/0x1e0 ? check_flags.part.50+0x6c/0x1e0 ? check_flags+0x26/0x30 ? lock_is_held_type+0xc3/0xf0 ? syscall_enter_from_user_mode+0x1b/0x60 ? do_syscall_64+0x13/0x80 ? rcu_read_lock_sched_held+0xa1/0xd0 ? __kasan_check_read+0x11/0x20 ? __fget_light+0xae/0x110 __x64_sys_ioctl+0xc3/0x100 do_syscall_64+0x37/0x80 entry_SYSCALL_64_after_hwframe+0x44/0xa9 RIP: 0033:0x7f4c4bdfe427 Allocated by task 28836: kasan_save_stack+0x21/0x50 __kasan_kmalloc.constprop.18+0xbe/0xd0 kasan_kmalloc+0x9/0x10 kmem_cache_alloc_trace+0x410/0xcb0 btrfs_backref_alloc_node+0x46/0xf0 btrfs_backref_add_tree_node+0x60d/0x11d0 build_backref_tree+0xc5/0x700 relocate_tree_blocks+0x2be/0xb90 relocate_block_group+0x2eb/0x780 btrfs_relocate_block_group+0x26e/0x4c0 btrfs_relocate_chunk+0x52/0x120 btrfs_balance+0xe2e/0x1900 btrfs_ioctl_balance+0x3a7/0x460 btrfs_ioctl+0x24c8/0x4360 __x64_sys_ioctl+0xc3/0x100 do_syscall_64+0x37/0x80 entry_SYSCALL_64_after_hwframe+0x44/0xa9 Freed by task 28836: kasan_save_stack+0x21/0x50 kasan_set_track+0x20/0x30 kasan_set_free_info+0x1f/0x30 __kasan_slab_free+0xf3/0x140 kasan_slab_free+0xe/0x10 kfree+0xde/0x200 btrfs_backref_error_cleanup+0x452/0x530 build_backref_tree+0x1a5/0x700 relocate_tree_blocks+0x2be/0xb90 relocate_block_group+0x2eb/0x780 btrfs_relocate_block_group+0x26e/0x4c0 btrfs_relocate_chunk+0x52/0x120 btrfs_balance+0xe2e/0x1900 btrfs_ioctl_balance+0x3a7/0x460 btrfs_ioctl+0x24c8/0x4360 __x64_sys_ioctl+0xc3/0x100 do_syscall_64+0x37/0x80 entry_SYSCALL_64_after_hwframe+0x44/0xa9 This occurred because we freed our backref node in btrfs_backref_error_cleanup(), but then tried to free it again in btrfs_backref_release_cache(). This is because btrfs_backref_release_cache() will cycle through all of the cache->leaves nodes and free them up. However btrfs_backref_error_cleanup() freed the backref node with btrfs_backref_free_node(), which simply kfree()d the backref node without unlinking it from the cache. Change this to a btrfs_backref_drop_node(), which does the appropriate cleanup and removes the node from the cache->leaves list, so when we go to free the remaining cache we don't trip over items we've already dropped. Fixes: 75bfb9aff45e ("Btrfs: cleanup error handling in build_backref_tree") CC: stable@vger.kernel.org # 4.4+ Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-01-18btrfs: don't get an EINTR during drop_snapshot for relocJosef Bacik
This was partially fixed by f3e3d9cc3525 ("btrfs: avoid possible signal interruption of btrfs_drop_snapshot() on relocation tree"), however it missed a spot when we restart a trans handle because we need to end the transaction. The fix is the same, simply use btrfs_join_transaction() instead of btrfs_start_transaction() when deleting reloc roots. Fixes: f3e3d9cc3525 ("btrfs: avoid possible signal interruption of btrfs_drop_snapshot() on relocation tree") CC: stable@vger.kernel.org # 5.4+ Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-01-18tools: gpio: fix %llu warning in gpio-watch.cKent Gibson
Some platforms, such as mips64, don't map __u64 to long long unsigned int so using %llu produces a warning: gpio-watch.c: In function ‘main’: gpio-watch.c:89:30: warning: format ‘%llu’ expects argument of type ‘long long unsigned int’, but argument 4 has type ‘__u64’ {aka ‘long unsigned int’} [-Wformat=] 89 | printf("line %u: %s at %llu\n", | ~~~^ | | | long long unsigned int | %lu 90 | chg.info.offset, event, chg.timestamp_ns); | ~~~~~~~~~~~~~~~~ | | | __u64 {aka long unsigned int} Replace the %llu with PRIu64 and cast the argument to uint64_t. Fixes: 33f0c47b8fb4 ("tools: gpio: implement gpio-watch") Signed-off-by: Kent Gibson <warthog618@gmail.com> Signed-off-by: Bartosz Golaszewski <bgolaszewski@baylibre.com>
2021-01-18tools: gpio: fix %llu warning in gpio-event-mon.cKent Gibson
Some platforms, such as mips64, don't map __u64 to long long unsigned int so using %llu produces a warning: gpio-event-mon.c:110:37: warning: format ‘%llu’ expects argument of type ‘long long unsigned int’, but argument 3 has type ‘__u64’ {aka ‘long unsigned int’} [-Wformat=] 110 | fprintf(stdout, "GPIO EVENT at %llu on line %d (%d|%d) ", | ~~~^ | | | long long unsigned int | %lu 111 | event.timestamp_ns, event.offset, event.line_seqno, | ~~~~~~~~~~~~~~~~~~ | | | __u64 {aka long unsigned int} Replace the %llu with PRIu64 and cast the argument to uint64_t. Fixes: 03fd11b03362 ("tools/gpio/gpio-event-mon: fix warning") Signed-off-by: Kent Gibson <warthog618@gmail.com> Signed-off-by: Bartosz Golaszewski <bgolaszewski@baylibre.com>
2021-01-18x86/xen: fix 'nopvspin' build errorRandy Dunlap
Fix build error in x86/xen/ when PARAVIRT_SPINLOCKS is not enabled. Fixes this build error: ../arch/x86/xen/smp_hvm.c: In function ‘xen_hvm_smp_init’: ../arch/x86/xen/smp_hvm.c:77:3: error: ‘nopvspin’ undeclared (first use in this function) nopvspin = true; Fixes: 3d7746bea925 ("x86/xen: Fix xen_hvm_smp_init() when vector callback not available") Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Reviewed-by: Juergen Gross <jgross@suse.com> Cc: David Woodhouse <dwmw@amazon.co.uk> Cc: Juergen Gross <jgross@suse.com> Link: https://lore.kernel.org/r/20210115191123.27572-1-rdunlap@infradead.org Signed-off-by: Juergen Gross <jgross@suse.com>
2021-01-17Linux 5.11-rc4v5.11-rc4Linus Torvalds
2021-01-17Merge tag 'perf-tools-fixes-2021-01-17' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux Pull perf tools fixes from Arnaldo Carvalho de Melo: - Fix 'CPU too large' error in Intel PT - Correct event attribute sizes in 'perf inject' - Sync build_bug.h and kvm.h kernel copies - Fix bpf.h header include directive in 5sec.c 'perf trace' bpf example - libbpf tests fixes - Fix shadow stat 'perf test' for non-bash shells - Take cgroups into account for shadow stats in 'perf stat' * tag 'perf-tools-fixes-2021-01-17' of git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux: perf inject: Correct event attribute sizes perf intel-pt: Fix 'CPU too large' error perf stat: Take cgroups into account for shadow stats perf stat: Introduce struct runtime_stat_data libperf tests: Fail when failing to get a tracepoint id libperf tests: If a test fails return non-zero libperf tests: Avoid uninitialized variable warning perf test: Fix shadow stat test for non-bash shells tools headers: Syncronize linux/build_bug.h with the kernel sources tools headers UAPI: Sync kvm.h headers with the kernel sources perf bpf examples: Fix bpf.h header include directive in 5sec.c example
2021-01-17Merge tag 'powerpc-5.11-4' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux Pull powerpc fixes from Michael Ellerman: "One fix for a lack of alignment in our linker script, that can lead to crashes depending on configuration etc. One fix for the 32-bit VDSO after the C VDSO conversion. Thanks to Andreas Schwab, Ariel Marcovitch, and Christophe Leroy" * tag 'powerpc-5.11-4' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: powerpc/vdso: Fix clock_gettime_fallback for vdso32 powerpc: Fix alignment bug within the init sections
2021-01-17Merge branch 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfsLinus Torvalds
Pull misc vfs fixes from Al Viro: "Several assorted fixes. I still think that audit ->d_name race is better fixed this way for the benefit of backports, with any possibly fancier variants done on top of it" * 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: dump_common_audit_data(): fix racy accesses to ->d_name iov_iter: fix the uaccess area in copy_compat_iovec_from_user umount(2): move the flag validity checks first
2021-01-17mm: don't put pinned pages into the swap cacheLinus Torvalds
So technically there is nothing wrong with adding a pinned page to the swap cache, but the pinning obviously means that the page can't actually be free'd right now anyway, so it's a bit pointless. However, the real problem is not with it being a bit pointless: the real issue is that after we've added it to the swap cache, we'll try to unmap the page. That will succeed, because the code in mm/rmap.c doesn't know or care about pinned pages. Even the unmapping isn't fatal per se, since the page will stay around in memory due to the pinning, and we do hold the connection to it using the swap cache. But when we then touch it next and take a page fault, the logic in do_swap_page() will map it back into the process as a possibly read-only page, and we'll then break the page association on the next COW fault. Honestly, this issue could have been fixed in any of those other places: (a) we could refuse to unmap a pinned page (which makes conceptual sense), or (b) we could make sure to re-map a pinned page writably in do_swap_page(), or (c) we could just make do_wp_page() not COW the pinned page (which was what we historically did before that "mm: do_wp_page() simplification" commit). But while all of them are equally valid models for breaking this chain, not putting pinned pages into the swap cache in the first place is the simplest one by far. It's also the safest one: the reason why do_wp_page() was changed in the first place was that getting the "can I re-use this page" wrong is so fraught with errors. If you do it wrong, you end up with an incorrectly shared page. As a result, using "page_maybe_dma_pinned()" in either do_wp_page() or do_swap_page() would be a serious bug since it is only a (very good) heuristic. Re-using the page requires a hard black-and-white rule with no room for ambiguity. In contrast, saying "this page is very likely dma pinned, so let's not add it to the swap cache and try to unmap it" is an obviously safe thing to do, and if the heuristic might very rarely be a false positive, no harm is done. Fixes: 09854ba94c6a ("mm: do_wp_page() simplification") Reported-and-tested-by: Martin Raiber <martin@urbackup.org> Cc: Pavel Begunkov <asml.silence@gmail.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Peter Xu <peterx@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-01-17x86/hyperv: Initialize clockevents after LAPIC is initializedDexuan Cui
With commit 4df4cb9e99f8, the Hyper-V direct-mode STIMER is actually initialized before LAPIC is initialized: see apic_intr_mode_init() x86_platform.apic_post_init() hyperv_init() hv_stimer_alloc() apic_bsp_setup() setup_local_APIC() setup_local_APIC() temporarily disables LAPIC, initializes it and re-eanble it. The direct-mode STIMER depends on LAPIC, and when it's registered, it can be programmed immediately and the timer can fire very soon: hv_stimer_init clockevents_config_and_register clockevents_register_device tick_check_new_device tick_setup_device tick_setup_periodic(), tick_setup_oneshot() clockevents_program_event When the timer fires in the hypervisor, if the LAPIC is in the disabled state, new versions of Hyper-V ignore the event and don't inject the timer interrupt into the VM, and hence the VM hangs when it boots. Note: when the VM starts/reboots, the LAPIC is pre-enabled by the firmware, so the window of LAPIC being temporarily disabled is pretty small, and the issue can only happen once out of 100~200 reboots for a 40-vCPU VM on one dev host, and on another host the issue doesn't reproduce after 2000 reboots. The issue is more noticeable for kdump/kexec, because the LAPIC is disabled by the first kernel, and stays disabled until the kdump/kexec kernel enables it. This is especially an issue to a Generation-2 VM (for which Hyper-V doesn't emulate the PIT timer) when CONFIG_HZ=1000 (rather than CONFIG_HZ=250) is used. Fix the issue by moving hv_stimer_alloc() to a later place where the LAPIC timer is initialized. Fixes: 4df4cb9e99f8 ("x86/hyperv: Initialize clockevents earlier in CPU onlining") Signed-off-by: Dexuan Cui <decui@microsoft.com> Reviewed-by: Michael Kelley <mikelley@microsoft.com> Link: https://lore.kernel.org/r/20210116223136.13892-1-decui@microsoft.com Signed-off-by: Wei Liu <wei.liu@kernel.org>
2021-01-17ia64: fix build failure caused by memory model changesMike Rapoport
The change of ia64's default memory model to SPARSEMEM causes defconfig build to fail: CC kernel/async.o In file included from include/linux/numa.h:25, from include/linux/async.h:13, from kernel/async.c:47: arch/ia64/include/asm/sparsemem.h:14:40: warning: "PAGE_SHIFT" is not defined, evaluates to 0 [-Wundef] 14 | #if ((CONFIG_FORCE_MAX_ZONEORDER - 1 + PAGE_SHIFT) > SECTION_SIZE_BITS) | ^~~~~~~~~~ In file included from include/linux/gfp.h:6, from include/linux/xarray.h:14, from include/linux/radix-tree.h:19, from include/linux/idr.h:15, from include/linux/kernfs.h:13, from include/linux/sysfs.h:16, from include/linux/kobject.h:20, from include/linux/energy_model.h:7, from include/linux/device.h:16, from include/linux/async.h:14, from kernel/async.c:47: include/linux/mmzone.h:1156:2: error: #error Allocator MAX_ORDER exceeds SECTION_SIZE 1156 | #error Allocator MAX_ORDER exceeds SECTION_SIZE | ^~~~~ The error cause is the missing definition of PAGE_SHIFT in the calculation of SECTION_SIZE_BITS. Add include of <asm/page.h> to arch/ia64/include/asm/sparsemem.h to solve the problem. Fixes: 214496cb1870 ("ia64: make SPARSEMEM default and disable DISCONTIGMEM") Reported-by: Guenter Roeck <linux@roeck-us.net> Tested-by: Guenter Roeck <linux@roeck-us.net> Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
2021-01-16skbuff: back tiny skbs with kmalloc() in __netdev_alloc_skb() tooAlexander Lobakin
Commit 3226b158e67c ("net: avoid 32 x truesize under-estimation for tiny skbs") ensured that skbs with data size lower than 1025 bytes will be kmalloc'ed to avoid excessive page cache fragmentation and memory consumption. However, the fix adressed only __napi_alloc_skb() (primarily for virtio_net and napi_get_frags()), but the issue can still be achieved through __netdev_alloc_skb(), which is still used by several drivers. Drivers often allocate a tiny skb for headers and place the rest of the frame to frags (so-called copybreak). Mirror the condition to __netdev_alloc_skb() to handle this case too. Since v1 [0]: - fix "Fixes:" tag; - refine commit message (mention copybreak usecase). [0] https://lore.kernel.org/netdev/20210114235423.232737-1-alobakin@pm.me Fixes: a1c7fff7e18f ("net: netdev_alloc_skb() use build_skb()") Signed-off-by: Alexander Lobakin <alobakin@pm.me> Link: https://lore.kernel.org/r/20210115150354.85967-1-alobakin@pm.me Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-16Merge tag 'scsi-fixes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi Pull SCSI fixes from James Bottomley: "Nine minor fixes, seven in drivers and two in the core SCSI disk driver (sd) which should be harmless involving removing an unused variable and quietening a spurious warning" Signed-off-by: James E.J. Bottomley <jejb@linux.ibm.com> * tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi: scsi: sd: Remove obsolete variable in sd_remove() scsi: sd: Suppress spurious errors when WRITE SAME is being disabled scsi: scsi_debug: Fix memleak in scsi_debug_init() scsi: mpt3sas: Fix spelling mistake in Kconfig "compatiblity" -> "compatibility" scsi: qedi: Correct max length of CHAP secret scsi: ufs: Correct the LUN used in eh_device_reset_handler() callback scsi: ufs: Relocate flush of exceptional event scsi: ufs: Relax the condition of UFSHCI_QUIRK_SKIP_MANUAL_WB_FLUSH_CTRL scsi: ufs: Fix possible power drain during system suspend
2021-01-16dump_common_audit_data(): fix racy accesses to ->d_nameAl Viro
We are not guaranteed the locking environment that would prevent dentry getting renamed right under us. And it's possible for old long name to be freed after rename, leading to UAF here. Cc: stable@kernel.org # v2.6.2+ Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2021-01-16Merge tag 'block-5.11-2021-01-16' of git://git.kernel.dk/linux-blockLinus Torvalds
Pull block fixes from Jens Axboe: "Just an nvme pull request via Christoph: - don't initialize hwmon for discover controllers (Sagi Grimberg) - fix iov_iter handling in nvme-tcp (Sagi Grimberg) - fix a preempt warning in nvme-tcp (Sagi Grimberg) - fix a possible NULL pointer dereference in nvme (Israel Rukshin)" * tag 'block-5.11-2021-01-16' of git://git.kernel.dk/linux-block: nvme: don't intialize hwmon for discovery controllers nvme-tcp: fix possible data corruption with bio merges nvme-tcp: Fix warning with CONFIG_DEBUG_PREEMPT nvmet-rdma: Fix NULL deref when setting pi_enable and traddr INADDR_ANY
2021-01-16Merge tag 'io_uring-5.11-2021-01-16' of git://git.kernel.dk/linux-blockLinus Torvalds
Pull io_uring fixes from Jens Axboe: "We still have a pending fix for a cancelation issue, but it's still being investigated. In the meantime: - Dead mm handling fix (Pavel) - SQPOLL setup error handling (Pavel) - Flush timeout sequence fix (Marcelo) - Missing finish_wait() for one exit case" * tag 'io_uring-5.11-2021-01-16' of git://git.kernel.dk/linux-block: io_uring: ensure finish_wait() is always called in __io_uring_task_cancel() io_uring: flush timeouts that should already have expired io_uring: do sqo disable on install_fd error io_uring: fix null-deref in io_disable_sqo_submit io_uring: don't take files/mm for a dead task io_uring: drop mm and files after task_work_run
2021-01-16Merge tag 'riscv-for-linus-5.11-rc4' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux Pull RISC-V fixes from Palmer Dabbelt: "There are a few more fixes than a normal rc4, largely due to the bubble introduced by the holiday break: - return -ENOSYS for syscall number -1, which previously returned an uninitialized value. - ensure of_clk_init() has been called in time_init(), without which clock drivers may not be initialized. - fix sifive,uart0 driver to properly display the baud rate. A fix to initialize MPIE that allows interrupts to be processed during system calls. - avoid erronously begin tracing IRQs when interrupts are disabled, which at least triggers suprious lockdep failures. - workaround for a warning related to calling smp_processor_id() while preemptible. The warning itself is suprious on currently availiable systems. - properly include the generic time VDSO calls. A fix to our kasan address mapping. A fix to the HiFive Unleashed device tree, which allows the Ethernet PHY to be properly initialized by Linux (as opposed to relying on the bootloader). - defconfig update to include SiFive's GPIO driver, which is present on the HiFive Unleashed and necessary to initialize the PHY. - avoid allocating memory while initializing reserved memory. - avoid allocating the last 4K of memory, as pointers there alias with syscall errors. There are also two cleanups that should have no functional effect but do fix build warnings: - drop a duplicated definition of PAGE_KERNEL_EXEC. - properly declare the asm register SP shim. - cleanup the rv32 memory size Kconfig entry, to reflect the actual size of memory availiable" * tag 'riscv-for-linus-5.11-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux: RISC-V: Fix maximum allowed phsyical memory for RV32 RISC-V: Set current memblock limit RISC-V: Do not allocate memblock while iterating reserved memblocks riscv: stacktrace: Move register keyword to beginning of declaration riscv: defconfig: enable gpio support for HiFive Unleashed dts: phy: add GPIO number and active state used for phy reset dts: phy: fix missing mdio device and probe failure of vsc8541-01 device riscv: Fix KASAN memory mapping. riscv: Fixup CONFIG_GENERIC_TIME_VSYSCALL riscv: cacheinfo: Fix using smp_processor_id() in preemptible riscv: Trace irq on only interrupt is enabled riscv: Drop a duplicated PAGE_KERNEL_EXEC riscv: Enable interrupts during syscalls with M-Mode riscv: Fix sifive serial driver riscv: Fix kernel time_init() riscv: return -ENOSYS for syscall -1
2021-01-16mm: don't play games with pinned pages in clear_page_refsLinus Torvalds
Turning a pinned page read-only breaks the pinning after COW. Don't do it. The whole "track page soft dirty" state doesn't work with pinned pages anyway, since the page might be dirtied by the pinning entity without ever being noticed in the page tables. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-01-16mm: fix clear_refs_write lockingLinus Torvalds
Turning page table entries read-only requires the mmap_sem held for writing. So stop doing the odd games with turning things from read locks to write locks and back. Just get the write lock. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-01-15RISC-V: Fix maximum allowed phsyical memory for RV32Atish Patra
Linux kernel can only map 1GB of address space for RV32 as the page offset is set to 0xC0000000. The current description in the Kconfig is confusing as it indicates that RV32 can support 2GB of physical memory. That is simply not true for current kernel. In future, a 2GB split support can be added to allow 2GB physical address space. Reviewed-by: Anup Patel <anup@brainfault.org> Signed-off-by: Atish Patra <atish.patra@wdc.com> Signed-off-by: Palmer Dabbelt <palmerdabbelt@google.com>
2021-01-15RISC-V: Set current memblock limitAtish Patra
Currently, linux kernel can not use last 4k bytes of addressable space because IS_ERR_VALUE macro treats those as an error. This will be an issue for RV32 as any memblock allocator potentially allocate chunk of memory from the end of DRAM (2GB) leading bad address error even though the address was technically valid. Fix this issue by limiting the memblock if available memory spans the entire address space. Reviewed-by: Anup Patel <anup@brainfault.org> Signed-off-by: Atish Patra <atish.patra@wdc.com> Signed-off-by: Palmer Dabbelt <palmerdabbelt@google.com>
2021-01-15RISC-V: Do not allocate memblock while iterating reserved memblocksAtish Patra
Currently, resource tree allocates memory blocks while iterating on the list. It leads to following kernel warning because memblock allocation also invokes memory block reservation API. [ 0.000000] ------------[ cut here ]------------ [ 0.000000] WARNING: CPU: 0 PID: 0 at kernel/resource.c:795 __insert_resource+0x8e/0xd0 [ 0.000000] Modules linked in: [ 0.000000] CPU: 0 PID: 0 Comm: swapper Not tainted 5.10.0-00022-ge20097fb37e2-dirty #549 [ 0.000000] epc: c00125c2 ra : c001262c sp : c1c01f50 [ 0.000000] gp : c1d456e0 tp : c1c0a980 t0 : ffffcf20 [ 0.000000] t1 : 00000000 t2 : 00000000 s0 : c1c01f60 [ 0.000000] s1 : ffffcf00 a0 : ffffff00 a1 : c1c0c0c4 [ 0.000000] a2 : 80c12b15 a3 : 80402000 a4 : 80402000 [ 0.000000] a5 : c1c0c0c4 a6 : 80c12b15 a7 : f5faf600 [ 0.000000] s2 : c1c0c0c4 s3 : c1c0e000 s4 : c1009a80 [ 0.000000] s5 : c1c0c000 s6 : c1d48000 s7 : c1613b4c [ 0.000000] s8 : 00000fff s9 : 80000200 s10: c1613b40 [ 0.000000] s11: 00000000 t3 : c1d4a000 t4 : ffffffff This is also unnecessary as we can pre-compute the total memblocks required for each memory region and allocate it before the loop. It save precious boot time not going through memblock allocation code every time. Fixes: 00ab027a3b82 ("RISC-V: Add kernel image sections to the resource tree") Reviewed-by: Anup Patel <anup@brainfault.org> Tested-by: Geert Uytterhoeven <geert@linux-m68k.org> Signed-off-by: Atish Patra <atish.patra@wdc.com> Signed-off-by: Palmer Dabbelt <palmerdabbelt@google.com>
2021-01-15iov_iter: fix the uaccess area in copy_compat_iovec_from_userChristoph Hellwig
sizeof needs to be called on the compat pointer, not the native one. Fixes: 89cd35c58bc2 ("iov_iter: transparently handle compat iovecs in import_iovec") Reported-by: David Laight <David.Laight@ACULAB.COM> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2021-01-15net_sched: avoid shift-out-of-bounds in tcindex_set_parms()Eric Dumazet
tc_index being 16bit wide, we need to check that TCA_TCINDEX_SHIFT attribute is not silly. UBSAN: shift-out-of-bounds in net/sched/cls_tcindex.c:260:29 shift exponent 255 is too large for 32-bit type 'int' CPU: 0 PID: 8516 Comm: syz-executor228 Not tainted 5.10.0-syzkaller #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Call Trace: __dump_stack lib/dump_stack.c:79 [inline] dump_stack+0x107/0x163 lib/dump_stack.c:120 ubsan_epilogue+0xb/0x5a lib/ubsan.c:148 __ubsan_handle_shift_out_of_bounds.cold+0xb1/0x181 lib/ubsan.c:395 valid_perfect_hash net/sched/cls_tcindex.c:260 [inline] tcindex_set_parms.cold+0x1b/0x215 net/sched/cls_tcindex.c:425 tcindex_change+0x232/0x340 net/sched/cls_tcindex.c:546 tc_new_tfilter+0x13fb/0x21b0 net/sched/cls_api.c:2127 rtnetlink_rcv_msg+0x8b6/0xb80 net/core/rtnetlink.c:5555 netlink_rcv_skb+0x153/0x420 net/netlink/af_netlink.c:2494 netlink_unicast_kernel net/netlink/af_netlink.c:1304 [inline] netlink_unicast+0x533/0x7d0 net/netlink/af_netlink.c:1330 netlink_sendmsg+0x907/0xe40 net/netlink/af_netlink.c:1919 sock_sendmsg_nosec net/socket.c:652 [inline] sock_sendmsg+0xcf/0x120 net/socket.c:672 ____sys_sendmsg+0x6e8/0x810 net/socket.c:2336 ___sys_sendmsg+0xf3/0x170 net/socket.c:2390 __sys_sendmsg+0xe5/0x1b0 net/socket.c:2423 do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46 entry_SYSCALL_64_after_hwframe+0x44/0xa9 Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: syzbot <syzkaller@googlegroups.com> Link: https://lore.kernel.org/r/20210114185229.1742255-1-eric.dumazet@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-15net_sched: gen_estimator: support large ewma logEric Dumazet
syzbot report reminded us that very big ewma_log were supported in the past, even if they made litle sense. tc qdisc replace dev xxx root est 1sec 131072sec ... While fixing the bug, also add boundary checks for ewma_log, in line with range supported by iproute2. UBSAN: shift-out-of-bounds in net/core/gen_estimator.c:83:38 shift exponent -1 is negative CPU: 0 PID: 0 Comm: swapper/0 Not tainted 5.10.0-syzkaller #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Call Trace: <IRQ> __dump_stack lib/dump_stack.c:79 [inline] dump_stack+0x107/0x163 lib/dump_stack.c:120 ubsan_epilogue+0xb/0x5a lib/ubsan.c:148 __ubsan_handle_shift_out_of_bounds.cold+0xb1/0x181 lib/ubsan.c:395 est_timer.cold+0xbb/0x12d net/core/gen_estimator.c:83 call_timer_fn+0x1a5/0x710 kernel/time/timer.c:1417 expire_timers kernel/time/timer.c:1462 [inline] __run_timers.part.0+0x692/0xa80 kernel/time/timer.c:1731 __run_timers kernel/time/timer.c:1712 [inline] run_timer_softirq+0xb3/0x1d0 kernel/time/timer.c:1744 __do_softirq+0x2bc/0xa77 kernel/softirq.c:343 asm_call_irq_on_stack+0xf/0x20 </IRQ> __run_on_irqstack arch/x86/include/asm/irq_stack.h:26 [inline] run_on_irqstack_cond arch/x86/include/asm/irq_stack.h:77 [inline] do_softirq_own_stack+0xaa/0xd0 arch/x86/kernel/irq_64.c:77 invoke_softirq kernel/softirq.c:226 [inline] __irq_exit_rcu+0x17f/0x200 kernel/softirq.c:420 irq_exit_rcu+0x5/0x20 kernel/softirq.c:432 sysvec_apic_timer_interrupt+0x4d/0x100 arch/x86/kernel/apic/apic.c:1096 asm_sysvec_apic_timer_interrupt+0x12/0x20 arch/x86/include/asm/idtentry.h:628 RIP: 0010:native_save_fl arch/x86/include/asm/irqflags.h:29 [inline] RIP: 0010:arch_local_save_flags arch/x86/include/asm/irqflags.h:79 [inline] RIP: 0010:arch_irqs_disabled arch/x86/include/asm/irqflags.h:169 [inline] RIP: 0010:acpi_safe_halt drivers/acpi/processor_idle.c:111 [inline] RIP: 0010:acpi_idle_do_entry+0x1c9/0x250 drivers/acpi/processor_idle.c:516 Fixes: 1c0d32fde5bd ("net_sched: gen_estimator: complete rewrite of rate estimators") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: syzbot <syzkaller@googlegroups.com> Link: https://lore.kernel.org/r/20210114181929.1717985-1-eric.dumazet@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-15net_sched: reject silly cell_log in qdisc_get_rtab()Eric Dumazet
iproute2 probably never goes beyond 8 for the cell exponent, but stick to the max shift exponent for signed 32bit. UBSAN reported: UBSAN: shift-out-of-bounds in net/sched/sch_api.c:389:22 shift exponent 130 is too large for 32-bit type 'int' CPU: 1 PID: 8450 Comm: syz-executor586 Not tainted 5.11.0-rc3-syzkaller #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Call Trace: __dump_stack lib/dump_stack.c:79 [inline] dump_stack+0x183/0x22e lib/dump_stack.c:120 ubsan_epilogue lib/ubsan.c:148 [inline] __ubsan_handle_shift_out_of_bounds+0x432/0x4d0 lib/ubsan.c:395 __detect_linklayer+0x2a9/0x330 net/sched/sch_api.c:389 qdisc_get_rtab+0x2b5/0x410 net/sched/sch_api.c:435 cbq_init+0x28f/0x12c0 net/sched/sch_cbq.c:1180 qdisc_create+0x801/0x1470 net/sched/sch_api.c:1246 tc_modify_qdisc+0x9e3/0x1fc0 net/sched/sch_api.c:1662 rtnetlink_rcv_msg+0xb1d/0xe60 net/core/rtnetlink.c:5564 netlink_rcv_skb+0x1f0/0x460 net/netlink/af_netlink.c:2494 netlink_unicast_kernel net/netlink/af_netlink.c:1304 [inline] netlink_unicast+0x7de/0x9b0 net/netlink/af_netlink.c:1330 netlink_sendmsg+0xaa6/0xe90 net/netlink/af_netlink.c:1919 sock_sendmsg_nosec net/socket.c:652 [inline] sock_sendmsg net/socket.c:672 [inline] ____sys_sendmsg+0x5a2/0x900 net/socket.c:2345 ___sys_sendmsg net/socket.c:2399 [inline] __sys_sendmsg+0x319/0x400 net/socket.c:2432 do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46 entry_SYSCALL_64_after_hwframe+0x44/0xa9 Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: syzbot <syzkaller@googlegroups.com> Acked-by: Cong Wang <cong.wang@bytedance.com> Link: https://lore.kernel.org/r/20210114160637.1660597-1-eric.dumazet@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-15Merge tag 'for-5.11/dm-fixes-1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm Pull device mapper fixes from Mike Snitzer: - Fix DM-raid's raid1 discard limits so discards work. - Select missing Kconfig dependencies for DM integrity and zoned targets. - Four fixes for DM crypt target's support to optionally bypass kcryptd workqueues. - Fix DM snapshot merge supports missing data flushes before committing metadata. - Fix DM integrity data device flushing when external metadata is used. - Fix DM integrity's maximum number of supported constructor arguments that user can request when creating an integrity device. - Eliminate DM core ioctl logging noise when an ioctl is issued without required CAP_SYS_RAWIO permission. * tag 'for-5.11/dm-fixes-1' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: dm crypt: defer decryption to a tasklet if interrupts disabled dm integrity: fix the maximum number of arguments dm crypt: do not call bio_endio() from the dm-crypt tasklet dm integrity: fix flush with external metadata device dm: eliminate potential source of excessive kernel log noise dm snapshot: flush merged data before committing metadata dm crypt: use GFP_ATOMIC when allocating crypto requests from softirq dm crypt: do not wait for backlogged crypto request completion in softirq dm zoned: select CONFIG_CRC32 dm integrity: select CRYPTO_SKCIPHER dm raid: fix discard limits for raid1
2021-01-15Merge https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpfJakub Kicinski
Daniel Borkmann says: ==================== pull-request: bpf 2021-01-16 1) Fix a double bpf_prog_put() for BPF_PROG_{TYPE_EXT,TYPE_TRACING} types in link creation's error path causing a refcount underflow, from Jiri Olsa. 2) Fix BTF validation errors for the case where kernel modules don't declare any new types and end up with an empty BTF, from Andrii Nakryiko. 3) Fix BPF local storage helpers to first check their {task,inode} owners for being NULL before access, from KP Singh. 4) Fix a memory leak in BPF setsockopt handling for the case where optlen is zero and thus temporary optval buffer should be freed, from Stanislav Fomichev. 5) Fix a syzbot memory allocation splat in BPF_PROG_TEST_RUN infra for raw_tracepoint caused by too big ctx_size_in, from Song Liu. 6) Fix LLVM code generation issues with verifier where PTR_TO_MEM{,_OR_NULL} registers were spilled to stack but not recognized, from Gilad Reti. * https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf: MAINTAINERS: Update my email address selftests/bpf: Add verifier test for PTR_TO_MEM spill bpf: Support PTR_TO_MEM{,_OR_NULL} register spilling bpf: Reject too big ctx_size_in for raw_tp test run libbpf: Allow loading empty BTFs bpf: Allow empty module BTFs bpf: Don't leak memory in bpf getsockopt when optlen == 0 bpf: Update local storage test to check handling of null ptrs bpf: Fix typo in bpf_inode_storage.c bpf: Local storage helpers should check nullness of owner ptr passed bpf: Prevent double bpf_prog_put call from bpf_tracing_prog_attach ==================== Link: https://lore.kernel.org/r/20210116002025.15706-1-daniel@iogearbox.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-15cls_flower: call nla_ok() before nla_next()Cong Wang
fl_set_enc_opt() simply checks if there are still bytes left to parse, but this is not sufficent as syzbot seems to be able to generate malformatted netlink messages. nla_ok() is more strict so should be used to validate the next nlattr here. And nla_validate_nested_deprecated() has less strict check too, it is probably too late to switch to the strict version, but we can just call nla_ok() too after it. Reported-and-tested-by: syzbot+2624e3778b18fc497c92@syzkaller.appspotmail.com Fixes: 0a6e77784f49 ("net/sched: allow flower to match tunnel options") Fixes: 79b1011cb33d ("net: sched: allow flower to match erspan options") Cc: Jamal Hadi Salim <jhs@mojatatu.com> Cc: Xin Long <lucien.xin@gmail.com> Cc: Jiri Pirko <jiri@resnulli.us> Signed-off-by: Cong Wang <cong.wang@bytedance.com> Link: https://lore.kernel.org/r/20210115185024.72298-1-xiyou.wangcong@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-15Merge branch 'akpm' (patches from Andrew)Linus Torvalds
Merge misc fixes from Andrew Morton: "10 patches. Subsystems affected by this patch series: MAINTAINERS and mm (slub, pagealloc, memcg, kasan, vmalloc, migration, hugetlb, memory-failure, and process_vm_access)" * emailed patches from Andrew Morton <akpm@linux-foundation.org>: mm/process_vm_access.c: include compat.h mm,hwpoison: fix printing of page flags MAINTAINERS: add Vlastimil as slab allocators maintainer mm/hugetlb: fix potential missing huge page size info mm: migrate: initialize err in do_migrate_pages mm/vmalloc.c: fix potential memory leak arm/kasan: fix the array size of kasan_early_shadow_pte[] mm/memcontrol: fix warning in mem_cgroup_page_lruvec() mm/page_alloc: add a missing mm_page_alloc_zone_locked() tracepoint mm, slub: consider rest of partial list if acquire_slab() fails
2021-01-15Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdmaLinus Torvalds
Pull rdma fixes from Jason Gunthorpe: "A fairly modest set of bug fixes, nothing abnormal from the merge window The ucma patch is a bit on the larger side, but given the regression was recently added I've opted to forward it to the rc stream. - Fix a ucma memory leak introduced in v5.9 while fixing the Syzkaller bugs - Don't fail when the xarray wraps for user verbs objects - User triggerable oops regression from the umem page size rework - Error unwind bugs in usnic, ocrdma, mlx5 and cma" * tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: RDMA/cma: Fix error flow in default_roce_mode_store RDMA/mlx5: Fix wrong free of blue flame register on error IB/mlx5: Fix error unwinding when set_has_smi_cap fails RDMA/umem: Avoid undefined behavior of rounddown_pow_of_two() RDMA/ocrdma: Fix use after free in ocrdma_dealloc_ucontext_pd() RDMA/usnic: Fix memleak in find_free_vf_and_create_qp_grp RDMA/restrack: Don't treat as an error allocation ID wrapping RDMA/ucma: Do not miss ctx destruction steps in some cases
2021-01-15io_uring: ensure finish_wait() is always called in __io_uring_task_cancel()Jens Axboe
If we enter with requests pending and performm cancelations, we'll have a different inflight count before and after calling prepare_to_wait(). This causes the loop to restart. If we actually ended up canceling everything, or everything completed in-between, then we'll break out of the loop without calling finish_wait() on the waitqueue. This can trigger a warning on exit_signals(), as we leave the task state in TASK_UNINTERRUPTIBLE. Put a finish_wait() after the loop to catch that case. Cc: stable@vger.kernel.org # 5.9+ Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-01-15MAINTAINERS: Update my email addressBjörn Töpel
My Intel email will stop working in a not too distant future. Move my MAINTAINERS entries to my kernel.org address. Signed-off-by: Björn Töpel <bjorn@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20210115104337.7751-1-bjorn.topel@gmail.com
2021-01-15Merge tag 'ext4_for_linus_stable' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 Pull ext4 fixes from Ted Ts'o: "A number of bug fixes for ext4: - Fix for the new fast_commit feature - Fix some error handling codepaths in whiteout handling and mountpoint sampling - Fix how we write ext4_error information so it goes through the journal when journalling is active, to avoid races that can lead to lost error information, superblock checksum failures, or DIF/DIX features" * tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: ext4: remove expensive flush on fast commit ext4: fix bug for rename with RENAME_WHITEOUT ext4: fix wrong list_splice in ext4_fc_cleanup ext4: use IS_ERR instead of IS_ERR_OR_NULL and set inode null when IS_ERR ext4: don't leak old mountpoint samples ext4: drop ext4_handle_dirty_super() ext4: fix superblock checksum failure when setting password salt ext4: use sbi instead of EXT4_SB(sb) in ext4_update_super() ext4: save error info to sb through journal if available ext4: protect superblock modifications with a buffer lock ext4: drop sync argument of ext4_commit_super() ext4: combine ext4_handle_error() and save_error_info()
2021-01-15Merge tag '5.11-rc3-smb3' of git://git.samba.org/sfrench/cifs-2.6Linus Torvalds
Pull cifs fixes from Steve French: "Two small cifs fixes for stable (including an important handle leak fix) and three small cleanup patches" * tag '5.11-rc3-smb3' of git://git.samba.org/sfrench/cifs-2.6: cifs: style: replace one-element array with flexible-array cifs: connect: style: Simplify bool comparison fs: cifs: remove unneeded variable in smb3_fs_context_dup cifs: fix interrupted close commands cifs: check pointer before freeing
2021-01-15Merge tag 'arm64-fixes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux Pull arm64 fixes from Catalin Marinas: - Set the minimum GCC version to 5.1 for arm64 due to earlier compiler bugs. - Make atomic helpers __always_inline to avoid a section mismatch when compiling with clang. - Fix the CMA and crashkernel reservations to use ZONE_DMA (remove the arm64_dma32_phys_limit variable, no longer needed with a dynamic ZONE_DMA sizing in 5.11). - Remove redundant IRQ flag tracing that was leaving lockdep inconsistent with the hardware state. - Revert perf events based hard lockup detector that was causing smp_processor_id() to be called in preemptible context. - Some trivial cleanups - spelling fix, renaming S_FRAME_SIZE to PT_REGS_SIZE, function prototypes added. * tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: arm64: selftests: Fix spelling of 'Mismatch' arm64: syscall: include prototype for EL0 SVC functions compiler.h: Raise minimum version of GCC to 5.1 for arm64 arm64: make atomic helpers __always_inline arm64: rename S_FRAME_SIZE to PT_REGS_SIZE Revert "arm64: Enable perf events based hard lockup detector" arm64: entry: remove redundant IRQ flag tracing arm64: Remove arm64_dma32_phys_limit and its uses
2021-01-15Merge tag 'mips_fixes_5.11.1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/mips/linux Pull MIPS fixes from Thomas Bogendoerfer: - fix coredumps on 64bit kernels - fix for alignment bugs preventing booting - fix checking for failed irq_alloc_desc calls * tag 'mips_fixes_5.11.1' of git://git.kernel.org/pub/scm/linux/kernel/git/mips/linux: MIPS: OCTEON: fix unreachable code in octeon_irq_init_ciu MIPS: relocatable: fix possible boot hangup with KASLR enabled MIPS: Fix malformed NT_FILE and NT_SIGINFO in 32bit coredumps MIPS: boot: Fix unaligned access with CONFIG_MIPS_RAW_APPENDED_DTB
2021-01-15perf inject: Correct event attribute sizesAl Grant
When 'perf inject' reads a perf.data file from an older version of perf, it writes event attributes into the output with the original size field, but lays them out as if they had the size currently used. Readers see a corrupt file. Update the size field to match the layout. Signed-off-by: Al Grant <al.grant@foss.arm.com> Acked-by: Jiri Olsa <jolsa@redhat.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Link: http://lore.kernel.org/lkml/20201124195818.30603-1-al.grant@arm.com Signed-off-by: Denis Nikitin <denik@chromium.org> Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2021-01-15perf intel-pt: Fix 'CPU too large' errorAdrian Hunter
In some cases, the number of cpus (nr_cpus_online) is confused with the maximum cpu number (nr_cpus_avail), which results in the error in the example below: Example on system with 8 cpus: Before: # echo 0 > /sys/devices/system/cpu/cpu2/online # ./perf record --kcore -e intel_pt// taskset --cpu-list 7 uname Linux [ perf record: Woken up 1 times to write data ] [ perf record: Captured and wrote 0.147 MB perf.data ] # ./perf script --itrace=e Requested CPU 7 too large. Consider raising MAX_NR_CPUS 0x25908 [0x8]: failed to process type: 68 [Invalid argument] After: # ./perf script --itrace=e # Fixes: 8c7274691f0d ("perf machine: Replace MAX_NR_CPUS with perf_env::nr_cpus_online") Fixes: 7df4e36a4785 ("perf session: Replace MAX_NR_CPUS with perf_env::nr_cpus_online") Signed-off-by: Adrian Hunter <adrian.hunter@intel.com> Tested-by: Kan Liang <kan.liang@linux.intel.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: stable@vger.kernel.org Link: http://lore.kernel.org/lkml/20210107174159.24897-1-adrian.hunter@intel.com Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2021-01-15perf stat: Take cgroups into account for shadow statsNamhyung Kim
As of now it doesn't consider cgroups when collecting shadow stats and metrics so counter values from different cgroups will be saved in a same slot. This resulted in incorrect numbers when those cgroups have different workloads. For example, let's look at the scenario below: cgroups A and C runs same workload which burns a cpu while cgroup B runs a light workload. $ perf stat -a -e cycles,instructions --for-each-cgroup A,B,C sleep 1 Performance counter stats for 'system wide': 3,958,116,522 cycles A 6,722,650,929 instructions A # 2.53 insn per cycle 1,132,741 cycles B 571,743 instructions B # 0.00 insn per cycle 4,007,799,935 cycles C 6,793,181,523 instructions C # 2.56 insn per cycle 1.001050869 seconds time elapsed When I run 'perf stat' with single workload, it usually shows IPC around 1.7. We can verify it (6,722,650,929.0 / 3,958,116,522 = 1.698) for cgroup A. But in this case, since cgroups are ignored, cycles are averaged so it used the lower value for IPC calculation and resulted in around 2.5. avg cycle: (3958116522 + 1132741 + 4007799935) / 3 = 2655683066 IPC (A) : 6722650929 / 2655683066 = 2.531 IPC (B) : 571743 / 2655683066 = 0.0002 IPC (C) : 6793181523 / 2655683066 = 2.557 We can simply compare cgroup pointers in the evsel and it'll be NULL when cgroups are not specified. With this patch, I can see correct numbers like below: $ perf stat -a -e cycles,instructions --for-each-cgroup A,B,C sleep 1 Performance counter stats for 'system wide': 4,171,051,687 cycles A 7,219,793,922 instructions A # 1.73 insn per cycle 1,051,189 cycles B 583,102 instructions B # 0.55 insn per cycle 4,171,124,710 cycles C 7,192,944,580 instructions C # 1.72 insn per cycle 1.007909814 seconds time elapsed Signed-off-by: Namhyung Kim <namhyung@kernel.org> Acked-by: Jiri Olsa <jolsa@redhat.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Ian Rogers <irogers@google.com> Cc: Jin Yao <yao.jin@linux.intel.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephane Eranian <eranian@google.com> Link: http://lore.kernel.org/lkml/20210115071139.257042-2-namhyung@kernel.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2021-01-15perf stat: Introduce struct runtime_stat_dataNamhyung Kim
To pass more info to the saved_value in the runtime_stat, add a new struct runtime_stat_data. Currently it only has 'ctx' field but later patch will add more. Note that we intentionally pass 0 as ctx to clock-related events for compatibility. It was already there in a few places. So move the code into the saved_value_lookup() explicitly and add a comment. Suggested-by: Andi Kleen <ak@linux.intel.com> Signed-off-by: Namhyung Kim <namhyung@kernel.org> Acked-by: Jiri Olsa <jolsa@redhat.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Ian Rogers <irogers@google.com> Cc: Jin Yao <yao.jin@linux.intel.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephane Eranian <eranian@google.com> Link: http://lore.kernel.org/lkml/20210115071139.257042-1-namhyung@kernel.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2021-01-15libperf tests: Fail when failing to get a tracepoint idIan Rogers
Permissions are necessary to get a tracepoint id. Fail the test when the read fails. Signed-off-by: Ian Rogers <irogers@google.com> Acked-by: Jiri Olsa <jolsa@redhat.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephane Eranian <eranian@google.com> Link: http://lore.kernel.org/lkml/20210114180250.3853825-2-irogers@google.com Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2021-01-15libperf tests: If a test fails return non-zeroIan Rogers
If a test fails return -1 rather than 0. This is consistent with the return value in test-cpumap.c Signed-off-by: Ian Rogers <irogers@google.com> Acked-by: Jiri Olsa <jolsa@redhat.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephane Eranian <eranian@google.com> Link: http://lore.kernel.org/lkml/20210114180250.3853825-1-irogers@google.com Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>