summaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)Author
2021-03-21signal: don't allow STOP on PF_IO_WORKER threadsEric W. Biederman
Just like we don't allow normal signals to IO threads, don't deliver a STOP to a task that has PF_IO_WORKER set. The IO threads don't take signals in general, and have no means of flushing out a stop either. Longer term, we may want to look into allowing stop of these threads, as it relates to eg process freezing. For now, this prevents a spin issue if a SIGSTOP is delivered to the parent task. Reported-by: Stefan Metzmacher <metze@samba.org> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2021-03-21signal: don't allow sending any signals to PF_IO_WORKER threadsJens Axboe
They don't take signals individually, and even if they share signals with the parent task, don't allow them to be delivered through the worker thread. Linux does allow this kind of behavior for regular threads, but it's really a compatability thing that we need not care about for the IO threads. Reported-by: Stefan Metzmacher <metze@samba.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-03-21lockdep: Add a missing initialization hint to the "INFO: Trying to register ↵Tetsuo Handa
non-static key" message Since this message is printed when dynamically allocated spinlocks (e.g. kzalloc()) are used without initialization (e.g. spin_lock_init()), suggest to developers to check whether initialization functions for objects were called, before making developers wonder what annotation is missing. [ mingo: Minor tweaks to the message. ] Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20210321064913.4619-1-penguin-kernel@I-love.SAKURA.ne.jp Signed-off-by: Ingo Molnar <mingo@kernel.org>
2021-03-21genirq: Disable interrupts for force threaded handlersThomas Gleixner
With interrupt force threading all device interrupt handlers are invoked from kernel threads. Contrary to hard interrupt context the invocation only disables bottom halfs, but not interrupts. This was an oversight back then because any code like this will have an issue: thread(irq_A) irq_handler(A) spin_lock(&foo->lock); interrupt(irq_B) irq_handler(B) spin_lock(&foo->lock); This has been triggered with networking (NAPI vs. hrtimers) and console drivers where printk() happens from an interrupt which interrupted the force threaded handler. Now people noticed and started to change the spin_lock() in the handler to spin_lock_irqsave() which affects performance or add IRQF_NOTHREAD to the interrupt request which in turn breaks RT. Fix the root cause and not the symptom and disable interrupts before invoking the force threaded handler which preserves the regular semantics and the usefulness of the interrupt force threading as a general debugging tool. For not RT this is not changing much, except that during the execution of the threaded handler interrupts are delayed until the handler returns. Vs. scheduling and softirq processing there is no difference. For RT kernels there is no issue. Fixes: 8d32a307e4fa ("genirq: Provide forced interrupt threading") Reported-by: Johan Hovold <johan@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Johan Hovold <johan@kernel.org> Acked-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Link: https://lore.kernel.org/r/20210317143859.513307808@linutronix.de
2021-03-19Merge tag 'io_uring-5.12-2021-03-19' of git://git.kernel.dk/linux-blockLinus Torvalds
Pull io_uring fixes from Jens Axboe: "Quieter week this time, which was both expected and desired. About half of the below is fixes for this release, the other half are just fixes in general. In detail: - Fix the freezing of IO threads, by making the freezer not send them fake signals. Make them freezable by default. - Like we did for personalities, move the buffer IDR to xarray. Kills some code and avoids a use-after-free on teardown. - SQPOLL cleanups and fixes (Pavel) - Fix linked timeout race (Pavel) - Fix potential completion post use-after-free (Pavel) - Cleanup and move internal structures outside of general kernel view (Stefan) - Use MSG_SIGNAL for send/recv from io_uring (Stefan)" * tag 'io_uring-5.12-2021-03-19' of git://git.kernel.dk/linux-block: io_uring: don't leak creds on SQO attach error io_uring: use typesafe pointers in io_uring_task io_uring: remove structures from include/linux/io_uring.h io_uring: imply MSG_NOSIGNAL for send[msg]()/recv[msg]() calls io_uring: fix sqpoll cancellation via task_work io_uring: add generic callback_head helpers io_uring: fix concurrent parking io_uring: halt SQO submission on ctx exit io_uring: replace sqd rw_semaphore with mutex io_uring: fix complete_post use ctx after free io_uring: fix ->flags races by linked timeouts io_uring: convert io_buffer_idr to XArray io_uring: allow IO worker threads to be frozen kernel: freezer should treat PF_IO_WORKER like PF_KTHREAD for freezing
2021-03-19bpf: Remove insn_buf[] declaration in inner blockJianlin Lv
Two insn_buf[16] variables are declared in the function which acts on function scope and block scope respectively. The statement in the inner block is redundant, so remove it. Signed-off-by: Jianlin Lv <Jianlin.Lv@arm.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20210318024851.49693-1-Jianlin.Lv@arm.com
2021-03-19genirq/matrix: Prevent allocation counter corruptionVitaly Kuznetsov
When irq_matrix_free() is called for an unallocated vector the managed_allocated and total_allocated counters get out of sync with the real state of the matrix. Later, when the last interrupt is freed, these counters will underflow resulting in UINTMAX because the counters are unsigned. While this is certainly a problem of the calling code, this can be catched in the allocator by checking the allocation bit for the to be freed vector which simplifies debugging. An example of the problem described above: https://lore.kernel.org/lkml/20210318192819.636943062@linutronix.de/ Add the missing sanity check and emit a warning when it triggers. Suggested-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20210319111823.1105248-1-vkuznets@redhat.com
2021-03-19bpf: Fix umd memory leak in copy_process()Zqiang
The syzbot reported a memleak as follows: BUG: memory leak unreferenced object 0xffff888101b41d00 (size 120): comm "kworker/u4:0", pid 8, jiffies 4294944270 (age 12.780s) backtrace: [<ffffffff8125dc56>] alloc_pid+0x66/0x560 [<ffffffff81226405>] copy_process+0x1465/0x25e0 [<ffffffff81227943>] kernel_clone+0xf3/0x670 [<ffffffff812281a1>] kernel_thread+0x61/0x80 [<ffffffff81253464>] call_usermodehelper_exec_work [<ffffffff81253464>] call_usermodehelper_exec_work+0xc4/0x120 [<ffffffff812591c9>] process_one_work+0x2c9/0x600 [<ffffffff81259ab9>] worker_thread+0x59/0x5d0 [<ffffffff812611c8>] kthread+0x178/0x1b0 [<ffffffff8100227f>] ret_from_fork+0x1f/0x30 unreferenced object 0xffff888110ef5c00 (size 232): comm "kworker/u4:0", pid 8414, jiffies 4294944270 (age 12.780s) backtrace: [<ffffffff8154a0cf>] kmem_cache_zalloc [<ffffffff8154a0cf>] __alloc_file+0x1f/0xf0 [<ffffffff8154a809>] alloc_empty_file+0x69/0x120 [<ffffffff8154a8f3>] alloc_file+0x33/0x1b0 [<ffffffff8154ab22>] alloc_file_pseudo+0xb2/0x140 [<ffffffff81559218>] create_pipe_files+0x138/0x2e0 [<ffffffff8126c793>] umd_setup+0x33/0x220 [<ffffffff81253574>] call_usermodehelper_exec_async+0xb4/0x1b0 [<ffffffff8100227f>] ret_from_fork+0x1f/0x30 After the UMD process exits, the pipe_to_umh/pipe_from_umh and tgid need to be released. Fixes: d71fa5c9763c ("bpf: Add kernel module with user mode driver that populates bpffs.") Reported-by: syzbot+44908bb56d2bfe56b28e@syzkaller.appspotmail.com Signed-off-by: Zqiang <qiang.zhang@windriver.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20210317030915.2865-1-qiang.zhang@windriver.com
2021-03-19kernel: debug: Ordinary typo fixes in the file gdbstub.cBhaskar Chowdhury
s/overwitten/overwritten/ s/procesing/processing/ Signed-off-by: Bhaskar Chowdhury <unixbhaskar@gmail.com> Link: https://lore.kernel.org/r/20210317104658.4053473-1-unixbhaskar@gmail.com Acked-by: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Daniel Thompson <daniel.thompson@linaro.org>
2021-03-19kdb: Simplify kdb commands registrationSumit Garg
Simplify kdb commands registration via using linked list instead of static array for commands storage. Signed-off-by: Sumit Garg <sumit.garg@linaro.org> Link: https://lore.kernel.org/r/20210224070827.408771-1-sumit.garg@linaro.org Reviewed-by: Douglas Anderson <dianders@chromium.org> [daniel.thompson@linaro.org: Removed a bunch of .cmd_minline = 0 initializers] Signed-off-by: Daniel Thompson <daniel.thompson@linaro.org>
2021-03-19kdb: Remove redundant function definitions/prototypesSumit Garg
Cleanup kdb code to get rid of unused function definitions/prototypes. Signed-off-by: Sumit Garg <sumit.garg@linaro.org> Link: https://lore.kernel.org/r/20210224071653.409150-1-sumit.garg@linaro.org Reviewed-by: Douglas Anderson <dianders@chromium.org> Signed-off-by: Daniel Thompson <daniel.thompson@linaro.org>
2021-03-19static_call: Fix static_call_update() sanity checkPeter Zijlstra
Sites that match init_section_contains() get marked as INIT. For built-in code init_sections contains both __init and __exit text. OTOH kernel_text_address() only explicitly includes __init text (and there are no __exit text markers). Match what jump_label already does and ignore the warning for INIT sites. Also see the excellent changelog for commit: 8f35eaa5f2de ("jump_label: Don't warn on __exit jump entries") Fixes: 9183c3f9ed710 ("static_call: Add inline static call infrastructure") Reported-by: Sumit Garg <sumit.garg@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Jarkko Sakkinen <jarkko@kernel.org> Tested-by: Sumit Garg <sumit.garg@linaro.org> Link: https://lkml.kernel.org/r/20210318113610.739542434@infradead.org
2021-03-19static_call: Align static_call_is_init() patching conditionPeter Zijlstra
The intent is to avoid writing init code after init (because the text might have been freed). The code is needlessly different between jump_label and static_call and not obviously correct. The existing code relies on the fact that the module loader clears the init layout, such that within_module_init() always fails, while jump_label relies on the module state which is more obvious and matches the kernel logic. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Jarkko Sakkinen <jarkko@kernel.org> Tested-by: Sumit Garg <sumit.garg@linaro.org> Link: https://lkml.kernel.org/r/20210318113610.636651340@infradead.org
2021-03-19static_call: Fix static_call_set_init()Peter Zijlstra
It turns out that static_call_set_init() does not preserve the other flags; IOW. it clears TAIL if it was set. Fixes: 9183c3f9ed710 ("static_call: Add inline static call infrastructure") Reported-by: Sumit Garg <sumit.garg@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Jarkko Sakkinen <jarkko@kernel.org> Tested-by: Sumit Garg <sumit.garg@linaro.org> Link: https://lkml.kernel.org/r/20210318113610.519406371@infradead.org
2021-03-19locking/locktorture: Fix incorrect use of ww_acquire_ctx in ww_mutex testWaiman Long
The ww_acquire_ctx structure for ww_mutex needs to persist for a complete lock/unlock cycle. In the ww_mutex test in locktorture, however, both ww_acquire_init() and ww_acquire_fini() are called within the lock function only. This causes a lockdep splat of "WARNING: Nested lock was not taken" when lockdep is enabled in the kernel. To fix this problem, we need to move the ww_acquire_fini() after the ww_mutex_unlock() in torture_ww_mutex_unlock(). This is done by allocating a global array of ww_acquire_ctx structures. Each locking thread is associated with its own ww_acquire_ctx via the unique thread id it has so that both the lock and unlock functions can access the same ww_acquire_ctx structure. Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20210318172814.4400-6-longman@redhat.com
2021-03-19locking/locktorture: Pass thread id to lock/unlock functionsWaiman Long
To allow the lock and unlock functions in locktorture to access per-thread information, we need to pass some hint on how to locate those information. One way to do this is to pass in a unique thread id which can then be used to access a global array for thread specific information. Change the lock and unlock method to add a thread id parameter which can be determined by the offset of the lwsp/lrsp pointer from the global lwsa/lrsa array. There is no other functional change in this patch. Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20210318172814.4400-5-longman@redhat.com
2021-03-19locking/locktorture: Fix false positive circular locking splat in ww_mutex testWaiman Long
In order to avoid false positive circular locking lockdep splat when runnng the ww_mutex torture test, we need to make sure that the ww_mutexes have the same lock class as the acquire_ctx. This means the ww_mutexes must have the same lockdep key as the acquire_ctx. Unfortunately the current DEFINE_WW_MUTEX() macro fails to do that. As a result, we add an init method for the ww_mutex test to do explicit ww_mutex_init()'s of the ww_mutexes to avoid the false positive warning. Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20210318172814.4400-3-longman@redhat.com
2021-03-19Merge branch 'locking/urgent' into locking/core, to pick up dependent commitsIngo Molnar
We are applying further, lower-prio fixes on top of two ww_mutex fixes in locking/urgent. Signed-off-by: Ingo Molnar <mingo@kernel.org>
2021-03-19swiotlb: remove swiotlb_nr_tblChristoph Hellwig
All callers just use it to check if swiotlb is active at all, for which they can just use is_swiotlb_active. In the longer run drivers need to stop using is_swiotlb_active as well, but let's do the simple step first. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2021-03-19swiotlb: dynamically allocate io_tlb_default_memChristoph Hellwig
Instead of allocating ->list and ->orig_addr separately just do one dynamic allocation for the actual io_tlb_mem structure. This simplifies a lot of the initialization code, and also allows to just check io_tlb_default_mem to see if swiotlb is in use. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2021-03-19swiotlb: move global variables into a new io_tlb_mem structureClaire Chang
Added a new struct, io_tlb_mem, as the IO TLB memory pool descriptor and moved relevant global variables into that struct. This will be useful later to allow for restricted DMA pool. Signed-off-by: Claire Chang <tientzu@chromium.org> [hch: rebased] Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2021-03-18cpufreq: schedutil: Call sugov_update_next_freq() before check to ↵Yue Hu
fast_switch_enabled Note that sugov_update_next_freq() may return false, that means the caller sugov_fast_switch() will do nothing except fast switch check. Similarly, sugov_deferred_update() also has unnecessary operations of raw_spin_{lock,unlock} in sugov_update_single_freq() for that case. So, let's call sugov_update_next_freq() before the fast switch check to avoid unnecessary behaviors above. Accordingly, update interface definition to sugov_deferred_update() and remove sugov_fast_switch() since we will call cpufreq_driver_fast_switch() directly instead. Signed-off-by: Yue Hu <huyue2@yulong.com> Acked-by: Viresh Kumar <viresh.kumar@linaro.org> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2021-03-18tracing: Add a verifier to check string pointers for trace eventsSteven Rostedt (VMware)
It is a common mistake for someone writing a trace event to save a pointer to a string in the TP_fast_assign() and then display that string pointer in the TP_printk() with %s. The problem is that those two events may happen a long time apart, where the source of the string may no longer exist. The proper way to handle displaying any string that is not guaranteed to be in the kernel core rodata section, is to copy it into the ring buffer via the __string(), __assign_str() and __get_str() helper macros. Add a check at run time while displaying the TP_printk() of events to make sure that every %s referenced is safe to dereference, and if it is not, trigger a warning and only show the address of the pointer, and the dereferenced string if it can be safely retrieved with a strncpy_from_kernel_nofault() call. In order to not have to copy the parsing of vsnprintf() formats, or even exporting its code, the verifier relies on vsnprintf() being able to modify the va_list that is passed to it, and it remains modified after it is called. This is the case for some architectures like x86_64, but other architectures like x86_32 pass the va_list to vsnprintf() as a value not a reference, and the verifier can not use it to parse the non string arguments. Thus, at boot up, it is checked if vsnprintf() modifies the passed in va_list or not, and a static branch will disable the verifier if it's not compatible. Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2021-03-18tracing: Add check of trace event print fmts for dereferencing pointersSteven Rostedt (VMware)
Trace events record data into the ring buffer at the time of the event. The trace event has a printf logic to display the recorded data at a much later time when the user reads the trace file. This makes using dereferencing pointers unsafe if the dereferenced pointer points to the original source. The safe way to handle this is to create an array within the trace event and copy the source into the array. Then the dereference pointer may point to that array. As this is a easy mistake to make, a check is added to examine all trace event print fmts to make sure that they are safe to use. This only checks the various %p* dereferenced pointers like %pB, %pR, etc. It does not handle dereferencing of strings, as there are some use cases that are OK to dereference the source. That will be dealt with differently. Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2021-03-18tracing: Add tracing_event_time_stamp() APISteven Rostedt (VMware)
Add a tracing_event_time_stamp() API that checks if the event passed in is not on the ring buffer but a pointer to the per CPU trace_buffered_event which does not have its time stamp set yet. If it is a pointer to the trace_buffered_event, then just return the current time stamp that the ring buffer would produce. Otherwise, return the time stamp from the event. Link: https://lkml.kernel.org/r/20210316164114.131996180@goodmis.org Reviewed-by: Tom Zanussi <zanussi@kernel.org> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2021-03-18ring-buffer: Add verifier for using ring_buffer_event_time_stamp()Steven Rostedt (VMware)
The ring_buffer_event_time_stamp() must be only called by an event that has not been committed yet, and is on the buffer that is passed in. This was used to help debug converting the histogram logic over to using the new time stamp code, and was proven to be very useful. Add a verifier that can check that this is the case, and extra WARN_ONs to catch unexpected use cases. Link: https://lkml.kernel.org/r/20210316164113.987294354@goodmis.org Reviewed-by: Tom Zanussi <zanussi@kernel.org> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2021-03-18tracing: Use a no_filter_buffering_ref to stop using the filter bufferSteven Rostedt (VMware)
Currently, the trace histograms relies on it using absolute time stamps to trigger the tracing to not use the temp buffer if filters are set. That's because the histograms need the full timestamp that is saved in the ring buffer. That is no longer the case, as the ring_buffer_event_time_stamp() can now return the time stamp for all events without all triggering a full absolute time stamp. Now that the absolute time stamp is an unrelated dependency to not using the filters. There's nothing about having absolute timestamps to keep from using the filter buffer. Instead, change the interface to explicitly state to disable filter buffering that the histogram logic can use. Link: https://lkml.kernel.org/r/20210316164113.847886563@goodmis.org Reviewed-by: Tom Zanussi <zanussi@kernel.org> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2021-03-18ring-buffer: Allow ring_buffer_event_time_stamp() to return time stamp of ↵Steven Rostedt (VMware)
all events Currently, ring_buffer_event_time_stamp() only returns an accurate time stamp of the event if it has an absolute extended time stamp attached to it. To make it more robust, use the event_stamp() in case the event does not have an absolute value attached to it. This will allow ring_buffer_event_time_stamp() to be used in more cases than just histograms, and it will also allow histograms to not require including absolute values all the time. Link: https://lkml.kernel.org/r/20210316164113.704830885@goodmis.org Reviewed-by: Tom Zanussi <zanussi@kernel.org> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2021-03-18tracing: Pass buffer of event to trigger operationsSteven Rostedt (VMware)
The ring_buffer_event_time_stamp() is going to be updated to extract the time stamp for the event without needing it to be set to have absolute values for all events. But to do so, it needs the buffer that the event is on as the buffer saves information for the event before it is committed to the buffer. If the trace buffer is disabled, a temporary buffer is used, and there's no access to this buffer from the current histogram triggers, even though it is passed to the trace event code. Pass the buffer that the event is on all the way down to the histogram triggers. Link: https://lkml.kernel.org/r/20210316164113.542448131@goodmis.org Reviewed-by: Tom Zanussi <zanussi@kernel.org> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2021-03-18ring-buffer: Add a event_stamp to cpu_buffer for each level of nestingSteven Rostedt (VMware)
Add a place to save the current event time stamp for each level of nesting. This will be used to retrieve the time stamp of the current event before it is committed. Link: https://lkml.kernel.org/r/20210316164113.399089673@goodmis.org Reviewed-by: Tom Zanussi <zanussi@kernel.org> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2021-03-18ring-buffer: Separate out internal use of ring_buffer_event_time_stamp()Steven Rostedt (VMware)
The exported use of ring_buffer_event_time_stamp() is going to become different than how it is used internally. Move the internal logic out into a static function called rb_event_time_stamp(), and have the internal callers call that instead. Link: https://lkml.kernel.org/r/20210316164113.257790481@goodmis.org Reviewed-by: Tom Zanussi <zanussi@kernel.org> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2021-03-18Revert "PM: ACPI: reboot: Use S5 for reboot"Josef Bacik
This reverts commit d60cd06331a3566d3305b3c7b566e79edf4e2095. This patch causes a panic when rebooting my Dell Poweredge r440. I do not have the full panic log as it's lost at that stage of the reboot and I do not have a serial console. Reverting this patch makes my system able to reboot again. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2021-03-18bpf, devmap: Move drop error path to devmap for XDP_REDIRECTLorenzo Bianconi
We want to change the current ndo_xdp_xmit drop semantics because it will allow us to implement better queue overflow handling. This is working towards the larger goal of a XDP TX queue-hook. Move XDP_REDIRECT error path handling from each XDP ethernet driver to devmap code. According to the new APIs, the driver running the ndo_xdp_xmit pointer, will break tx loop whenever the hw reports a tx error and it will just return to devmap caller the number of successfully transmitted frames. It will be devmap responsibility to free dropped frames. Move each XDP ndo_xdp_xmit capable driver to the new APIs: - veth - virtio-net - mvneta - mvpp2 - socionext - amazon ena - bnxt - freescale (dpaa2, dpaa) - xen-frontend - qede - ice - igb - ixgbe - i40e - mlx5 - ti (cpsw, cpsw-new) - tun - sfc Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Ioana Ciornei <ioana.ciornei@nxp.com> Reviewed-by: Ilias Apalodimas <ilias.apalodimas@linaro.org> Reviewed-by: Camelia Groza <camelia.groza@nxp.com> Acked-by: Edward Cree <ecree.xilinx@gmail.com> Acked-by: Jesper Dangaard Brouer <brouer@redhat.com> Acked-by: Shay Agroskin <shayagr@amazon.com> Link: https://lore.kernel.org/bpf/ed670de24f951cfd77590decf0229a0ad7fd12f6.1615201152.git.lorenzo@kernel.org
2021-03-18time/debug: Remove dentry pointer for debugfsGreg Kroah-Hartman
There is no need to keep the dentry pointer around for the created debugfs file, as it is only needed when removing it from the system. When it is to be removed, ask debugfs itself for the pointer, to save on storage and make things a bit simpler. Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20210216155020.1012407-1-gregkh@linuxfoundation.org
2021-03-18bpf: Fix fexit trampoline.Alexei Starovoitov
The fexit/fmod_ret programs can be attached to kernel functions that can sleep. The synchronize_rcu_tasks() will not wait for such tasks to complete. In such case the trampoline image will be freed and when the task wakes up the return IP will point to freed memory causing the crash. Solve this by adding percpu_ref_get/put for the duration of trampoline and separate trampoline vs its image life times. The "half page" optimization has to be removed, since first_half->second_half->first_half transition cannot be guaranteed to complete in deterministic time. Every trampoline update becomes a new image. The image with fmod_ret or fexit progs will be freed via percpu_ref_kill and call_rcu_tasks. Together they will wait for the original function and trampoline asm to complete. The trampoline is patched from nop to jmp to skip fexit progs. They are freed independently from the trampoline. The image with fentry progs only will be freed via call_rcu_tasks_trace+call_rcu_tasks which will wait for both sleepable and non-sleepable progs to complete. Fixes: fec56f5890d9 ("bpf: Introduce BPF trampoline") Reported-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Paul E. McKenney <paulmck@kernel.org> # for RCU Link: https://lore.kernel.org/bpf/20210316210007.38949-1-alexei.starovoitov@gmail.com
2021-03-17bpf: Add sanity check for upper ptr_limitPiotr Krysiuk
Given we know the max possible value of ptr_limit at the time of retrieving the latter, add basic assertions, so that the verifier can bail out if anything looks odd and reject the program. Nothing triggered this so far, but it also does not hurt to have these. Signed-off-by: Piotr Krysiuk <piotras@gmail.com> Co-developed-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org>
2021-03-17irq: Simplify condition in irq_matrix_reserve()Juergen Gross
The if condition in irq_matrix_reserve() can be much simpler. While at it fix a typo in the comment. Signed-off-by: Juergen Gross <jgross@suse.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20210211070953.5914-1-jgross@suse.com
2021-03-17bpf: Simplify alu_limit masking for pointer arithmeticPiotr Krysiuk
Instead of having the mov32 with aux->alu_limit - 1 immediate, move this operation to retrieve_ptr_limit() instead to simplify the logic and to allow for subsequent sanity boundary checks inside retrieve_ptr_limit(). This avoids in future that at the time of the verifier masking rewrite we'd run into an underflow which would not sign extend due to the nature of mov32 instruction. Signed-off-by: Piotr Krysiuk <piotras@gmail.com> Co-developed-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org>
2021-03-17bpf: Fix off-by-one for area size in creating mask to leftPiotr Krysiuk
retrieve_ptr_limit() computes the ptr_limit for registers with stack and map_value type. ptr_limit is the size of the memory area that is still valid / in-bounds from the point of the current position and direction of the operation (add / sub). This size will later be used for masking the operation such that attempting out-of-bounds access in the speculative domain is redirected to remain within the bounds of the current map value. When masking to the right the size is correct, however, when masking to the left, the size is off-by-one which would lead to an incorrect mask and thus incorrect arithmetic operation in the non-speculative domain. Piotr found that if the resulting alu_limit value is zero, then the BPF_MOV32_IMM() from the fixup_bpf_calls() rewrite will end up loading 0xffffffff into AX instead of sign-extending to the full 64 bit range, and as a result, this allows abuse for executing speculatively out-of- bounds loads against 4GB window of address space and thus extracting the contents of kernel memory via side-channel. Fixes: 979d63d50c0c ("bpf: prevent out of bounds speculation on pointer arithmetic") Signed-off-by: Piotr Krysiuk <piotras@gmail.com> Co-developed-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org>
2021-03-17bpf: Prohibit alu ops for pointer types not defining ptr_limitPiotr Krysiuk
The purpose of this patch is to streamline error propagation and in particular to propagate retrieve_ptr_limit() errors for pointer types that are not defining a ptr_limit such that register-based alu ops against these types can be rejected. The main rationale is that a gap has been identified by Piotr in the existing protection against speculatively out-of-bounds loads, for example, in case of ctx pointers, unprivileged programs can still perform pointer arithmetic. This can be abused to execute speculatively out-of-bounds loads without restrictions and thus extract contents of kernel memory. Fix this by rejecting unprivileged programs that attempt any pointer arithmetic on unprotected pointer types. The two affected ones are pointer to ctx as well as pointer to map. Field access to a modified ctx' pointer is rejected at a later point in time in the verifier, and 7c6967326267 ("bpf: Permit map_ptr arithmetic with opcode add and offset 0") only relevant for root-only use cases. Risk of unprivileged program breakage is considered very low. Fixes: 7c6967326267 ("bpf: Permit map_ptr arithmetic with opcode add and offset 0") Fixes: b2157399cc98 ("bpf: prevent out-of-bounds speculation") Signed-off-by: Piotr Krysiuk <piotras@gmail.com> Co-developed-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org>
2021-03-17tick/sched: Prevent false positive softirq pending warnings on RTThomas Gleixner
On RT a task which has soft interrupts disabled can block on a lock and schedule out to idle while soft interrupts are pending. This triggers the warning in the NOHZ idle code which complains about going idle with pending soft interrupts. But as the task is blocked soft interrupt processing is temporarily blocked as well which means that such a warning is a false positive. To prevent that check the per CPU state which indicates that a scheduled out task has soft interrupts disabled. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Tested-by: Paul E. McKenney <paulmck@kernel.org> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20210309085727.527563866@linutronix.de
2021-03-17softirq: Make softirq control and processing RT awareThomas Gleixner
Provide a local lock based serialization for soft interrupts on RT which allows the local_bh_disabled() sections and servicing soft interrupts to be preemptible. Provide the necessary inline helpers which allow to reuse the bulk of the softirq processing code. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Tested-by: Paul E. McKenney <paulmck@kernel.org> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20210309085727.426370483@linutronix.de
2021-03-17softirq: Move various protections into inline helpersThomas Gleixner
To allow reuse of the bulk of softirq processing code for RT and to avoid #ifdeffery all over the place, split protections for various code sections out into inline helpers so the RT variant can just replace them in one go. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Tested-by: Paul E. McKenney <paulmck@kernel.org> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20210309085727.310118772@linutronix.de
2021-03-17irqtime: Make accounting correct on RTThomas Gleixner
vtime_account_irq and irqtime_account_irq() base checks on preempt_count() which fails on RT because preempt_count() does not contain the softirq accounting which is seperate on RT. These checks do not need the full preempt count as they only operate on the hard and softirq sections. Use irq_count() instead which provides the correct value on both RT and non RT kernels. The compiler is clever enough to fold the masking for !RT: 99b: 65 8b 05 00 00 00 00 mov %gs:0x0(%rip),%eax - 9a2: 25 ff ff ff 7f and $0x7fffffff,%eax + 9a2: 25 00 ff ff 00 and $0xffff00,%eax Reported-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Tested-by: Paul E. McKenney <paulmck@kernel.org> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20210309085727.153926793@linutronix.de
2021-03-17tasklets: Prevent tasklet_unlock_spin_wait() deadlock on RTThomas Gleixner
tasklet_unlock_spin_wait() spin waits for the TASKLET_STATE_SCHED bit in the tasklet state to be cleared. This works on !RT nicely because the corresponding execution can only happen on a different CPU. On RT softirq processing is preemptible, therefore a task preempting the softirq processing thread can spin forever. Prevent this by invoking local_bh_disable()/enable() inside the loop. In case that the softirq processing thread was preempted by the current task, current will block on the local lock which yields the CPU to the preempted softirq processing thread. If the tasklet is processed on a different CPU then the local_bh_disable()/enable() pair is just a waste of processor cycles. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20210309084241.988908275@linutronix.de
2021-03-17tasklets: Replace spin wait in tasklet_kill()Peter Zijlstra
tasklet_kill() spin waits for TASKLET_STATE_SCHED to be cleared invoking yield() from inside the loop. yield() is an ill defined mechanism and the result might still be wasting CPU cycles in a tight loop which is especially painful in a guest when the CPU running the tasklet is scheduled out. tasklet_kill() is used in teardown paths and not performance critical at all. Replace the spin wait with wait_var_event(). Signed-off-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20210309084241.890532921@linutronix.de
2021-03-17tasklets: Replace spin wait in tasklet_unlock_wait()Peter Zijlstra
tasklet_unlock_wait() spin waits for TASKLET_STATE_RUN to be cleared. This is wasting CPU cycles in a tight loop which is especially painful in a guest when the CPU running the tasklet is scheduled out. tasklet_unlock_wait() is invoked from tasklet_kill() which is used in teardown paths and not performance critical at all. Replace the spin wait with wait_var_event(). There are no users of tasklet_unlock_wait() which are invoked from atomic contexts. The usage in tasklet_disable() has been replaced temporarily with the spin waiting variant until the atomic users are fixed up and will be converted to the sleep wait variant later. Signed-off-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20210309084241.783936921@linutronix.de
2021-03-17rseq, ptrace: Add PTRACE_GET_RSEQ_CONFIGURATION requestPiotr Figiel
For userspace checkpoint and restore (C/R) a way of getting process state containing RSEQ configuration is needed. There are two ways this information is going to be used: - to re-enable RSEQ for threads which had it enabled before C/R - to detect if a thread was in a critical section during C/R Since C/R preserves TLS memory and addresses RSEQ ABI will be restored using the address registered before C/R. Detection whether the thread is in a critical section during C/R is needed to enforce behavior of RSEQ abort during C/R. Attaching with ptrace() before registers are dumped itself doesn't cause RSEQ abort. Restoring the instruction pointer within the critical section is problematic because rseq_cs may get cleared before the control is passed to the migrated application code leading to RSEQ invariants not being preserved. C/R code will use RSEQ ABI address to find the abort handler to which the instruction pointer needs to be set. To achieve above goals expose the RSEQ ABI address and the signature value with the new ptrace request PTRACE_GET_RSEQ_CONFIGURATION. This new ptrace request can also be used by debuggers so they are aware of stops within restartable sequences in progress. Signed-off-by: Piotr Figiel <figiel@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Michal Miroslaw <emmir@google.com> Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Acked-by: Oleg Nesterov <oleg@redhat.com> Link: https://lkml.kernel.org/r/20210226135156.1081606-1-figiel@google.com
2021-03-17quota: wire up quotactl_pathSascha Hauer
Wire up the quotactl_path syscall added in the previous patch. Link: https://lore.kernel.org/r/20210304123541.30749-3-s.hauer@pengutronix.de Signed-off-by: Sascha Hauer <s.hauer@pengutronix.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jan Kara <jack@suse.cz>
2021-03-17softirq: s/BUG/WARN_ONCE/ on tasklet SCHED state not setDirk Behme
Replace BUG() with WARN_ONCE() on wrong tasklet state, in order to: - increase the verbosity / aid in debugging - avoid fatal/unrecoverable state Suggested-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Dirk Behme <dirk.behme@de.bosch.com> Signed-off-by: Eugeniu Rosca <erosca@de.adit-jv.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20210317102012.32399-1-erosca@de.adit-jv.com