summaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)Author
2017-06-08rcu: Add lockdep_assert_held() teeth to tree_plugin.hPaul E. McKenney
Comments can be helpful, but assertions carry more force. This commit therefore adds lockdep_assert_held() and RCU_LOCKDEP_WARN() calls to enforce lock-held and interrupt-disabled preconditions. Reported-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2017-06-08rcu: Add lockdep_assert_held() teeth to tree.cPaul E. McKenney
Comments can be helpful, but assertions carry more force. This commit therefore adds lockdep_assert_held() and RCU_LOCKDEP_WARN() calls to enforce lock-held and interrupt-disabled preconditions. Reported-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2017-06-08srcu: Print non-default exp_holdoff values at boot timePaul E. McKenney
This commit makes srcu_bootup_announce() check for non-default values of the auto-expedite holdoff time exp_holdoff and print a message if so. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2017-06-08srcu: Make exp_holdoff module parameter be staticPaul E. McKenney
Because exp_holdoff is not used outside of srcutree.c, it can be static. This commit therefore makes this change. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2017-06-08rcu: Update rcu_bootup_announce_oddness()Paul E. McKenney
This commit updates rcu_bootup_announce_oddness() to check additional Kconfig options and module/boot parameters. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2017-06-08rcu: Print out rcupdate.c non-default boot-time settingsPaul E. McKenney
This commit adds a rcupdate_announce_bootup_oddness() function to print out non-default values of significant kernel boot parameter settings to aid in debugging. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2017-06-08rcu: Add preemptibility checks in rcu_sched_qs() and rcu_bh_qs()Paul E. McKenney
This commit adds WARN_ON_ONCE() calls that trigger if either rcu_sched_qs() or rcu_bh_qs() are invoked with preemption enabled. In the immortal words of Peter Zijlstra: "these are much harder to ignore than comments". Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2017-06-08rcuperf: Add writer_holdoff boot parameterPaul E. McKenney
This commit adds a writer_holdoff boot parameter to rcuperf, which is intended to be used to test Tree SRCU's auto-expediting. This boot parameter is in microseconds, and defaults to zero (that is, disabled). Set it to a bit larger than srcutree.exp_holdoff, keeping the nanosecond/microsecond conversion, to force Tree SRCU to auto-expedite more aggressively. This commit also adds documentation for this parameter, and fixes some alphabetization while in the neighborhood. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2017-06-08rcuperf: Set more user-friendly defaultsPaul E. McKenney
Common-case use of rcuperf must set rcuperf.nreaders=0 and if not built as a module, rcuperf.shutdown. This commit therefore sets the default for rcuperf.nreaders to zero and sets the default for rcuperf.shutdown to zero if rcuperf is built as a module and to one otherwise. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2017-06-08srcu: Shrink Tiny SRCU a bit morePaul E. McKenney
This commit rearranges Tiny SRCU's srcu_struct structure, substitutes u8 for bool, and shrinks counters down to short. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2017-06-08srcu: Make Classic and Tree SRCU announce themselves at bootupPaul E. McKenney
Currently, the only way to tell whether a given kernel is running Classic, Tiny, or Tree SRCU is to look at the .config file, which can easily be lost or associated with the wrong kernel. This commit therefore has Classic and Tree SRCU identify themselves at boot time. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2017-06-08rcuperf: Add test for dynamically initialized srcu_structPaul E. McKenney
This commit adds a perf_type of "srcud", which species that rcuperf test SRCU on a dynamically initialized srcu_struct. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2017-06-08rcu: Make sync_rcu_preempt_exp_done() return boolPaul E. McKenney
The sync_rcu_preempt_exp_done() function returns a logical expression, but its return type is nevertheless int. This commit therefore changes the return type to bool. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2017-06-08rcuperf: Add ability to performance-test call_rcu() and friendsPaul E. McKenney
This commit upgrades rcuperf so that it can do performance testing on asynchronous grace-period primitives such as call_srcu(). There is a new rcuperf.gp_async module parameter that specifies this new behavior, with the pre-existing rcuperf.gp_exp testing expedited grace periods such as synchronize_rcu_expedited, and with the default being to test synchronous non-expedited grace periods such as synchronize_rcu(). There is also a new rcuperf.gp_async_max module parameter that specifies the maximum number of outstanding callbacks per writer kthread, defaulting to 1,000. When this limit is exceeded, the writer thread invokes the appropriate flavor of rcu_barrier() to wait for callbacks to drain. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> [ paulmck: Removed the redundant initialization noted by Arnd Bergmann. ]
2017-06-08rcu: Remove obsolete reference to synchronize_kernel()Paul E. McKenney
The synchronize_kernel() primitive was removed in favor of synchronize_sched() more than a decade ago, and it seems likely that rather few kernel hackers are familiar with it. Its continued presence is therefore providing more confusion than enlightenment. This commit therefore removes the reference from the synchronize_sched() header comment, and adds the corresponding information to the synchronize_rcu(0 header comment. Reported-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2017-06-08rcuperf: Defer expedited/normal check to end of testPaul E. McKenney
Current rcuperf startup checks to see if the user asked to measure only expedited grace periods, yet constrained all grace periods to be normal, or if the user asked to measure only normal grace periods, yet constrained all grace periods to be expedited. Useless tests of this sort are aborted. Unfortunately, making RCU work through the mid-boot dead zone [1] puts RCU into expedited-only mode during that zone. Which happens to also be the exact time that rcuperf carries out the aforementioned check. So if the user asks rcuperf to measure only normal grace periods (the default), rcuperf will now always complain and terminate the test. This commit therefore moves the checks to rcu_perf_cleanup(). This has the disadvantage of failing to abort useless tests, but avoids the need to create yet another kthread and the need to do fiddly checks involving the holdoff time. (Yes, another approach is to do the checks in a late-stage init function, but that would require some way to communicate badness to rcuperf's kthreads, and seems not worth the bother.) [1] https://lwn.net/Articles/716148/ Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2017-06-08rcu: Complain if blocking in preemptible RCU read-side critical sectionPaul E. McKenney
Although preemptible RCU allows its read-side critical sections to be preempted, general blocking is forbidden. The reason for this is that excessive preemption times can be handled by CONFIG_RCU_BOOST=y, but a voluntarily blocked task doesn't care how high you boost its priority. Because preemptible RCU is a global mechanism, one ill-behaved reader hurts everyone. Hence the prohibition against general blocking in RCU-preempt read-side critical sections. Preemption yes, blocking no. This commit enforces this prohibition. There is a special exception for the -rt patchset (which they kindly volunteered to implement): It is OK to block (as opposed to merely being preempted) within an RCU-preempt read-side critical section, but only if the blocking is subject to priority inheritance. This exception permits CONFIG_RCU_BOOST=y to get -rt RCU readers out of trouble. Why doesn't this exception also apply to mainline's rt_mutex? Because of the possibility that someone does general blocking while holding an rt_mutex. Yes, the priority boosting will affect the rt_mutex, but it won't help with the task doing general blocking while holding that rt_mutex. Reported-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2017-06-08srcu: Eliminate possibility of destructive counter overflowPaul E. McKenney
Earlier versions of Tree SRCU were subject to a counter overflow bug that could theoretically result in too-short grace periods. This commit eliminates this problem by adding an update-side memory barrier. The short explanation is that if the updater sums the unlock counts too late to see a given __srcu_read_unlock() increment, that CPU's next __srcu_read_lock() must see the new value of ->srcu_idx, thus incrementing the other bank of counters. This eliminates the possibility of destructive counter overflow as long as the srcu_read_lock() nesting level does not exceed floor(ULONG_MAX/NR_CPUS/2), which should be an eminently reasonable nesting limit, especially on 64-bit systems. Reported-by: Lance Roy <ldr709@gmail.com> Suggested-by: Lance Roy <ldr709@gmail.com> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2017-06-08rcu: Prevent rcu_barrier() from starting needless grace periodsPaul E. McKenney
Currently rcu_barrier() uses call_rcu() to enqueue new callbacks on each CPU with a non-empty callback list. This works, but means that rcu_barrier() forces grace periods that are not otherwise needed. The key point is that rcu_barrier() never needs to wait for a grace period, but instead only for all pre-existing callbacks to be invoked. This means that rcu_barrier()'s new callbacks should be placed in the callback-list segment containing the last pre-existing callback. This commit makes this change using the new rcu_segcblist_entrain() function. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2017-06-08srcu: Allow use of Classic SRCU from both process and interrupt contextPaolo Bonzini
Linu Cherian reported a WARN in cleanup_srcu_struct() when shutting down a guest running iperf on a VFIO assigned device. This happens because irqfd_wakeup() calls srcu_read_lock(&kvm->irq_srcu) in interrupt context, while a worker thread does the same inside kvm_set_irq(). If the interrupt happens while the worker thread is executing __srcu_read_lock(), updates to the Classic SRCU ->lock_count[] field or the Tree SRCU ->srcu_lock_count[] field can be lost. The docs say you are not supposed to call srcu_read_lock() and srcu_read_unlock() from irq context, but KVM interrupt injection happens from (host) interrupt context and it would be nice if SRCU supported the use case. KVM is using SRCU here not really for the "sleepable" part, but rather due to its IPI-free fast detection of grace periods. It is therefore not desirable to switch back to RCU, which would effectively revert commit 719d93cd5f5c ("kvm/irqchip: Speed up KVM_SET_GSI_ROUTING", 2014-01-16). However, the docs are overly conservative. You can have an SRCU instance only has users in irq context, and you can mix process and irq context as long as process context users disable interrupts. In addition, __srcu_read_unlock() actually uses this_cpu_dec() on both Tree SRCU and Classic SRCU. For those two implementations, only srcu_read_lock() is unsafe. When Classic SRCU's __srcu_read_unlock() was changed to use this_cpu_dec(), in commit 5a41344a3d83 ("srcu: Simplify __srcu_read_unlock() via this_cpu_dec()", 2012-11-29), __srcu_read_lock() did two increments. Therefore it kept __this_cpu_inc(), with preempt_disable/enable in the caller. Tree SRCU however only does one increment, so on most architectures it is more efficient for __srcu_read_lock() to use this_cpu_inc(), and any performance differences appear to be down in the noise. Cc: stable@vger.kernel.org Fixes: 719d93cd5f5c ("kvm/irqchip: Speed up KVM_SET_GSI_ROUTING") Reported-by: Linu Cherian <linuc.decode@gmail.com> Suggested-by: Linu Cherian <linuc.decode@gmail.com> Cc: kvm@vger.kernel.org Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2017-06-08srcu: Allow use of Tiny/Tree SRCU from both process and interrupt contextPaolo Bonzini
Linu Cherian reported a WARN in cleanup_srcu_struct() when shutting down a guest running iperf on a VFIO assigned device. This happens because irqfd_wakeup() calls srcu_read_lock(&kvm->irq_srcu) in interrupt context, while a worker thread does the same inside kvm_set_irq(). If the interrupt happens while the worker thread is executing __srcu_read_lock(), updates to the Classic SRCU ->lock_count[] field or the Tree SRCU ->srcu_lock_count[] field can be lost. The docs say you are not supposed to call srcu_read_lock() and srcu_read_unlock() from irq context, but KVM interrupt injection happens from (host) interrupt context and it would be nice if SRCU supported the use case. KVM is using SRCU here not really for the "sleepable" part, but rather due to its IPI-free fast detection of grace periods. It is therefore not desirable to switch back to RCU, which would effectively revert commit 719d93cd5f5c ("kvm/irqchip: Speed up KVM_SET_GSI_ROUTING", 2014-01-16). However, the docs are overly conservative. You can have an SRCU instance only has users in irq context, and you can mix process and irq context as long as process context users disable interrupts. In addition, __srcu_read_unlock() actually uses this_cpu_dec() on both Tree SRCU and Classic SRCU. For those two implementations, only srcu_read_lock() is unsafe. When Classic SRCU's __srcu_read_unlock() was changed to use this_cpu_dec(), in commit 5a41344a3d83 ("srcu: Simplify __srcu_read_unlock() via this_cpu_dec()", 2012-11-29), __srcu_read_lock() did two increments. Therefore it kept __this_cpu_inc(), with preempt_disable/enable in the caller. Tree SRCU however only does one increment, so on most architectures it is more efficient for __srcu_read_lock() to use this_cpu_inc(), and any performance differences appear to be down in the noise. Unlike Classic and Tree SRCU, Tiny SRCU does increments and decrements on a single variable. Therefore, as Peter Zijlstra pointed out, Tiny SRCU's implementation already supports mixed-context use of srcu_read_lock() and srcu_read_unlock(), at least as long as uses of srcu_read_lock() and srcu_read_unlock() in each handler are nested and paired properly. In other words, it is still illegal to (say) invoke srcu_read_lock() in an interrupt handler and to invoke the matching srcu_read_unlock() in a softirq handler. Therefore, the only change required for Tiny SRCU is to its comments. Fixes: 719d93cd5f5c ("kvm/irqchip: Speed up KVM_SET_GSI_ROUTING") Reported-by: Linu Cherian <linuc.decode@gmail.com> Suggested-by: Linu Cherian <linuc.decode@gmail.com> Cc: kvm@vger.kernel.org Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Tested-by: Paolo Bonzini <pbonzini@redhat.com>
2017-06-08Revert "printk: fix double printing with earlycon"Petr Mladek
This reverts commit cf39bf58afdaabc0b86f141630fb3fd18190294e. The commit regression to users that define both console=ttyS1 and console=ttyS0 on the command line, see https://lkml.kernel.org/r/20170509082915.GA13236@bistromath.localdomain The kernel log messages always appeared only on one serial port. It is even documented in Documentation/admin-guide/serial-console.rst: "Note that you can only define one console per device type (serial, video)." The above mentioned commit changed the order in which the command line parameters are searched. As a result, the kernel log messages go to the last mentioned ttyS* instead of the first one. We long thought that using two console=ttyS* on the command line did not make sense. But then we realized that console= parameters were handled also by systemd, see http://0pointer.de/blog/projects/serial-console.html "By default systemd will instantiate one serial-getty@.service on the main kernel console, if it is not a virtual terminal." where "[4] If multiple kernel consoles are used simultaneously, the main console is the one listed first in /sys/class/tty/console/active, which is the last one listed on the kernel command line." This puts the original report into another light. The system is running in qemu. The first serial port is used to store the messages into a file. The second one is used to login to the system via a socket. It depends on systemd and the historic kernel behavior. By other words, systemd causes that it makes sense to define both console=ttyS1 console=ttyS0 on the command line. The kernel fix caused regression related to userspace (systemd) and need to be reverted. In addition, it went out that the fix helped only partially. The messages still were duplicated when the boot console was removed early by late_initcall(printk_late_init). Then the entire log was replayed when the same console was registered as a normal one. Link: 20170606160339.GC7604@pathway.suse.cz Cc: Aleksey Makarov <aleksey.makarov@linaro.org> Cc: Sabrina Dubroca <sd@queasysnail.net> Cc: Sudeep Holla <sudeep.holla@arm.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Peter Hurley <peter@hurleysoftware.com> Cc: Jiri Slaby <jslaby@suse.com> Cc: Robin Murphy <robin.murphy@arm.com>, Cc: Steven Rostedt <rostedt@goodmis.org> Cc: "Nair, Jayachandran" <Jayachandran.Nair@cavium.com> Cc: linux-serial@vger.kernel.org Cc: linux-kernel@vger.kernel.org Reported-by: Sabrina Dubroca <sd@queasysnail.net> Acked-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Signed-off-by: Petr Mladek <pmladek@suse.com>
2017-06-08rt_mutex: Add lockdep annotationsPeter Zijlstra
Now that (PI) futexes have their own private RT-mutex interface and implementation we can easily add lockdep annotations to the existing RT-mutex interface. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-06-08sched/idle: Add deferrable vmstat_updater backAubrey Li
Deferrable vmstat_updater was missing in commit: c1de45ca831a ("sched/idle: Add support for tasks that inject idle") Add it back. Signed-off-by: Aubrey Li <aubrey.li@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Aubrey Li <aubrey.li@intel.com> Cc: Christoph Lameter <cl@linux.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1496803742-38274-1-git-send-email-aubrey.li@intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-06-08sched/core: Omit building stop_sched_class when !SMPNicolas Pitre
The stop class is invoked through stop_machine only. This is dead code on UP builds. Signed-off-by: Nicolas Pitre <nico@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/20170529210302.26868-3-nicolas.pitre@linaro.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-06-08sched/deadline: Use the revised wakeup rule for suspending constrained dl tasksDaniel Bristot de Oliveira
We have been facing some problems with self-suspending constrained deadline tasks. The main reason is that the original CBS was not designed for such sort of tasks. One problem reported by Xunlei Pang takes place when a task suspends, and then is awakened before the deadline, but so close to the deadline that its remaining runtime can cause the task to have an absolute density higher than allowed. In such situation, the original CBS assumes that the task is facing an early activation, and so it replenishes the task and set another deadline, one deadline in the future. This rule works fine for implicit deadline tasks. Moreover, it allows the system to adapt the period of a task in which the external event source suffered from a clock drift. However, this opens the window for bandwidth leakage for constrained deadline tasks. For instance, a task with the following parameters: runtime = 5 ms deadline = 7 ms [density] = 5 / 7 = 0.71 period = 1000 ms If the task runs for 1 ms, and then suspends for another 1ms, it will be awakened with the following parameters: remaining runtime = 4 laxity = 5 presenting a absolute density of 4 / 5 = 0.80. In this case, the original CBS would assume the task had an early wakeup. Then, CBS will reset the runtime, and the absolute deadline will be postponed by one relative deadline, allowing the task to run. The problem is that, if the task runs this pattern forever, it will keep receiving bandwidth, being able to run 1ms every 2ms. Following this behavior, the task would be able to run 500 ms in 1 sec. Thus running more than the 5 ms / 1 sec the admission control allowed it to run. Trying to address the self-suspending case, Luca Abeni, Giuseppe Lipari, and Juri Lelli [1] revisited the CBS in order to deal with self-suspending tasks. In the new approach, rather than replenishing/postponing the absolute deadline, the revised wakeup rule adjusts the remaining runtime, reducing it to fit into the allowed density. A revised version of the idea is: At a given time t, the maximum absolute density of a task cannot be higher than its relative density, that is: runtime / (deadline - t) <= dl_runtime / dl_deadline Knowing the laxity of a task (deadline - t), it is possible to move it to the other side of the equality, thus enabling to define max remaining runtime a task can use within the absolute deadline, without over-running the allowed density: runtime = (dl_runtime / dl_deadline) * (deadline - t) For instance, in our previous example, the task could still run: runtime = ( 5 / 7 ) * 5 runtime = 3.57 ms Without causing damage for other deadline tasks. It is note worthy that the laxity cannot be negative because that would cause a negative runtime. Thus, this patch depends on the patch: df8eac8cafce ("sched/deadline: Throttle a constrained deadline task activated after the deadline") Which throttles a constrained deadline task activated after the deadline. Finally, it is also possible to use the revised wakeup rule for all other tasks, but that would require some more discussions about pros and cons. Reported-by: Xunlei Pang <xpang@redhat.com> Signed-off-by: Daniel Bristot de Oliveira <bristot@redhat.com> [peterz: replaced dl_is_constrained with dl_is_implicit] Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Juri Lelli <juri.lelli@arm.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Luca Abeni <luca.abeni@santannapisa.it> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Romulo Silva de Oliveira <romulo.deoliveira@ufsc.br> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tommaso Cucinotta <tommaso.cucinotta@sssup.it> Link: http://lkml.kernel.org/r/5c800ab3a74a168a84ee5f3f84d12a02e11383be.1495803804.git.bristot@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-06-08sched/deadline: Zero out positive runtime after throttling constrained tasksXunlei Pang
When a contrained task is throttled by dl_check_constrained_dl(), it may carry the remaining positive runtime, as a result when dl_task_timer() fires and calls replenish_dl_entity(), it will not be replenished correctly due to the positive dl_se->runtime. This patch assigns its runtime to 0 if positive after throttling. Signed-off-by: Xunlei Pang <xlpang@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Daniel Bristot de Oliveira <bristot@redhat.com> Cc: Juri Lelli <juri.lelli@arm.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Luca Abeni <luca.abeni@santannapisa.it> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Thomas Gleixner <tglx@linutronix.de> Fixes: df8eac8cafce ("sched/deadline: Throttle a constrained deadline task activated after the deadline) Link: http://lkml.kernel.org/r/1494421417-27550-1-git-send-email-xlpang@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-06-08sched/deadline: Reclaim bandwidth not used by dl tasksLuca Abeni
This commit introduces a per-runqueue "extra utilization" that can be reclaimed by deadline tasks. In this way, the maximum fraction of CPU time that can reclaimed by deadline tasks is fixed (and configurable) and does not depend on the total deadline utilization. The GRUB accounting rule is modified to add this "extra utilization" to the inactive utilization of the runqueue, and to avoid reclaiming more than a maximum fraction of the CPU time. Tested-by: Daniel Bristot de Oliveira <bristot@redhat.com> Signed-off-by: Luca Abeni <luca.abeni@santannapisa.it> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Claudio Scordino <claudio@evidence.eu.com> Cc: Joel Fernandes <joelaf@google.com> Cc: Juri Lelli <juri.lelli@arm.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mathieu Poirier <mathieu.poirier@linaro.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tommaso Cucinotta <tommaso.cucinotta@sssup.it> Link: http://lkml.kernel.org/r/1495138417-6203-10-git-send-email-luca.abeni@santannapisa.it Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-06-08sched/deadline: Base GRUB reclaiming on the inactive utilizationLuca Abeni
Instead of decreasing the runtime as "dq = -Uact dt" (eventually divided by the maximum utilization available for deadline tasks), decrease it as "dq = -max{u, (1 - Uinact)} dt", where u is the task utilization and Uinact is the "inactive utilization". In this way, the maximum fraction of CPU time that can be reclaimed is given by the total utilization of deadline tasks. This approach solves a fairness issue with "traditional" global GRUB reclaiming: using the traditional GRUB algorithm, if tasks are allocated to the various cores in a non-uniform way, the reclaiming mechanism allows some tasks to reclaim more time than others. This issue is visible starting 11 time-consuming tasks with runtime 10ms and period 30ms (total utilization 3.666) on a 4-cores system: some tasks will receive much more than the reserved runtime (thanks to the reclaiming mechanism), while other tasks will receive less than the reserved runtime. Tested-by: Daniel Bristot de Oliveira <bristot@redhat.com> Signed-off-by: Luca Abeni <luca.abeni@santannapisa.it> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Claudio Scordino <claudio@evidence.eu.com> Cc: Joel Fernandes <joelaf@google.com> Cc: Juri Lelli <juri.lelli@arm.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mathieu Poirier <mathieu.poirier@linaro.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tommaso Cucinotta <tommaso.cucinotta@sssup.it> Link: http://lkml.kernel.org/r/1495138417-6203-9-git-send-email-luca.abeni@santannapisa.it Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-06-08sched/deadline: Track the "total rq utilization" tooLuca Abeni
The total rq utilization is defined as the sum of the utilisations of tasks that are "assigned" to a runqueue, independently from their state (TASK_RUNNING or blocked) Tested-by: Daniel Bristot de Oliveira <bristot@redhat.com> Signed-off-by: Luca Abeni <luca.abeni@santannapisa.it> Signed-off-by: Claudio Scordino <claudio@evidence.eu.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Joel Fernandes <joelaf@google.com> Cc: Juri Lelli <juri.lelli@arm.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mathieu Poirier <mathieu.poirier@linaro.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tommaso Cucinotta <tommaso.cucinotta@sssup.it> Link: http://lkml.kernel.org/r/1495138417-6203-8-git-send-email-luca.abeni@santannapisa.it Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-06-08sched/deadline: Make GRUB a task's flagLuca Abeni
This patch introduces the SCHED_FLAG_RECLAIM flag to specify that a DL task is allowed to reclaim unused CPU time (using the GRUB algorithm). Tested-by: Daniel Bristot de Oliveira <bristot@redhat.com> Signed-off-by: Luca Abeni <luca.abeni@santannapisa.it> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Claudio Scordino <claudio@evidence.eu.com> Cc: Joel Fernandes <joelaf@google.com> Cc: Juri Lelli <juri.lelli@arm.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mathieu Poirier <mathieu.poirier@linaro.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tommaso Cucinotta <tommaso.cucinotta@sssup.it> Link: http://lkml.kernel.org/r/1495138417-6203-7-git-send-email-luca.abeni@santannapisa.it Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-06-08sched/deadline: Do not reclaim the whole CPU bandwidthLuca Abeni
Original GRUB tends to reclaim 100% of the CPU time... And this allows a CPU hog to starve non-deadline tasks. To address this issue, allow the scheduler to reclaim only a specified fraction of CPU time, stored in the new "bw_ratio" field of the dl runqueue structure. Tested-by: Daniel Bristot de Oliveira <bristot@redhat.com> Signed-off-by: Luca Abeni <luca.abeni@santannapisa.it> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Claudio Scordino <claudio@evidence.eu.com> Cc: Joel Fernandes <joelaf@google.com> Cc: Juri Lelli <juri.lelli@arm.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mathieu Poirier <mathieu.poirier@linaro.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tommaso Cucinotta <tommaso.cucinotta@sssup.it> Link: http://lkml.kernel.org/r/1495138417-6203-6-git-send-email-luca.abeni@santannapisa.it Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-06-08sched/deadline: Implement GRUB accountingLuca Abeni
According to the GRUB (Greedy Reclaimation of Unused Bandwidth) reclaiming algorithm, the runtime is not decreased as "dq = -dt", but as "dq = -Uact dt" (where Uact is the per-runqueue active utilization). Hence, this commit modifies the runtime accounting rule in update_curr_dl() to implement the GRUB rule. Tested-by: Daniel Bristot de Oliveira <bristot@redhat.com> Signed-off-by: Luca Abeni <luca.abeni@santannapisa.it> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Claudio Scordino <claudio@evidence.eu.com> Cc: Joel Fernandes <joelaf@google.com> Cc: Juri Lelli <juri.lelli@arm.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mathieu Poirier <mathieu.poirier@linaro.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tommaso Cucinotta <tommaso.cucinotta@sssup.it> Link: http://lkml.kernel.org/r/1495138417-6203-5-git-send-email-luca.abeni@santannapisa.it Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-06-08sched/deadline: Fix the update of the total -deadline utilizationLuca Abeni
Now that the inactive timer can be armed to fire at the 0-lag time, it is possible to use inactive_task_timer() to update the total -deadline utilization (dl_b->total_bw) at the correct time, fixing dl_overflow() and __setparam_dl(). Tested-by: Daniel Bristot de Oliveira <bristot@redhat.com> Signed-off-by: Luca Abeni <luca.abeni@santannapisa.it> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Claudio Scordino <claudio@evidence.eu.com> Cc: Joel Fernandes <joelaf@google.com> Cc: Juri Lelli <juri.lelli@arm.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mathieu Poirier <mathieu.poirier@linaro.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tommaso Cucinotta <tommaso.cucinotta@sssup.it> Link: http://lkml.kernel.org/r/1495138417-6203-4-git-send-email-luca.abeni@santannapisa.it Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-06-08sched/deadline: Improve the tracking of active utilizationLuca Abeni
This patch implements a more theoretically sound algorithm for tracking active utilization: instead of decreasing it when a task blocks, use a timer (the "inactive timer", named after the "Inactive" task state of the GRUB algorithm) to decrease the active utilization at the so called "0-lag time". Tested-by: Claudio Scordino <claudio@evidence.eu.com> Tested-by: Daniel Bristot de Oliveira <bristot@redhat.com> Signed-off-by: Luca Abeni <luca.abeni@santannapisa.it> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Joel Fernandes <joelaf@google.com> Cc: Juri Lelli <juri.lelli@arm.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mathieu Poirier <mathieu.poirier@linaro.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tommaso Cucinotta <tommaso.cucinotta@sssup.it> Link: http://lkml.kernel.org/r/1495138417-6203-3-git-send-email-luca.abeni@santannapisa.it Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-06-08sched/deadline: Track the active utilizationLuca Abeni
Active utilization is defined as the total utilization of active (TASK_RUNNING) tasks queued on a runqueue. Hence, it is increased when a task wakes up and is decreased when a task blocks. When a task is migrated from CPUi to CPUj, immediately subtract the task's utilization from CPUi and add it to CPUj. This mechanism is implemented by modifying the pull and push functions. Note: this is not fully correct from the theoretical point of view (the utilization should be removed from CPUi only at the 0 lag time), a more theoretically sound solution is presented in the next patches. Tested-by: Daniel Bristot de Oliveira <bristot@redhat.com> Signed-off-by: Luca Abeni <luca.abeni@unitn.it> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Juri Lelli <juri.lelli@arm.com> Cc: Claudio Scordino <claudio@evidence.eu.com> Cc: Joel Fernandes <joelaf@google.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mathieu Poirier <mathieu.poirier@linaro.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tommaso Cucinotta <tommaso.cucinotta@sssup.it> Link: http://lkml.kernel.org/r/1495138417-6203-2-git-send-email-luca.abeni@santannapisa.it Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-06-08sched/core: Implement new approach to scale select_idle_cpu()Peter Zijlstra
Hackbench recently suffered a bunch of pain, first by commit: 4c77b18cf8b7 ("sched/fair: Make select_idle_cpu() more aggressive") and then by commit: c743f0a5c50f ("sched/fair, cpumask: Export for_each_cpu_wrap()") which fixed a bug in the initial for_each_cpu_wrap() implementation that made select_idle_cpu() even more expensive. The bug was that it would skip over CPUs when bits were consequtive in the bitmask. This however gave me an idea to fix select_idle_cpu(); where the old scheme was a cliff-edge throttle on idle scanning, this introduces a more gradual approach. Instead of stopping to scan entirely, we limit how many CPUs we scan. Initial benchmarks show that it mostly recovers hackbench while not hurting anything else, except Mason's schbench, but not as bad as the old thing. It also appears to recover the tbench high-end, which also suffered like hackbench. Tested-by: Matt Fleming <matt@codeblueprint.co.uk> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Chris Mason <clm@fb.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: hpa@zytor.com Cc: kitsunyan <kitsunyan@inbox.ru> Cc: linux-kernel@vger.kernel.org Cc: lvenanci@redhat.com Cc: riel@redhat.com Cc: xiaolong.ye@intel.com Link: http://lkml.kernel.org/r/20170517105350.hk5m4h4jb6dfr65a@hirez.programming.kicks-ass.net Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-06-08perf/core: Remove unused perf_cgroup_event_cgrp_time() functionMatthias Kaehlcke
The function was added by commit e5d1367f17ba ("perf: Add cgroup support") in 2011 and hasn't been used since then. Removing it fixes the following warning when building with Clang: kernel/events/core.c:696:19: error: unused function 'perf_cgroup_event_cgrp_time' [-Werror,-Wunused-function] Signed-off-by: Matthias Kaehlcke <mka@chromium.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Douglas Anderson <dianders@chromium.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/20170523215132.189049-1-mka@chromium.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-06-08perf/core: Correct event creation with PERF_FORMAT_GROUPPeter Zijlstra
Andi was asking about PERF_FORMAT_GROUP vs inherited events, which led to the discovery of a bug from commit: 3dab77fb1bf8 ("perf: Rework/fix the whole read vs group stuff") - PERF_SAMPLE_GROUP = 1U << 4, + PERF_SAMPLE_READ = 1U << 4, - if (attr->inherit && (attr->sample_type & PERF_SAMPLE_GROUP)) + if (attr->inherit && (attr->read_format & PERF_FORMAT_GROUP)) is a clear fail :/ While this changes user visible behaviour; it was previously possible to create an inherited event with PERF_SAMPLE_READ; this is deemed acceptible because its results were always incorrect. Reported-by: Andi Kleen <ak@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephane Eranian <eranian@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vince Weaver <vince@deater.net> Fixes: 3dab77fb1bf8 ("perf: Rework/fix the whole read vs group stuff") Link: http://lkml.kernel.org/r/20170530094512.dy2nljns2uq7qa3j@hirez.programming.kicks-ass.net Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-06-08Merge branch 'perf/urgent' into perf/core, to pick up fixesIngo Molnar
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-06-08perf/core: Drop kernel samples even though :u is specifiedJin Yao
When doing sampling, for example: perf record -e cycles:u ... On workloads that do a lot of kernel entry/exits we see kernel samples, even though :u is specified. This is due to skid existing. This might be a security issue because it can leak kernel addresses even though kernel sampling support is disabled. The patch drops the kernel samples if exclude_kernel is specified. For example, test on Haswell desktop: perf record -e cycles:u <mgen> perf report --stdio Before patch applied: 99.77% mgen mgen [.] buf_read 0.20% mgen mgen [.] rand_buf_init 0.01% mgen [kernel.vmlinux] [k] apic_timer_interrupt 0.00% mgen mgen [.] last_free_elem 0.00% mgen libc-2.23.so [.] __random_r 0.00% mgen libc-2.23.so [.] _int_malloc 0.00% mgen mgen [.] rand_array_init 0.00% mgen [kernel.vmlinux] [k] page_fault 0.00% mgen libc-2.23.so [.] __random 0.00% mgen libc-2.23.so [.] __strcasestr 0.00% mgen ld-2.23.so [.] strcmp 0.00% mgen ld-2.23.so [.] _dl_start 0.00% mgen libc-2.23.so [.] sched_setaffinity@@GLIBC_2.3.4 0.00% mgen ld-2.23.so [.] _start We can see kernel symbols apic_timer_interrupt and page_fault. After patch applied: 99.79% mgen mgen [.] buf_read 0.19% mgen mgen [.] rand_buf_init 0.00% mgen libc-2.23.so [.] __random_r 0.00% mgen mgen [.] rand_array_init 0.00% mgen mgen [.] last_free_elem 0.00% mgen libc-2.23.so [.] vfprintf 0.00% mgen libc-2.23.so [.] rand 0.00% mgen libc-2.23.so [.] __random 0.00% mgen libc-2.23.so [.] _int_malloc 0.00% mgen libc-2.23.so [.] _IO_doallocbuf 0.00% mgen ld-2.23.so [.] do_lookup_x 0.00% mgen ld-2.23.so [.] open_verify.constprop.7 0.00% mgen ld-2.23.so [.] _dl_important_hwcaps 0.00% mgen libc-2.23.so [.] sched_setaffinity@@GLIBC_2.3.4 0.00% mgen ld-2.23.so [.] _start There are only userspace symbols. Signed-off-by: Jin Yao <yao.jin@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: <stable@vger.kernel.org> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephane Eranian <eranian@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vince Weaver <vincent.weaver@maine.edu> Cc: acme@kernel.org Cc: jolsa@kernel.org Cc: kan.liang@intel.com Cc: mark.rutland@arm.com Cc: will.deacon@arm.com Cc: yao.jin@intel.com Link: http://lkml.kernel.org/r/1495706947-3744-1-git-send-email-yao.jin@linux.intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-06-07ftrace/kprobes: selftests: Check kretprobe maxactive is supportedMasami Hiramatsu
Check the kretprobe maxactive is supported by kprobe_events interface. To ensure the kernel feature, this changes ftrace README to describe it. Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org> Signed-off-by: Shuah Khan <shuahkh@osg.samsung.com>
2017-06-06Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller
Just some simple overlapping changes in marvell PHY driver and the DSA core code. Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-07Revert "ACPI / sleep: Ignore spurious SCI wakeups from suspend-to-idle"Rafael J. Wysocki
Revert commit eed4d47efe95 (ACPI / sleep: Ignore spurious SCI wakeups from suspend-to-idle) as it turned out to be premature and triggered a number of different issues on various systems. That includes, but is not limited to, premature suspend-to-RAM aborts on Dell XPS 13 (9343) reported by Dominik. The issue the commit in question attempted to address is real and will need to be taken care of going forward, but evidently more work is needed for this purpose. Reported-by: Dominik Brodowski <linux@dominikbrodowski.net> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2017-06-06bpf: cgroup skb progs cannot access ld_abs/indDaniel Borkmann
Commit fb9a307d11d6 ("bpf: Allow CGROUP_SKB eBPF program to access sk_buff") enabled programs of BPF_PROG_TYPE_CGROUP_SKB type to use ld_abs/ind instructions. However, at this point, we cannot use them, since offsets relative to SKF_LL_OFF will end up pointing skb_mac_header(skb) out of bounds since in the egress path it is not yet set at that point in time, but only after __dev_queue_xmit() did a general reset on the mac header. bpf_internal_load_pointer_neg_helper() will then end up reading data from a wrong offset. BPF_PROG_TYPE_CGROUP_SKB programs can use bpf_skb_load_bytes() already to access packet data, which is also more flexible than the insns carried over from cBPF. Fixes: fb9a307d11d6 ("bpf: Allow CGROUP_SKB eBPF program to access sk_buff") Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Cc: Chenbo Feng <fengc@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-06bpf: Add BPF_OBJ_GET_INFO_BY_FDMartin KaFai Lau
A single BPF_OBJ_GET_INFO_BY_FD cmd is used to obtain the info for both bpf_prog and bpf_map. The kernel can figure out the fd is associated with a bpf_prog or bpf_map. The suggested struct bpf_prog_info and struct bpf_map_info are not meant to be a complete list and it is not the goal of this patch. New fields can be added in the future patch. The focus of this patch is to create the interface, BPF_OBJ_GET_INFO_BY_FD cmd for exposing the bpf_prog's and bpf_map's info. The obj's info, which will be extended (and get bigger) over time, is separated from the bpf_attr to avoid bloating the bpf_attr. Signed-off-by: Martin KaFai Lau <kafai@fb.com> Acked-by: Alexei Starovoitov <ast@fb.com> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-06bpf: Add BPF_MAP_GET_FD_BY_IDMartin KaFai Lau
Add BPF_MAP_GET_FD_BY_ID command to allow user to get a fd from a bpf_map's ID. bpf_map_inc_not_zero() is added and is called with map_idr_lock held. __bpf_map_put() is also added which has the 'bool do_idr_lock' param to decide if the map_idr_lock should be acquired when freeing the map->id. In the error path of bpf_map_inc_not_zero(), it may have to call __bpf_map_put(map, false) which does not need to take the map_idr_lock when freeing the map->id. It is currently limited to CAP_SYS_ADMIN which we can consider to lift it in followup patches. Signed-off-by: Martin KaFai Lau <kafai@fb.com> Acked-by: Alexei Starovoitov <ast@fb.com> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-06bpf: Add BPF_PROG_GET_FD_BY_IDMartin KaFai Lau
Add BPF_PROG_GET_FD_BY_ID command to allow user to get a fd from a bpf_prog's ID. bpf_prog_inc_not_zero() is added and is called with prog_idr_lock held. __bpf_prog_put() is also added which has the 'bool do_idr_lock' param to decide if the prog_idr_lock should be acquired when freeing the prog->id. In the error path of bpf_prog_inc_not_zero(), it may have to call __bpf_prog_put(map, false) which does not need to take the prog_idr_lock when freeing the prog->id. It is currently limited to CAP_SYS_ADMIN which we can consider to lift it in followup patches. Signed-off-by: Martin KaFai Lau <kafai@fb.com> Acked-by: Alexei Starovoitov <ast@fb.com> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-06bpf: Add BPF_(PROG|MAP)_GET_NEXT_ID commandMartin KaFai Lau
This patch adds BPF_PROG_GET_NEXT_ID and BPF_MAP_GET_NEXT_ID to allow userspace to iterate all bpf_prog IDs and bpf_map IDs. The API is trying to be consistent with the existing BPF_MAP_GET_NEXT_KEY. It is currently limited to CAP_SYS_ADMIN which we can consider to lift it in followup patches. Signed-off-by: Martin KaFai Lau <kafai@fb.com> Acked-by: Alexei Starovoitov <ast@fb.com> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-06bpf: Introduce bpf_map IDMartin KaFai Lau
This patch generates an unique ID for each created bpf_map. The approach is similar to the earlier patch for bpf_prog ID. It is worth to note that the bpf_map's ID and bpf_prog's ID are in two independent ID spaces and both have the same valid range: [1, INT_MAX). Signed-off-by: Martin KaFai Lau <kafai@fb.com> Acked-by: Alexei Starovoitov <ast@fb.com> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: David S. Miller <davem@davemloft.net>