summaryrefslogtreecommitdiff
path: root/include/linux
AgeCommit message (Collapse)Author
2017-06-27block: move bounce declarations to block/blk.hChristoph Hellwig
Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-06-27nvme: add support for streams and directivesJens Axboe
This adds support for Directives in NVMe, particular for the Streams directive. Support for Directives is a new feature in NVMe 1.3. It allows a user to pass in information about where to store the data, so that it the device can do so most effiently. If an application is managing and writing data with different life times, mixing differently retentioned data onto the same locations on flash can cause write amplification to grow. This, in turn, will reduce performance and life time of the device. Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-06-27blk-mq: expose write hints through debugfsJens Axboe
Useful to verify that things are working the way they should. Reading the file will return number of kb written with each write hint. Writing the file will reset the statistics. No care is taken to ensure that we don't race on updates. Drivers will write to q->write_hints[] if they handle a given write hint. Reviewed-by: Andreas Dilger <adilger@dilger.ca> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-06-27block: add support for write hints in a bioJens Axboe
No functional changes in this patch, we just use up some holes in the bio and request structures to define a write hint that we psas down the stack. Ensure that we don't merge requests that have different life time hints assigned to them, and that we inherit the write hint when cloning a bio. Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-06-27fs: add fcntl() interface for setting/getting write life time hintsJens Axboe
Define a set of write life time hints: RWH_WRITE_LIFE_NOT_SET No hint information set RWH_WRITE_LIFE_NONE No hints about write life time RWH_WRITE_LIFE_SHORT Data written has a short life time RWH_WRITE_LIFE_MEDIUM Data written has a medium life time RWH_WRITE_LIFE_LONG Data written has a long life time RWH_WRITE_LIFE_EXTREME Data written has an extremely long life time The intent is for these values to be relative to each other, no absolute meaning should be attached to these flag names. Add an fcntl interface for querying these flags, and also for setting them as well: F_GET_RW_HINT Returns the read/write hint set on the underlying inode. F_SET_RW_HINT Set one of the above write hints on the underlying inode. F_GET_FILE_RW_HINT Returns the read/write hint set on the file descriptor. F_SET_FILE_RW_HINT Set one of the above write hints on the file descriptor. The user passes in a 64-bit pointer to get/set these values, and the interface returns 0/-1 on success/error. Sample program testing/implementing basic setting/getting of write hints is below. Add support for storing the write life time hint in the inode flags and in struct file as well, and pass them to the kiocb flags. If both a file and its corresponding inode has a write hint, then we use the one in the file, if available. The file hint can be used for sync/direct IO, for buffered writeback only the inode hint is available. This is in preparation for utilizing these hints in the block layer, to guide on-media data placement. /* * writehint.c: get or set an inode write hint */ #include <stdio.h> #include <fcntl.h> #include <stdlib.h> #include <unistd.h> #include <stdbool.h> #include <inttypes.h> #ifndef F_GET_RW_HINT #define F_LINUX_SPECIFIC_BASE 1024 #define F_GET_RW_HINT (F_LINUX_SPECIFIC_BASE + 11) #define F_SET_RW_HINT (F_LINUX_SPECIFIC_BASE + 12) #endif static char *str[] = { "RWF_WRITE_LIFE_NOT_SET", "RWH_WRITE_LIFE_NONE", "RWH_WRITE_LIFE_SHORT", "RWH_WRITE_LIFE_MEDIUM", "RWH_WRITE_LIFE_LONG", "RWH_WRITE_LIFE_EXTREME" }; int main(int argc, char *argv[]) { uint64_t hint; int fd, ret; if (argc < 2) { fprintf(stderr, "%s: file <hint>\n", argv[0]); return 1; } fd = open(argv[1], O_RDONLY); if (fd < 0) { perror("open"); return 2; } if (argc > 2) { hint = atoi(argv[2]); ret = fcntl(fd, F_SET_RW_HINT, &hint); if (ret < 0) { perror("fcntl: F_SET_RW_HINT"); return 4; } } ret = fcntl(fd, F_GET_RW_HINT, &hint); if (ret < 0) { perror("fcntl: F_GET_RW_HINT"); return 3; } printf("%s: hint %s\n", argv[1], str[hint]); close(fd); return 0; } Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-06-27tty: add function to convert device name to numberOkash Khawaja
The function converts strings like ttyS0 and ttyUSB0 to dev_t like (4, 64) and (188, 0). It does this by scanning tty_drivers list for corresponding device name and index. If the driver is not registered, this function returns -ENODEV. It also acquires tty_mutex. Signed-off-by: Okash Khawaja <okash.khawaja@gmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-06-27x86: use common aperfmperf_khz_on_cpu() to calculate KHz using APERF/MPERFLen Brown
The goal of this change is to give users a uniform and meaningful result when they read /sys/...cpufreq/scaling_cur_freq on modern x86 hardware, as compared to what they get today. Modern x86 processors include the hardware needed to accurately calculate frequency over an interval -- APERF, MPERF, and the TSC. Here we provide an x86 routine to make this calculation on supported hardware, and use it in preference to any driver driver-specific cpufreq_driver.get() routine. MHz is computed like so: MHz = base_MHz * delta_APERF / delta_MPERF MHz is the average frequency of the busy processor over a measurement interval. The interval is defined to be the time between successive invocations of aperfmperf_khz_on_cpu(), which are expected to to happen on-demand when users read sysfs attribute cpufreq/scaling_cur_freq. As with previous methods of calculating MHz, idle time is excluded. base_MHz above is from TSC calibration global "cpu_khz". This x86 native method to calculate MHz returns a meaningful result no matter if P-states are controlled by hardware or firmware and/or if the Linux cpufreq sub-system is or is-not installed. When this routine is invoked more frequently, the measurement interval becomes shorter. However, the code limits re-computation to 10ms intervals so that average frequency remains meaningful. Discerning users are encouraged to take advantage of the turbostat(8) utility, which can gracefully handle concurrent measurement intervals of arbitrary length. Signed-off-by: Len Brown <len.brown@intel.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2017-06-26Revert "ktime: Simplify ktime_compare implementation"Thomas Gleixner
Thierry bisected boot failures to this simplification commit. Reverts: 3f1d472055bb ("ktime: Simplify ktime_compare implementation") Reported-by: Thierry Reding <thierry.reding@gmail.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Mariusz Skamra <mariuszx.skamra@intel.com>
2017-06-26Merge tag 'iio-for-4.13b' of ↵Greg Kroah-Hartman
git://git.kernel.org/pub/scm/linux/kernel/git/jic23/iio into staging-next Jonathan writes: Second set of IIO new device support, features and cleanups for the 4.13 cycle. A few reverts here. One was a general failure to notice a device was already supported by another driver. The second is due to a review comment pointing out that the original patch was a bad idea and would break existing systems. Reverts * bma180 - Revert addition of support for the BMA250E it is already supported by the bmc150-accel and better supported at that. Oops. * hi8435 - The fix for cleanup of the reset gpio stuff isn't a good way to go. It breaks systems where an inverting level convertor is used. The right fix is to make the original devicetree correct - even if it involves patching the devicetree in kernel. New Device Support * stm32-adc - STM32H7 support and bindings. Features * core - add a hardware triggered operating mode for systems in which the actual trigger is never seen by the kernel. This is typically only used when a device 'can' use other triggers, but if a particular magic one is enabled the interrupt is effectively handled in hardware and we never see it. * st-lsm6dsx - support active low interrupts. * stm32-adc - Make the core adc clock optional as not all hardware supported requires it. - Make the bus clock optional in the per instance driver as it may be shared by all instances of the ADC and is handled by the core. - Rework to have a data structure representing the device type specific elements. * stm32-trigger (and counter) - Use the INDIO_HARDWARE_TRIGGERED_MODE where appropriate. - Add an attribute to configure device modes for quadrature counting etc. Clean ups and minor fixes * IIO core. - use __sysfs_match_string() helper rather than open coding the same. * ad7791 - use sysfs_match_string() helper rather than open coding the same. * aspeed-adc - handle return value of clk_prepare_enable * cpcap - Fix default register values and ensure the battery thermistor is enabled correctly. - Fix the reported die temperature where we can - docs are lacking. - Remove the hung interrupt quirk as no longer happens due to fix in the mfd driver. * hi8435 - Remove &s from hi8435_info definition as unneeded and inconsistent. * hid-sensor-trgger - Add kconfig depends on IIO_BUFFER (fixes patch in previous series) * ina2xx - Make the use of iio_info_mask* elements consistent for all channels. This doesn't have any visible effect, but acts as clear documentation of which channels various resulting attributes apply to. * lpc32xx - handle the return value of clk_prepare_enable. * meson-saradc - NULL instead of 0 for pointer. * mma9551 - use NULL for GPIO connection ID to aid implementation fo ACPI support. Here the connection ID doesn't actually tell us anything and it is much easier to deal with the driver if it's not there. * mpu6050 - Fix lock issues through use of a local mux. - Replace sprintf with scnprintf as appropriate. - Check whoami against all known values. This allows for a small number of boards where we are really fishing for the part not being present at all. It is unfortunately common to have undescribed changes to use newer chips. We paper over this but just emitting a warning for those cases as long as we know about. * mxs-lradc - Fix some non static warnings. * rcar-adc - Part of making the naming for this part consistent across the kernel. * st_accel - drop some spi_device_id entries for variants with no SPI support * st_magn - drop some spi_device_id entries for variants with no SPI support. * sx9500 - Use devm_gpiod_get instead of indexed value with an index of 0 on all occasions. * twl4030 - Drop unused twl4030_get_madc_conversion as callers removed now throughout kernel. - Unexport twl4030_madc_conversion() as no used only within this driver. - Drop twl4030_madc_user_params as not used now. - Drop twl4030_madc_request.func_cb as not used now. - Fold the twl4030-madc.h header into the driver as no longer used anywhere else in the kernel. * xilinx - Handle the return value of clk_prepare_enable
2017-06-25Merge branch 'timers-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull timer fixes from Thomas Gleixner: "A few fixes for timekeeping and timers: - Plug a subtle race due to a missing READ_ONCE() in the timekeeping code where reloading of a pointer results in an inconsistent callback argument being supplied to the clocksource->read function. - Correct the CLOCK_MONOTONIC_RAW sub-nanosecond accounting in the time keeping core code, to prevent a possible discontuity. - Apply a similar fix to the arm64 vdso clock_gettime() implementation - Add missing includes to clocksource drivers, which relied on indirect includes which fails in certain configs. - Use the proper iomem pointer for read/iounmap in a probe function" * 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: arm64/vdso: Fix nsec handling for CLOCK_MONOTONIC_RAW time: Fix CLOCK_MONOTONIC_RAW sub-nanosecond accounting time: Fix clock->read(clock) race around clocksource changes clocksource: Explicitly include linux/clocksource.h when needed clocksource/drivers/arm_arch_timer: Fix read and iounmap of incorrect variable
2017-06-25tty: define tty_open_by_driver when CONFIG_TTY is not definedOkash Khawaja
This patch adds definition of tty_open_by_driver when CONFIG_TTY is not defined. This was supposed to have been included in commit 12e84c71b7d4ee38d51377fd494ac748ee4e6912 ("tty: export tty_open_by_driver"). The patch follows convention for other such functions and returns NULL. Signed-off-by: Okash Khawaja <okash.khawaja@gmail.com> Reviewed-by: Samuel Thibault <samuel.thibault@ens-lyon.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-06-24genirq/timings: Add infrastructure for estimating the next interrupt arrival ↵Daniel Lezcano
time An interrupt behaves with a burst of activity with periodic interval of time followed by one or two peaks of longer interval. As the time intervals are periodic, statistically speaking they follow a normal distribution and each interrupts can be tracked individually. Add a mechanism to compute the statistics on all interrupts, except the timers which are deterministic from a prediction point of view, as their expiry time is known. The goal is to extract the periodicity for each interrupt, with the last timestamp and sum them, so the next event can be predicted to a certain extent. Taking the earliest prediction gives the expected wakeup on the system (assuming a timer won't expire before). Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Nicolas Pitre <nicolas.pitre@linaro.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: Hannes Reinecke <hare@suse.com> Cc: Vincent Guittot <vincent.guittot@linaro.org> Cc: "Rafael J . Wysocki" <rafael@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Bjorn Helgaas <bhelgaas@google.com> Link: http://lkml.kernel.org/r/1498227072-5980-2-git-send-email-daniel.lezcano@linaro.org
2017-06-24genirq/timings: Add infrastructure to track the interrupt timingsDaniel Lezcano
The interrupt framework gives a lot of information about each interrupt. It does not keep track of when those interrupts occur though, which is a prerequisite for estimating the next interrupt arrival for power management purposes. Add a mechanism to record the timestamp for each interrupt occurrences in a per-CPU circular buffer to help with the prediction of the next occurrence using a statistical model. Each CPU can store up to IRQ_TIMINGS_SIZE events <irq, timestamp>, the current value of IRQ_TIMINGS_SIZE is 32. Each event is encoded into a single u64, where the high 48 bits are used for the timestamp and the low 16 bits are for the irq number. A static key is introduced so when the irq prediction is switched off at runtime, the overhead is near to zero. It results in most of the code in internals.h for inline reasons and a very few in the new file timings.c. The latter will contain more in the next patch which will provide the statistical model for the next event prediction. Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Nicolas Pitre <nicolas.pitre@linaro.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: Hannes Reinecke <hare@suse.com> Cc: Vincent Guittot <vincent.guittot@linaro.org> Cc: "Rafael J . Wysocki" <rafael@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Bjorn Helgaas <bhelgaas@google.com> Link: http://lkml.kernel.org/r/1498227072-5980-1-git-send-email-daniel.lezcano@linaro.org
2017-06-24Merge branch 'linus' into sched/core, to pick up fixesIngo Molnar
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-06-24PM / OPP: Add dev_pm_opp_{set|put}_clkname()Viresh Kumar
In order to support OPP switching, OPP layer needs to get pointer to the clock for the device. Simple cases work fine without using the routines added by this patch (i.e. by passing connection-id as NULL), but for a device with multiple clocks available, the OPP core needs to know the exact name of the clk to use. Add a new set of APIs to get that done. Tested-by: Rajendra Nayak <rnayak@codeaurora.org> Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org> Reviewed-by: Stephen Boyd <sboyd@codeaurora.org> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2017-06-23slub: make sysfs file removal asynchronousTejun Heo
Commit bf5eb3de3847 ("slub: separate out sysfs_slab_release() from sysfs_slab_remove()") made slub sysfs file removals synchronous to kmem_cache shutdown. Unfortunately, this created a possible ABBA deadlock between slab_mutex and sysfs draining mechanism triggering the following lockdep warning. ====================================================== [ INFO: possible circular locking dependency detected ] 4.10.0-test+ #48 Not tainted ------------------------------------------------------- rmmod/1211 is trying to acquire lock: (s_active#120){++++.+}, at: [<ffffffff81308073>] kernfs_remove+0x23/0x40 but task is already holding lock: (slab_mutex){+.+.+.}, at: [<ffffffff8120f691>] kmem_cache_destroy+0x41/0x2d0 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #1 (slab_mutex){+.+.+.}: lock_acquire+0xf6/0x1f0 __mutex_lock+0x75/0x950 mutex_lock_nested+0x1b/0x20 slab_attr_store+0x75/0xd0 sysfs_kf_write+0x45/0x60 kernfs_fop_write+0x13c/0x1c0 __vfs_write+0x28/0x120 vfs_write+0xc8/0x1e0 SyS_write+0x49/0xa0 entry_SYSCALL_64_fastpath+0x1f/0xc2 -> #0 (s_active#120){++++.+}: __lock_acquire+0x10ed/0x1260 lock_acquire+0xf6/0x1f0 __kernfs_remove+0x254/0x320 kernfs_remove+0x23/0x40 sysfs_remove_dir+0x51/0x80 kobject_del+0x18/0x50 __kmem_cache_shutdown+0x3e6/0x460 kmem_cache_destroy+0x1fb/0x2d0 kvm_exit+0x2d/0x80 [kvm] vmx_exit+0x19/0xa1b [kvm_intel] SyS_delete_module+0x198/0x1f0 entry_SYSCALL_64_fastpath+0x1f/0xc2 other info that might help us debug this: Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(slab_mutex); lock(s_active#120); lock(slab_mutex); lock(s_active#120); *** DEADLOCK *** 2 locks held by rmmod/1211: #0: (cpu_hotplug.dep_map){++++++}, at: [<ffffffff810a7877>] get_online_cpus+0x37/0x80 #1: (slab_mutex){+.+.+.}, at: [<ffffffff8120f691>] kmem_cache_destroy+0x41/0x2d0 stack backtrace: CPU: 3 PID: 1211 Comm: rmmod Not tainted 4.10.0-test+ #48 Hardware name: Hewlett-Packard HP Compaq Pro 6300 SFF/339A, BIOS K01 v02.05 05/07/2012 Call Trace: print_circular_bug+0x1be/0x210 __lock_acquire+0x10ed/0x1260 lock_acquire+0xf6/0x1f0 __kernfs_remove+0x254/0x320 kernfs_remove+0x23/0x40 sysfs_remove_dir+0x51/0x80 kobject_del+0x18/0x50 __kmem_cache_shutdown+0x3e6/0x460 kmem_cache_destroy+0x1fb/0x2d0 kvm_exit+0x2d/0x80 [kvm] vmx_exit+0x19/0xa1b [kvm_intel] SyS_delete_module+0x198/0x1f0 ? SyS_delete_module+0x5/0x1f0 entry_SYSCALL_64_fastpath+0x1f/0xc2 It'd be the cleanest to deal with the issue by removing sysfs files without holding slab_mutex before the rest of shutdown; however, given the current code structure, it is pretty difficult to do so. This patch punts sysfs file removal to a work item. Before commit bf5eb3de3847, the removal was punted to a RCU delayed work item which is executed after release. Now, we're punting to a different work item on shutdown which still maintains the goal removing the sysfs files earlier when destroying kmem_caches. Link: http://lkml.kernel.org/r/20170620204512.GI21326@htj.duckdns.org Fixes: bf5eb3de3847 ("slub: separate out sysfs_slab_release() from sysfs_slab_remove()") Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Tested-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-06-23soc: actions: owl-sps: Factor out owl_sps_set_pg() for power-gatingAndreas Färber
Allow the SMP code to reuse PM domain code for CPU2/CPU3 wakeup. Signed-off-by: Andreas Färber <afaerber@suse.de>
2017-06-22Merge commit '8e8320c9315c' into for-4.13/blockJens Axboe
Pull in the fix for shared tags, as it conflicts with the pending changes in for-4.13/block. We already pulled in v4.12-rc5 to solve other conflicts or get fixes that went into 4.12, so not a lot of changes in this merge. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-06-22genirq/irqdomain: Remove auto-recursive hierarchy supportMarc Zyngier
It did seem like a good idea at the time, but it never really caught on, and auto-recursive domains remain unused 3 years after having been introduced. Oh well, time for a late spring cleanup. Signed-off-by: Marc Zyngier <marc.zyngier@arm.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2017-06-22genirq/irqdomain: Add irq_domain_update_bus_token helperMarc Zyngier
We can have irq domains that are identified by the same fwnode (because they are serviced by the same HW), and yet have different functionnality (because they serve different busses, for example). This is what we use the bus_token field. Since we don't use this field when generating the domain name, all the aliasing domains will get the same name, and the debugfs file creation fails. Also, bus_token is updated by individual drivers, and the core code is unaware of that update. In order to sort this mess, let's introduce a helper that takes care of updating bus_token, and regenerate the debugfs file. A separate patch will update all the individual users. Signed-off-by: Marc Zyngier <marc.zyngier@arm.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2017-06-22genirq: Introduce IRQD_SINGLE_TARGET flagThomas Gleixner
Many interrupt chips allow only a single CPU as interrupt target. The core code has no knowledge about that. That's unfortunate as it could avoid trying to readd a newly online CPU to the effective affinity mask. Add the status flag and the necessary accessors. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Jens Axboe <axboe@kernel.dk> Cc: Marc Zyngier <marc.zyngier@arm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Keith Busch <keith.busch@intel.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Christoph Hellwig <hch@lst.de> Link: http://lkml.kernel.org/r/20170619235447.352343969@linutronix.de
2017-06-22genirq/cpuhotplug: Handle managed IRQs on CPU hotplugThomas Gleixner
If a CPU goes offline, interrupts affine to the CPU are moved away. If the outgoing CPU is the last CPU in the affinity mask the migration code breaks the affinity and sets it it all online cpus. This is a problem for affinity managed interrupts as CPU hotplug is often used for power management purposes. If the affinity is broken, the interrupt is not longer affine to the CPUs to which it was allocated. The affinity spreading allows to lay out multi queue devices in a way that they are assigned to a single CPU or a group of CPUs. If the last CPU goes offline, then the queue is not longer used, so the interrupt can be shutdown gracefully and parked until one of the assigned CPUs comes online again. Add a graceful shutdown mechanism into the irq affinity breaking code path, mark the irq as MANAGED_SHUTDOWN and leave the affinity mask unmodified. In the online path, scan the active interrupts for managed interrupts and if the interrupt is functional and the newly online CPU is part of the affinity mask, restart the interrupt if it is marked MANAGED_SHUTDOWN or if the interrupts is started up, try to add the CPU back to the effective affinity mask. Originally-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Jens Axboe <axboe@kernel.dk> Cc: Marc Zyngier <marc.zyngier@arm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Keith Busch <keith.busch@intel.com> Cc: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/20170619235447.273417334@linutronix.de
2017-06-22genirq: Handle managed irqs gracefully in irq_startup()Thomas Gleixner
Affinity managed interrupts should keep their assigned affinity accross CPU hotplug. To avoid magic hackery in device drivers, the core code shall manage them transparently and set these interrupts into a managed shutdown state when the last CPU of the assigned affinity mask goes offline. The interrupt will be restarted when one of the CPUs in the assigned affinity mask comes back online. Add the necessary logic to irq_startup(). If an interrupt is requested and started up, the code checks whether it is affinity managed and if so, it checks whether a CPU in the interrupts affinity mask is online. If not, it puts the interrupt into managed shutdown state. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Jens Axboe <axboe@kernel.dk> Cc: Marc Zyngier <marc.zyngier@arm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Keith Busch <keith.busch@intel.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Christoph Hellwig <hch@lst.de> Link: http://lkml.kernel.org/r/20170619235447.189851170@linutronix.de
2017-06-22genirq: Introduce IRQD_MANAGED_SHUTDOWNThomas Gleixner
Affinity managed interrupts should keep their assigned affinity accross CPU hotplug. To avoid magic hackery in device drivers, the core code shall manage them transparently. This will set these interrupts into a managed shutdown state when the last CPU of the assigned affinity mask goes offline. The interrupt will be restarted when one of the CPUs in the assigned affinity mask comes back online. Introduce the necessary state flag and the accessor functions. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Jens Axboe <axboe@kernel.dk> Cc: Marc Zyngier <marc.zyngier@arm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Keith Busch <keith.busch@intel.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Christoph Hellwig <hch@lst.de> Link: http://lkml.kernel.org/r/20170619235446.954523476@linutronix.de
2017-06-22genirq: Introduce effective affinity maskThomas Gleixner
There is currently no way to evaluate the effective affinity mask of a given interrupt. Many irq chips allow only a single target CPU or a subset of CPUs in the affinity mask. Updating the mask at the time of setting the affinity to the subset would be counterproductive because information for cpu hotplug about assigned interrupt affinities gets lost. On CPU hotplug it's also pointless to force migrate an interrupt, which is not targeted at the CPU effectively. But currently the information is not available. Provide a seperate mask to be updated by the irq_chip->irq_set_affinity() implementations. Implement the read only proc files so the user can see the effective mask as well w/o trying to deduce it from /proc/interrupts. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Jens Axboe <axboe@kernel.dk> Cc: Marc Zyngier <marc.zyngier@arm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Keith Busch <keith.busch@intel.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Christoph Hellwig <hch@lst.de> Link: http://lkml.kernel.org/r/20170619235446.247834245@linutronix.de
2017-06-22genirq: Move irq_fixup_move_pending() to coreThomas Gleixner
Now that x86 uses the generic code, the function declaration and inline stub can move to the core internal header. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Jens Axboe <axboe@kernel.dk> Cc: Marc Zyngier <marc.zyngier@arm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Keith Busch <keith.busch@intel.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Christoph Hellwig <hch@lst.de> Link: http://lkml.kernel.org/r/20170619235445.928156166@linutronix.de
2017-06-22genirq/cpuhotplug: Add support for cleaning up move in progressThomas Gleixner
In order to move x86 to the generic hotplug migration code, add support for cleaning up move in progress bits. On architectures which have this x86 specific (mis)feature not enabled, this is optimized out by the compiler. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Jens Axboe <axboe@kernel.dk> Cc: Marc Zyngier <marc.zyngier@arm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Keith Busch <keith.busch@intel.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Christoph Hellwig <hch@lst.de> Link: http://lkml.kernel.org/r/20170619235445.525817311@linutronix.de
2017-06-22genirq: Provide irq_fixup_move_pending()Thomas Gleixner
If an CPU goes offline, the interrupts are migrated away, but a eventually pending interrupt move, which has not yet been made effective is kept pending even if the outgoing CPU is the sole target of the pending affinity mask. What's worse is, that the pending affinity mask is discarded even if it would contain a valid subset of the online CPUs. Implement a helper function which allows to avoid these issues. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Jens Axboe <axboe@kernel.dk> Cc: Marc Zyngier <marc.zyngier@arm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Keith Busch <keith.busch@intel.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Christoph Hellwig <hch@lst.de> Link: http://lkml.kernel.org/r/20170619235444.691345468@linutronix.de
2017-06-22genirq: Add missing comment for IRQD_STARTEDThomas Gleixner
Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Marc Zyngier <marc.zyngier@arm.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Keith Busch <keith.busch@intel.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Christoph Hellwig <hch@lst.de> Link: http://lkml.kernel.org/r/20170619235444.614913014@linutronix.de
2017-06-22genirq/debugfs: Add proper debugfs interfaceThomas Gleixner
Debugging (hierarchical) interupt domains is tedious as there is no information about the hierarchy and no information about states of interrupts in the various domain levels. Add a debugfs directory 'irq' and subdirectories 'domains' and 'irqs'. The domains directory contains the domain files. The content is information about the domain. If the domain is part of a hierarchy then the parent domains are printed as well. # ls /sys/kernel/debug/irq/domains/ default INTEL-IR-2 INTEL-IR-MSI-2 IO-APIC-IR-2 PCI-MSI DMAR-MSI INTEL-IR-3 INTEL-IR-MSI-3 IO-APIC-IR-3 unknown-1 INTEL-IR-0 INTEL-IR-MSI-0 IO-APIC-IR-0 IO-APIC-IR-4 VECTOR INTEL-IR-1 INTEL-IR-MSI-1 IO-APIC-IR-1 PCI-HT # cat /sys/kernel/debug/irq/domains/VECTOR name: VECTOR size: 0 mapped: 216 flags: 0x00000041 # cat /sys/kernel/debug/irq/domains/IO-APIC-IR-0 name: IO-APIC-IR-0 size: 24 mapped: 19 flags: 0x00000041 parent: INTEL-IR-3 name: INTEL-IR-3 size: 65536 mapped: 167 flags: 0x00000041 parent: VECTOR name: VECTOR size: 0 mapped: 216 flags: 0x00000041 Unfortunately there is no per cpu information about the VECTOR domain (yet). The irqs directory contains detailed information about mapped interrupts. # cat /sys/kernel/debug/irq/irqs/3 handler: handle_edge_irq status: 0x00004000 istate: 0x00000000 ddepth: 1 wdepth: 0 dstate: 0x01018000 IRQD_IRQ_DISABLED IRQD_SINGLE_TARGET IRQD_MOVE_PCNTXT node: 0 affinity: 0-143 effectiv: 0 pending: domain: IO-APIC-IR-0 hwirq: 0x3 chip: IR-IO-APIC flags: 0x10 IRQCHIP_SKIP_SET_WAKE parent: domain: INTEL-IR-3 hwirq: 0x20000 chip: INTEL-IR flags: 0x0 parent: domain: VECTOR hwirq: 0x3 chip: APIC flags: 0x0 This was developed to simplify the debugging of the managed affinity changes. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Marc Zyngier <marc.zyngier@arm.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Keith Busch <keith.busch@intel.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Christoph Hellwig <hch@lst.de> Link: http://lkml.kernel.org/r/20170619235444.537566163@linutronix.de Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2017-06-22genirq/irqdomain: Add map counterThomas Gleixner
Add a map counter instead of counting radix tree entries for diagnosis. That also gives correct information for linear domains. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Marc Zyngier <marc.zyngier@arm.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Keith Busch <keith.busch@intel.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Christoph Hellwig <hch@lst.de> Link: http://lkml.kernel.org/r/20170619235444.459397746@linutronix.de
2017-06-22genirq: Allow fwnode to carry name information onlyThomas Gleixner
In order to provide proper debug interface it's required to have domain names available when the domain is added. Non fwnode based architectures like x86 have no way to do so. It's not possible to use domain ops or host data for this as domain ops might be the same for several instances, but the names have to be unique. Extend the irqchip fwnode to allow transporting the domain name. If no node is supplied, create a 'unknown-N' placeholder. Warn if an invalid node is supplied and treat it like no node. This happens e.g. with i2 devices on x86 which hand in an ACPI type node which has no interface for retrieving the name. [ Folded a fix from Marc to make DT name parsing work ] Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Marc Zyngier <marc.zyngier@arm.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Keith Busch <keith.busch@intel.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Christoph Hellwig <hch@lst.de> Link: http://lkml.kernel.org/r/20170619235443.588784933@linutronix.de
2017-06-22Merge branch 'uuid-types'Rafael J. Wysocki
Merge branch 'uuid-types' from git://git.infradead.org/users/hch/uuid.git to satisfy dependencies.
2017-06-22sched/loadavg: Generalize "_idle" naming to "_nohz"Frederic Weisbecker
The loadavg naming code still assumes that nohz == idle whereas its code is actually handling well both nohz idle and nohz full. So lets fix the naming according to what the code actually does, to unconfuse the reader. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Acked-by: Rik van Riel <riel@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1497838322-10913-2-git-send-email-fweisbec@gmail.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-06-22Merge branch 'linus' into x86/mm, to pick up fixesIngo Molnar
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-06-22Merge branch 'linus' into locking/core, to pick up fixesIngo Molnar
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-06-21Merge branch 'for-linus' of git://git.kernel.dk/linux-blockLinus Torvalds
Pull block fixes from Jens Axboe: "This contains a set of fixes for xen-blkback by way of Konrad, and a performance regression fix for blk-mq for shared tags. The latter could account for as much as a 50x reduction in performance, with the test case from the user with 500 name spaces. A more realistic setup on my end with 32 drives showed a 3.5x drop. The fix has been thoroughly tested before being committed" * 'for-linus' of git://git.kernel.dk/linux-block: blk-mq: fix performance regression with shared tags xen-blkback: don't leak stack data via response ring xen/blkback: don't use xen_blkif_get() in xen-blkback kthread xen/blkback: don't free be structure too early xen/blkback: fix disconnect while I/Os in flight
2017-06-21blk-mq: Make it safe to quiesce and unquiesce from an interrupt handlerBart Van Assche
Since blk_mq_quiesce_queue_nowait() can be called from interrupt context, make this safe. Since this function is not in the hot path, uninline it. Fixes: commit f4560ffe8cec ("blk-mq: use QUEUE_FLAG_QUIESCED to quiesce queue") Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com> Cc: Ming Lei <ming.lei@redhat.com> Cc: Hannes Reinecke <hare@suse.com> Cc: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-06-21blk-mq: fix performance regression with shared tagsJens Axboe
If we have shared tags enabled, then every IO completion will trigger a full loop of every queue belonging to a tag set, and every hardware queue for each of those queues, even if nothing needs to be done. This causes a massive performance regression if you have a lot of shared devices. Instead of doing this huge full scan on every IO, add an atomic counter to the main queue that tracks how many hardware queues have been marked as needing a restart. With that, we can avoid looking for restartable queues, if we don't have to. Max reports that this restores performance. Before this patch, 4K IOPS was limited to 22-23K IOPS. With the patch, we are running at 950-970K IOPS. Fixes: 6d8c6c0f97ad ("blk-mq: Restart a single queue if tag sets are shared") Reported-by: Max Gurtovoy <maxg@mellanox.com> Tested-by: Max Gurtovoy <maxg@mellanox.com> Reviewed-by: Bart Van Assche <bart.vanassche@sandisk.com> Tested-by: Bart Van Assche <bart.vanassche@wdc.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-06-21irq/generic-chip: Provide devm_irq_setup_generic_chip()Bartosz Golaszewski
Provide a resource managed variant of irq_setup_generic_chip(). Signed-off-by: Bartosz Golaszewski <brgl@bgdev.pl> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Marc Zyngier <marc.zyngier@arm.com> Cc: linux-doc@vger.kernel.org Cc: Jonathan Corbet <corbet@lwn.net> Link: http://lkml.kernel.org/r/1496246820-13250-6-git-send-email-brgl@bgdev.pl
2017-06-21irq/generic-chip: Provide devm_irq_alloc_generic_chip()Bartosz Golaszewski
Provide a resource managed variant of irq_alloc_generic_chip(). Signed-off-by: Bartosz Golaszewski <brgl@bgdev.pl> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Marc Zyngier <marc.zyngier@arm.com> Cc: linux-doc@vger.kernel.org Cc: Jonathan Corbet <corbet@lwn.net> Link: http://lkml.kernel.org/r/1496246820-13250-5-git-send-email-brgl@bgdev.pl
2017-06-21irq/generic-chip: Provide irq_destroy_generic_chip()Bartosz Golaszewski
Most users of irq_alloc_generic_chip() call irq_setup_generic_chip() too. To simplify the cleanup provide a function that both removes a generic chip and frees its memory. Signed-off-by: Bartosz Golaszewski <brgl@bgdev.pl> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Marc Zyngier <marc.zyngier@arm.com> Cc: linux-doc@vger.kernel.org Cc: Jonathan Corbet <corbet@lwn.net> Link: http://lkml.kernel.org/r/1496246820-13250-3-git-send-email-brgl@bgdev.pl
2017-06-21irq/generic-chip: Provide irq_free_generic_chip()Bartosz Golaszewski
Currently there's no way for users of irq_alloc_generic_chip() to free the allocated memory other than calling kfree() manually on the returned pointer. This may lead to errors if the internals of irq_alloc_generic_chip() ever change. Provide a routine to free the generic chip. Signed-off-by: Bartosz Golaszewski <brgl@bgdev.pl> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Marc Zyngier <marc.zyngier@arm.com> Cc: linux-doc@vger.kernel.org Cc: Jonathan Corbet <corbet@lwn.net> Link: http://lkml.kernel.org/r/1496246820-13250-2-git-send-email-brgl@bgdev.pl
2017-06-21Merge branch 'fortglx/4.13/time' of ↵Thomas Gleixner
https://git.linaro.org/people/john.stultz/linux into timers/core Merge time(keeping) updates from John Stultz: "Just a small set of changes, the biggest changes being the MONOTONIC_RAW handling cleanup, and a new kselftest from Miroslav. Also a a clear warning deprecating CONFIG_GENERIC_TIME_VSYSCALL_OLD, which affects ppc and ia64."
2017-06-21Merge branch 'timers/urgent' into timers/coreThomas Gleixner
Pick up dependent changes.
2017-06-20time: Clean up CLOCK_MONOTONIC_RAW time handlingJohn Stultz
Now that we fixed the sub-ns handling for CLOCK_MONOTONIC_RAW, remove the duplicitive tk->raw_time.tv_nsec, which can be stored in tk->tkr_raw.xtime_nsec (similarly to how its handled for monotonic time). Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@kernel.org> Cc: Miroslav Lichvar <mlichvar@redhat.com> Cc: Richard Cochran <richardcochran@gmail.com> Cc: Prarit Bhargava <prarit@redhat.com> Cc: Stephen Boyd <stephen.boyd@linaro.org> Cc: Kevin Brodsky <kevin.brodsky@arm.com> Cc: Will Deacon <will.deacon@arm.com> Cc: Daniel Mentz <danielmentz@google.com> Tested-by: Daniel Mentz <danielmentz@google.com> Signed-off-by: John Stultz <john.stultz@linaro.org>
2017-06-20block: Add a comment above queue_lockdep_assert_held()Bart Van Assche
Add a comment above the queue_lockdep_assert_held() macro that explains the purpose of the q->queue_lock test. Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Cc: Hannes Reinecke <hare@suse.com> Cc: Omar Sandoval <osandov@fb.com> Cc: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-06-20block: Introduce request_queue.initialize_rq_fn()Bart Van Assche
Several block drivers need to initialize the driver-private request data after having called blk_get_request() and before .prep_rq_fn() is called, e.g. when submitting a REQ_OP_SCSI_* request. Avoid that that initialization code has to be repeated after every blk_get_request() call by adding new callback functions to struct request_queue and to struct blk_mq_ops. Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Hannes Reinecke <hare@suse.com> Cc: Omar Sandoval <osandov@fb.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-06-20block: Make request operation type argument declarations consistentBart Van Assche
Instead of declaring the second argument of blk_*_get_request() as int and passing it to functions that expect an unsigned int, declare that second argument as unsigned int. Also because of consistency, rename that second argument from 'rw' into 'op'. This patch does not change any functionality. Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Cc: Hannes Reinecke <hare@suse.com> Cc: Omar Sandoval <osandov@fb.com> Cc: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-06-20blk-mq: Reduce blk_mq_hw_ctx sizeBart Van Assche
Since the srcu structure is rather large (184 bytes on an x86-64 system with kernel debugging disabled), only allocate it if needed. Reported-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Ming Lei <ming.lei@redhat.com> Cc: Hannes Reinecke <hare@suse.com> Cc: Omar Sandoval <osandov@fb.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>