linux.git - Linus' kernel tree

Age	Commit message (Collapse)	Author
2015-08-20	pmem, dax: have direct_access use __pmem annotation	Ross Zwisler
	Update the annotation for the kaddr pointer returned by direct_access() so that it is a __pmem pointer. This is consistent with the PMEM driver and with how this direct_access() pointer is used in the DAX code. Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2015-08-20	pmem: add copy_from_iter_pmem() and clear_pmem()	Ross Zwisler
	Add support for two new PMEM APIs, copy_from_iter_pmem() and clear_pmem(). copy_from_iter_pmem() is used to copy data from an iterator into a PMEM buffer. clear_pmem() zeros a PMEM memory range. Both of these new APIs must be explicitly ordered using a wmb_pmem() function call and are implemented in such a way that the wmb_pmem() will make the stores to PMEM durable. Because both APIs are unordered they can be called as needed without introducing any unwanted memory barriers. Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2015-08-20	pmem: remove layer when calling arch_has_wmb_pmem()	Ross Zwisler
	Prior to this change arch_has_wmb_pmem() was only called by arch_has_pmem_api(). Both arch_has_wmb_pmem() and arch_has_pmem_api() checked to make sure that CONFIG_ARCH_HAS_PMEM_API was enabled. Instead, remove the old arch_has_wmb_pmem() wrapper to be rid of one extra layer of indirection and the redundant CONFIG_ARCH_HAS_PMEM_API check. Rename __arch_has_wmb_pmem() to arch_has_wmb_pmem() since we no longer have a wrapper, and just have arch_has_pmem_api() call the architecture specific arch_has_wmb_pmem() directly. Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2015-08-20	pmem, x86: move x86 PMEM API to new pmem.h header	Ross Zwisler
	Move the x86 PMEM API implementation out of asm/cacheflush.h and into its own header asm/pmem.h. This will allow members of the PMEM API to be more easily identified on this and other architectures. Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com> Suggested-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2015-08-20	PCI: Add pci_scan_root_bus_msi()	Lorenzo Pieralisi
	Add a pci_scan_root_bus_msi() interface so an arch can specify the MSI controller up front. This removes the need for a pcibios callback to set the MSI controller later. This is not exported because I'd like to replace the variety of "scan root bus" interfaces with a single, more extensible interface that can handle the MSI controller, domain, pci_ops, resources, etc. I hope this interface is temporary. [bhelgaas: changelog, split into separate patch] Suggested-by: Russell King <linux@arm.linux.org.uk> Signed-off-by: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Reviewed-by: Jingoo Han <jingoohan1@gmail.com>
2015-08-20	kbuild: Fix .text.unlikely placement	Andi Kleen
	When building a kernel with .text.unlikely text the unlikely text for each translation unit was put next to the main .text code in the final vmlinux. The problem is that the linker doesn't allow more specific submatches of a section name in a different linker script statement after the main match. So we need to move them all into one line. With that change .text.unlikely is at the end of everything again. I also moved .text.hot into the same statement though, even though that's not strictly needed. Signed-off-by: Andi Kleen <ak@linux.intel.com> Signed-off-by: Michal Marek <mmarek@suse.com>
2015-08-20	xen/PMU: Intercept PMU-related MSR and APIC accesses	Boris Ostrovsky
	Provide interfaces for recognizing accesses to PMU-related MSRs and LVTPC APIC and process these accesses in Xen PMU code. (The interrupt handler performs XENPMU_flush right away in the beginning since no PMU emulation is available. It will be added with a later patch). Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Reviewed-by: David Vrabel <david.vrabel@citrix.com> Signed-off-by: David Vrabel <david.vrabel@citrix.com>
2015-08-20	xen/PMU: Initialization code for Xen PMU	Boris Ostrovsky
	Map shared data structure that will hold CPU registers, VPMU context, V/PCPU IDs of the CPU interrupted by PMU interrupt. Hypervisor fills this information in its handler and passes it to the guest for further processing. Set up PMU VIRQ. Now that perf infrastructure will assume that PMU is available on a PV guest we need to be careful and make sure that accesses via RDPMC instruction don't cause fatal traps by the hypervisor. Provide a nop RDPMC handler. For the same reason avoid issuing a warning on a write to APIC's LVTPC. Both of these will be made functional in later patches. Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Reviewed-by: David Vrabel <david.vrabel@citrix.com> Signed-off-by: David Vrabel <david.vrabel@citrix.com>
2015-08-20	xen/PMU: Sysfs interface for setting Xen PMU mode	Boris Ostrovsky
	Set Xen's PMU mode via /sys/hypervisor/pmu/pmu_mode. Add XENPMU hypercall. Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: David Vrabel <david.vrabel@citrix.com>
2015-08-20	xen: xensyms support	Boris Ostrovsky
	Export Xen symbols to dom0 via /proc/xen/xensyms (similar to /proc/kallsyms). Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Reviewed-by: David Vrabel <david.vrabel@citrix.com> Signed-off-by: David Vrabel <david.vrabel@citrix.com>
2015-08-20	mm: provide early_memremap_ro to establish read-only mapping	Juergen Gross
	During early boot as Xen pv domain the kernel needs to map some page tables supplied by the hypervisor read only. This is needed to be able to relocate some data structures conflicting with the physical memory map especially on systems with huge RAM (above 512GB). Provide the function early_memremap_ro() to provide this read only mapping. Signed-off-by: Juergen Gross <jgross@suse.com> Acked-by: Konrad Rzeszutek Wilk <Konrad.wilk@oracle.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: David Vrabel <david.vrabel@citrix.com>
2015-08-20	xen: sync with xen headers	Juergen Gross
	Use the newest headers from the xen tree to get some new structure layouts. Signed-off-by: Juergen Gross <jgross@suse.com> Reviewed-by: David Vrabel <david.vrabel@citrix.com> Acked-by: Konrad Rzeszutek Wilk <Konrad.wilk@oracle.com> Signed-off-by: David Vrabel <david.vrabel@citrix.com>
2015-08-20	xen/events: Support event channel rebind on ARM	Julien Grall
	Currently, the event channel rebind code is gated with the presence of the vector callback. The virtual interrupt controller on ARM has the concept of per-CPU interrupt (PPI) which allow us to support per-VCPU event channel. Therefore there is no need of vector callback for ARM. Xen is already using a free PPI to notify the guest VCPU of an event. Furthermore, the xen code initialization in Linux (see arch/arm/xen/enlighten.c) is requesting correctly a per-CPU IRQ. Introduce new helper xen_support_evtchn_rebind to allow architecture decide whether rebind an event is support or not. It will always return true on ARM and keep the same behavior on x86. This is also allow us to drop the usage of xen_have_vector_callback entirely in the ARM code. Signed-off-by: Julien Grall <julien.grall@citrix.com> Signed-off-by: David Vrabel <david.vrabel@citrix.com>
2015-08-20	Merge branch 'perf/urgent' into perf/core, to pick up fixes before adding ↵	Ingo Molnar
	more changes Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-08-20	fbdev: fix cea_modes array size	Tomi Valkeinen
	CEA defines 64 modes, indexed from 1 to 64. modedb has cea_modes arrays, which contains 64 entries. However, the code uses the CEA indices directly, i.e. the first mode is at cea_modes[1]. This means the array is one too short. This does not cause references to uninitialized memory as the code in fbmon only allows indexes up to 63, and the cea_modes does not contain an entry for the mode 64 so it could not be used in any case. However, the code contains a check 'if (idx > ARRAY_SIZE(cea_modes)', and while that check is a no-op as at that point idx cannot be >= 63, it upsets static checkers. Fix this by increasing the cea_array size to be 65, and change the code to allow mode 64. Signed-off-by: Tomi Valkeinen <tomi.valkeinen@ti.com> Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
2015-08-20	dmaengine: Stricter legacy checking in dma_request_slave_channel_compat()	Geert Uytterhoeven
	dma_request_slave_channel_compat() is meant for drivers that support both DT and legacy platform device based probing: if DT channel DMA setup fails, it will fall back to platform data based DMA channel setup, using hardcoded DMA channel IDs and a filter function. However, if the DTS doesn't provide a "dmas" property for the device, the fallback is also used. If the legacy filter function is not hardcoded in the DMA slave driver, but comes from platform data, it will be NULL. Then dma_request_slave_channel_compat() will succeed incorrectly, and return a DMA channel, as a NULL legacy filter function actually means "all channels are OK", not "do not match". Later, when trying to use that DMA channel, it will fail with: rcar-dmac e6700000.dma-controller: rcar_dmac_prep_slave_sg: bad parameter: len=1, id=-22 To fix this, ensure that both the filter function and the DMA channel ID are not NULL before using the legacy fallback. Note that some DMA slave drivers can handle this failure, and will fall back to PIO. See also commit 056f6c87028544de ("dmaengine: shdma: Make dummy shdma_chan_filter() always return false"), which fixed the same issue for the case where shdma_chan_filter() is hardcoded in a DMA slave driver. Suggested-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be> Signed-off-by: Vinod Koul <vinod.koul@intel.com>
2015-08-19	vrf: vrf_master_ifindex_rcu is not always called with rcu read lock	Nikolay Aleksandrov
	While running net-next I hit this: [ 634.073119] =============================== [ 634.073150] [ INFO: suspicious RCU usage. ] [ 634.073182] 4.2.0-rc6+ #45 Not tainted [ 634.073213] ------------------------------- [ 634.073244] include/net/vrf.h:38 suspicious rcu_dereference_check() usage! [ 634.073274] other info that might help us debug this: [ 634.073307] rcu_scheduler_active = 1, debug_locks = 1 [ 634.073338] 2 locks held by swapper/0/0: [ 634.073369] #0: (((&n->timer))){+.-...}, at: [<ffffffff8112bc35>] call_timer_fn+0x5/0x480 [ 634.073412] #1: (slock-AF_INET){+.-...}, at: [<ffffffff8174f0f5>] icmp_send+0x155/0x5f0 [ 634.073450] stack backtrace: [ 634.073483] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.2.0-rc6+ #45 [ 634.073514] Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006 [ 634.073545] 0000000000000000 0593ba8242d9ace4 ffff88002fc03b48 ffffffff81803f1b [ 634.073612] 0000000000000000 ffffffff81e12500 ffff88002fc03b78 ffffffff811003c5 [ 634.073642] 0000000000000000 ffff88002ec4e600 ffffffff81f00f80 ffff88002fc03cf0 [ 634.073669] Call Trace: [ 634.073694] <IRQ> [<ffffffff81803f1b>] dump_stack+0x4c/0x65 [ 634.073728] [<ffffffff811003c5>] lockdep_rcu_suspicious+0xc5/0x100 [ 634.073763] [<ffffffff8174eb56>] icmp_route_lookup+0x176/0x5c0 [ 634.073793] [<ffffffff8174f2fb>] ? icmp_send+0x35b/0x5f0 [ 634.073818] [<ffffffff8174f274>] ? icmp_send+0x2d4/0x5f0 [ 634.073844] [<ffffffff8174f3ce>] icmp_send+0x42e/0x5f0 [ 634.073873] [<ffffffff8170b662>] ipv4_link_failure+0x22/0xa0 [ 634.073899] [<ffffffff8174bdda>] arp_error_report+0x3a/0x80 [ 634.073926] [<ffffffff816d6100>] ? neigh_lookup+0x2c0/0x2c0 [ 634.073952] [<ffffffff816d396e>] neigh_invalidate+0x8e/0x110 [ 634.073984] [<ffffffff816d62ae>] neigh_timer_handler+0x1ae/0x290 [ 634.074013] [<ffffffff816d6100>] ? neigh_lookup+0x2c0/0x2c0 [ 634.074013] [<ffffffff8112bce3>] call_timer_fn+0xb3/0x480 [ 634.074013] [<ffffffff8112bc35>] ? call_timer_fn+0x5/0x480 [ 634.074013] [<ffffffff816d6100>] ? neigh_lookup+0x2c0/0x2c0 [ 634.074013] [<ffffffff8112c2bc>] run_timer_softirq+0x20c/0x430 [ 634.074013] [<ffffffff810af50e>] __do_softirq+0xde/0x630 [ 634.074013] [<ffffffff810afc97>] irq_exit+0x117/0x120 [ 634.074013] [<ffffffff81810976>] smp_apic_timer_interrupt+0x46/0x60 [ 634.074013] [<ffffffff8180e950>] apic_timer_interrupt+0x70/0x80 [ 634.074013] <EOI> [<ffffffff8106b9d6>] ? native_safe_halt+0x6/0x10 [ 634.074013] [<ffffffff81101d8d>] ? trace_hardirqs_on+0xd/0x10 [ 634.074013] [<ffffffff81027d43>] default_idle+0x23/0x200 [ 634.074013] [<ffffffff8102852f>] arch_cpu_idle+0xf/0x20 [ 634.074013] [<ffffffff810f89ba>] default_idle_call+0x2a/0x40 [ 634.074013] [<ffffffff810f8dcc>] cpu_startup_entry+0x39c/0x4c0 [ 634.074013] [<ffffffff817f9cad>] rest_init+0x13d/0x150 [ 634.074013] [<ffffffff81f69038>] start_kernel+0x4a8/0x4c9 [ 634.074013] [<ffffffff81f68120>] ? early_idt_handler_array+0x120/0x120 [ 634.074013] [<ffffffff81f68339>] x86_64_start_reservations+0x2a/0x2c [ 634.074013] [<ffffffff81f68485>] x86_64_start_kernel+0x14a/0x16d It would seem vrf_master_ifindex_rcu() can be called without RCU held in other contexts as well so introduce a new helper which acquires rcu and returns the ifindex. Also add curly braces around both the "if" and "else" parts as per the style guide. Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-19	NFSv4: Enable delegated opens even when reboot recovery is pending	Trond Myklebust
	Unlike the previous attempt, this takes into account the fact that we may be calling it from the recovery thread itself. Detect this by looking at what kind of open we're doing, and checking the state of the NFS_DELEGATION_NEED_RECLAIM if it turns out we're doing a reboot reclaim-type open. Cc: Olga Kornievskaia <aglo@umich.edu> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-08-19	lwtunnel: Fix the sparse warnings in fib_encap_match	Ying Xue
	When CONFIG_LWTUNNEL config is not enabled, the lwtstate_free() is not declared in lwtunnel.h at all. However, even in this case, the function is still referenced in fib_semantics.c so that there appears the following sparse warnings: net/ipv4/fib_semantics.c:553:17: error: undefined identifier 'lwtstate_free' CC net/ipv4/fib_semantics.o net/ipv4/fib_semantics.c: In function ‘fib_encap_match’: net/ipv4/fib_semantics.c:553:3: error: implicit declaration of function ‘lwtstate_free’ [-Werror=implicit-function-declaration] cc1: some warnings being treated as errors make[1]: * [net/ipv4/fib_semantics.o] Error 1 make: * [net/ipv4/fib_semantics.o] Error 2 To eliminate the error, we define an empty function for lwtstate_free() in lwtunnel.h when CONFIG_LWTUNNEL is disabled. Fixes: df383e6240ef ("lwtunnel: fix memory leak") Cc: Jiri Benc <jbenc@redhat.com> Reported-by: kbuild test robot <fengguang.wu@intel.com> Signed-off-by: Ying Xue <ying.xue@windriver.com> Acked-by: Jiri Benc <jbenc@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-20	drm/edid: add function to help find SADs	Russell King
	Add a function to find the start of the SADs in the ELD. This complements the helper to retrieve the SAD count. [airlied: this fixes a build problem with the alsa eld helper which required this]. Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk> Signed-off-by: Dave Airlie <airlied@redhat.com>
2015-08-20	genirq: Introduce irq_chip_set_type_parent() helper	Grygorii Strashko
	This helper is required for irq chips which do not implement a irq_set_type callback and need to call down the irq domain hierarchy for the actual trigger type change. This helper is required to fix further wreckage caused by the conversion of TI OMAP to hierarchical irq domains and therefor tagged for stable. [ tglx: Massaged changelog ] Signed-off-by: Grygorii Strashko <grygorii.strashko@ti.com> Cc: Sudeep Holla <sudeep.holla@arm.com> Cc: <linux@arm.linux.org.uk> Cc: <nsekhar@ti.com> Cc: <jason@lakedaemon.net> Cc: <balbi@ti.com> Cc: <linux-arm-kernel@lists.infradead.org> Cc: <tony@atomide.com> Cc: <marc.zyngier@arm.com> Cc: stable@vger.kernel.org # 4.1 Link: http://lkml.kernel.org/r/1439554830-19502-3-git-send-email-grygorii.strashko@ti.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2015-08-19	block: Replace SG_GAPS with new queue limits mask	Keith Busch
	The SG_GAPS queue flag caused checks for bio vector alignment against PAGE_SIZE, but the device may have different constraints. This patch adds a queue limits so a driver with such constraints can set to allow requests that would have been unnecessarily split. The new gaps check takes the request_queue as a parameter to simplify the logic around invoking this function. This new limit makes the queue flag redundant, so removing it and all usage. Device-mappers will inherit the correct settings through blk_stack_limits(). Signed-off-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>
2015-08-19	netfilter: bridge: fix IPv6 packets not being bridged with CONFIG_IPV6=n	Bernhard Thaler
	230ac490f7fba introduced a dependency to CONFIG_IPV6 which breaks bridging of IPv6 packets on a bridge with CONFIG_IPV6=n. Sysctl entry /proc/sys/net/bridge/bridge-nf-call-ip6tables defaults to 1, for this reason packets are handled by br_nf_pre_routing_ipv6(). When compiled with CONFIG_IPV6=n this function returns NF_DROP but should return NF_ACCEPT to let packets through. Change CONFIG_IPV6=n br_nf_pre_routing_ipv6() return value to NF_ACCEPT. Tested with a simple bridge with two interfaces and IPv6 packets trying to pass from host on left side to host on right side of the bridge. Fixes: 230ac490f7fba ("netfilter: bridge: split ipv6 code into separated file") Signed-off-by: Bernhard Thaler <bernhard.thaler@wvnet.at> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-08-19	netfilter: nf_tables: Use 32 bit addressing register from nft_type_to_reg()	Pablo Neira Ayuso
	nft_type_to_reg() needs to return the register in the new 32 bit addressing, otherwise we hit EINVAL when using mappings. Fixes: 49499c3 ("netfilter: nf_tables: switch registers to 32 bit addressing") Reported-by: Andreas Schultz <aschultz@tpip.net> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-08-19	Merge tag 'asoc-v4.2-disable-topology' of ↵	Takashi Iwai
	git://git.kernel.org/pub/scm/linux/kernel/git/broonie/sound into for-linus ASoC: Disable topology support for v4.2 The topology code merged in the v4.2 merge window introduced a new ABI which was believed to be suitable for use but subsequently additional work by the developers of this feature have revealed some problems that need to be addressed. In order to allow this to be done without having to support the initial ABI add Kconfig to disable the build and also add some #error statements to the UAPI header so users can't use them.
2015-08-19	Revert "[media] rc: rc-ir-raw: Add scancode encoder callback"	David Härdeman
	This reverts commit 9869da5bacc5c9b865a183bd36c04be76cdd325d. The current code is not mature enough, the API should allow a single protocol to be specified. Also, the current code contains heuristics that will depend on module load order. Signed-off-by: David Härdeman <david@hardeman.nu> Acked-by: Antti Seppälä <a.seppala@gmail.com> Signed-off-by: Mauro Carvalho Chehab <mchehab@osg.samsung.com>
2015-08-19	Revert "[media] rc: rc-core: Add support for encode_wakeup drivers"	David Härdeman
	This reverts commit 0d830b2d1295fee82546d57185da5a6604f11ae2. The current code is not mature enough, the API should allow a single protocol to be specified. Also, the current code contains heuristics that will depend on module load order. Signed-off-by: David Härdeman <david@hardeman.nu> Acked-by: Antti Seppälä <a.seppala@gmail.com> Signed-off-by: Mauro Carvalho Chehab <mchehab@osg.samsung.com>
2015-08-19	Revert "[media] rc: nuvoton-cir: Add support for writing wakeup samples via ↵	David Härdeman
	sysfs filter callback" This reverts commit da7ee60b03bd66bb10974d7444aa444de6391312. The current code is not mature enough, the API should allow a single protocol to be specified. Also, the current code contains heuristics that will depend on module load order. Signed-off-by: David Härdeman <david@hardeman.nu> Acked-by: Antti Seppälä <a.seppala@gmail.com> Signed-off-by: Mauro Carvalho Chehab <mchehab@osg.samsung.com>
2015-08-18	vrf: drop unused num_slaves member	Nikolay Aleksandrov
	slave_queue has a num_slaves member which is unused, drop it. Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-18	lwtunnel: fix memory leak	Jiri Benc
	The built lwtunnel_state struct has to be freed after comparison. Fixes: 571e722676fe3 ("ipv4: support for fib route lwtunnel encap attributes") Signed-off-by: Jiri Benc <jbenc@redhat.com> Acked-by: Roopa Prabhu <roopa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-18	blkcg: use CGROUP_WEIGHT_* scale for io.weight on the unified hierarchy	Tejun Heo
	cgroup is trying to make interface consistent across different controllers. For weight based resource control, the knob should have the range [1, 10000] and default to 100. This patch updates cfq-iosched so that the weight range conforms. The internal calculations have enough range and the widening of the weight range shouldn't cause any problem. * blkcg_policy->cpd_bind_fn() is added. If present, this is invoked when blkcg is attached to a hierarchy. * cfq_cpd_init() is updated to use the new default value on the unified hierarchy. * cfq_cpd_bind() callback is implemented to clear per-blkg configs and apply the default config matching the hierarchy type. * cfqd->root_group->[leaf_]weight initialization in cfq_init_queue() is moved into !CONFIG_CFQ_GROUP_IOSCHED block. cfq_cpd_bind() is now responsible for initializing the initial weights when blkcg is enabled. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Vivek Goyal <vgoyal@redhat.com> Cc: Arianna Avanzini <avanzini.arianna@gmail.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2015-08-18	blkcg: implement interface for the unified hierarchy	Tejun Heo
	blkcg interface grew to be the biggest of all controllers and unfortunately most inconsistent too. The interface files are inconsistent with a number of cloes duplicates. Some files have recursive variants while others don't. There's distinction between normal and leaf weights which isn't intuitive and there are a lot of stat knobs which don't make much sense outside of debugging and expose too much implementation details to userland. In the unified hierarchy, everything is always hierarchical and internal nodes can't have tasks rendering the two structural issues twisting the current interface. The interface has to be updated in a significant anyway and this is a good chance to revamp it as a whole. This patch implements blkcg interface for the unified hierarchy. * (from a previous patch) blkcg is identified by "io" instead of "blkio" on the unified hierarchy. Given that the whole interface is updated anyway, the rename shouldn't carry noticeable conversion overhead. * The original interface consisted of 27 files is replaced with the following three files. blkio.stat : per-blkcg stats blkio.weight : per-cgroup and per-cgroup-queue weight settings blkio.max : per-cgroup-queue bps and iops max limits Documentation/cgroups/unified-hierarchy.txt updated accordingly. v2: blkcg_policy->dfl_cftypes wasn't removed on blkcg_policy_unregister() corrupting the cftypes list. Fixed. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@fb.com>
2015-08-18	blkcg: misc preparations for unified hierarchy interface	Tejun Heo
	* Export blkg_dev_name() * Drop unnecessary @cft from __cfq_set_weight(). Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@fb.com>
2015-08-18	blkcg: move body parsing from blkg_conf_prep() to its callers	Tejun Heo
	Currently, blkg_conf_prep() expects input to be of the following form MAJ:MIN NUM and reads the NUM part into blkg_conf_ctx->v. This is quite restrictive and gets in the way in implementing blkcg interface for the unified hierarchy. This patch updates blkg_conf_prep() so that it expects MAJ:MIN BODY_STR where BODY_STR is an arbitrary string. blkg_conf_ctx->v is replaced with ->body which is a char pointer pointing to the start of BODY_STR. Parsing of the body is moved to blkg_conf_prep()'s callers. To allow using, for example, strsep() on blkg_conf_ctx->val, it is a non-const pointer and to accommodate that const is dropped from @input too. This doesn't cause any behavior changes. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@fb.com>
2015-08-18	blkcg: mark existing cftypes as legacy	Tejun Heo
	blkcg is about to grow interface for the unified hierarchy. Add legacy to existing cftypes. * blkcg_policy->cftypes -> blkcg_policy->legacy_cftypes * blk-cgroup.c:blkcg_files -> blkcg_legacy_files * cfq-iosched.c:cfq_blkcg_files -> cfq_blkcg_legacy_files * blk-throttle.c:throtl_files -> throtl_legacy_files Pure renames. No functional change. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@fb.com>
2015-08-18	blkcg: rename subsystem name from blkio to io	Tejun Heo
	blkio interface has become messy over time and is currently the largest. In addition to the inconsistent naming scheme, it has multiple stat files which report more or less the same thing, a number of debug stat files which expose internal details which shouldn't have been part of the public interface in the first place, recursive and non-recursive stats and leaf and non-leaf knobs. Both recursive vs. non-recursive and leaf vs. non-leaf distinctions don't make any sense on the unified hierarchy as only leaf cgroups can contain processes. cgroups is going through a major interface revision with the unified hierarchy involving significant fundamental usage changes and given that a significant portion of the interface doesn't make sense anymore, it's a good time to reorganize the interface. As the first step, this patch renames the external visible subsystem name from "blkio" to "io". This is more concise, matches the other two major subsystem names, "cpu" and "memory", and better suited as blkcg will be involved in anything writeback related too whether an actual block device is involved or not. As the subsystem legacy_name is set to "blkio", the only userland visible change outside the unified hierarchy is that blkcg is reported as "io" instead of "blkio" in the subsystem initialized message during boot. On the unified hierarchy, blkcg now appears as "io". Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Li Zefan <lizefan@huawei.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: cgroups@vger.kernel.org Signed-off-by: Jens Axboe <axboe@fb.com>
2015-08-18	blkcg: move io_service_bytes and io_serviced stats into blkcg_gq	Tejun Heo
	Currently, both cfq-iosched and blk-throttle keep track of io_service_bytes and io_serviced stats. While keeping track of them separately may be useful during development, it doesn't make much sense otherwise. Also, blk-throttle was counting bio's as IOs while cfq-iosched request's, which is more confusing than informative. This patch adds ->stat_bytes and ->stat_ios to blkg (blkcg_gq), removes the counterparts from cfq-iosched and blk-throttle and let them print from the common blkg counters. The common counters are incremented during bio issue in blkcg_bio_issue_check(). The outputs are still filtered by whether the policy has blkg_policy_data on a given blkg, so cfq's output won't show up if it has never been used for a given blkg. The only times when the outputs would differ significantly are when policies are attached on the fly or elevators are switched back and forth. Those are quite exceptional operations and I don't think they warrant keeping separate counters. v3: Update blkio-controller.txt accordingly. v2: Account IOs during bio issues instead of request completions so that bio-based drivers can be handled the same way. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2015-08-18	blkcg: make blkg_[rw]stat_recursive_sum() to be able to index into blkcg_gq	Tejun Heo
	Currently, blkg_[rw]stat_recursive_sum() assume that the target counter is located in pd (blkg_policy_data); however, some counters are planned to be moved to blkg (blkcg_gq). This patch updates blkg_[rw]stat_recursive_sum() to take blkg and blkg_policy pointers instead of pd. If policy is NULL, it indexes into blkg. If non-NULL, into the blkg's pd of the policy. The existing usages are updated to maintain the current behaviors. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2015-08-18	blkcg: make blkcg_[rw]stat per-cpu	Tejun Heo
	blkcg_[rw]stat are used as stat counters for blkcg policies. It isn't per-cpu by itself and blk-throttle makes it per-cpu by wrapping around it. This patch makes blkcg_[rw]stat per-cpu and drop the ad-hoc per-cpu wrapping in blk-throttle. * blkg_[rw]stat->cnt is replaced with cpu_cnt which is struct percpu_counter. This makes syncp unnecessary as remote accesses are handled by percpu_counter itself. * blkg_[rw]stat_init() can now fail due to percpu allocation failure and thus are updated to return int. * percpu_counters need explicit freeing. blkg_[rw]stat_exit() added. * As blkg_rwstat->cpu_cnt[] can't be read directly anymore, reading and summing results are stored in ->aux_cnt[] instead. * Custom per-cpu stat implementation in blk-throttle is removed. This makes all blkcg stat counters per-cpu without complicating policy implmentations. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2015-08-18	blkcg: add blkg_[rw]stat->aux_cnt and replace cfq_group->dead_stats with it	Tejun Heo
	cgroup stats are local to each cgroup and doesn't propagate to ancestors by default. When recursive stats are necessary, the sum is calculated over all the descendants. This initially was for backward compatibility to support both group-local and recursive stats but this mode of operation makes general sense as stat update is much hotter thafn reporting those stats. This however ends up losing recursive stats when a child is removed. To work around this, cfq-iosched adds its stats to its parent cfq_group->dead_stats which is summed up together when calculating recursive stats. It's planned that the core stats will be moved to blkcg_gq, so we want to move the mechanism for keeping track of the stats of dead children from cfq to blkcg core. This patch adds blkg_[rw]stat->aux_cnt which are atomic64_t's keeping track of auxiliary counts which are excluded when reading local counts but included for recursive. blkg_[rw]stat_merge() which were used by cfq to implement dead_stats are replaced by blkg_[rw]stat_add_aux(), and cfq now forwards stats of a dead cgroup to the aux counts of parent->stats instead of separate ->dead_stats. This will also help making blkg_[rw]stats per-cpu. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2015-08-18	blkcg: consolidate blkg creation in blkcg_bio_issue_check()	Tejun Heo
	blkg (blkcg_gq) currently is created by blkcg policies invoking blkg_lookup_create() which ends up repeating about the same code in different policies. Theoretically, this can avoid the overhead of looking and/or creating blkg's if blkcg is enabled but no policy is in use; however, the cost of blkg lookup / creation is very low especially if only the root blkcg is in use which is highly likely if no blkcg policy is in active use - it boils down to a single very predictable conditional and surrounding RCU protection. This patch consolidates blkg creation to a new function blkcg_bio_issue_check() which is called during bio issue from generic_make_request_checks(). blkcg_bio_issue_check() is now the only function which tries to create missing blkg's. The subsequent policy and request_list operations just perform blkg_lookup() and if missing falls back to the root. * blk_get_rl() no longer tries to create blkg. It uses blkg_lookup() instead of blkg_lookup_create(). * blk_throtl_bio() is now called from blkcg_bio_issue_check() with rcu read locked and blkg already looked up. Both throtl_lookup_tg() and throtl_lookup_create_tg() are dropped. * cfq is similarly updated. cfq_lookup_create_cfqg() is replaced with cfq_lookup_cfqg()which uses blkg_lookup(). This consolidates blkg handling and avoids unnecessary blkg creation retries under memory pressure. In addition, this provides a common bio entry point into blkcg where things like common accounting can be performed. v2: Build fixes for !CONFIG_CFQ_GROUP_IOSCHED and !CONFIG_BLK_DEV_THROTTLING. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Vivek Goyal <vgoyal@redhat.com> Cc: Arianna Avanzini <avanzini.arianna@gmail.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2015-08-18	blkcg: move root blkg lookup optimization from throtl_lookup_tg() to ↵	Tejun Heo
	__blkg_lookup() Currently, both throttle and cfq policies implement their own root blkg (blkcg_gq) lookup fast path. This patch moves root blkg optimization from throtl_lookup_tg() to __blkg_lookup(). cfq-iosched currently doesn't use blkg_lookup() but will be converted and drop the optimization too. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Vivek Goyal <vgoyal@redhat.com> Cc: Arianna Avanzini <avanzini.arianna@gmail.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2015-08-18	blkcg: inline [__]blkg_lookup()	Tejun Heo
	blkg_lookup() checks whether the target queue is bypassing and, if not, calls __blkg_lookup() which first checks the lookup hint and then performs radix tree walk. The operations upto hint checking are trivial and there are many users of this function. This patch inlines blkg_lookup() and the fast path part of __blkg_lookup(). The radix tree lookup and hint update are now in blkg_lookup_slowpath(). This will help consolidating blkg handling by easing moving root blkcg short-circuit to inlined lookup fast path. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Vivek Goyal <vgoyal@redhat.com> Cc: Arianna Avanzini <avanzini.arianna@gmail.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2015-08-18	blkcg: replace blkcg_policy->cpd_size with ->cpd_alloc/free_fn() methods	Tejun Heo
	Each active policy has a cpd (blkcg_policy_data) on each blkcg. The cpd's were allocated by blkcg core and each policy could request to allocate extra space at the end by setting blkcg_policy->cpd_size larger than the size of cpd. This is a bit unusual but blkg (blkcg_gq) policy data used to be handled this way too so it made sense to be consistent; however, blkg policy data switched to alloc/free callbacks. This patch makes similar changes to cpd handling. blkcg_policy->cpd_alloc/free_fn() are added to replace ->cpd_size. As cpd allocation is now done from policy side, it can simply allocate a larger area which embeds cpd at the beginning. As ->cpd_alloc_fn() may be able to perform all necessary initializations, this patch makes ->cpd_init_fn() optional. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Vivek Goyal <vgoyal@redhat.com> Cc: Arianna Avanzini <avanzini.arianna@gmail.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2015-08-18	blkcg: minor updates around blkcg_policy_data	Tejun Heo
	* Rename blkcg->pd[] to blkcg->cpd[] so that cpd is consistently used for blkcg_policy_data. * Make blkcg_policy->cpd_init_fn() take blkcg_policy_data instead of blkcg. This makes it consistent with blkg_policy_data methods and to-be-added cpd alloc/free methods. * blkcg_policy_data->blkcg and cpd_to_blkcg() added so that cpd_init_fn() can determine the associated blkcg from blkcg_policy_data. v2: blkcg_policy_data->blkcg initializations were missing. Added. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Vivek Goyal <vgoyal@redhat.com> Cc: Arianna Avanzini <avanzini.arianna@gmail.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2015-08-18	blkcg: make blkcg_policy methods take a pointer to blkcg_policy_data	Tejun Heo
	The newly added ->pd_alloc_fn() and ->pd_free_fn() deal with pd (blkg_policy_data) while the older ones use blkg (blkcg_gq). As using blkg doesn't make sense for ->pd_alloc_fn() and after allocation pd can always be mapped to blkg and given that these are policy-specific methods, it makes sense to converge on pd. This patch makes all methods deal with pd instead of blkg. Most conversions are trivial. In blk-cgroup.c, a couple method invocation sites now test whether pd exists instead of policy state for consistency. This shouldn't cause any behavioral differences. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2015-08-18	blk-throttle: clean up blkg_policy_data alloc/init/exit/free methods	Tejun Heo
	With the recent addition of alloc and free methods, things became messier. This patch reorganizes them according to the followings. * ->pd_alloc_fn() Responsible for allocation and static initializations - the ones which can be done independent of where the pd might be attached. * ->pd_init_fn() Initializations which require the knowledge of where the pd is attached. * ->pd_free_fn() The counter part of pd_alloc_fn(). Static de-init and freeing. This leaves ->pd_exit_fn() without any users. Removed. While at it, collapse an one liner function throtl_pd_exit(), which has only one user, into its user. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2015-08-18	blkcg: replace blkcg_policy->pd_size with ->pd_alloc/free_fn() methods	Tejun Heo
	A blkg (blkcg_gq) represents the relationship between a cgroup and request_queue. Each active policy has a pd (blkg_policy_data) on each blkg. The pd's were allocated by blkcg core and each policy could request to allocate extra space at the end by setting blkcg_policy->pd_size larger than the size of pd. This is a bit unusual but was done this way mostly to simplify error handling and all the existing use cases could be handled this way; however, this is becoming too restrictive now that percpu memory can be allocated without blocking. This introduces two new mandatory blkcg_policy methods - pd_alloc_fn() and pd_free_fn() - which are used to allocate and release pd for a given policy. As pd allocation is now done from policy side, it can simply allocate a larger area which embeds pd at the beginning. This change makes ->pd_size pointless. Removed. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2015-08-18	blkcg: restructure blkg_policy_data allocation in blkcg_activate_policy()	Tejun Heo
	When a policy gets activated, it needs to allocate and install its policy data on all existing blkg's (blkcg_gq's). Because blkg iteration is protected by a spinlock, it currently counts the total number of blkg's in the system, allocates the matching number of policy data on a list and installs them during a single iteration. This can be simplified by using speculative GFP_NOWAIT allocations while iterating and falling back to a preallocated policy data on failure. If the preallocated one has already been consumed, it releases the lock, preallocate with GFP_KERNEL and then restarts the iteration. This can be a bit more expensive than before but policy activation is a very cold path and shouldn't matter. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@fb.com>
2015-08-18	blkcg: remove unnecessary request_list->blkg NULL test in blk_put_rl()	Tejun Heo
	Since ec13b1d6f0a0 ("blkcg: always create the blkcg_gq for the root blkcg"), a request_list always has its blkg associated. Drop unnecessary rl->blkg NULL test from blk_put_rl(). Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <axboe@fb.com>