linux.git - Linus' kernel tree

Age	Commit message (Collapse)	Author
2017-04-19	perf tools: Move sane ctype stuff from util.h to sane_ctype.h	Arnaldo Carvalho de Melo
	More stuff that came from git, out of the hodge-podge that is util.h Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: David Ahern <dsahern@gmail.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Wang Nan <wangnan0@huawei.com> Link: http://lkml.kernel.org/n/tip-e3lana4gctz3ub4hn4y29hkw@git.kernel.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2017-04-19	perf tools: Ditch unused PATH_SEP, STRIP_EXTENSION	Arnaldo Carvalho de Melo
	Should make sense for windows, where git is supported. Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: David Ahern <dsahern@gmail.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Wang Nan <wangnan0@huawei.com> Link: http://lkml.kernel.org/n/tip-lzxlhmqrizk72d0zcsreggy8@git.kernel.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2017-04-19	perf tools: Replace STR() calls with __stringify()	Arnaldo Carvalho de Melo
	Both do the same thing, the later is the one we get from linux/stringify.h, i.e. we now use the same function name/practice as the kernel sources. Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: David Ahern <dsahern@gmail.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Wang Nan <wangnan0@huawei.com> Link: http://lkml.kernel.org/n/tip-w2sxa5o4bfx7fjrd5mu4zmke@git.kernel.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2017-04-19	perf tools: Remove PRI[xu] macros from perf.h	Arnaldo Carvalho de Melo
	We get them from inttypes.h. Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: David Ahern <dsahern@gmail.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Wang Nan <wangnan0@huawei.com> Link: http://lkml.kernel.org/n/tip-qla4e4mwbf1oewafp1ee2etd@git.kernel.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2017-04-19	perf tools: Including missing inttypes.h header	Arnaldo Carvalho de Melo
	Needed to use the PRI[xu](32,64) formatting macros. Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: David Ahern <dsahern@gmail.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Wang Nan <wangnan0@huawei.com> Link: http://lkml.kernel.org/n/tip-wkbho8kaw24q67dd11q0j39f@git.kernel.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2017-04-19	perf tools: Remove unused macros from util.h	Arnaldo Carvalho de Melo
	TYPEOF(), for instance, was only used by MSB() that wasn't used at all, besides typeof() is used in many places, should be the preferred way. Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: David Ahern <dsahern@gmail.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Wang Nan <wangnan0@huawei.com> Link: http://lkml.kernel.org/n/tip-golox8oa2w1oq28snki14z6s@git.kernel.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2017-04-19	tools include: Drop ARRAY_SIZE() definition from linux/hashtable.h	Arnaldo Carvalho de Melo
	As tools/include/linux/kernel.h has it now, with the goodies present in the kernel.h counterpart, i.e. checking that the parameter is an array at build time. Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: David Ahern <dsahern@gmail.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Wang Nan <wangnan0@huawei.com> Link: http://lkml.kernel.org/n/tip-v0b41ivu6z6dyugbq9ffa9ez@git.kernel.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2017-04-19	objtool: Drop ARRAY_SIZE() definition, tools/include/linux/kernel.h has it now	Arnaldo Carvalho de Melo
	And with the goodies present in the kernel.h counterpart, i.e. checking that the parameter is an array at build time. Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: David Ahern <dsahern@gmail.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Wang Nan <wangnan0@huawei.com> Link: http://lkml.kernel.org/n/tip-roiwxwgwgld4kygn65if60wa@git.kernel.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2017-04-19	perf tools: Add include <linux/kernel.h> where ARRAY_SIZE() is used	Arnaldo Carvalho de Melo
	To pave the way for further cleanups where linux/kernel.h may stop being included in some header. Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: David Ahern <dsahern@gmail.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Wang Nan <wangnan0@huawei.com> Link: http://lkml.kernel.org/n/tip-qqxan6tfsl6qx3l0v3nwgjvk@git.kernel.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2017-04-19	tools include: Move ARRAY_SIZE() to linux/kernel.h	Arnaldo Carvalho de Melo
	To match the kernel, then look for places redefining it to make it use this version, which checks that its parameter is an array at build time. Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: David Ahern <dsahern@gmail.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Wang Nan <wangnan0@huawei.com> Link: http://lkml.kernel.org/n/tip-txlcf1im83bcbj6kh0wxmyy8@git.kernel.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2017-04-19	tools include: Adopt __same_type() and __must_be_array() from the kernel	Arnaldo Carvalho de Melo
	Will be used to adopt the more stringent version of ARRAY_SIZE(), the one in the kernel sources. Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: David Ahern <dsahern@gmail.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Wang Nan <wangnan0@huawei.com> Link: http://lkml.kernel.org/n/tip-d85dpvay1hoqscpezlntyd8x@git.kernel.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2017-04-19	tools include: Introduce linux/bug.h, from the kernel sources	Arnaldo Carvalho de Melo
	With just what we will need in the upcoming changesets, the BUILD_BUG_ON_ZERO() definition. Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: David Ahern <dsahern@gmail.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Wang Nan <wangnan0@huawei.com> Link: http://lkml.kernel.org/n/tip-lw8zg7x6ttwcvqhp90mwe3vo@git.kernel.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2017-04-19	perf tools: Remove FLEX_ARRAY definition	Arnaldo Carvalho de Melo
	We rely on symbol->name[0] since the beginning of tools/perf/, never having received any complaint about it, also all the containers build perf just fine, so remove this git codebase remnant. Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: David Ahern <dsahern@gmail.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Wang Nan <wangnan0@huawei.com> Link: http://lkml.kernel.org/n/tip-jsjpgojut8e22o2gtz83augk@git.kernel.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2017-04-19	perf unwind arm64: Add missing errno.h header	Arnaldo Carvalho de Melo
	Since it uses EINVAL unconditionally, it needs to also unconditionally include errno.h. Detected when recent changes made errno.h not be included by chance when tools/perf/arch/arm64/util/unwind-libunwind.c gets included by tools/perf/util/libunwind/arm64.c. Putting this changeset just before that change so that we don't lose bisectability on arm64. Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: David Ahern <dsahern@gmail.com> Cc: Jean Pihet <jean.pihet@linaro.org> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Wang Nan <wangnan0@huawei.com> Fixes: 8ab596afb97b ("perf tools ARM64: Wire up perf_regs and unwind support") Link: http://lkml.kernel.org/n/tip-60zjev2o1locp5ivod38epa2@git.kernel.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2017-04-19	acpi/arm64: Add SBSA Generic Watchdog support in GTDT driver	Fu Wei
	This driver adds support for parsing SBSA Generic Watchdog timer in GTDT, parse all info in SBSA Generic Watchdog Structure in GTDT, and creating a platform device with that information. This allows the operating system to obtain device data from the resource of platform device. The platform device named "sbsa-gwdt" can be used by the ARM SBSA Generic Watchdog driver. Signed-off-by: Fu Wei <fu.wei@linaro.org> Signed-off-by: Hanjun Guo <hanjun.guo@linaro.org> Tested-by: Xiongfeng Wang <wangxiongfeng2@huawei.com> Reviewed-by: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com> Signed-off-by: Mark Rutland <mark.rutland@arm.com>
2017-04-19	clocksource: arm_arch_timer: add GTDT support for memory-mapped timer	Fu Wei
	The patch add memory-mapped timer register support by using the information provided by the new GTDT driver of ACPI. Signed-off-by: Fu Wei <fu.wei@linaro.org> Reviewed-by: Hanjun Guo <hanjun.guo@linaro.org> [Mark: verify CNTFRQ, only register the first frame] Signed-off-by: Mark Rutland <mark.rutland@arm.com>
2017-04-19	acpi/arm64: Add memory-mapped timer support in GTDT driver	Fu Wei
	On platforms booting with ACPI, architected memory-mapped timers' configuration data is provided by firmware through the ACPI GTDT static table. The clocksource architected timer kernel driver requires a firmware interface to collect timer configuration and configure its driver. this infrastructure is present for device tree systems, but it is missing on systems booting with ACPI. Implement the kernel infrastructure required to parse the static ACPI GTDT table so that the architected timer clocksource driver can make use of it on systems booting with ACPI, therefore enabling the corresponding timers configuration. Signed-off-by: Fu Wei <fu.wei@linaro.org> Signed-off-by: Hanjun Guo <hanjun.guo@linaro.org> Acked-by: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com> [Mark: restructure error handling] Signed-off-by: Mark Rutland <mark.rutland@arm.com>
2017-04-19	netfilter: tcp: Use TCP_MAX_WSCALE instead of literal 14	Gao Feng
	The window scale may be enlarged from 14 to 15 according to the itef draft https://tools.ietf.org/html/draft-nishida-tcpm-maxwin-03. Use the macro TCP_MAX_WSCALE to support it easily with TCP stack in the future. Signed-off-by: Gao Feng <fgao@ikuai8.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-04-19	netfilter: ipvs: fix incorrect conflict resolution	Florian Westphal
	The commit ab8bc7ed864b9c4f1fcb00a22bbe4e0f66ce8003 ("netfilter: remove nf_ct_is_untracked") changed the line if (ct && !nf_ct_is_untracked(ct) && nfct_nat(ct)) { to if (ct && nfct_nat(ct)) { meanwhile, the commit 41390895e50bc4f28abe384c6b35ac27464a20ec ("netfilter: ipvs: don't check for presence of nat extension") from ipvs-next had changed the same line to if (ct && !nf_ct_is_untracked(ct) && (ct->status & IPS_NAT_MASK)) { When ipvs-next got merged into nf-next, the merge resolution took the first version, dropping the conversion of nfct_nat(). While this doesn't cause a problem at the moment, it will once we stop adding the nat extension by default. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-04-19	nefilter: eache: reduce struct size from 32 to 24 byte	Florian Westphal
	Only "cache" needs to use ulong (its used with set_bit()), missed can use u16. Also add build-time assertion to ensure event bits fit. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-04-19	netfilter: allow early drop of assured conntracks	Florian Westphal
	If insertion of a new conntrack fails because the table is full, the kernel searches the next buckets of the hash slot where the new connection was supposed to be inserted at for an entry that hasn't seen traffic in reply direction (non-assured), if it finds one, that entry is is dropped and the new connection entry is allocated. Allow the conntrack gc worker to also remove assured conntracks if resources are low. Do this by querying the l4 tracker, e.g. tcp connections are now dropped if they are no longer established (e.g. in finwait). This could be refined further, e.g. by adding 'soft' established timeout (i.e., a timeout that is only used once we get close to resource exhaustion). Cc: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu> Signed-off-by: Florian Westphal <fw@strlen.de> Acked-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-04-19	netfilter: conntrack: use u8 for extension sizes again	Florian Westphal
	commit 223b02d923ecd7c84cf9780bb3686f455d279279 ("netfilter: nf_conntrack: reserve two bytes for nf_ct_ext->len") had to increase size of the extension offsets because total size of the extensions had increased to a point where u8 did overflow. 3 years later we've managed to diet extensions a bit and we no longer need u16. Furthermore we can now add a compile-time assertion for this problem. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-04-19	netfilter: remove last traces of variable-sized extensions	Florian Westphal
	get rid of the (now unused) nf_ct_ext_add_length define and also rename the function to plain nf_ct_ext_add(). Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-04-19	netfilter: helpers: remove data_len usage for inkernel helpers	Florian Westphal
	No need to track this for inkernel helpers anymore as NF_CT_HELPER_BUILD_BUG_ON checks do this now. All inkernel helpers know what kind of structure they stored in helper->data. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-04-19	netfilter: nfnetlink_cthelper: reject too large userspace allocation requests	Florian Westphal
	Userspace should not abuse the kernel to store large amounts of data, reject requests larger than the private area can accommodate. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-04-19	netfilter: helper: add build-time asserts for helper data size	Florian Westphal
	add a 32 byte scratch area in the helper struct instead of relying on variable sized helpers plus compile-time asserts to let us know if 32 bytes aren't enough anymore. Not having variable sized helpers will later allow to add BUILD_BUG_ON for the total size of conntrack extensions -- the helper extension is the only one that doesn't have a fixed size. The (useless!) NF_CT_HELPER_BUILD_BUG_ON(0); are added so that in case someone adds a new helper and copy-pastes from one that doesn't store private data at least some indication that this macro should be used somehow is there... Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-04-19	netfilter: conntrack: move helper struct to nf_conntrack_helper.h	Florian Westphal
	its definition is not needed in nf_conntrack.h. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-04-19	netfilter: nft_ct: allow to set ctnetlink event types of a connection	Florian Westphal
	By default the kernel emits all ctnetlink events for a connection. This allows to select the types of events to generate. This can be used to e.g. only send DESTROY events but no NEW/UPDATE ones and will work even if sysctl net.netfilter.nf_conntrack_events is set to 0. This was already possible via iptables' CT target, but the nft version has the advantage that it can also be used with already-established conntracks. The added nf_ct_is_template() check isn't a bug fix as we only support mark and labels (and unlike ecache the conntrack core doesn't copy those). Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-04-19	clocksource: arm_arch_timer: simplify ACPI support code.	Fu Wei
	The patch update arm_arch_timer driver to use the function provided by the new GTDT driver of ACPI. By this way, arm_arch_timer.c can be simplified, and separate all the ACPI GTDT knowledge from this timer driver. Signed-off-by: Fu Wei <fu.wei@linaro.org> Signed-off-by: Hanjun Guo <hanjun.guo@linaro.org> Tested-by: Xiongfeng Wang <wangxiongfeng2@huawei.com> Reviewed-by: Hanjun Guo <hanjun.guo@linaro.org> Tested-by: Hanjun Guo <hanjun.guo@linaro.org> Signed-off-by: Mark Rutland <mark.rutland@arm.com>
2017-04-19	acpi/arm64: Add GTDT table parse driver	Fu Wei
	This patch adds support for parsing arch timer info in GTDT, provides some kernel APIs to parse all the PPIs and always-on info in GTDT and export them. By this driver, we can simplify arm_arch_timer drivers, and separate the ACPI GTDT knowledge from it. Signed-off-by: Fu Wei <fu.wei@linaro.org> Signed-off-by: Hanjun Guo <hanjun.guo@linaro.org> Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Tested-by: Xiongfeng Wang <wangxiongfeng2@huawei.com> Reviewed-by: Hanjun Guo <hanjun.guo@linaro.org> Tested-by: Hanjun Guo <hanjun.guo@linaro.org> Acked-by: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com> Signed-off-by: Mark Rutland <mark.rutland@arm.com>
2017-04-19	clocksource: arm_arch_timer: split MMIO timer probing.	Fu Wei
	Currently the code to probe MMIO architected timers mixes DT parsing with actual poking of hardware. This makes the code harder than necessary to understand, and makes it difficult to add support for probing via ACPI. This patch splits the DT parsing from HW probing. The DT parsing now lives in arch_timer_mem_of_init(), which fills in an arch_timer_mem structure that it hands to probing functions that can be reused for ACPI support. Since the rate detection logic will be slight different when using ACPI, the probing is performed as a number of steps. This results in more code for the moment, and some arguably redundant work, but simplifies matters considerably when ACPI support is added. Signed-off-by: Fu Wei <fu.wei@linaro.org> [Mark: refactor the probing split] Signed-off-by: Mark Rutland <mark.rutland@arm.com>
2017-04-19	clocksource: arm_arch_timer: add structs to describe MMIO timer	Fu Wei
	In preparation for ACPI GTDT support, this patch adds structs to describe the MMIO timers indepedent of the firmware interface. Subsequent patches will use these to split the FW/HW probing logic, so that the HW probing logic can be shared by ACPI and DT. Signed-off-by: Fu Wei <fu.wei@linaro.org> Reviewed-by: Hanjun Guo <hanjun.guo@linaro.org> Signed-off-by: Mark Rutland <mark.rutland@arm.com>
2017-04-19	clocksource: arm_arch_timer: move arch_timer_needs_of_probing into DT init call	Fu Wei
	To cleanly split code paths specific to ACPI or DT at a higher level, this patch removes arch_timer_init(), folding the relevant parts of its logic into existing callers. This pathes the way for further rework, and saves a few lines. Signed-off-by: Fu Wei <fu.wei@linaro.org> Reviewed-by: Hanjun Guo <hanjun.guo@linaro.org> [Mark: reword commit message] Signed-off-by: Mark Rutland <mark.rutland@arm.com>
2017-04-19	clocksource: arm_arch_timer: refactor arch_timer_needs_probing	Fu Wei
	When booting with DT, it's possible for timer nodes to be probed in any order. Some common initialisation needs to occur after all nodes have been probed, and arch_timer_common_init() has code to detect when this has happened. This logic is DT-specific, and it would be best to factor it out of the common code that will be shared with ACPI. This patch folds this into the existing arch_timer_needs_probing(), which is renamed to arch_timer_needs_of_probing(), and no longer takes any arguments. This is only called when using DT, and not when using ACPI, which will have a deterministic probe order. Signed-off-by: Fu Wei <fu.wei@linaro.org> Reviewed-by: Hanjun Guo <hanjun.guo@linaro.org> [Mark: reword commit message] Signed-off-by: Mark Rutland <mark.rutland@arm.com>
2017-04-19	clocksource: arm_arch_timer: split dt-only rate handling	Fu Wei
	For historical reasons, rate detection when probing via DT is somewhat convoluted. We tried to package this up in arch_timer_detect_rate(), but with the addition of ACPI worse, and gets in the way of stringent rate checking when ACPI is used. This patch makes arch_timer_detect_rate() specific to DT, ripping out ACPI logic. In preparation for rework of the MMIO timer probing, the reading of the relevant CNTFRQ register is factored out to callers. The function is then renamed to arch_timer_of_configure_rate(), which better represents its new place in the world. Comments are added in the DT and ACPI probe paths to explain this. Signed-off-by: Fu Wei <fu.wei@linaro.org> [Mark: reword commit message] Signed-off-by: Mark Rutland <mark.rutland@arm.com>
2017-04-19	block: remove the osdblk driver	Christoph Hellwig
	This was just a proof of concept user for the SCSI OSD library, and never had any real users. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Boaz Harrosh <ooo@electrozaur.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2017-04-19	block: Make writeback throttling defaults consistent for SQ devices	Jan Kara
	When CFQ is used as an elevator, it disables writeback throttling because they don't play well together. Later when a different elevator is chosen for the device, writeback throttling doesn't get enabled again as it should. Make sure CFQ enables writeback throttling (if it should be enabled by default) when we switch from it to another IO scheduler. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Jens Axboe <axboe@fb.com>
2017-04-19	block, bfq: split bfq-iosched.c into multiple source files	Paolo Valente
	The BFQ I/O scheduler features an optimal fair-queuing (proportional-share) scheduling algorithm, enriched with several mechanisms to boost throughput and reduce latency for interactive and real-time applications. This makes BFQ a large and complex piece of code. This commit addresses this issue by splitting BFQ into three main, independent components, and by moving each component into a separate source file: 1. Main algorithm: handles the interaction with the kernel, and decides which requests to dispatch; it uses the following two further components to achieve its goals. 2. Scheduling engine (Hierarchical B-WF2Q+ scheduling algorithm): computes the schedule, using weights and budgets provided by the above component. 3. cgroups support: handles group operations (creation, destruction, move, ...). Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Jens Axboe <axboe@fb.com>
2017-04-19	block, bfq: remove all get and put of I/O contexts	Paolo Valente
	When a bfq queue is set in service and when it is merged, a reference to the I/O context associated with the queue is taken. This reference is then released when the queue is deselected from service or split. More precisely, the release of the reference is postponed to when the scheduler lock is released, to avoid nesting between the scheduler and the I/O-context lock. In fact, such nesting would lead to deadlocks, because of other code paths that take the same locks in the opposite order. This postponing of I/O-context releases does complicate code. This commit addresses these issue by modifying involved operations in such a way to not need to get the above I/O-context references any more. Then it also removes any get and release of these references. Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Jens Axboe <axboe@fb.com>
2017-04-19	block, bfq: handle bursts of queue activations	Arianna Avanzini
	Many popular I/O-intensive services or applications spawn or reactivate many parallel threads/processes during short time intervals. Examples are systemd during boot or git grep. These services or applications benefit mostly from a high throughput: the quicker the I/O generated by their processes is cumulatively served, the sooner the target job of these services or applications gets completed. As a consequence, it is almost always counterproductive to weight-raise any of the queues associated to the processes of these services or applications: in most cases it would just lower the throughput, mainly because weight-raising also implies device idling. To address this issue, an I/O scheduler needs, first, to detect which queues are associated with these services or applications. In this respect, we have that, from the I/O-scheduler standpoint, these services or applications cause bursts of activations, i.e., activations of different queues occurring shortly after each other. However, a shorter burst of activations may be caused also by the start of an application that does not consist in a lot of parallel I/O-bound threads (see the comments on the function bfq_handle_burst for details). In view of these facts, this commit introduces: 1) an heuristic to detect (only) bursts of queue activations caused by services or applications consisting in many parallel I/O-bound threads; 2) the prevention of device idling and weight-raising for the queues belonging to these bursts. Signed-off-by: Arianna Avanzini <avanzini.arianna@gmail.com> Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Jens Axboe <axboe@fb.com>
2017-04-19	block, bfq: boost the throughput with random I/O on NCQ-capable HDDs	Paolo Valente
	This patch is basically the counterpart, for NCQ-capable rotational devices, of the previous patch. Exactly as the previous patch does on flash-based devices and for any workload, this patch disables device idling on rotational devices, but only for random I/O. In fact, only with these queues disabling idling boosts the throughput on NCQ-capable rotational devices. To not break service guarantees, idling is disabled for NCQ-enabled rotational devices only when the same symmetry conditions considered in the previous patches hold. Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Arianna Avanzini <avanzini.arianna@gmail.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2017-04-19	block, bfq: boost the throughput on NCQ-capable flash-based devices	Paolo Valente
	This patch boosts the throughput on NCQ-capable flash-based devices, while still preserving latency guarantees for interactive and soft real-time applications. The throughput is boosted by just not idling the device when the in-service queue remains empty, even if the queue is sync and has a non-null idle window. This helps to keep the drive's internal queue full, which is necessary to achieve maximum performance. This solution to boost the throughput is a port of commits a68bbdd and f7d7b7a for CFQ. As already highlighted in a previous patch, allowing the device to prefetch and internally reorder requests trivially causes loss of control on the request service order, and hence on service guarantees. Fortunately, as discussed in detail in the comments on the function bfq_bfqq_may_idle(), if every process has to receive the same fraction of the throughput, then the service order enforced by the internal scheduler of a flash-based device is relatively close to that enforced by BFQ. In particular, it is close enough to let service guarantees be substantially preserved. Things change in an asymmetric scenario, i.e., if not every process has to receive the same fraction of the throughput. In this case, to guarantee the desired throughput distribution, the device must be prevented from prefetching requests. This is exactly what this patch does in asymmetric scenarios. Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Arianna Avanzini <avanzini.arianna@gmail.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2017-04-19	block, bfq: reduce idling only in symmetric scenarios	Arianna Avanzini
	A seeky queue (i..e, a queue containing random requests) is assigned a very small device-idling slice, for throughput issues. Unfortunately, given the process associated with a seeky queue, this behavior causes the following problem: if the process, say P, performs sync I/O and has a higher weight than some other processes doing I/O and associated with non-seeky queues, then BFQ may fail to guarantee to P its reserved share of the throughput. The reason is that idling is key for providing service guarantees to processes doing sync I/O [1]. This commit addresses this issue by allowing the device-idling slice to be reduced for a seeky queue only if the scenario happens to be symmetric, i.e., if all the queues are to receive the same share of the throughput. [1] P. Valente, A. Avanzini, "Evolution of the BFQ Storage I/O Scheduler", Proceedings of the First Workshop on Mobile System Technologies (MST-2015), May 2015. http://algogroup.unimore.it/people/paolo/disk_sched/mst-2015.pdf Signed-off-by: Arianna Avanzini <avanzini.arianna@gmail.com> Signed-off-by: Riccardo Pizzetti <riccardo.pizzetti@gmail.com> Signed-off-by: Samuele Zecchini <samuele.zecchini92@gmail.com> Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Jens Axboe <axboe@fb.com>
2017-04-19	block, bfq: add Early Queue Merge (EQM)	Arianna Avanzini
	A set of processes may happen to perform interleaved reads, i.e., read requests whose union would give rise to a sequential read pattern. There are two typical cases: first, processes reading fixed-size chunks of data at a fixed distance from each other; second, processes reading variable-size chunks at variable distances. The latter case occurs for example with QEMU, which splits the I/O generated by a guest into multiple chunks, and lets these chunks be served by a pool of I/O threads, iteratively assigning the next chunk of I/O to the first available thread. CFQ denotes as 'cooperating' a set of processes that are doing interleaved I/O, and when it detects cooperating processes, it merges their queues to obtain a sequential I/O pattern from the union of their I/O requests, and hence boost the throughput. Unfortunately, in the following frequent case, the mechanism implemented in CFQ for detecting cooperating processes and merging their queues is not responsive enough to handle also the fluctuating I/O pattern of the second type of processes. Suppose that one process of the second type issues a request close to the next request to serve of another process of the same type. At that time the two processes would be considered as cooperating. But, if the request issued by the first process is to be merged with some other already-queued request, then, from the moment at which this request arrives, to the moment when CFQ controls whether the two processes are cooperating, the two processes are likely to be already doing I/O in distant zones of the disk surface or device memory. CFQ uses however preemption to get a sequential read pattern out of the read requests performed by the second type of processes too. As a consequence, CFQ uses two different mechanisms to achieve the same goal: boosting the throughput with interleaved I/O. This patch introduces Early Queue Merge (EQM), a unified mechanism to get a sequential read pattern with both types of processes. The main idea is to immediately check whether a newly-arrived request lets some pair of processes become cooperating, both in the case of actual request insertion and, to be responsive with the second type of processes, in the case of request merge. Both types of processes are then handled by just merging their queues. Signed-off-by: Arianna Avanzini <avanzini.arianna@gmail.com> Signed-off-by: Mauro Andreolini <mauro.andreolini@unimore.it> Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Jens Axboe <axboe@fb.com>
2017-04-19	block, bfq: reduce latency during request-pool saturation	Paolo Valente
	This patch introduces an heuristic that reduces latency when the I/O-request pool is saturated. This goal is achieved by disabling device idling, for non-weight-raised queues, when there are weight- raised queues with pending or in-flight requests. In fact, as explained in more detail in the comment on the function bfq_bfqq_may_idle(), this reduces the rate at which processes associated with non-weight-raised queues grab requests from the pool, thereby increasing the probability that processes associated with weight-raised queues get a request immediately (or at least soon) when they need one. Along the same line, if there are weight-raised queues, then this patch halves the service rate of async (write) requests for non-weight-raised queues. Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Arianna Avanzini <avanzini.arianna@gmail.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2017-04-19	block, bfq: preserve a low latency also with NCQ-capable drives	Paolo Valente
	I/O schedulers typically allow NCQ-capable drives to prefetch I/O requests, as NCQ boosts the throughput exactly by prefetching and internally reordering requests. Unfortunately, as discussed in detail and shown experimentally in [1], this may cause fairness and latency guarantees to be violated. The main problem is that the internal scheduler of an NCQ-capable drive may postpone the service of some unlucky (prefetched) requests as long as it deems serving other requests more appropriate to boost the throughput. This patch addresses this issue by not disabling device idling for weight-raised queues, even if the device supports NCQ. This allows BFQ to start serving a new queue, and therefore allows the drive to prefetch new requests, only after the idling timeout expires. At that time, all the outstanding requests of the expired queue have been most certainly served. [1] P. Valente and M. Andreolini, "Improving Application Responsiveness with the BFQ Disk I/O Scheduler", Proceedings of the 5th Annual International Systems and Storage Conference (SYSTOR '12), June 2012. Slightly extended version: http://algogroup.unimore.it/people/paolo/disk_sched/bfq-v1-suite- results.pdf Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Arianna Avanzini <avanzini.arianna@gmail.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2017-04-19	block, bfq: reduce I/O latency for soft real-time applications	Paolo Valente
	To guarantee a low latency also to the I/O requests issued by soft real-time applications, this patch introduces a further heuristic, which weight-raises (in the sense explained in the previous patch) also the queues associated to applications deemed as soft real-time. To be deemed as soft real-time, an application must meet two requirements. First, the application must not require an average bandwidth higher than the approximate bandwidth required to playback or record a compressed high-definition video. Second, the request pattern of the application must be isochronous, i.e., after issuing a request or a batch of requests, the application must stop issuing new requests until all its pending requests have been completed. After that, the application may issue a new batch, and so on. As for the second requirement, it is critical to require also that, after all the pending requests of the application have been completed, an adequate minimum amount of time elapses before the application starts issuing new requests. This prevents also greedy (i.e., I/O-bound) applications from being incorrectly deemed, occasionally, as soft real-time. In fact, if any amount of time is fine, then even a greedy application may, paradoxically, meet both the above requirements, if: (1) the application performs random I/O and/or the device is slow, and (2) the CPU load is high. The reason is the following. First, if condition (1) is true, then, during the service of the application, the throughput may be low enough to let the application meet the bandwidth requirement. Second, if condition (2) is true as well, then the application may occasionally behave in an apparently isochronous way, because it may simply stop issuing requests while the CPUs are busy serving other processes. To address this issue, the heuristic leverages the simple fact that greedy applications issue all their requests as quickly as they can, whereas soft real-time applications spend some time processing data after each batch of requests is completed. In particular, the heuristic works as follows. First, according to the above isochrony requirement, the heuristic checks whether an application may be soft real-time, thereby giving to the application the opportunity to be deemed as such, only when both the following two conditions happen to hold: 1) the queue associated with the application has expired and is empty, 2) there is no outstanding request of the application. Suppose that both conditions hold at time, say, t_c and that the application issues its next request at time, say, t_i. At time t_c the heuristic computes the next time instant, called soft_rt_next_start in the code, such that, only if t_i >= soft_rt_next_start, then both the next conditions will hold when the application issues its next request: 1) the application will meet the above bandwidth requirement, 2) a given minimum time interval, say Delta, will have elapsed from time t_c (so as to filter out greedy application). The current value of Delta is a little bit higher than the value that we have found, experimentally, to be adequate on a real, general-purpose machine. In particular we had to increase Delta to make the filter quite precise also in slower, embedded systems, and in KVM/QEMU virtual machines (details in the comments on the code). If the application actually issues its next request after time soft_rt_next_start, then its associated queue will be weight-raised for a relatively short time interval. If, during this time interval, the application proves again to meet the bandwidth and isochrony requirements, then the end of the weight-raising period for the queue is moved forward, and so on. Note that an application whose associated queue never happens to be empty when it expires will never have the opportunity to be deemed as soft real-time. Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Arianna Avanzini <avanzini.arianna@gmail.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2017-04-19	block, bfq: improve responsiveness	Paolo Valente
	This patch introduces a simple heuristic to load applications quickly, and to perform the I/O requested by interactive applications just as quickly. To this purpose, both a newly-created queue and a queue associated with an interactive application (we explain in a moment how BFQ decides whether the associated application is interactive), receive the following two special treatments: 1) The weight of the queue is raised. 2) The queue unconditionally enjoys device idling when it empties; in fact, if the requests of a queue are sync, then performing device idling for the queue is a necessary condition to guarantee that the queue receives a fraction of the throughput proportional to its weight (see [1] for details). For brevity, we call just weight-raising the combination of these two preferential treatments. For a newly-created queue, weight-raising starts immediately and lasts for a time interval that: 1) depends on the device speed and type (rotational or non-rotational), and 2) is equal to the time needed to load (start up) a large-size application on that device, with cold caches and with no additional workload. Finally, as for guaranteeing a fast execution to interactive, I/O-related tasks (such as opening a file), consider that any interactive application blocks and waits for user input both after starting up and after executing some task. After a while, the user may trigger new operations, after which the application stops again, and so on. Accordingly, the low-latency heuristic weight-raises again a queue in case it becomes backlogged after being idle for a sufficiently long (configurable) time. The weight-raising then lasts for the same time as for a just-created queue. According to our experiments, the combination of this low-latency heuristic and of the improvements described in the previous patch allows BFQ to guarantee a high application responsiveness. [1] P. Valente, A. Avanzini, "Evolution of the BFQ Storage I/O Scheduler", Proceedings of the First Workshop on Mobile System Technologies (MST-2015), May 2015. http://algogroup.unimore.it/people/paolo/disk_sched/mst-2015.pdf Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Arianna Avanzini <avanzini.arianna@gmail.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2017-04-19	block, bfq: add more fairness with writes and slow processes	Paolo Valente
	This patch deals with two sources of unfairness, which can also cause high latencies and throughput loss. The first source is related to write requests. Write requests tend to starve read requests, basically because, on one side, writes are slower than reads, whereas, on the other side, storage devices confuse schedulers by deceptively signaling the completion of write requests immediately after receiving them. This patch addresses this issue by just throttling writes. In particular, after a write request is dispatched for a queue, the budget of the queue is decremented by the number of sectors to write, multiplied by an (over)charge coefficient. The value of the coefficient is the result of our tuning with different devices. The second source of unfairness has to do with slowness detection: when the in-service queue is expired, BFQ also controls whether the queue has been "too slow", i.e., has consumed its last-assigned budget at such a low rate that it would have been impossible to consume all of this budget within the maximum time slice T_max (Subsec. 3.5 in [1]). In this case, the queue is always (over)charged the whole budget, to reduce its utilization of the device. Both this overcharge and the slowness-detection criterion may cause unfairness. First, always charging a full budget to a slow queue is too coarse. It is much more accurate, and this patch lets BFQ do so, to charge an amount of service 'equivalent' to the amount of time during which the queue has been in service. As explained in more detail in the comments on the code, this enables BFQ to provide time fairness among slow queues. Secondly, because of ZBR, a queue may be deemed as slow when its associated process is performing I/O on the slowest zones of a disk. However, unless the process is truly too slow, not reducing the disk utilization of the queue is more profitable in terms of disk throughput than the opposite. A similar problem is caused by logical block mapping on non-rotational devices. For this reason, this patch lets a queue be charged time, and not budget, only if the queue has consumed less than 2/3 of its assigned budget. As an additional, important benefit, this tolerance allows BFQ to preserve enough elasticity to still perform bandwidth, and not time, distribution with little unlucky or quasi-sequential processes. Finally, for the same reasons as above, this patch makes slowness detection itself much less harsh: a queue is deemed slow only if it has consumed its budget at less than half of the peak rate. [1] P. Valente and M. Andreolini, "Improving Application Responsiveness with the BFQ Disk I/O Scheduler", Proceedings of the 5th Annual International Systems and Storage Conference (SYSTOR '12), June 2012. Slightly extended version: http://algogroup.unimore.it/people/paolo/disk_sched/bfq-v1-suite- results.pdf Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Arianna Avanzini <avanzini.arianna@gmail.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2017-04-19	block, bfq: modify the peak-rate estimator	Paolo Valente
	Unless the maximum budget B_max that BFQ can assign to a queue is set explicitly by the user, BFQ automatically updates B_max. In particular, BFQ dynamically sets B_max to the number of sectors that can be read, at the current estimated peak rate, during the maximum time, T_max, allowed before a budget timeout occurs. In formulas, if we denote as R_est the estimated peak rate, then B_max = T_max ∗ R_est. Hence, the higher R_est is with respect to the actual device peak rate, the higher the probability that processes incur budget timeouts unjustly is. Besides, a too high value of B_max unnecessarily increases the deviation from an ideal, smooth service. Unfortunately, it is not trivial to estimate the peak rate correctly: because of the presence of sw and hw queues between the scheduler and the device components that finally serve I/O requests, it is hard to say exactly when a given dispatched request is served inside the device, and for how long. As a consequence, it is hard to know precisely at what rate a given set of requests is actually served by the device. On the opposite end, the dispatch time of any request is trivially available, and, from this piece of information, the "dispatch rate" of requests can be immediately computed. So, the idea in the next function is to use what is known, namely request dispatch times (plus, when useful, request completion times), to estimate what is unknown, namely in-device request service rate. The main issue is that, because of the above facts, the rate at which a certain set of requests is dispatched over a certain time interval can vary greatly with respect to the rate at which the same requests are then served. But, since the size of any intermediate queue is limited, and the service scheme is lossless (no request is silently dropped), the following obvious convergence property holds: the number of requests dispatched MUST become closer and closer to the number of requests completed as the observation interval grows. This is the key property used in this new version of the peak-rate estimator. Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Arianna Avanzini <avanzini.arianna@gmail.com> Signed-off-by: Jens Axboe <axboe@fb.com>