linux.git - Linus' kernel tree

Age	Commit message (Collapse)	Author
2018-03-18	bcache: fix using of loop variable in memory shrink	Tang Junhui
	In bch_mca_scan(), There are some confusion and logical error in the use of loop variables. In this patch, we clarify them as: 1) nr: the number of btree nodes needs to scan, which will decrease after we scan a btree node, and should not be less than 0; 2) i: the number of btree nodes have scanned, includes both btree_cache_freeable and btree_cache, which should not be bigger than btree_cache_used; 3) freed: the number of btree nodes have freed. Signed-off-by: Tang Junhui <tang.junhui@zte.com.cn> Reviewed-by: Michael Lyle <mlyle@lyle.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-03-18	bcache: fix error return value in memory shrink	Tang Junhui
	In bch_mca_scan(), the return value should not be the number of freed btree nodes, but the number of pages of freed btree nodes. Signed-off-by: Tang Junhui <tang.junhui@zte.com.cn> Reviewed-by: Michael Lyle <mlyle@lyle.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-03-18	bcache: fix incorrect sysfs output value of strip size	Tang Junhui
	Stripe size is shown as zero when no strip in back end device: [root@ceph132 ~]# cat /sys/block/sdd/bcache/stripe_size 0.0k Actually it should be 1T Bytes (1 << 31 sectors), but in sysfs interface, stripe_size was changed from sectors to bytes, and move 9 bits left, so the 32 bits variable overflows. This patch change the variable to a 64 bits type before moving bits. Signed-off-by: Tang Junhui <tang.junhui@zte.com.cn> Reviewed-by: Michael Lyle <mlyle@lyle.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-03-18	bcache: fix inaccurate io state for detached bcache devices	Tang Junhui
	When we run IO in a detached device, and run iostat to shows IO status, normally it will show like bellow (Omitted some fields): Device: ... avgrq-sz avgqu-sz await r_await w_await svctm %util sdd ... 15.89 0.53 1.82 0.20 2.23 1.81 52.30 bcache0 ... 15.89 115.42 0.00 0.00 0.00 2.40 69.60 but after IO stopped, there are still very big avgqu-sz and %util values as bellow: Device: ... avgrq-sz avgqu-sz await r_await w_await svctm %util bcache0 ... 0 5326.32 0.00 0.00 0.00 0.00 100.10 The reason for this issue is that, only generic_start_io_acct() called and no generic_end_io_acct() called for detached device in cached_dev_make_request(). See the code: //start generic_start_io_acct() generic_start_io_acct(q, rw, bio_sectors(bio), &d->disk->part0); if (cached_dev_get(dc)) { //will callback generic_end_io_acct() } else { //will not call generic_end_io_acct() } This patch calls generic_end_io_acct() in the end of IO for detached devices, so we can show IO state correctly. (Modified to use GFP_NOIO in kzalloc() by Coly Li) Changelog: v2: fix typo. v1: the initial version. Signed-off-by: Tang Junhui <tang.junhui@zte.com.cn> Reviewed-by: Coly Li <colyli@suse.de> Reviewed-by: Hannes Reinecke <hare@suse.com> Reviewed-by: Michael Lyle <mlyle@lyle.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-03-18	bcache: add stop_when_cache_set_failed option to backing device	Coly Li
	When there are too many I/O errors on cache device, current bcache code will retire the whole cache set, and detach all bcache devices. But the detached bcache devices are not stopped, which is problematic when bcache is in writeback mode. If the retired cache set has dirty data of backing devices, continue writing to bcache device will write to backing device directly. If the LBA of write request has a dirty version cached on cache device, next time when the cache device is re-registered and backing device re-attached to it again, the stale dirty data on cache device will be written to backing device, and overwrite latest directly written data. This situation causes a quite data corruption. But we cannot simply stop all attached bcache devices when the cache set is broken or disconnected. For example, use bcache to accelerate performance of an email service. In such workload, if cache device is broken but no dirty data lost, keep the bcache device alive and permit email service continue to access user data might be a better solution for the cache device failure. Nix <nix@esperi.org.uk> points out the issue and provides the above example to explain why it might be necessary to not stop bcache device for broken cache device. Pavel Goran <via-bcache@pvgoran.name> provides a brilliant suggestion to provide "always" and "auto" options to per-cached device sysfs file stop_when_cache_set_failed. If cache set is retiring and the backing device has no dirty data on cache, it should be safe to keep the bcache device alive. In this case, if stop_when_cache_set_failed is set to "auto", the device failure handling code will not stop this bcache device and permit application to access the backing device with a unattached bcache device. Changelog: [mlyle: edited to not break string constants across lines] v3: fix typos pointed out by Nix. v2: change option values of stop_when_cache_set_failed from 1/0 to "auto"/"always". v1: initial version, stop_when_cache_set_failed can be 0 (not stop) or 1 (always stop). Signed-off-by: Coly Li <colyli@suse.de> Reviewed-by: Michael Lyle <mlyle@lyle.org> Signed-off-by: Michael Lyle <mlyle@lyle.org> Cc: Nix <nix@esperi.org.uk> Cc: Pavel Goran <via-bcache@pvgoran.name> Cc: Junhui Tang <tang.junhui@zte.com.cn> Cc: Hannes Reinecke <hare@suse.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-03-18	bcache: add CACHE_SET_IO_DISABLE to struct cache_set flags	Coly Li
	When too many I/Os failed on cache device, bch_cache_set_error() is called in the error handling code path to retire whole problematic cache set. If new I/O requests continue to come and take refcount dc->count, the cache set won't be retired immediately, this is a problem. Further more, there are several kernel thread and self-armed kernel work may still running after bch_cache_set_error() is called. It needs to wait quite a while for them to stop, or they won't stop at all. They also prevent the cache set from being retired. The solution in this patch is, to add per cache set flag to disable I/O request on this cache and all attached backing devices. Then new coming I/O requests can be rejected in *_make_request() before taking refcount, kernel threads and self-armed kernel worker can stop very fast when flags bit CACHE_SET_IO_DISABLE is set. Because bcache also do internal I/Os for writeback, garbage collection, bucket allocation, journaling, this kind of I/O should be disabled after bch_cache_set_error() is called. So closure_bio_submit() is modified to check whether CACHE_SET_IO_DISABLE is set on cache_set->flags. If set, closure_bio_submit() will set bio->bi_status to BLK_STS_IOERR and return, generic_make_request() won't be called. A sysfs interface is also added to set or clear CACHE_SET_IO_DISABLE bit from cache_set->flags, to disable or enable cache set I/O for debugging. It is helpful to trigger more corner case issues for failed cache device. Changelog v4, add wait_for_kthread_stop(), and call it before exits writeback and gc kernel threads. v3, change CACHE_SET_IO_DISABLE from 4 to 3, since it is bit index. remove "bcache: " prefix when printing out kernel message. v2, more changes by previous review, - Use CACHE_SET_IO_DISABLE of cache_set->flags, suggested by Junhui. - Check CACHE_SET_IO_DISABLE in bch_btree_gc() to stop a while-loop, this is reported and inspired from origal patch of Pavel Vazharov. v1, initial version. Signed-off-by: Coly Li <colyli@suse.de> Reviewed-by: Hannes Reinecke <hare@suse.com> Reviewed-by: Michael Lyle <mlyle@lyle.org> Cc: Junhui Tang <tang.junhui@zte.com.cn> Cc: Michael Lyle <mlyle@lyle.org> Cc: Pavel Vazharov <freakpv@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-03-18	bcache: stop dc->writeback_rate_update properly	Coly Li
	struct delayed_work writeback_rate_update in struct cache_dev is a delayed worker to call function update_writeback_rate() in period (the interval is defined by dc->writeback_rate_update_seconds). When a metadate I/O error happens on cache device, bcache error handling routine bch_cache_set_error() will call bch_cache_set_unregister() to retire whole cache set. On the unregister code path, this delayed work is stopped by calling cancel_delayed_work_sync(&dc->writeback_rate_update). dc->writeback_rate_update is a special delayed work from others in bcache. In its routine update_writeback_rate(), this delayed work is re-armed itself. That means when cancel_delayed_work_sync() returns, this delayed work can still be executed after several seconds defined by dc->writeback_rate_update_seconds. The problem is, after cancel_delayed_work_sync() returns, the cache set unregister code path will continue and release memory of struct cache set. Then the delayed work is scheduled to run, __update_writeback_rate() will reference the already released cache_set memory, and trigger a NULL pointer deference fault. This patch introduces two more bcache device flags, - BCACHE_DEV_WB_RUNNING bit set: bcache device is in writeback mode and running, it is OK for dc->writeback_rate_update to re-arm itself. bit clear:bcache device is trying to stop dc->writeback_rate_update, this delayed work should not re-arm itself and quit. - BCACHE_DEV_RATE_DW_RUNNING bit set: routine update_writeback_rate() is executing. bit clear: routine update_writeback_rate() quits. This patch also adds a function cancel_writeback_rate_update_dwork() to wait for dc->writeback_rate_update quits before cancel it by calling cancel_delayed_work_sync(). In order to avoid a deadlock by unexpected quit dc->writeback_rate_update, after time_out seconds this function will give up and continue to call cancel_delayed_work_sync(). And here I explain how this patch stops self re-armed delayed work properly with the above stuffs. update_writeback_rate() sets BCACHE_DEV_RATE_DW_RUNNING at its beginning and clears BCACHE_DEV_RATE_DW_RUNNING at its end. Before calling cancel_writeback_rate_update_dwork() clear flag BCACHE_DEV_WB_RUNNING. Before calling cancel_delayed_work_sync() wait utill flag BCACHE_DEV_RATE_DW_RUNNING is clear. So when calling cancel_delayed_work_sync(), dc->writeback_rate_update must be already re- armed, or quite by seeing BCACHE_DEV_WB_RUNNING cleared. In both cases delayed work routine update_writeback_rate() won't be executed after cancel_delayed_work_sync() returns. Inside update_writeback_rate() before calling schedule_delayed_work(), flag BCACHE_DEV_WB_RUNNING is checked before. If this flag is cleared, it means someone is about to stop the delayed work. Because flag BCACHE_DEV_RATE_DW_RUNNING is set already and cancel_delayed_work_sync() has to wait for this flag to be cleared, we don't need to worry about race condition here. If update_writeback_rate() is scheduled to run after checking BCACHE_DEV_RATE_DW_RUNNING and before calling cancel_delayed_work_sync() in cancel_writeback_rate_update_dwork(), it is also safe. Because at this moment BCACHE_DEV_WB_RUNNING is cleared with memory barrier. As I mentioned previously, update_writeback_rate() will see BCACHE_DEV_WB_RUNNING is clear and quit immediately. Because there are more dependences inside update_writeback_rate() to struct cache_set memory, dc->writeback_rate_update is not a simple self re-arm delayed work. After trying many different methods (e.g. hold dc->count, or use locks), this is the only way I can find which works to properly stop dc->writeback_rate_update delayed work. Changelog: v3: change values of BCACHE_DEV_WB_RUNNING and BCACHE_DEV_RATE_DW_RUNNING to bit index, for test_bit(). v2: Try to fix the race issue which is pointed out by Junhui. v1: The initial version for review Signed-off-by: Coly Li <colyli@suse.de> Reviewed-by: Junhui Tang <tang.junhui@zte.com.cn> Reviewed-by: Michael Lyle <mlyle@lyle.org> Cc: Michael Lyle <mlyle@lyle.org> Cc: Hannes Reinecke <hare@suse.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-03-18	bcache: quit dc->writeback_thread when BCACHE_DEV_DETACHING is set	Coly Li
	In patch "bcache: fix cached_dev->count usage for bch_cache_set_error()", cached_dev_get() is called when creating dc->writeback_thread, and cached_dev_put() is called when exiting dc->writeback_thread. This modification works well unless people detach the bcache device manually by 'echo 1 > /sys/block/bcache<N>/bcache/detach' Because this sysfs interface only calls bch_cached_dev_detach() which wakes up dc->writeback_thread but does not stop it. The reason is, before patch "bcache: fix cached_dev->count usage for bch_cache_set_error()", inside bch_writeback_thread(), if cache is not dirty after writeback, cached_dev_put() will be called here. And in cached_dev_make_request() when a new write request makes cache from clean to dirty, cached_dev_get() will be called there. Since we don't operate dc->count in these locations, refcount d->count cannot be dropped after cache becomes clean, and cached_dev_detach_finish() won't be called to detach bcache device. This patch fixes the issue by checking whether BCACHE_DEV_DETACHING is set inside bch_writeback_thread(). If this bit is set and cache is clean (no existing writeback_keys), break the while-loop, call cached_dev_put() and quit the writeback thread. Please note if cache is still dirty, even BCACHE_DEV_DETACHING is set the writeback thread should continue to perform writeback, this is the original design of manually detach. It is safe to do the following check without locking, let me explain why, + if (!test_bit(BCACHE_DEV_DETACHING, &dc->disk.flags) && + (!atomic_read(&dc->has_dirty) \|\| !dc->writeback_running)) { If the kenrel thread does not sleep and continue to run due to conditions are not updated in time on the running CPU core, it just consumes more CPU cycles and has no hurt. This should-sleep-but-run is safe here. We just focus on the should-run-but-sleep condition, which means the writeback thread goes to sleep in mistake while it should continue to run. 1, First of all, no matter the writeback thread is hung or not, kthread_stop() from cached_dev_detach_finish() will wake up it and terminate by making kthread_should_stop() return true. And in normal run time, bit on index BCACHE_DEV_DETACHING is always cleared, the condition !test_bit(BCACHE_DEV_DETACHING, &dc->disk.flags) is always true and can be ignored as constant value. 2, If one of the following conditions is true, the writeback thread should go to sleep, "!atomic_read(&dc->has_dirty)" or "!dc->writeback_running)" each of them independently controls the writeback thread should sleep or not, let's analyse them one by one. 2.1 condition "!atomic_read(&dc->has_dirty)" If dc->has_dirty is set from 0 to 1 on another CPU core, bcache will call bch_writeback_queue() immediately or call bch_writeback_add() which indirectly calls bch_writeback_queue() too. In bch_writeback_queue(), wake_up_process(dc->writeback_thread) is called. It sets writeback thread's task state to TASK_RUNNING and following an implicit memory barrier, then tries to wake up the writeback thread. In writeback thread, its task state is set to TASK_INTERRUPTIBLE before doing the condition check. If other CPU core sets the TASK_RUNNING state after writeback thread setting TASK_INTERRUPTIBLE, the writeback thread will be scheduled to run very soon because its state is not TASK_INTERRUPTIBLE. If other CPU core sets the TASK_RUNNING state before writeback thread setting TASK_INTERRUPTIBLE, the implict memory barrier of wake_up_process() will make sure modification of dc->has_dirty on other CPU core is updated and observed on the CPU core of writeback thread. Therefore the condition check will correctly be false, and continue writeback code without sleeping. 2.2 condition "!dc->writeback_running)" dc->writeback_running can be changed via sysfs file, every time it is modified, a following bch_writeback_queue() is alwasy called. So the change is always observed on the CPU core of writeback thread. If dc->writeback_running is changed from 0 to 1 on other CPU core, this condition check will observe the modification and allow writeback thread to continue to run without sleeping. Now we can see, even without a locking protection, multiple conditions check is safe here, no deadlock or process hang up will happen. I compose a separte patch because that patch "bcache: fix cached_dev->count usage for bch_cache_set_error()" already gets a "Reviewed-by:" from Hannes Reinecke. Also this fix is not trivial and good for a separate patch. Signed-off-by: Coly Li <colyli@suse.de> Reviewed-by: Michael Lyle <mlyle@lyle.org> Cc: Hannes Reinecke <hare@suse.com> Cc: Huijun Tang <tang.junhui@zte.com.cn> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-03-18	bcache: fix cached_dev->count usage for bch_cache_set_error()	Coly Li
	When bcache metadata I/O fails, bcache will call bch_cache_set_error() to retire the whole cache set. The expected behavior to retire a cache set is to unregister the cache set, and unregister all backing device attached to this cache set, then remove sysfs entries of the cache set and all attached backing devices, finally release memory of structs cache_set, cache, cached_dev and bcache_device. In my testing when journal I/O failure triggered by disconnected cache device, sometimes the cache set cannot be retired, and its sysfs entry /sys/fs/bcache/<uuid> still exits and the backing device also references it. This is not expected behavior. When metadata I/O failes, the call senquence to retire whole cache set is, bch_cache_set_error() bch_cache_set_unregister() bch_cache_set_stop() __cache_set_unregister() <- called as callback by calling clousre_queue(&c->caching) cache_set_flush() <- called as a callback when refcount of cache_set->caching is 0 cache_set_free() <- called as a callback when refcount of catch_set->cl is 0 bch_cache_set_release() <- called as a callback when refcount of catch_set->kobj is 0 I find if kernel thread bch_writeback_thread() quits while-loop when kthread_should_stop() is true and searched_full_index is false, clousre callback cache_set_flush() set by continue_at() will never be called. The result is, bcache fails to retire whole cache set. cache_set_flush() will be called when refcount of closure c->caching is 0, and in function bcache_device_detach() refcount of closure c->caching is released to 0 by clousre_put(). In metadata error code path, function bcache_device_detach() is called by cached_dev_detach_finish(). This is a callback routine being called when cached_dev->count is 0. This refcount is decreased by cached_dev_put(). The above dependence indicates, cache_set_flush() will be called when refcount of cache_set->cl is 0, and refcount of cache_set->cl to be 0 when refcount of cache_dev->count is 0. The reason why sometimes cache_dev->count is not 0 (when metadata I/O fails and bch_cache_set_error() called) is, in bch_writeback_thread(), refcount of cache_dev is not decreased properly. In bch_writeback_thread(), cached_dev_put() is called only when searched_full_index is true and cached_dev->writeback_keys is empty, a.k.a there is no dirty data on cache. In most of run time it is correct, but when bch_writeback_thread() quits the while-loop while cache is still dirty, current code forget to call cached_dev_put() before this kernel thread exits. This is why sometimes cache_set_flush() is not executed and cache set fails to be retired. The reason to call cached_dev_put() in bch_writeback_rate() is, when the cache device changes from clean to dirty, cached_dev_get() is called, to make sure during writeback operatiions both backing and cache devices won't be released. Adding following code in bch_writeback_thread() does not work, static int bch_writeback_thread(void *arg) } + if (atomic_read(&dc->has_dirty)) + cached_dev_put() + return 0; } because writeback kernel thread can be waken up and start via sysfs entry: echo 1 > /sys/block/bcache<N>/bcache/writeback_running It is difficult to check whether backing device is dirty without race and extra lock. So the above modification will introduce potential refcount underflow in some conditions. The correct fix is, to take cached dev refcount when creating the kernel thread, and put it before the kernel thread exits. Then bcache does not need to take a cached dev refcount when cache turns from clean to dirty, or to put a cached dev refcount when cache turns from ditry to clean. The writeback kernel thread is alwasy safe to reference data structure from cache set, cache and cached device (because a refcount of cache device is taken for it already), and no matter the kernel thread is stopped by I/O errors or system reboot, cached_dev->count can always be used correctly. The patch is simple, but understanding how it works is quite complicated. Changelog: v2: set dc->writeback_thread to NULL in this patch, as suggested by Hannes. v1: initial version for review. Signed-off-by: Coly Li <colyli@suse.de> Reviewed-by: Hannes Reinecke <hare@suse.com> Reviewed-by: Michael Lyle <mlyle@lyle.org> Cc: Michael Lyle <mlyle@lyle.org> Cc: Junhui Tang <tang.junhui@zte.com.cn> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-03-19	soc: mediatek: update power domain data of MT2712	weiyi.lu@mediatek.com
	1. split MFG power domain into MFG/MFG_SC1/MFG_SC2/MFG_SC3 according to MT2712 ECO design change 2. add subdomain support for MT2712 Signed-off-by: Weiyi Lu <weiyi.lu@mediatek.com> Signed-off-by: Matthias Brugger <matthias.bgg@gmail.com>
2018-03-18	spi: jcore: disable ref_clk after getting its rate	Alexey Khoroshilov
	The driver does not disable ref_clk on remove. According to the comment, the only reason to enable the clock is to get its rate. So, it should be safe to disable clk just after that. By the way, clk_prepare_enable() looks to be more appropriate than clk_enable() here. Found by Linux Driver Verification project (linuxtesting.org). Signed-off-by: Alexey Khoroshilov <khoroshilov@ispras.ru> Signed-off-by: Mark Brown <broonie@kernel.org>
2018-03-19	gpio: ks8695: Include the right header	Linus Walleij
	This driver is a pure GPIO driver and should only include <linux/gpio/driver.h>. Signed-off-by: Linus Walleij <linus.walleij@linaro.org>
2018-03-19	gpio: kempld: Include the right header	Linus Walleij
	This driver is a pure GPIO driver and should only include <linux/gpio/driver.h>. Signed-off-by: Linus Walleij <linus.walleij@linaro.org>
2018-03-19	gpio: janz-ttl: Use BIT() macro	Linus Walleij
	This makes the code more readable by using the BIT() macro. Signed-off-by: Linus Walleij <linus.walleij@linaro.org>
2018-03-19	gpio: janz-ttl: Include the right header	Linus Walleij
	This driver is a pure GPIO driver and should only include <linux/gpio/driver.h>. Signed-off-by: Linus Walleij <linus.walleij@linaro.org>
2018-03-19	gpio: it87: Include the right header	Linus Walleij
	This driver is a pure GPIO driver and should only include <linux/gpio/driver.h>. Signed-off-by: Linus Walleij <linus.walleij@linaro.org>
2018-03-19	gpio: ich: Use BIT() macro	Linus Walleij
	Using BIT() makes (1 << foo) constructions easier to read, and also account for common mistakes where bit 31 is not working because of numbers being interpreted as negative unless specified as unsigned. Signed-off-by: Linus Walleij <linus.walleij@linaro.org>
2018-03-19	gpio: ich: Include the right header	Linus Walleij
	This driver is a pure GPIO driver and should only include <linux/gpio/driver.h>. Refrain from using GPIOF_* flags in the driver, just use 1/0 to return direction. Signed-off-by: Linus Walleij <linus.walleij@linaro.org>
2018-03-19	gpio: htc-gpio: Include the right header	Linus Walleij
	This driver is a pure GPIO driver and should only include <linux/gpio/driver.h>. Drop the include of <linux/gpio.h> from the platform data header as well, it serves no purpose. Signed-off-by: Linus Walleij <linus.walleij@linaro.org>
2018-03-19	gpio: grgpio: Include the right header	Linus Walleij
	This driver is a pure GPIO driver and should only include <linux/gpio/driver.h>. Signed-off-by: Linus Walleij <linus.walleij@linaro.org>
2018-03-19	gpio: ftgpio010: Drop of_gpio.h include	Linus Walleij
	This driver does not make use of the functions in <linux/of_gpio.h> so drop this include. Signed-off-by: Linus Walleij <linus.walleij@linaro.org>
2018-03-19	gpio: em: Use the right include	Linus Walleij
	The Emma Mobile (EM) GPIO driver uses the too generic include <linux/gpio.h>. It is a driver so it should just use <linux/gpio/driver.h>. Signed-off-by: Linus Walleij <linus.walleij@linaro.org>
2018-03-19	gpio: ge: Drop of_gpio.h include	Linus Walleij
	This driver does not make use of the functions in <linux/of_gpio.h> so drop this include. Signed-off-by: Linus Walleij <linus.walleij@linaro.org>
2018-03-19	gpio: dln2: Include proper header	Linus Walleij
	This driver has no business including <linux/gpio.h>, it is a driver so include <linux/gpio/driver.h>. GPIOF_DIR_IN/GPIOF_DIR_OUT are for consumers and should not be used in drivers to use just 1/0 instead. Signed-off-by: Linus Walleij <linus.walleij@linaro.org>
2018-03-19	cpufreq: powernv: Don't validate the frequency table twice	Viresh Kumar
	The cpufreq core is already validating the CPU frequency table after calling the ->init() callback of the cpufreq drivers and the drivers don't need to do the same anymore. Though they need to set the policy->freq_table field directly from the ->init() callback now. Stop validating the frequency table from powernv driver. Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2018-03-19	powercap: RAPL: Add support for Cannon Lake	Joe Konno
	RAPL MSRs and handling for Cannon Lake are similar to Sky Lake and Kaby Lake. Signed-off-by: Joe Konno <joe.konno@intel.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2018-03-18	ACPI / PM: Reduce LPI constraints logging noise	Rafael J. Wysocki
	If a device referred to by ACPI LPI constrains (coming from function 1 of the Low Power S0 Idle _DSM interface) is not power-manageable via ACPI (no _PS0 method and no power resources), the code generating diagnostic information for the LPI constraints will print a message about that to the kernel log on every system suspend-resume cycle (possibly for multiple times). That is not very useful and noisy, so modify that code to disregard the LPI list entries corresponding to the devices that are not power- manageable after printing that information for them once. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Reviewed-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
2018-03-18	ACPI / Kconfig: Update ACPI_PROCFS_POWER help text	Randy Dunlap
	Fix grammar and punctuation (end sentences with a period) in the Kconfig help text for ACPI_PROCFS_POWER. I was looking at this since it appears to be going away (again, some day) and I have a working script that uses this info to tell me battery usage. I can update the script to use /sys/class/power_supply (in theory) but the contents (with units) should be documented in Documentation/ABI/ before /proc/acpi/battery/ is removed (IMO). Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2018-03-18	ACPI / OSI: Add OEM _OSI strings to disable NVidia RTD3	Alex Hung
	A number of Dell systems require an OEM _OSI string "Linux-Dell-Video" as a BIOS workaround to disable RTD3 which causes systems hangs when NVidia graphics cards are installed. The affected Dell systems are with system IDs: 0818, 0819, 0820, 0850, 0851, 086F, 0870, 0885 and 0886. The form of the OEM _OSI strings is defined by each OEMs and is discussed in Documentation/acpi/osi.txt. Signed-off-by: Alex Hung <alex.hung@canonical.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2018-03-18	net: dsa: mv88e6xxx: Add MDIO interrupts for internal PHYs	Andrew Lunn
	When registering an MDIO bus, it is possible to pass an array of interrupts, one per address on the bus. phylib will then associate the interrupt to the PHY device, if no other interrupt is provided. Some of the global2 interrupts are PHY interrupts. Place them into the MDIO bus structure. Signed-off-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-18	net: dsa: mv88e6xxx: Add number of internal PHYs	Andrew Lunn
	Add to the info structure the number of internal PHYs, if they generate interrupts. Some of the older generations of switches have internal PHYs, but no interrupt registers. In this case, set the count to zero. Signed-off-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-18	net: dsa: mv88e6xxx: Add missing g1 IRQ numbers	Andrew Lunn
	With the recent change to polling for interrupts, it is important that the number of global 1 interrupts is listed. Without it, the driver requests an interrupt domain for zero interrupts, which returns EINVAL, and the probe fails. Add two missing entries. Signed-off-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-18	net: dsa: mv88e6xxx: Fix missing register lock in serdes_get_stats	Florian Fainelli
	We can hit the register lock not held assertion with the following path: [ 34.170631] mv88e6085 0.1:00: Switch registers lock not held! [ 34.176510] CPU: 0 PID: 950 Comm: ethtool Not tainted 4.16.0-rc4 #143 [ 34.182985] Hardware name: Freescale Vybrid VF5xx/VF6xx (Device Tree) [ 34.189519] Backtrace: [ 34.192033] [<8010c4b4>] (dump_backtrace) from [<8010c788>] (show_stack+0x20/0x24) [ 34.199680] r6:9f5dc010 r5:00000011 r4:9f5dc010 r3:00000000 [ 34.205434] [<8010c768>] (show_stack) from [<80679d38>] (dump_stack+0x24/0x28) [ 34.212719] [<80679d14>] (dump_stack) from [<804844a8>] (mv88e6xxx_read+0x70/0x7c) [ 34.220376] [<80484438>] (mv88e6xxx_read) from [<804870dc>] (mv88e6xxx_port_get_cmode+0x34/0x4c) [ 34.229257] r5:a09cd128 r4:9ee31d07 [ 34.232880] [<804870a8>] (mv88e6xxx_port_get_cmode) from [<80487e6c>] (mv88e6352_port_has_serdes+0x24/0x64) [ 34.242690] r4:9f5dc010 [ 34.245309] [<80487e48>] (mv88e6352_port_has_serdes) from [<804880b8>] (mv88e6352_serdes_get_stats+0x28/0x12c) [ 34.255389] r4:00000001 [ 34.257973] [<80488090>] (mv88e6352_serdes_get_stats) from [<804811e8>] (mv88e6xxx_get_ethtool_stats+0xb0/0xc0) [ 34.268156] r10:00000000 r9:00000000 r8:00000000 r7:a09cd020 r6:00000001 r5:9f5dc01c [ 34.276052] r4:9f5dc010 [ 34.278631] [<80481138>] (mv88e6xxx_get_ethtool_stats) from [<8064f740>] (dsa_slave_get_ethtool_stats+0xbc/0xc4) mv88e6xxx_get_ethtool_stats() calls mv88e6xxx_get_stats() which calls both chip->info->ops->stats_get_stats(), which holds the register lock, and chip->info->ops->serdes_get_stats() which does not. Have chip->info->ops->serdes_get_stats() be running with the register lock held to avoid such assertions. Fixes: 436fe17d273b ("net: dsa: mv88e6xxx: Allow the SERDES interfaces to have statistics") Signed-off-by: Florian Fainelli <f.fainelli@gmail.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-18	net: fec: Fix unbalanced PM runtime calls	Florian Fainelli
	When unbinding/removing the driver, we will run into the following warnings: [ 259.655198] fec 400d1000.ethernet: 400d1000.ethernet supply phy not found, using dummy regulator [ 259.665065] fec 400d1000.ethernet: Unbalanced pm_runtime_enable! [ 259.672770] fec 400d1000.ethernet (unnamed net_device) (uninitialized): Invalid MAC address: 00:00:00:00:00:00 [ 259.683062] fec 400d1000.ethernet (unnamed net_device) (uninitialized): Using random MAC address: f2:3e:93:b7:29:c1 [ 259.696239] libphy: fec_enet_mii_bus: probed Avoid these warnings by balancing the runtime PM calls during fec_drv_remove(). Signed-off-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-18	Merge branch 'irq-urgent-for-linus' of ↵	Linus Torvalds
	git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull irq fixes from Thomas Gleixner: "Three fixes for irq chip drivers: - Make sure the allocations in the GIC-V3 ITS driver are large enough to accomodate the interrupt space - Fix a misplaced __iomem annotation which causes a splat of 26 sparse warnings - Remove an unused function in the IMX GPCV2 driver which causes build warnings" * 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: irqchip/irq-imx-gpcv2: Remove unused function irqchip/gic-v3-its: Ensure nr_ites >= nr_lpis irqchip/gic-v3-its: Fix misplaced __iomem annotations
2018-03-18	Merge branch 'efi-urgent-for-linus' of ↵	Linus Torvalds
	git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull EFI fix from Thomas Gleixner: "A single fix to prevent partially initialized pointers in mixed mode (64bit kernel on 32bit UEFI)" * 'efi-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: efi/libstub/tpm: Initialize pointer variables to zero for mixed mode
2018-03-18	ACPICA: Cleanup/simplify module-level code support	Bob Moore
	This prepares the code for eventual removal of the original style of deferred execution of the MLC. Signed-off-by: Bob Moore <robert.moore@intel.com> Signed-off-by: Erik Schmauss <erik.schmauss@intel.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2018-03-18	ACPICA: Events: add a return on failure from acpi_hw_register_read	Erik Schmauss
	This ensures that acpi_ev_fixed_event_detect() does not use fixed_status and and fixed_enable as uninitialized variables. Signed-off-by: Erik Schmauss <erik.schmauss@intel.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2018-03-18	ACPICA: adding SPDX headers	Erik Schmauss
	Signed-off-by: Erik Schmauss <erik.schmauss@intel.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2018-03-18	ACPICA: Rename a global for clarity, no functional change	Bob Moore
	Was acpi_gbl_parse_table_as_term_list, changed to: acpi_gbl_execute_tables_as_methods. Signed-off-by: Bob Moore <robert.moore@intel.com> Signed-off-by: Erik Schmauss <erik.schmauss@intel.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2018-03-18	ACPICA: macros: fix ACPI_ERROR_NAMESPACE macro	Erik Schmauss
	Fixing the ACPI_ERROR_NAMESPACE macros created an "unused variable" compile error when ACPI_NO_ERROR_MESSAGES was defined. This commit also fixes the above compilation errors by surrounding variables meant for debugging inside a new ACPI_ERROR_ONLY macro. Signed-off-by: Erik Schmauss <erik.schmauss@intel.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2018-03-18	ACPICA: Change a compile-time option to a runtime option	Bob Moore
	Changes the option to ignore package resolution errors into a runtime option. Signed-off-by: Bob Moore <robert.moore@intel.com> Signed-off-by: Erik Schmauss <erik.schmauss@intel.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2018-03-18	ACPICA: Remove calling of _STA from acpi_get_object_info()	Hans de Goede
	As the documentatuon above its declaration indicates, acpi_get_object_info() is intended for early probe usage and as such should not call any methods which may rely on op_regions, before this commit it was also calling _STA, which on some systems does rely on op_regions. Calling _STA before things are ready leads to errors such as these (under Linux, on some hardware): [ 0.123579] ACPI Error: No handler for Region [ECRM] (00000000ba9edc4c) [generic_serial_bus] (20170831/evregion-166) [ 0.123601] ACPI Error: Region generic_serial_bus (ID=9) has no handler (20170831/exfldio-299) [ 0.123618] ACPI Error: Method parse/execution failed \_SB.I2C1.BAT1._STA, AE_NOT_EXIST (20170831/psparse-550) End 2015 support for the _SUB method was removed for exactly the same reason. Removing current_status from struct acpi_device_info only has a limited impact. Within ACPICA it is only used by 2 debug messages, both of which are modified to no longer print it with this commit. Outside of ACPICA, there was one user in Linux, which has been patched to no longer use current_status in Torvald's current master. I've not checked if free_BSD or others are using the current_status field. Signed-off-by: Hans de Goede <hdegoede@redhat.com> Signed-off-by: Erik Schmauss <erik.schmauss@intel.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2018-03-18	ACPICA: AML Debug Object: Don't ignore output of zero-length strings	Bob Moore
	The implementation previously ignored null strings (""), but these could be important, especially for debug. Signed-off-by: Bob Moore <robert.moore@intel.com> Signed-off-by: Erik Schmauss <erik.schmauss@intel.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2018-03-18	ACPICA: Fix memory leak on unusual memory leak	Bob Moore
	Fixes a single-object memory leak on a store-to-reference method invocation. ACPICA BZ 1439. Signed-off-by: Bob Moore <robert.moore@intel.com> Signed-off-by: Erik Schmauss <erik.schmauss@intel.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2018-03-18	ACPICA: Events: Dispatch GPEs after enabling for the first time	Erik Schmauss
	After being enabled for the first time, the GPEs may have STS bits already set. Setting EN bits is not sufficient to trigger the GPEs again, so this patch polls GPEs after enabling them for the first time. This is a cleaner version on top of the "GPE clear" fix generated according to Mika's report and Rafael's original Linux based fix. Based on Linux commit originated from Rafael J. Wysocki, fixed by Lv Zheng. Signed-off-by: Lv Zheng <lv.zheng@intel.com> Signed-off-by: Erik Schmauss <erik.schmauss@intel.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2018-03-18	ACPICA: Events: Add parallel GPE handling support to fix potential redundant ↵	Erik Schmauss
	_Exx evaluations There is a risk that a GPE method/handler may be invoked twice. Let's consider a case, both GPE0(RAW_HANDLER) and GPE1(_Exx) is triggered. =======================================+============================= IRQ handler (top-half) \|IRQ polling =======================================+============================= acpi_ev_detect_gpe() \| LOCK() \| READ (GPE0-7 enable/status registers)\| ^^^^^^^^^^^^ROOT CAUSE^^^^^^^^^^^^^^^\| Walk GPE0 \| UNLOCK() \|LOCK() Invoke GPE0 RAW_HANDLER \|READ (GPE1 enable/status bit) \|acpi_ev_gpe_dispatch(irq=false) \| CLEAR (GPE1 enable bit) \| CLEAR (GPE1 status bit) LOCK() \|UNLOCK() Walk GPE1 +============================= acpi_ev_gpe_dispatch(irq=true) \|IRQ polling (defer) CLEAR (GPE1 enable bit) +============================= CLEAR (GPE1 status bit) \|acpi_ev_async_execute_gpe_method() Walk others \| Evaluate GPE1 _Exx fi \| acpi_ev_async_enable_gpe() UNLOCK() \| LOCK() =======================================+ SET (GPE enable bit) IRQ handler (bottom-half) \| UNLOCK() =======================================+ acpi_ev_async_execute_gpe_method() \| Evaluate GPE1 _Exx \| acpi_ev_async_enable_gpe() \| LOCK() \| SET (GPE1 enable bit) \| UNLOCK() \| =======================================+============================= If acpi_ev_detect_gpe() is only invoked from the IRQ context, there won't be more than one _Lxx/_Exx evaluations for one status bit flagging if the IRQ handlers controlled by the underlying IRQ chip/driver (ex. APIC) are run in serial. Note that, this is a known potential gap and we had an approach, locking entire non-raw-handler processes in the top-half IRQ handler and handling all raw-handlers out of the locked loop to be friendly to those IRQ chip/driver. But the approach is too complicated while the issue is not so real, thus ACPICA treated such issue (if any) as a parallelism/quality issue of the underlying IRQ chip/driver to stop putting it on the radar. Bug in link #1 is suspiciously reflecting the same cause, and if so, it can also be fixed by this simpler approach. But it will be no excuse an ACPICA problem now if ACPICA starts to poll IRQs itself. In the changed scenario, _Exx will be evaluated from the task context due to new ACPICA provided "polling after enabling GPEs" mechanism. And the above figure uses edge-triggered GPEs demonstrating the possibility of evaluating _Exx twice for one status bit flagging. As a conclusion, there is now an increased chance of evaluating _Lxx/_Exx more than once for one status bit flagging. However this is still not a real problem if the _Lxx/_Exx checks the underlying hardware IRQ reasoning and finally just changes the 2nd and the follow-up evaluations into no-ops. Note that _Lxx should always be written in this way as a level-trigger GPE could have it's status wrongly duplicated by the underlying IRQ delivery mechanisms. But _Exx may have very low quality BIOS by BIOS to trigger real issues. For example, trigger duplicated button notifications. To solve this issue, we need to stop reading a bunch of enable/status register bits, but read only one GPE's enable/status bit. And GPE status register's W1C nature ensures that acknowledging one GPE won't affect another GPEs' status bits. Thus the hardware GPE architecture has already provided us with the mechanism of implementing such parallelism. So we can lock around one GPE handling process to achieve the parallelism: 1. If we can incorporate GPE enable bit check in detection and ensure the atomicity of the following process (top-half IRQ handler): READ (enable/status bit) if (enabled && raised) CLEAR (enable bit) and handle the GPE after this process, we can ensure that we will only invoke GPE handler once for one status bit flagging. 2. In addtion for edge-triggered GPEs, if we can ensure the atomicity of the following process (top-half IRQ handler): READ (enable/status bit) if (enabled && raised) CLEAR (enable bit) CLEAR (status bit) and handle the GPE after this process, we can ensure that we will only invoke GPE handler once for one status bit flagging. By doing a cleanup in this way, we can remove duplicate GPE handling code and ensure that all logics are collected in 1 function. And the function will be safe for both IRQ interrupt and IRQ polling, and will be safe for us to release and re-acquire acpi_gbl_gpe_lock at any time rather than raw handler only during the top-half IRQ handler. Lv Zheng. Link: https://bugzilla.kernel.org/show_bug.cgi?id=196703 [#1] Signed-off-by: Lv Zheng <lv.zheng@intel.com> Signed-off-by: Erik Schmauss <erik.schmauss@intel.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2018-03-18	ACPICA: Events: Stop unconditionally clearing ACPI IRQs during suspend/resume	Erik Schmauss
	Unconditionally clearing ACPI IRQs during suspend/resume can lead to unexpected IRQ losts. This patch fixes this issue by removing such IRQ clearing code. If this patch triggers regression, the regression should be in the GPE handlers that cannot correctly determine some spurious triggered events as no-ops. Please report any regression related to this commit to the ACPI component on kernel bugzilla. Lv Zheng. Link: https://bugzilla.kernel.org/show_bug.cgi?id=196249 Signed-off-by: Lv Zheng <lv.zheng@intel.com> Reported-and-tested-by: Eric Bakula-Davis <ericbakuladavis@gmail.com> Tested-by: Mika Westerberg <mika.westerberg@linux.intel.com> Signed-off-by: Erik Schmauss <erik.schmauss@intel.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2018-03-18	ACPICA: acpi: acpica: fix acpi operand cache leak in nseval.c	Seunghun Han
	I found an ACPI cache leak in ACPI early termination and boot continuing case. When early termination occurs due to malicious ACPI table, Linux kernel terminates ACPI function and continues to boot process. While kernel terminates ACPI function, kmem_cache_destroy() reports Acpi-Operand cache leak. Boot log of ACPI operand cache leak is as follows: >[ 0.464168] ACPI: Added _OSI(Module Device) >[ 0.467022] ACPI: Added _OSI(Processor Device) >[ 0.469376] ACPI: Added _OSI(3.0 _SCP Extensions) >[ 0.471647] ACPI: Added _OSI(Processor Aggregator Device) >[ 0.477997] ACPI Error: Null stack entry at ffff880215c0aad8 (20170303/exresop-174) >[ 0.482706] ACPI Exception: AE_AML_INTERNAL, While resolving operands for [opcode_name unavailable] (20170303/dswexec-461) >[ 0.487503] ACPI Error: Method parse/execution failed [\DBG] (Node ffff88021710ab40), AE_AML_INTERNAL (20170303/psparse-543) >[ 0.492136] ACPI Error: Method parse/execution failed [\_SB._INI] (Node ffff88021710a618), AE_AML_INTERNAL (20170303/psparse-543) >[ 0.497683] ACPI: Interpreter enabled >[ 0.499385] ACPI: (supports S0) >[ 0.501151] ACPI: Using IOAPIC for interrupt routing >[ 0.503342] ACPI Error: Null stack entry at ffff880215c0aad8 (20170303/exresop-174) >[ 0.506522] ACPI Exception: AE_AML_INTERNAL, While resolving operands for [opcode_name unavailable] (20170303/dswexec-461) >[ 0.510463] ACPI Error: Method parse/execution failed [\DBG] (Node ffff88021710ab40), AE_AML_INTERNAL (20170303/psparse-543) >[ 0.514477] ACPI Error: Method parse/execution failed [\_PIC] (Node ffff88021710ab18), AE_AML_INTERNAL (20170303/psparse-543) >[ 0.518867] ACPI Exception: AE_AML_INTERNAL, Evaluating _PIC (20170303/bus-991) >[ 0.522384] kmem_cache_destroy Acpi-Operand: Slab cache still has objects >[ 0.524597] CPU: 1 PID: 1 Comm: swapper/0 Not tainted 4.12.0-rc5 #26 >[ 0.526795] Hardware name: innotek gmb_h virtual_box/virtual_box, BIOS virtual_box 12/01/2006 >[ 0.529668] Call Trace: >[ 0.530811] ? dump_stack+0x5c/0x81 >[ 0.532240] ? kmem_cache_destroy+0x1aa/0x1c0 >[ 0.533905] ? acpi_os_delete_cache+0xa/0x10 >[ 0.535497] ? acpi_ut_delete_caches+0x3f/0x7b >[ 0.537237] ? acpi_terminate+0xa/0x14 >[ 0.538701] ? acpi_init+0x2af/0x34f >[ 0.540008] ? acpi_sleep_proc_init+0x27/0x27 >[ 0.541593] ? do_one_initcall+0x4e/0x1a0 >[ 0.543008] ? kernel_init_freeable+0x19e/0x21f >[ 0.546202] ? rest_init+0x80/0x80 >[ 0.547513] ? kernel_init+0xa/0x100 >[ 0.548817] ? ret_from_fork+0x25/0x30 >[ 0.550587] vgaarb: loaded >[ 0.551716] EDAC MC: Ver: 3.0.0 >[ 0.553744] PCI: Probing PCI hardware >[ 0.555038] PCI host bridge to bus 0000:00 > ... Continue to boot and log is omitted ... I analyzed this memory leak in detail and found acpi_ns_evaluate() function only removes Info->return_object in AE_CTRL_RETURN_VALUE case. But, when errors occur, the status value is not AE_CTRL_RETURN_VALUE, and Info->return_object is also not null. Therefore, this causes acpi operand memory leak. This cache leak causes a security threat because an old kernel (<= 4.9) shows memory locations of kernel functions in stack dump. Some malicious users could use this information to neutralize kernel ASLR. I made a patch to fix ACPI operand cache leak. Signed-off-by: Seunghun Han <kkamagui@gmail.com> Signed-off-by: Erik Schmauss <erik.schmauss@intel.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2018-03-18	Merge tag 'v4.16-rc5' into devel	Linus Walleij
	Linux 4.16-rc5 merged into the GPIO devel branch to resolve a nasty conflict between fixes and devel in the RCAR driver. Signed-off-by: Linus Walleij <linus.walleij@linaro.org>