linux.git - Linus' kernel tree

Age	Commit message (Collapse)	Author
2017-08-18	PCI/IB: add support for pci driver attribute groups	Greg Kroah-Hartman
	Some drivers (specifically the nes IB driver), want to create a lot of sysfs driver attributes. Instead of open-coding the creation and removal of these files (and getting it wrong btw), it's a better idea to let the driver core handle all of this logic for us. So add a new field to the pci driver structure, **groups, that allows pci drivers to specify an attribute group list it wishes to have created when it is registered with the driver core. Big bonus is now the driver doesn't race with userspace when the sysfs files are created vs. when the kobject is announced, so any script/tool that actually wanted to use these files will not have to poll waiting for them to show up. Cc: Faisal Latif <faisal.latif@intel.com> Cc: Doug Ledford <dledford@redhat.com> Cc: Sean Hefty <sean.hefty@intel.com> Cc: Hal Rosenstock <hal.rosenstock@gmail.com> Cc: Bjorn Helgaas <bhelgaas@google.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Acked-by: Bjorn Helgaas <bhelgaas@google.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
2017-08-18	PCI: endpoint: Add an API to get matching "pci_epf_device_id"	Kishon Vijay Abraham I
	Add an API to get "pci_epf_device_id" matching the EPF name. This can be used by the EPF driver to get the driver data corresponding to the EPF device name. Signed-off-by: Kishon Vijay Abraham I <kishon@ti.com> [bhelgaas: folded in "while" loop termination fix from Colin Ian King <colin.king@canonical.com>] Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
2017-08-18	cgroup: Add mount flag to enable cpuset to use v2 behavior in v1 cgroup	Waiman Long
	A new mount option "cpuset_v2_mode" is added to the v1 cgroupfs filesystem to enable cpuset controller to use v2 behavior in a v1 cgroup. This mount option applies only to cpuset controller and have no effect on other controllers. Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2017-08-18	blk-mq: Make blk_mq_reinit_tagset() calls easier to read	Bart Van Assche
	Since blk_mq_ops.reinit_request is only called from inside blk_mq_reinit_tagset(), make this function pointer an argument of blk_mq_reinit_tagset() instead of a member of struct blk_mq_ops. This patch does not change any functionality but makes blk_mq_reinit_tagset() calls easier to read and to analyze. Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com> Reviewed-by: Hannes Reinecke <hare@suse.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Sagi Grimberg <sagi@grimberg.me> Cc: James Smart <james.smart@broadcom.com> Cc: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-08-18	kernel/watchdog: Prevent false positives with turbo modes	Thomas Gleixner
	The hardlockup detector on x86 uses a performance counter based on unhalted CPU cycles and a periodic hrtimer. The hrtimer period is about 2/5 of the performance counter period, so the hrtimer should fire 2-3 times before the performance counter NMI fires. The NMI code checks whether the hrtimer fired since the last invocation. If not, it assumess a hard lockup. The calculation of those periods is based on the nominal CPU frequency. Turbo modes increase the CPU clock frequency and therefore shorten the period of the perf/NMI watchdog. With extreme Turbo-modes (3x nominal frequency) the perf/NMI period is shorter than the hrtimer period which leads to false positives. A simple fix would be to shorten the hrtimer period, but that comes with the side effect of more frequent hrtimer and softlockup thread wakeups, which is not desired. Implement a low pass filter, which checks the perf/NMI period against kernel time. If the perf/NMI fires before 4/5 of the watchdog period has elapsed then the event is ignored and postponed to the next perf/NMI. That solves the problem and avoids the overhead of shorter hrtimer periods and more frequent softlockup thread wakeups. Fixes: 58687acba592 ("lockup_detector: Combine nmi_watchdog and softlockup detector") Reported-and-tested-by: Kan Liang <Kan.liang@intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: dzickus@redhat.com Cc: prarit@redhat.com Cc: ak@linux.intel.com Cc: babu.moger@oracle.com Cc: peterz@infradead.org Cc: eranian@google.com Cc: acme@redhat.com Cc: stable@vger.kernel.org Cc: atomlin@redhat.com Cc: akpm@linux-foundation.org Cc: torvalds@linux-foundation.org Link: http://lkml.kernel.org/r/alpine.DEB.2.20.1708150931310.1886@nanos
2017-08-18	Merge branch 'irq/for-gpio' into irq/core	Thomas Gleixner
	Merge the flow handlers and irq domain extensions which are in a separate branch so they can be consumed by the gpio folks.
2017-08-18	irqdomain: Add irq_domain_{push,pop}_irq() functions	David Daney
	For an already existing irqdomain hierarchy, as might be obtained via a call to pci_enable_msix_range(), a PCI driver wishing to add an additional irqdomain to the hierarchy needs to be able to insert the irqdomain to that already initialized hierarchy. Calling irq_domain_create_hierarchy() allows the new irqdomain to be created, but no existing code allows for initializing the associated irq_data. Add a couple of helper functions (irq_domain_push_irq() and irq_domain_pop_irq()) to initialize the irq_data for the new irqdomain added to an existing hierarchy. Signed-off-by: David Daney <david.daney@cavium.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Marc Zyngier <marc.zyngier@arm.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Alexandre Courbot <gnurou@gmail.com> Cc: Linus Walleij <linus.walleij@linaro.org> Cc: linux-gpio@vger.kernel.org Link: http://lkml.kernel.org/r/1503017616-3252-6-git-send-email-david.daney@cavium.com
2017-08-18	genirq: Add handle_fasteoi_{level,edge}_irq flow handlers	David Daney
	Follow-on patch for gpio-thunderx uses a irqdomain hierarchy which requires slightly different flow handlers, add them to chip.c which contains most of the other flow handlers. Make these conditionally compiled based on CONFIG_IRQ_FASTEOI_HIERARCHY_HANDLERS. Signed-off-by: David Daney <david.daney@cavium.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Alexandre Courbot <gnurou@gmail.com> Cc: Marc Zyngier <marc.zyngier@arm.com> Cc: Linus Walleij <linus.walleij@linaro.org> Cc: linux-gpio@vger.kernel.org Link: http://lkml.kernel.org/r/1503017616-3252-3-git-send-email-david.daney@cavium.com
2017-08-18	genirq: Restrict effective affinity to interrupts actually using it	Marc Zyngier
	Just because CONFIG_GENERIC_IRQ_EFFECTIVE_AFF_MASK is selected doesn't mean that all the interrupts are using the effective affinity mask. For a number of them, this mask is likely to be empty. In order to deal with this, let's restrict the use of the effective affinity mask to these interrupts that have a non empty effective affinity. Signed-off-by: Marc Zyngier <marc.zyngier@arm.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Andrew Lunn <andrew@lunn.ch> Cc: James Hogan <james.hogan@imgtec.com> Cc: Jason Cooper <jason@lakedaemon.net> Cc: Paul Burton <paul.burton@imgtec.com> Cc: Chris Zankel <chris@zankel.net> Cc: Kevin Cernekee <cernekee@gmail.com> Cc: Wei Xu <xuwei5@hisilicon.com> Cc: Max Filippov <jcmvbkbc@gmail.com> Cc: Florian Fainelli <f.fainelli@gmail.com> Cc: Gregory Clement <gregory.clement@free-electrons.com> Cc: Matt Redfearn <matt.redfearn@imgtec.com> Cc: Sebastian Hesselbarth <sebastian.hesselbarth@gmail.com> Link: http://lkml.kernel.org/r/20170818083925.10108-2-marc.zyngier@arm.com
2017-08-18	Merge branch 'x86/asm' into locking/core	Ingo Molnar
	We need the ASM_UNREACHABLE() macro for a dependent patch. Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-08-18	ACPI / PM: Check low power idle constraints for debug only	Srinivas Pandruvada
	For SoC to achieve its lowest power platform idle state a set of hardware preconditions must be met. These preconditions or constraints can be obtained by issuing a device specific method (_DSM) with function "1". Refer to the document provided in the link below. Here during initialization (from attach() callback of LPS0 device), invoke function 1 to get the device constraints. Each enabled constraint is stored in a table. The devices in this table are used to check whether they were in required minimum state, while entering suspend. This check is done from platform freeze wake() callback, only when /sys/power/pm_debug_messages attribute is non zero. If any constraint is not met and device is ACPI power managed then it prints the device information to kernel logs. Also if debug is enabled in acpi/sleep.c, the constraint table and state of each device on wake is dumped in kernel logs. Since pm_debug_messages_on setting is used as condition to check constraints outside kernel/power/main.c, pm_debug_messages_on is changed to a global variable. Link: http://www.uefi.org/sites/default/files/resources/Intel_ACPI_Low_Power_S0_Idle.pdf Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2017-08-17	Revert "pstore: Honor dmesg_restrict sysctl on dmesg dumps"	Kees Cook
	This reverts commit 68c4a4f8abc60c9440ede9cd123d48b78325f7a3, with various conflict clean-ups. The capability check required too much privilege compared to simple DAC controls. A system builder was forced to have crash handler processes run with CAP_SYSLOG which would give it the ability to read (and wipe) the _current_ dmesg, which is much more access than being given access only to the historical log stored in pstorefs. With the prior commit to make the root directory 0750, the files are protected by default but a system builder can now opt to give access to a specific group (via chgrp on the pstorefs root directory) without being forced to also give away CAP_SYSLOG. Suggested-by: Nick Kralevich <nnk@google.com> Signed-off-by: Kees Cook <keescook@chromium.org> Reviewed-by: Petr Mladek <pmladek@suse.cz> Reviewed-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
2017-08-17	quota: Reduce contention on dq_data_lock	Jan Kara
	dq_data_lock is currently used to protect all modifications of quota accounting information, consistency of quota accounting on the inode, and dquot pointers from inode. As a result contention on the lock can be pretty heavy. Reduce the contention on the lock by protecting quota accounting information by a new dquot->dq_dqb_lock and consistency of quota accounting with inode usage by inode->i_lock. This change reduces time to create 500000 files on ext4 on ramdisk by 50 different processes in separate directories by 6% when user quota is turned on. When those 50 processes belong to 50 different users, the improvement is about 9%. Signed-off-by: Jan Kara <jack@suse.cz>
2017-08-17	fs: Provide __inode_get_bytes()	Jan Kara
	Provide helper __inode_get_bytes() which assumes i_lock is already acquired. Quota code will need this to be able to use i_lock to protect consistency of quota accounting information and inode usage. Signed-off-by: Jan Kara <jack@suse.cz>
2017-08-17	quota: Inline functions into their callsites	Jan Kara
	inode_add_rsv_space() and inode_sub_rsv_space() had only one callsite. Inline them there directly. inode_claim_rsv_space() and inode_reclaim_rsv_space() had two callsites so inline them there as well. This will simplify further locking changes. Signed-off-by: Jan Kara <jack@suse.cz>
2017-08-17	quota: Allow disabling tracking of dirty dquots in a list	Jan Kara
	Filesystems that are journalling quotas generally don't need tracking of dirty dquots in a list since forcing a transaction commit flushes all quotas anyway. Allow filesystem to say it doesn't want dquots to be tracked as it reduces contention on the dq_list_lock. Signed-off-by: Jan Kara <jack@suse.cz>
2017-08-17	quota: Remove dq_wait_unused from dquot	Jan Kara
	Currently every dquot carries a wait_queue_head_t used only when we are turning quotas off to wait for last users to drop dquot references. Since such rare case is not performance sensitive in any means, just use a global waitqueue for this and save space in struct dquot. Also convert the logic to use wait_event() instead of open-coding it. Signed-off-by: Jan Kara <jack@suse.cz>
2017-08-18	Merge tag 'omapdrm-4.14' of ↵	Dave Airlie
	git://git.kernel.org/pub/scm/linux/kernel/git/tomba/linux into drm-next omapdrm changes for v4.14 * HDMI hot plug IRQ support (instead of polling) * Big driver cleanup from Laurent (no functional changes) * OMAP5 DSI support (only the pinmuxing was missing) * tag 'omapdrm-4.14' of git://git.kernel.org/pub/scm/linux/kernel/git/tomba/linux: (60 commits) drm/omap: Potential NULL deref in omap_crtc_duplicate_state() drm/omap: remove no-op cleanup code drm/omap: rename omapdrm device back drm: omapdrm: Remove omapdrm platform data ARM: OMAP2+: Don't register omapdss device for omapdrm ARM: OMAP2+: Remove unused omapdrm platform device drm: omapdrm: Remove the omapdss driver drm: omapdrm: Register omapdrm platform device in omapdss driver drm: omapdrm: hdmi: Don't allocate PHY features dynamically drm: omapdrm: hdmi: Configure the PHY from the HDMI core version drm: omapdrm: hdmi: Configure the PLL from the HDMI core version drm: omapdrm: hdmi: Pass HDMI core version as integer to HDMI audio drm: omapdrm: hdmi: Replace OMAP SoC model check with HDMI xmit version drm: omapdrm: hdmi: Rename functions and structures to use hdmi_ prefix drm/omap: add OMAP5 DSIPHY lane-enable support drm/omap: use regmap_update_bit() when muxing DSI pads drm: omapdrm: Remove dss_features.h drm: omapdrm: Move supported outputs feature to dss driver drm: omapdrm: Move DSS_FCK feature to dss driver drm: omapdrm: Move PCD, LINEWIDTH and DOWNSCALE features to dispc driver ...
2017-08-17	lsm_audit: update my email address	Stephen Smalley
	Update my email address since epoch.ncsc.mil no longer exists. MAINTAINERS and CREDITS are already correct. Signed-off-by: Stephen Smalley <sds@tycho.nsa.gov> Signed-off-by: Paul Moore <paul@paul-moore.com>
2017-08-17	quota: Convert dqio_mutex to rwsem	Jan Kara
	Convert dqio_mutex to rwsem and call it dqio_sem. No functional changes yet. Signed-off-by: Jan Kara <jack@suse.cz>
2017-08-17	pty: fix the cached path of the pty slave file descriptor in the master	Linus Torvalds
	Christian Brauner reported that if you use the TIOCGPTPEER ioctl() to get a slave pty file descriptor, the resulting file descriptor doesn't look right in /proc/<pid>/fd/<fd>. In particular, he wanted to use readlink() on /proc/self/fd/<fd> to get the pathname of the slave pty (basically implementing "ptsname{_r}()"). The reason for that was that we had generated the wrong 'struct path' when we create the pty in ptmx_open(). In particular, the dentry was correct, but the vfsmount pointed to the mount of the ptmx node. That _can_ be correct - in case you use "/dev/pts/ptmx" to open the master - but usually is not. The normal case is to use /dev/ptmx, which then looks up the pts/ directory, and then the vfsmount of the ptmx node is obviously the /dev directory, not the /dev/pts/ directory. We actually did have the right vfsmount available, but in the wrong place (it gets looked up in 'devpts_acquire()' when we get a reference to the pts filesystem), and so ptmx_open() used the wrong mnt pointer. The end result of this confusion was that the pty worked fine, but when if you did TIOCGPTPEER to get the slave side of the pty, end end result would also work, but have that dodgy 'struct path'. And then when doing "d_path()" on to get the pathname, the vfsmount would not match the root of the pts directory, and d_path() would return an empty pathname thinking that the entry had escaped a bind mount into another mount. This fixes the problem by making devpts_acquire() return the vfsmount for the pts filesystem, allowing ptmx_open() to trivially just use the right mount for the pts dentry, and create the proper 'struct path'. Reported-by: Christian Brauner <christian.brauner@ubuntu.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Acked-by: Eric Biederman <ebiederm@xmission.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-08-17	Merge branches 'doc.2017.08.17a', 'fixes.2017.08.17a', ↵	Paul E. McKenney
	'hotplug.2017.07.25b', 'misc.2017.08.17a', 'spin_unlock_wait_no.2017.08.17a', 'srcu.2017.07.27c' and 'torture.2017.07.24c' into HEAD doc.2017.08.17a: Documentation updates. fixes.2017.08.17a: RCU fixes. hotplug.2017.07.25b: CPU-hotplug updates. misc.2017.08.17a: Miscellaneous fixes outside of RCU (give or take conflicts). spin_unlock_wait_no.2017.08.17a: Remove spin_unlock_wait(). srcu.2017.07.27c: SRCU updates. torture.2017.07.24c: Torture-test updates.
2017-08-17	locking: Remove spin_unlock_wait() generic definitions	Paul E. McKenney
	There is no agreed-upon definition of spin_unlock_wait()'s semantics, and it appears that all callers could do just as well with a lock/unlock pair. This commit therefore removes spin_unlock_wait() and related definitions from core code. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: Will Deacon <will.deacon@arm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Alan Stern <stern@rowland.harvard.edu> Cc: Andrea Parri <parri.andrea@gmail.com> Cc: Linus Torvalds <torvalds@linux-foundation.org>
2017-08-17	Merge branch 'core' into arm/tegra	Joerg Roedel

2017-08-17	swait: Add idle variants which don't contribute to load average	Luis R. Rodriguez
	There are cases where folks are using an interruptible swait when using kthreads. This is rather confusing given you'd expect interruptible waits to be -- interruptible, but kthreads are not interruptible ! The reason for such practice though is to avoid having these kthreads contribute to the system load average. When systems are idle some kthreads may spend a lot of time blocking if using swait_event_timeout(). This would contribute to the system load average. On systems without preemption this would mean the load average of an idle system is bumped to 2 instead of 0. On systems with PREEMPT=y this would mean the load average of an idle system is bumped to 3 instead of 0. This adds proper API using TASK_IDLE to make such goals explicit and avoid confusion. Suggested-by: "Eric W. Biederman" <ebiederm@xmission.com> Acked-by: "Eric W. Biederman" <ebiederm@xmission.com> Tested-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Signed-off-by: Luis R. Rodriguez <mcgrof@kernel.org> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2017-08-17	rcu: Create reasonable API for do_exit() TASKS_RCU processing	Paul E. McKenney
	Currently, the exit-time support for TASKS_RCU is open-coded in do_exit(). This commit creates exit_tasks_rcu_start() and exit_tasks_rcu_finish() APIs for do_exit() use. This has the benefit of confining the use of the tasks_rcu_exit_srcu variable to one file, allowing it to become static. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2017-08-17	rcu: Drive TASKS_RCU directly off of PREEMPT	Paul E. McKenney
	The actual use of TASKS_RCU is only when PREEMPT, otherwise RCU-sched is used instead. This commit therefore makes synchronize_rcu_tasks() and call_rcu_tasks() available always, but mapped to synchronize_sched() and call_rcu_sched(), respectively, when !PREEMPT. This approach also allows some #ifdefs to be removed from rcutorture. Reported-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Reviewed-by: Masami Hiramatsu <mhiramat@kernel.org> Acked-by: Ingo Molnar <mingo@kernel.org>
2017-08-17	locking/lockdep: Explicitly initialize wq_barrier::done::map	Boqun Feng
	With the new lockdep crossrelease feature, which checks completions usage, a false positive is reported in the workqueue code: > Worker A : acquired of wfc.work -> wait for cpu_hotplug_lock to be released > Task B : acquired of cpu_hotplug_lock -> wait for lock#3 to be released > Task C : acquired of lock#3 -> wait for completion of barr->done > (Task C is in lru_add_drain_all_cpuslocked()) > Worker D : wait for wfc.work to be released -> will complete barr->done Such a dead lock can not happen because Task C's barr->done and Worker D's barr->done can not be the same instance. The reason of this false positive is we initialize all wq_barrier::done at insert_wq_barrier() via init_completion(), which makes them belong to the same lock class, therefore, impossible circles are reported. To fix this, explicitly initialize the lockdep map for wq_barrier::done in insert_wq_barrier(), so that the lock class key of wq_barrier::done is a subkey of the corresponding work_struct, as a result we won't build a dependency between a wq_barrier with a unrelated work, and we can differ wq barriers based on the related works, so the false positive above is avoided. Also define the empty lockdep_init_map_crosslock() for !CROSSRELEASE to make the code simple and away from unnecessary #ifdefs. Reported-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Boqun Feng <boqun.feng@gmail.com> Cc: Byungchul Park <byungchul.park@lge.com> Cc: Lai Jiangshan <jiangshanlai@gmail.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Tejun Heo <tj@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/20170817094622.12915-1-boqun.feng@gmail.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-08-17	locking/lockdep: Rename CONFIG_LOCKDEP_COMPLETE to CONFIG_LOCKDEP_COMPLETIONS	Byungchul Park
	'complete' is an adjective and LOCKDEP_COMPLETE sounds like 'lockdep is complete', so pick a better name that uses a noun. Signed-off-by: Byungchul Park <byungchul.park@lge.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: kernel-team@lge.com Link: http://lkml.kernel.org/r/1502960261-16206-3-git-send-email-byungchul.park@lge.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-08-17	efi: Introduce efi_early_memdesc_ptr to get pointer to memmap descriptor	Baoquan He
	The existing map iteration helper for_each_efi_memory_desc_in_map can only be used after the kernel initializes the EFI subsystem to set up struct efi_memory_map. Before that we also need iterate map descriptors which are stored in several intermediate structures, like struct efi_boot_memmap for arch independent usage and struct efi_info for x86 arch only. Introduce efi_early_memdesc_ptr() to get pointer to a map descriptor, and replace several places where that primitive is open coded. Signed-off-by: Baoquan He <bhe@redhat.com> [ Various improvements to the text. ] Acked-by: Matt Fleming <matt@codeblueprint.co.uk> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: ard.biesheuvel@linaro.org Cc: fanc.fnst@cn.fujitsu.com Cc: izumi.taku@jp.fujitsu.com Cc: keescook@chromium.org Cc: linux-efi@vger.kernel.org Cc: n-horiguchi@ah.jp.nec.com Cc: thgarnie@google.com Link: http://lkml.kernel.org/r/20170816134651.GF21273@x1 Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-08-17	locking/refcounts, x86/asm: Implement fast refcount overflow protection	Kees Cook
	This implements refcount_t overflow protection on x86 without a noticeable performance impact, though without the fuller checking of REFCOUNT_FULL. This is done by duplicating the existing atomic_t refcount implementation but with normally a single instruction added to detect if the refcount has gone negative (e.g. wrapped past INT_MAX or below zero). When detected, the handler saturates the refcount_t to INT_MIN / 2. With this overflow protection, the erroneous reference release that would follow a wrap back to zero is blocked from happening, avoiding the class of refcount-overflow use-after-free vulnerabilities entirely. Only the overflow case of refcounting can be perfectly protected, since it can be detected and stopped before the reference is freed and left to be abused by an attacker. There isn't a way to block early decrements, and while REFCOUNT_FULL stops increment-from-zero cases (which would be the state _after_ an early decrement and stops potential double-free conditions), this fast implementation does not, since it would require the more expensive cmpxchg loops. Since the overflow case is much more common (e.g. missing a "put" during an error path), this protection provides real-world protection. For example, the two public refcount overflow use-after-free exploits published in 2016 would have been rendered unexploitable: http://perception-point.io/2016/01/14/analysis-and-exploitation-of-a-linux-kernel-vulnerability-cve-2016-0728/ http://cyseclabs.com/page?n=02012016 This implementation does, however, notice an unchecked decrement to zero (i.e. caller used refcount_dec() instead of refcount_dec_and_test() and it resulted in a zero). Decrements under zero are noticed (since they will have resulted in a negative value), though this only indicates that a use-after-free may have already happened. Such notifications are likely avoidable by an attacker that has already exploited a use-after-free vulnerability, but it's better to have them reported than allow such conditions to remain universally silent. On first overflow detection, the refcount value is reset to INT_MIN / 2 (which serves as a saturation value) and a report and stack trace are produced. When operations detect only negative value results (such as changing an already saturated value), saturation still happens but no notification is performed (since the value was already saturated). On the matter of races, since the entire range beyond INT_MAX but before 0 is negative, every operation at INT_MIN / 2 will trap, leaving no overflow-only race condition. As for performance, this implementation adds a single "js" instruction to the regular execution flow of a copy of the standard atomic_t refcount operations. (The non-"and_test" refcount_dec() function, which is uncommon in regular refcount design patterns, has an additional "jz" instruction to detect reaching exactly zero.) Since this is a forward jump, it is by default the non-predicted path, which will be reinforced by dynamic branch prediction. The result is this protection having virtually no measurable change in performance over standard atomic_t operations. The error path, located in .text.unlikely, saves the refcount location and then uses UD0 to fire a refcount exception handler, which resets the refcount, handles reporting, and returns to regular execution. This keeps the changes to .text size minimal, avoiding return jumps and open-coded calls to the error reporting routine. Example assembly comparison: refcount_inc() before: .text: ffffffff81546149: f0 ff 45 f4 lock incl -0xc(%rbp) refcount_inc() after: .text: ffffffff81546149: f0 ff 45 f4 lock incl -0xc(%rbp) ffffffff8154614d: 0f 88 80 d5 17 00 js ffffffff816c36d3 ... .text.unlikely: ffffffff816c36d3: 48 8d 4d f4 lea -0xc(%rbp),%rcx ffffffff816c36d7: 0f ff (bad) These are the cycle counts comparing a loop of refcount_inc() from 1 to INT_MAX and back down to 0 (via refcount_dec_and_test()), between unprotected refcount_t (atomic_t), fully protected REFCOUNT_FULL (refcount_t-full), and this overflow-protected refcount (refcount_t-fast): 2147483646 refcount_inc()s and 2147483647 refcount_dec_and_test()s: cycles protections atomic_t 82249267387 none refcount_t-fast 82211446892 overflow, untested dec-to-zero refcount_t-full 144814735193 overflow, untested dec-to-zero, inc-from-zero This code is a modified version of the x86 PAX_REFCOUNT atomic_t overflow defense from the last public patch of PaX/grsecurity, based on my understanding of the code. Changes or omissions from the original code are mine and don't reflect the original grsecurity/PaX code. Thanks to PaX Team for various suggestions for improvement for repurposing this code to be a refcount-only protection. Signed-off-by: Kees Cook <keescook@chromium.org> Reviewed-by: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Christoph Hellwig <hch@infradead.org> Cc: David S. Miller <davem@davemloft.net> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Elena Reshetova <elena.reshetova@intel.com> Cc: Eric Biggers <ebiggers3@gmail.com> Cc: Eric W. Biederman <ebiederm@xmission.com> Cc: Greg KH <gregkh@linuxfoundation.org> Cc: Hans Liljestrand <ishkamiel@gmail.com> Cc: James Bottomley <James.Bottomley@hansenpartnership.com> Cc: Jann Horn <jannh@google.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Manfred Spraul <manfred@colorfullife.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@redhat.com> Cc: Serge E. Hallyn <serge@hallyn.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: arozansk@redhat.com Cc: axboe@kernel.dk Cc: kernel-hardening@lists.openwall.com Cc: linux-arch <linux-arch@vger.kernel.org> Link: http://lkml.kernel.org/r/20170815161924.GA133115@beast Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-08-17	x86/mm, mm/hwpoison: Clear PRESENT bit for kernel 1:1 mappings of poison pages	Tony Luck
	Speculative processor accesses may reference any memory that has a valid page table entry. While a speculative access won't generate a machine check, it will log the error in a machine check bank. That could cause escalation of a subsequent error since the overflow bit will be then set in the machine check bank status register. Code has to be double-plus-tricky to avoid mentioning the 1:1 virtual address of the page we want to map out otherwise we may trigger the very problem we are trying to avoid. We use a non-canonical address that passes through the usual Linux table walking code to get to the same "pte". Thanks to Dave Hansen for reviewing several iterations of this. Also see: http://marc.info/?l=linux-mm&m=149860136413338&w=2 Signed-off-by: Tony Luck <tony.luck@intel.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Borislav Petkov <bp@suse.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: Elliott, Robert (Persistent Memory) <elliott@hpe.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-mm@kvack.org Cc: stable@vger.kernel.org Link: http://lkml.kernel.org/r/20170816171803.28342-1-tony.luck@intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-08-17	Merge branch 'linus' into perf/core, to pick up fixes	Ingo Molnar
	Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-08-16	ptr_ring: use kmalloc_array()	Eric Dumazet
	As found by syzkaller, malicious users can set whatever tx_queue_len on a tun device and eventually crash the kernel. Lets remove the ALIGN(XXX, SMP_CACHE_BYTES) thing since a small ring buffer is not fast anyway. Fixes: 2e0ab8ca83c1 ("ptr_ring: array based FIFO for pointers") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Dmitry Vyukov <dvyukov@google.com> Cc: Michael S. Tsirkin <mst@redhat.com> Cc: Jason Wang <jasowang@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-16	vmbus: remove unused vmbus_sendpacket_ctl	stephen hemminger
	The only usage of vmbus_sendpacket_ctl was by vmbus_sendpacket. Signed-off-by: Stephen Hemminger <sthemmin@microsoft.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-16	vmbus: remove unused vmubs_sendpacket_pagebuffer_ctl	stephen hemminger
	The function vmbus_sendpacket_pagebuffer_ctl was never used directly. Just have vmbus_send_pagebuffer Signed-off-by: Stephen Hemminger <sthemmin@microsoft.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-16	vmbus: remove unused vmbus_sendpacket_multipagebuffer	stephen hemminger
	This function is not used anywhere in current code. Signed-off-by: Stephen Hemminger <sthemmin@microsoft.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-16	bpf: sock_map fixes for !CONFIG_BPF_SYSCALL and !STREAM_PARSER	John Fastabend
	Resolve issues with !CONFIG_BPF_SYSCALL and !STREAM_PARSER net/core/filter.c: In function ‘do_sk_redirect_map’: net/core/filter.c:1881:3: error: implicit declaration of function ‘__sock_map_lookup_elem’ [-Werror=implicit-function-declaration] sk = __sock_map_lookup_elem(ri->map, ri->ifindex); ^ net/core/filter.c:1881:6: warning: assignment makes pointer from integer without a cast [enabled by default] sk = __sock_map_lookup_elem(ri->map, ri->ifindex); Fixes: 174a79ff9515 ("bpf: sockmap with sk redirect support") Reported-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: John Fastabend <john.fastabend@gmail.com> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-17	Merge tag 'drm-misc-next-2017-08-16' of ↵	Dave Airlie
	git://anongit.freedesktop.org/git/drm-misc into drm-next UAPI Changes: - vc4: Allow userspace to dictate rendering order in submit_cl ioctl (Eric) Cross-subsystem Changes: - vboxvideo: One of Cihangir's patches applies to vboxvideo which is maintained in staging Core Changes: - atomic_legacy_backoff is officially killed (Daniel) - Extract drm_device.h (Daniel) - Unregister drm device on unplug (Daniel) - Rename deprecated drm__(un)?reference functions to drm__{get\|put} (Cihangir) Driver Changes: - vc4: Error/destroy path cleanups, log level demotion, edid leak (Eric) - various: Make various drm__funcs structs const (Bhumika) - tinydrm: add support for LEGO MINDSTORMS EV3 LCD (David) - various: Second half of .dumb_{map_offset\|destroy} defaults set (Noralf) Cc: Daniel Vetter <daniel.vetter@ffwll.ch> Cc: Eric Anholt <eric@anholt.net> Cc: Bhumika Goyal <bhumirks@gmail.com> Cc: Cihangir Akturk <cakturk@gmail.com> Cc: David Lechner <david@lechnology.com> Cc: Noralf Trønnes <noralf@tronnes.org> tag 'drm-misc-next-2017-08-16' of git://anongit.freedesktop.org/git/drm-misc: (50 commits) drm/gem-cma-helper: Remove drm_gem_cma_dumb_map_offset() drm/virtio: Use the drm_driver.dumb_destroy default drm/bochs: Use the drm_driver.dumb_destroy default drm/mgag200: Use the drm_driver.dumb_destroy default drm/exynos: Use .dumb_map_offset and .dumb_destroy defaults drm/msm: Use the drm_driver.dumb_destroy default drm/ast: Use the drm_driver.dumb_destroy default drm/qxl: Use the drm_driver.dumb_destroy default drm/udl: Use the drm_driver.dumb_destroy default drm/cirrus: Use the drm_driver.dumb_destroy default drm/tegra: Use .dumb_map_offset and .dumb_destroy defaults drm/gma500: Use .dumb_map_offset and .dumb_destroy defaults drm/mxsfb: Use .dumb_map_offset and .dumb_destroy defaults drm/meson: Use .dumb_map_offset and .dumb_destroy defaults drm/kirin: Use .dumb_map_offset and .dumb_destroy defaults drm/vc4: Continue the switch to drm_*_put() helpers drm/vc4: Fix leak of HDMI EDID dma-buf: fix reservation_object_wait_timeout_rcu to wait correctly v2 dma-buf: add reservation_object_copy_fences (v2) drm/tinydrm: add support for LEGO MINDSTORMS EV3 LCD ...
2017-08-16	Merge tag 'omap-for-v4.14/soc-signed' of ↵	Arnd Bergmann
	git://git.kernel.org/pub/scm/linux/kernel/git/tmlind/linux-omap into next/soc Pull "soc changes for omaps for v4.14" from Tony Lindgren: SoC updates for omaps for v4.14. Most of the chages are to add support for new dra762 SoC. The other changes are are for legacy DMA code removal, and MMC quirk and iodelay config for dra7. * tag 'omap-for-v4.14/soc-signed' of git://git.kernel.org/pub/scm/linux/kernel/git/tmlind/linux-omap: ARM: OMAP: dra7: powerdomain data: Register SoC specific powerdomains ARM: dra762: Enable SMP for dra762 ARM: dra7: hwmod: Register dra76x specific hwmod ARM: dra762: Add support for device identification ARM: OMAP2+: board-generic: add support for dra762 family ARM: OMAP2+: Select PINCTRL_TI_IODELAY for SOC_DRA7XX ARM: OMAP2+: Add pdata-quirks for MMC/SD on DRA74x EVM ARM: OMAP2+: Remove unused legacy code for DMA
2017-08-16	Merge tag 'tee-drv-for-4.14' of ↵	Arnd Bergmann
	http://git.linaro.org/people/jens.wiklander/linux-tee into next/drivers Pull "Small fixes and enhancements for the TEE subsystem" from Jens Wiklander: * tag 'tee-drv-for-4.14' of http://git.linaro.org/people/jens.wiklander/linux-tee: tee: optee: sync with new naming of interrupts tee: indicate privileged dev in gen_caps tee: optee: interruptible RPC sleep tee: optee: add const to tee_driver_ops and tee_desc structures tee: tee_shm: Constify dma_buf_ops structures. tee: add forward declaration for struct device tee: optee: fix uninitialized symbol 'parg'
2017-08-16	SUNRPC: Don't hold the transport lock across socket copy operations	Trond Myklebust
	Instead add a mechanism to ensure that the request doesn't disappear from underneath us while copying from the socket. We do this by preventing xprt_release() from freeing the XDR buffers until the flag RPC_TASK_MSG_RECV has been cleared from the request. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Reviewed-by: Chuck Lever <chuck.lever@oracle.com>
2017-08-16	bpf: sockmap with sk redirect support	John Fastabend
	Recently we added a new map type called dev map used to forward XDP packets between ports (6093ec2dc313). This patches introduces a similar notion for sockets. A sockmap allows users to add participating sockets to a map. When sockets are added to the map enough context is stored with the map entry to use the entry with a new helper bpf_sk_redirect_map(map, key, flags) This helper (analogous to bpf_redirect_map in XDP) is given the map and an entry in the map. When called from a sockmap program, discussed below, the skb will be sent on the socket using skb_send_sock(). With the above we need a bpf program to call the helper from that will then implement the send logic. The initial site implemented in this series is the recv_sock hook. For this to work we implemented a map attach command to add attributes to a map. In sockmap we add two programs a parse program and a verdict program. The parse program uses strparser to build messages and pass them to the verdict program. The parse programs use the normal strparser semantics. The verdict program is of type SK_SKB. The verdict program returns a verdict SK_DROP, or SK_REDIRECT for now. Additional actions may be added later. When SK_REDIRECT is returned, expected when bpf program uses bpf_sk_redirect_map(), the sockmap logic will consult per cpu variables set by the helper routine and pull the sock entry out of the sock map. This pattern follows the existing redirect logic in cls and xdp programs. This gives the flow, recv_sock -> str_parser (parse_prog) -> verdict_prog -> skb_send_sock \ -> kfree_skb As an example use case a message based load balancer may use specific logic in the verdict program to select the sock to send on. Sample programs are provided in future patches that hopefully illustrate the user interfaces. Also selftests are in follow-on patches. Signed-off-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-16	bpf: export bpf_prog_inc_not_zero	John Fastabend
	bpf_prog_inc_not_zero will be used by upcoming sockmap patches this patch simply exports it so we can pull it in. Signed-off-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-16	bpf: introduce new program type for skbs on sockets	John Fastabend
	A class of programs, run from strparser and soon from a new map type called sock map, are used with skb as the context but on established sockets. By creating a specific program type for these we can use bpf helpers that expect full sockets and get the verifier to ensure these helpers are not used out of context. The new type is BPF_PROG_TYPE_SK_SKB. This patch introduces the infrastructure and type. Signed-off-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-16	PCI: Add pci_irqd_intx_xlate()	Paul Burton
	Legacy PCI INTx interrupts are represented in the PCI_INTERRUPT_PIN register using the range 1-4, which matches our enum pci_interrupt_pin. This is however not ideal for an IRQ domain, where with 4 interrupts we would ideally have a domain of size 4 & hwirq numbers in the range 0-3. Different PCI host controller drivers have handled this in different ways. Of those under drivers/pci/ which register an INTx IRQ domain, we have: - pcie-altera uses the range 1-4 in device trees and an IRQ domain of size 5 to cover that range, with entry 0 wasted. - pcie-xilinx & pcie-xilinx-nwl use the range 1-4 in device trees but register an IRQ domain of size 4, which doesn't cover the hwirq=4/INTD case leading to that interrupt being broken. - pci-ftpci100 & pci-aardvark use the range 0-3 in both device trees & as hwirq numbering in the driver & IRQ domain. In order to introduce some level of consistency in at least the hwirq numbering used by the drivers & IRQ domains, this patch introduces a new pci_irqd_intx_xlate() helper function which drivers using the 1-4 range in device trees can assign as the xlate callback for their INTx IRQ domain. This translates the 1-4 range into a 0-3 range, allowing us to use an IRQ domain of size 4 & avoid a wasted entry. Further patches will make use of this in drivers to allow them to use an IRQ domain of size 4 for legacy INTx interrupts without breaking INTD. Signed-off-by: Paul Burton <paul.burton@imgtec.com> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
2017-08-16	Drivers: hv: vmbus: Fix rescind handling issues	K. Y. Srinivasan
	This patch handles the following issues that were observed when we are handling racing channel offer message and rescind message for the same offer: 1. Since the host does not respond to messages on a rescinded channel, in the current code, we could be indefinitely blocked on the vmbus_open() call. 2. When a rescinded channel is being closed, if there is a pending interrupt on the channel, we could end up freeing the channel that the interrupt handler would run on. Signed-off-by: K. Y. Srinivasan <kys@microsoft.com> Reviewed-by: Dexuan Cui <decui@microsoft.com> Tested-by: Dexuan Cui <decui@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-08-16	genirq/irq_sim: Add a devres variant of irq_sim_init()	Bartosz Golaszewski
	Add a resource managed version of irq_sim_init(). This can be conveniently used in device drivers. Signed-off-by: Bartosz Golaszewski <brgl@bgdev.pl> Acked-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> Cc: Lars-Peter Clausen <lars@metafoo.de> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Marc Zyngier <marc.zyngier@arm.com> Cc: Linus Walleij <linus.walleij@linaro.org> Cc: linux-doc@vger.kernel.org Cc: linux-gpio@vger.kernel.org Cc: Bamvor Jian Zhang <bamvor.zhangjian@linaro.org> Cc: Jonathan Cameron <jic23@kernel.org> Link: http://lkml.kernel.org/r/20170814145318.6495-3-brgl@bgdev.pl Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2017-08-16	genirq/irq_sim: Add a simple interrupt simulator framework	Bartosz Golaszewski
	Implement a simple, irq_work-based framework for simulating interrupts. Currently the API exposes routines for initializing and deinitializing the simulator object, enqueueing the interrupts and retrieving the allocated interrupt numbers based on the offset of the dummy interrupt in the simulator struct. Signed-off-by: Bartosz Golaszewski <brgl@bgdev.pl> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> Cc: Lars-Peter Clausen <lars@metafoo.de> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Marc Zyngier <marc.zyngier@arm.com> Cc: Linus Walleij <linus.walleij@linaro.org> Cc: linux-doc@vger.kernel.org Cc: linux-gpio@vger.kernel.org Cc: Bamvor Jian Zhang <bamvor.zhangjian@linaro.org> Cc: Jonathan Cameron <jic23@kernel.org> Link: http://lkml.kernel.org/r/20170814145318.6495-2-brgl@bgdev.pl Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2017-08-16	drm: omapdrm: Remove omapdrm platform data	Laurent Pinchart
	The omapdrm platform data are not used anymore, remove them. Signed-off-by: Laurent Pinchart <laurent.pinchart@ideasonboard.com> Reviewed-by: Tomi Valkeinen <tomi.valkeinen@ti.com> Signed-off-by: Tomi Valkeinen <tomi.valkeinen@ti.com>