linux.git - Linus' kernel tree

Age	Commit message (Collapse)	Author
2017-11-06	PCI: Add for_each_pci_bridge() helper	Andy Shevchenko
	The following pattern is often used: list_for_each_entry(dev, &bus->devices, bus_list) { if (pci_is_bridge(dev)) { ... } } Add a for_each_pci_bridge() helper to make that code easier to write and read by reducing indentation level. It also saves one or few lines of code in each occurrence. Convert PCI core parts here at the same time. Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> [bhelgaas: fold in http://lkml.kernel.org/r/20171013165352.25550-1-andriy.shevchenko@linux.intel.com] Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
2017-11-07	netfilter: nf_tables: get set elements via netlink	Pablo Neira Ayuso
	This patch adds a new get operation to look up for specific elements in a set via netlink interface. You can also use it to check if an interval already exists. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-11-06	mtd: Fix C++ comment in include/linux/mtd/mtd.h	Pavel Machek
	C++ comments look wrong in kernel tree. Fix one. Signed-off-by: Pavel Machek <pavel@ucw.cz> Acked-by: Boris Brezillon <boris.brezillon@free-electrons.com> Signed-off-by: Richard Weinberger <richard@nod.at>
2017-11-06	xen: limit grant v2 interface to the v1 functionality	Juergen Gross
	As there is currently no user for sub-page grants or transient grants remove that functionality. This at once makes it possible to switch from grant v2 to grant v1 without restrictions, as there is no loss of functionality other than the limited frame number width related to the switch. Signed-off-by: Juergen Gross <jgross@suse.com> Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
2017-11-06	xen: re-introduce support for grant v2 interface	Juergen Gross
	The grant v2 support was removed from the kernel with commit 438b33c7145ca8a5131a30c36d8f59bce119a19a ("xen/grant-table: remove support for V2 tables") as the higher memory footprint of v2 grants resulted in less grants being possible for a kernel compared to the v1 grant interface. As machines with more than 16TB of memory are expected to be more common in the near future support of grant v2 is mandatory in order to be able to run a Xen pv domain at any memory location. So re-add grant v2 support basically by reverting above commit. Signed-off-by: Juergen Gross <jgross@suse.com> Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
2017-11-06	ide: Convert timers to use timer_setup()	Kees Cook
	In preparation for unconditionally passing the struct timer_list pointer to all timer callbacks, switch to using the new timer_setup() and from_timer() to pass the timer pointer explicitly. Cc: "David S. Miller" <davem@davemloft.net> Cc: linux-ide@vger.kernel.org Signed-off-by: Kees Cook <keescook@chromium.org>
2017-11-06	ALSA: seq: Avoid invalid lockdep class warning	Takashi Iwai
	The recent fix for adding rwsem nesting annotation was using the given "hop" argument as the lock subclass key. Although the idea itself works, it may trigger a kernel warning like: BUG: looking up invalid subclass: 8 .... since the lockdep has a smaller number of subclasses (8) than we currently allow for the hops there (10). The current definition is merely a sanity check for avoiding the too deep delivery paths, and the 8 hops are already enough. So, as a quick fix, just follow the max hops as same as the max lockdep subclasses. Fixes: 1f20f9ff57ca ("ALSA: seq: Fix nested rwsem annotation for lockdep splat") Reported-by: syzbot <syzkaller@googlegroups.com> Cc: <stable@vger.kernel.org> Signed-off-by: Takashi Iwai <tiwai@suse.de>
2017-11-06	KVM: arm/arm64: vgic: restructure kvm_vgic_(un)map_phys_irq	Eric Auger
	We want to reuse the core of the map/unmap functions for IRQ forwarding. Let's move the computation of the hwirq in kvm_vgic_map_phys_irq and pass the linux IRQ as parameter. the host_irq is added to struct vgic_irq. We introduce kvm_vgic_map/unmap_irq which take a struct vgic_irq handle as a parameter. Acked-by: Christoffer Dall <christoffer.dall@linaro.org> Signed-off-by: Eric Auger <eric.auger@redhat.com> Signed-off-by: Marc Zyngier <marc.zyngier@arm.com> Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
2017-11-06	Merge git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git irq/core	Christoffer Dall

2017-11-06	netfilter: conntrack: don't cache nlattr_tuple_size result in nla_size	Florian Westphal
	We currently call ->nlattr_tuple_size() once at register time and cache result in l4proto->nla_size. nla_size is the only member that is written to, avoiding this would allow to make l4proto trackers const. We can use ->nlattr_tuple_size() at run time, and cache result in the individual trackers instead. This is an intermediate step, next patch removes nlattr_size() callback and computes size at compile time, then removes nla_size. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-11-06	KVM: arm/arm64: Rework kvm_timer_should_fire	Christoffer Dall
	kvm_timer_should_fire() can be called in two different situations from the kvm_vcpu_block(). The first case is before calling kvm_timer_schedule(), used for wait polling, and in this case the VCPU thread is running and the timer state is loaded onto the hardware so all we have to do is check if the virtual interrupt lines are asserted, becasue the timer interrupt handler functions will raise those lines as appropriate. The second case is inside the wait loop of kvm_vcpu_block(), where we have already called kvm_timer_schedule() and therefore the hardware will be disabled and the software view of the timer state is up to date (timer->loaded is false), and so we can simply check if the timer should fire by looking at the software state. Signed-off-by: Christoffer Dall <cdall@linaro.org> Reviewed-by: Marc Zyngier <marc.zyngier@arm.com>
2017-11-06	KVM: arm/arm64: Get rid of kvm_timer_flush_hwstate	Christoffer Dall
	Now when both the vtimer and the ptimer when using both the in-kernel vgic emulation and a userspace IRQ chip are driven by the timer signals and at the vcpu load/put boundaries, instead of recomputing the timer state at every entry/exit to/from the guest, we can get entirely rid of the flush hwstate function. Signed-off-by: Christoffer Dall <cdall@linaro.org> Acked-by: Marc Zyngier <marc.zyngier@arm.com>
2017-11-06	KVM: arm/arm64: Avoid timer save/restore in vcpu entry/exit	Christoffer Dall
	We don't need to save and restore the hardware timer state and examine if it generates interrupts on on every entry/exit to the guest. The timer hardware is perfectly capable of telling us when it has expired by signaling interrupts. When taking a vtimer interrupt in the host, we don't want to mess with the timer configuration, we just want to forward the physical interrupt to the guest as a virtual interrupt. We can use the split priority drop and deactivate feature of the GIC to do this, which leaves an EOI'ed interrupt active on the physical distributor, making sure we don't keep taking timer interrupts which would prevent the guest from running. We can then forward the physical interrupt to the VM using the HW bit in the LR of the GIC, like we do already, which lets the guest directly deactivate both the physical and virtual timer simultaneously, allowing the timer hardware to exit the VM and generate a new physical interrupt when the timer output is again asserted later on. We do need to capture this state when migrating VCPUs between physical CPUs, however, which we use the vcpu put/load functions for, which are called through preempt notifiers whenever the thread is scheduled away from the CPU or called directly if we return from the ioctl to userspace. One caveat is that we have to save and restore the timer state in both kvm_timer_vcpu_[put/load] and kvm_timer_[schedule/unschedule], because we can have the following flows: 1. kvm_vcpu_block 2. kvm_timer_schedule 3. schedule 4. kvm_timer_vcpu_put (preempt notifier) 5. schedule (vcpu thread gets scheduled back) 6. kvm_timer_vcpu_load (preempt notifier) 7. kvm_timer_unschedule And a version where we don't actually call schedule: 1. kvm_vcpu_block 2. kvm_timer_schedule 7. kvm_timer_unschedule Since kvm_timer_[schedule/unschedule] may not be followed by put/load, but put/load also may be called independently, we call the timer save/restore functions from both paths. Since they rely on the loaded flag to never save/restore when unnecessary, this doesn't cause any harm, and we ensure that all invokations of either set of functions work as intended. An added benefit beyond not having to read and write the timer sysregs on every entry and exit is that we no longer have to actively write the active state to the physical distributor, because we configured the irq for the vtimer to only get a priority drop when handling the interrupt in the GIC driver (we called irq_set_vcpu_affinity()), and the interrupt stays active after firing on the host. Reviewed-by: Marc Zyngier <marc.zyngier@arm.com> Signed-off-by: Christoffer Dall <cdall@linaro.org>
2017-11-06	KVM: arm/arm64: Use separate timer for phys timer emulation	Christoffer Dall
	We were using the same hrtimer for emulating the physical timer and for making sure a blocking VCPU thread would be eventually woken up. That worked fine in the previous arch timer design, but as we are about to actually use the soft timer expire function for the physical timer emulation, change the logic to use a dedicated hrtimer. This has the added benefit of not having to cancel any work in the sync path, which in turn allows us to run the flush and sync with IRQs disabled. Note that the hrtimer used to program the host kernel's timer to generate an exit from the guest when the emulated physical timer fires never has to inject any work, and to share the soft_timer_cancel() function with the bg_timer, we change the function to only cancel any pending work if the pointer to the work struct is not null. Acked-by: Marc Zyngier <marc.zyngier@arm.com> Signed-off-by: Christoffer Dall <cdall@linaro.org>
2017-11-06	KVM: arm/arm64: Rename soft timer to bg_timer	Christoffer Dall
	As we are about to introduce a separate hrtimer for the physical timer, call this timer bg_timer, because we refer to this timer as the background timer in the code and comments elsewhere. Signed-off-by: Christoffer Dall <cdall@linaro.org> Acked-by: Marc Zyngier <marc.zyngier@arm.com>
2017-11-06	KVM: arm/arm64: Make timer_arm and timer_disarm helpers more generic	Christoffer Dall
	We are about to add an additional soft timer to the arch timer state for a VCPU and would like to be able to reuse the functions to program and cancel a timer, so we make them slightly more generic and rename to make it more clear that these functions work on soft timers and not the hardware resource that this code is managing. The armed flag on the timer state is only used to assert a condition, and we don't rely on this assertion in any meaningful way, so we can simply get rid of this flack and slightly reduce complexity. Acked-by: Marc Zyngier <marc.zyngier@arm.com> Signed-off-by: Christoffer Dall <cdall@linaro.org>
2017-11-06	ACPI / PM: Take SMART_SUSPEND driver flag into account	Rafael J. Wysocki
	Make the ACPI PM domain take DPM_FLAG_SMART_SUSPEND into account in its system suspend callbacks. [Note that the pm_runtime_suspended() check in acpi_dev_needs_resume() is an optimization, because if is not passed, all of the subsequent checks may be skipped and some of them are much more overhead in general.] Also use the observation that if the device is in runtime suspend at the beginning of the "late" phase of a system-wide suspend-like transition, its state cannot change going forward (runtime PM is disabled for it at that time) until the transition is over and the subsequent system-wide PM callbacks should be skipped for it (as they generally assume the device to not be suspended), so add checks for that in acpi_subsys_suspend_late/noirq() and acpi_subsys_freeze_late/noirq(). Moreover, if acpi_subsys_resume_noirq() is called during the subsequent system-wide resume transition and if the device was left in runtime suspend previously, its runtime PM status needs to be changed to "active" as it is going to be put into the full-power state going forward, so add a check for that too in there. In turn, if acpi_subsys_thaw_noirq() runs after the device has been left in runtime suspend, the subsequent "thaw" callbacks need to be skipped for it (as they may not work correctly with a suspended device), so set the power.direct_complete flag for the device then to make the PM core skip those callbacks. On top of the above, make the analogous changes in the acpi_lpss driver that uses the ACPI PM domain callbacks. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-11-06	PCI / PM: Take SMART_SUSPEND driver flag into account	Rafael J. Wysocki
	Make the PCI bus type take DPM_FLAG_SMART_SUSPEND into account in its system-wide PM callbacks and make sure that all code that should not run in parallel with pci_pm_runtime_resume() is executed in the "late" phases of system suspend, freeze and poweroff transitions. [Note that the pm_runtime_suspended() check in pci_dev_keep_suspended() is an optimization, because if is not passed, all of the subsequent checks may be skipped and some of them are much more overhead in general.] Also use the observation that if the device is in runtime suspend at the beginning of the "late" phase of a system-wide suspend-like transition, its state cannot change going forward (runtime PM is disabled for it at that time) until the transition is over and the subsequent system-wide PM callbacks should be skipped for it (as they generally assume the device to not be suspended), so add checks for that in pci_pm_suspend_late/noirq(), pci_pm_freeze_late/noirq() and pci_pm_poweroff_late/noirq(). Moreover, if pci_pm_resume_noirq() or pci_pm_restore_noirq() is called during the subsequent system-wide resume transition and if the device was left in runtime suspend previously, its runtime PM status needs to be changed to "active" as it is going to be put into the full-power state, so add checks for that too to these functions. In turn, if pci_pm_thaw_noirq() runs after the device has been left in runtime suspend, the subsequent "thaw" callbacks need to be skipped for it (as they may not work correctly with a suspended device), so set the power.direct_complete flag for the device then to make the PM core skip those callbacks. In addition to the above add a core helper for checking if DPM_FLAG_SMART_SUSPEND is set and the device runtime PM status is "suspended" at the same time, which is done quite often in the new code (and will be done elsewhere going forward too). Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Acked-by: Bjorn Helgaas <bhelgaas@google.com>
2017-11-06	PM / core: Add SMART_SUSPEND driver flag	Rafael J. Wysocki
	Define and document a SMART_SUSPEND flag to instruct bus types and PM domains that the system suspend callbacks provided by the driver can cope with runtime-suspended devices, so from the driver's perspective it should be safe to leave devices in runtime suspend during system suspend. Setting that flag may also cause middle-layer code (bus types, PM domains etc.) to skip invocations of the ->suspend_late and ->suspend_noirq callbacks provided by the driver if the device is in runtime suspend at the beginning of the "late" phase of the system-wide suspend transition, in which case the driver's system-wide resume callbacks may be invoked back-to-back with its ->runtime_suspend callback, so the driver has to be able to cope with that too. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Reviewed-by: Ulf Hansson <ulf.hansson@linaro.org>
2017-11-06	PCI / PM: Use the NEVER_SKIP driver flag	Rafael J. Wysocki
	Replace the PCI-specific flag PCI_DEV_FLAGS_NEEDS_RESUME with the PM core's DPM_FLAG_NEVER_SKIP one everywhere and drop it. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Acked-by: Bjorn Helgaas <bhelgaas@google.com> Reviewed-by: Ulf Hansson <ulf.hansson@linaro.org>
2017-11-06	PM / core: Add NEVER_SKIP and SMART_PREPARE driver flags	Rafael J. Wysocki
	The motivation for this change is to provide a way to work around a problem with the direct-complete mechanism used for avoiding system suspend/resume handling for devices in runtime suspend. The problem is that some middle layer code (the PCI bus type and the ACPI PM domain in particular) returns positive values from its system suspend ->prepare callbacks regardless of whether the driver's ->prepare returns a positive value or 0, which effectively prevents drivers from being able to control the direct-complete feature. Some drivers need that control, however, and the PCI bus type has grown its own flag to deal with this issue, but since it is not limited to PCI, it is better to address it by adding driver flags at the core level. To that end, add a driver_flags field to struct dev_pm_info for flags that can be set by device drivers at the probe time to inform the PM core and/or bus types, PM domains and so on on the capabilities and/or preferences of device drivers. Also add two static inline helpers for setting that field and testing it against a given set of flags and make the driver core clear it automatically on driver remove and probe failures. Define and document two PM driver flags related to the direct- complete feature: NEVER_SKIP and SMART_PREPARE that can be used, respectively, to indicate to the PM core that the direct-complete mechanism should never be used for the device and to inform the middle layer code (bus types, PM domains etc) that it can only request the PM core to use the direct-complete mechanism for the device (by returning a positive value from its ->prepare callback) if it also has been requested by the driver. While at it, make the core check pm_runtime_suspended() when setting power.direct_complete so that it doesn't need to be checked by ->prepare callbacks. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Acked-by: Bjorn Helgaas <bhelgaas@google.com> Reviewed-by: Ulf Hansson <ulf.hansson@linaro.org>
2017-11-06	Merge branch 'acpi-pm' into pm-core	Rafael J. Wysocki

2017-11-06	Merge remote-tracking branches 'regmap/topic/const' and ↵	Mark Brown
	'regmap/topic/hwspinlock' into regmap-next
2017-11-06	Merge remote-tracking branch 'regmap/topic/core' into regmap-next	Mark Brown

2017-11-06	ALSA: timer: Limit max instances per timer	Takashi Iwai
	Currently we allow unlimited number of timer instances, and it may bring the system hogging way too much CPU when too many timer instances are opened and processed concurrently. This may end up with a soft-lockup report as triggered by syzkaller, especially when hrtimer backend is deployed. Since such insane number of instances aren't demanded by the normal use case of ALSA sequencer and it merely opens a risk only for abuse, this patch introduces the upper limit for the number of instances per timer backend. As default, it's set to 1000, but for the fine-grained timer like hrtimer, it's set to 100. Reported-by: syzbot Tested-by: Jérôme Glisse <jglisse@redhat.com> Cc: <stable@vger.kernel.org> Signed-off-by: Takashi Iwai <tiwai@suse.de>
2017-11-06	Merge branch 'x86/mm' into x86/asm, to pick up pending changes	Ingo Molnar
	Concentrate x86 MM and asm related changes into a single super-topic, in preparation for larger changes. Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-11-06	Merge branch 'x86/fpu' into x86/asm, to pick up fix	Ingo Molnar
	Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-11-05	f2fs: support quota sys files	Jaegeuk Kim
	This patch supports hidden quota files in the system, which will be used for Android. It requires up-to-date f2fs-tools later than v1.9.0. Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-11-05	f2fs: add quota_ino feature infra	Jaegeuk Kim
	This patch adds quota_ino feature infra to be used for quota files. Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-11-05	f2fs: support flexible inline xattr size	Chao Yu
	Now, in product, more and more features based on file encryption were introduced, their demand of xattr space is increasing, however, inline xattr has fixed-size of 200 bytes, once inline xattr space is full, new increased xattr data would occupy additional xattr block which may bring us more space usage and performance regression during persisting. In order to resolve above issue, it's better to expand inline xattr size flexibly according to user's requirement. So this patch introduces new filesystem feature 'flexible inline xattr', and new mount option 'inline_xattr_size=%u', once mkfs enables the feature, we can use the option to make f2fs supporting flexible inline xattr size. To support this feature, we add extra attribute i_inline_xattr_size in inode layout, indicating that how many space inline xattr borrows from block address mapping space in inode layout, by this, we can easily locate and store flexible-sized inline xattr data in inode. Inode disk layout: +----------------------+ \| .i_mode \| \| ... \| \| .i_ext \| +----------------------+ \| .i_extra_isize \| \| .i_inline_xattr_size \|-----------+ \| ... \| \| +----------------------+ \| \| .i_addr \| \| \| - block address or \| \| \| - inline data \| \| +----------------------+<---+ v \| inline xattr \| +---inline xattr range +----------------------+<---+ \| .i_nid \| +----------------------+ \| node_footer \| \| (nid, ino, offset) \| +----------------------+ Note that, we have to cnosider backward compatibility which reserved inline_data space, 200 bytes, all the time, reported by Sheng Yong. Previous inline data or directory always reserved 200 bytes in inode layout, even if inline_xattr is disabled. In order to keep inline_dentry's structure for backward compatibility, we get the space back only from inline_data. Signed-off-by: Chao Yu <yuchao0@huawei.com> Reported-by: Sheng Yong <shengyong1@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-11-05	include/linux/fs.h: fix comment about struct address_space	Mike Rapoport
	Before commit 9c5d760b8d22 ("mm: split gfp_mask and mapping flags into separate fields") the private_* fields of struct adrress_space were grouped together and using "ditto" in comments describing the last fields was correct. With introduction of gpf_mask between private_lock and private_list "ditto" references the wrong description. Fix it by using the elaborate description. Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-11-05	Merge branch 'perf-urgent-for-linus' of ↵	Linus Torvalds
	git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull perf fixes from Ingo Molnar: "Various fixes: - synchronize kernel and tooling headers - cgroup support fix - two tooling fixes" * 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: tools/headers: Synchronize kernel ABI headers perf/cgroup: Fix perf cgroup hierarchy support perf tools: Unwind properly location after REJECT perf symbols: Fix memory corruption because of zero length symbols
2017-11-05	Merge branch 'core-urgent-for-linus' of ↵	Linus Torvalds
	git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull core fixes from Ingo Molnar: - workaround for gcc asm handling - futex race fixes - objtool build warning fix - two watchdog fixes: a crash fix (revert) and a bug fix for /proc/sys/kernel/watchdog_thresh handling. * 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: objtool: Prevent GCC from merging annotate_unreachable(), take 2 objtool: Resync objtool's instruction decoder source code copy with the kernel's latest version watchdog/hardlockup/perf: Use atomics to track in-use cpu counter watchdog/harclockup/perf: Revert a33d44843d45 ("watchdog/hardlockup/perf: Simplify deferred event destroy") futex: Fix more put_pi_state() vs. exit_pi_state_list() races
2017-11-05	bpf, cgroup: implement eBPF-based device controller for cgroup v2	Roman Gushchin
	Cgroup v2 lacks the device controller, provided by cgroup v1. This patch adds a new eBPF program type, which in combination of previously added ability to attach multiple eBPF programs to a cgroup, will provide a similar functionality, but with some additional flexibility. This patch introduces a BPF_PROG_TYPE_CGROUP_DEVICE program type. A program takes major and minor device numbers, device type (block/character) and access type (mknod/read/write) as parameters and returns an integer which defines if the operation should be allowed or terminated with -EPERM. Signed-off-by: Roman Gushchin <guro@fb.com> Acked-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Tejun Heo <tj@kernel.org> Cc: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-05	device_cgroup: prepare code for bpf-based device controller	Roman Gushchin
	This is non-functional change to prepare the device cgroup code for adding eBPF-based controller for cgroups v2. The patch performs the following changes: 1) __devcgroup_inode_permission() and devcgroup_inode_mknod() are moving to the device-cgroup.h and converting into static inline. 2) __devcgroup_check_permission() is exported. 3) devcgroup_check_permission() wrapper is introduced to be used by both existing and new bpf-based implementations. Signed-off-by: Roman Gushchin <guro@fb.com> Acked-by: Tejun Heo <tj@kernel.org> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-05	Merge tag 'mlx5-updates-2017-11-04' of ↵	David S. Miller
	git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux Saeed Mahameed says: ==================== mlx5-updates-2017-11-04 This series includes: From Huy: dscp to priority mapping for Ethernet packet. =================================================== First six patches enable differentiated services code point (dscp) to priority mapping for Ethernet packet. Once this feature is enabled, the packet is routed to the corresponding priority based on its dscp. User can combine this feature with priority flow control (pfc) feature to have priority flow control based on the dscp. Firmware interface: Mellanox firmware provides two control knobs for this feature: QPTS register allow changing the trust state between dscp and pcp mode. The default is pcp mode. Once in dscp mode, firmware will route the packet based on its dscp value if the dscp field exists. QPDPM register allow mapping a specific dscp (0 to 63) to a specific priority (0 to 7). By default, all the dscps are mapped to priority zero. Software interface: This feature is controlled via application priority TLV. IEEE specification P802.1Qcd/D2.1 defines priority selector id 5 for application priority TLV. This APP TLV selector defines DSCP to priority map. This APP TLV can be sent by the switch or can be set locally using software such as lldptool. In mlx5 drivers, we add the support for net dcb's getapp and setapp call back. Mlx5 driver only handles the selector id 5 application entry (dscp application priority application entry). If user sends multiple dscp to priority APP TLV entries on the same dscp, the last sent one will take effect. All the previous sent will be deleted. The firmware trust state (in QPTS register) is changed based on the number of dscp to priority application entries. When the first dscp to priority application entry is added by the user, the trust state is changed to dscp. When the last dscp to priority application entry is deleted by the user, the trust state is changed to pcp. When the port is in DSCP trust state, the transmit queue is selected based on the dscp of the skb. When the port is in DSCP trust state and vport inline mode is not NONE, firmware requires mlx5 driver to copy the IP header to the wqe ethernet segment inline header if the skb has it. This is done by changing the transmit queue sq's min inline mode to L3. Note that the min inline mode of sqs that belong to other features such as xdpsq, icosq are not modified. =================================================== Plus to the dscp series, some small misc changes are include as well: From Inbar, Ethtool msglvl support and some debug prints in DCBNL logic From Or Gerlitz, Enlarge the NIC TC offload table size From Rabie, Initialize destination_flow struct to 0 From Feras, Add inner TTC table to IPoIB flow steering From Tal, Enable CQE based moderation on TX CQ ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-05	tcp: higher throughput under reordering with adaptive RACK reordering wnd	Priyaranjan Jha
	Currently TCP RACK loss detection does not work well if packets are being reordered beyond its static reordering window (min_rtt/4).Under such reordering it may falsely trigger loss recoveries and reduce TCP throughput significantly. This patch improves that by increasing and reducing the reordering window based on DSACK, which is now supported in major TCP implementations. It makes RACK's reo_wnd adaptive based on DSACK and no. of recoveries. - If DSACK is received, increment reo_wnd by min_rtt/4 (upper bounded by srtt), since there is possibility that spurious retransmission was due to reordering delay longer than reo_wnd. - Persist the current reo_wnd value for TCP_RACK_RECOVERY_THRESH (16) no. of successful recoveries (accounts for full DSACK-based loss recovery undo). After that, reset it to default (min_rtt/4). - At max, reo_wnd is incremented only once per rtt. So that the new DSACK on which we are reacting, is due to the spurious retx (approx) after the reo_wnd has been updated last time. - reo_wnd is tracked in terms of steps (of min_rtt/4), rather than absolute value to account for change in rtt. In our internal testing, we observed significant increase in throughput, in scenarios where reordering exceeds min_rtt/4 (previous static value). Signed-off-by: Priyaranjan Jha <priyarjha@google.com> Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-05	net: dsa: make tree index unsigned	Vivien Didelot
	Similarly to a DSA switch and port, rename the tree index from "tree" to "index" and make it an unsigned int because it isn't supposed to be less than 0. u32 is an OF specific data used to retrieve the value and has no need to be propagated up to the tree index. Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-05	net: dsa: make switch index unsigned	Vivien Didelot
	Define the DSA switch index as an unsigned int, because it will never be less than 0. Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-05	bpf: remove old offload/analyzer	Jakub Kicinski
	Thanks to the ability to load a program for a specific device, running verifier twice is no longer needed. Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-05	xdp: allow attaching programs loaded for specific device	Jakub Kicinski
	Pass the netdev pointer to bpf_prog_get_type(). This way BPF code can decide whether the device matches what the code was loaded/translated for. Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Reviewed-by: Simon Horman <simon.horman@netronome.com> Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-05	bpf: report offload info to user space	Jakub Kicinski
	Extend struct bpf_prog_info to contain information about program being bound to a device. Since the netdev may get destroyed while program still exists we need a flag to indicate the program is loaded for a device, even if the device is gone. Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Reviewed-by: Simon Horman <simon.horman@netronome.com> Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-05	bpf: offload: add infrastructure for loading programs for a specific netdev	Jakub Kicinski
	The fact that we don't know which device the program is going to be used on is quite limiting in current eBPF infrastructure. We have to reverse or limit the changes which kernel makes to the loaded bytecode if we want it to be offloaded to a networking device. We also have to invent new APIs for debugging and troubleshooting support. Make it possible to load programs for a specific netdev. This helps us to bring the debug information closer to the core eBPF infrastructure (e.g. we will be able to reuse the verifer log in device JIT). It allows device JITs to perform translation on the original bytecode. __bpf_prog_get() when called to get a reference for an attachment point will now refuse to give it if program has a device assigned. Following patches will add a version of that function which passes the expected netdev in. @type argument in __bpf_prog_get() is renamed to attach_type to make it clearer that it's only set on attachment. All calls to ndo_bpf are protected by rtnl, only verifier callbacks are not. We need a wait queue to make sure netdev doesn't get destroyed while verifier is still running and calling its driver. Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Reviewed-by: Simon Horman <simon.horman@netronome.com> Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-05	net: bpf: rename ndo_xdp to ndo_bpf	Jakub Kicinski
	ndo_xdp is a control path callback for setting up XDP in the driver. We can reuse it for other forms of communication between the eBPF stack and the drivers. Rename the callback and associated structures and definitions. Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Reviewed-by: Simon Horman <simon.horman@netronome.com> Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-05	rtnetlink: use netnsid to query interface	Jiri Benc
	Currently, when an application gets netnsid from the kernel (for example as the result of RTM_GETLINK call on one end of the veth pair), it's not much useful. There's no reliable way to get to the netns fd from the netnsid, nor does any kernel API accept netnsid. Extend the RTM_GETLINK call to also accept netnsid. It will operate on the netns with the given netnsid in such case. Of course, the calling process needs to have enough capabilities in the target name space; for now, require CAP_NET_ADMIN. This can be relaxed in the future. To signal to the calling process that the kernel understood the new IFLA_IF_NETNSID attribute in the query, it will include it in the response. This is needed to detect older kernels, as they will just ignore IFLA_IF_NETNSID and query in the current name space. This patch implemetns IFLA_IF_NETNSID only for get and dump. For set operations, this can be extended later. Signed-off-by: Jiri Benc <jbenc@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-05	openvswitch: reliable interface indentification in port dumps	Jiri Benc
	This patch allows reliable identification of netdevice interfaces connected to openvswitch bridges. In particular, user space queries the netdev interfaces belonging to the ports for statistics, up/down state, etc. Datapath dump needs to provide enough information for the user space to be able to do that. Currently, only interface names are returned. This is not sufficient, as openvswitch allows its ports to be in different name spaces and the interface name is valid only in its name space. What is needed and generally used in other netlink APIs, is the pair ifindex+netnsid. The solution is addition of the ifindex+netnsid pair (or only ifindex if in the same name space) to vport get/dump operation. On request side, ideally the ifindex+netnsid pair could be used to get/set/del the corresponding vport. This is not implemented by this patch and can be added later if needed. Signed-off-by: Jiri Benc <jbenc@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-05	i2c: nuc900: remove platform_data, too	Wolfram Sang
	Commit 7da62cb1853025 ("i2c: nuc900: remove driver") removed the driver, we should remove the platform_data as well. Signed-off-by: Wolfram Sang <wsa@the-dreams.de>
2017-11-04	net/mlx5: QPTS and QPDPM register firmware command support	Huy Nguyen
	The QPTS register allows changing the priority trust state between pcp and dscp. Add support to get/set trust state from device. When the port is in pcp/dscp trust state, packet is routed by hardware to matching priority based on its pcp/dscp value respectively. The QPDPM register allow channing the dscp to priority mapping. Add support to get/set dscp to priority mapping from device. Note that to change a dscp mapping, the "e" bit of this dscp structure must be set in the QPDPM firmware command. Signed-off-by: Huy Nguyen <huyn@mellanox.com> Reviewed-by: Parav Pandit <parav@mellanox.com> Reviewed-by: Or Gerlitz <ogerlitz@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2017-11-04	net/mlx5: Add MLX5_SET16 and MLX5_GET16	Huy Nguyen
	Add MLX5_SET16 and MLX5_GET16 for 16bit structure field in firmware command. Signed-off-by: Huy Nguyen <huyn@mellanox.com> Reviewed-by: Parav Pandit <parav@mellanox.com> Reviewed-by: Eli Cohen <eli@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2017-11-04	net/mlx5: QCAM register firmware command support	Huy Nguyen
	The QCAM register provides capability bit for all the QoS registers using ACCESS_REG command. Signed-off-by: Huy Nguyen <huyn@mellanox.com> Reviewed-by: Parav Pandit <parav@mellanox.com> Reviewed-by: Or Gerlitz <ogerlitz@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>