linux.git - Linus' kernel tree

Age	Commit message (Collapse)	Author
2016-04-25	cgroup, cpuset: replace cpuset_post_attach_flush() with ↵	Tejun Heo
	cgroup_subsys->post_attach callback Since e93ad19d0564 ("cpuset: make mm migration asynchronous"), cpuset kicks off asynchronous NUMA node migration if necessary during task migration and flushes it from cpuset_post_attach_flush() which is called at the end of __cgroup_procs_write(). This is to avoid performing migration with cgroup_threadgroup_rwsem write-locked which can lead to deadlock through dependency on kworker creation. memcg has a similar issue with charge moving, so let's convert it to an official callback rather than the current one-off cpuset specific function. This patch adds cgroup_subsys->post_attach callback and makes cpuset register cpuset_post_attach_flush() as its ->post_attach. The conversion is mostly one-to-one except that the new callback is called under cgroup_mutex. This is to guarantee that no other migration operations are started before ->post_attach callbacks are finished. cgroup_mutex is one of the outermost mutex in the system and has never been and shouldn't be a problem. We can add specialized synchronization around __cgroup_procs_write() but I don't think there's any noticeable benefit. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Li Zefan <lizefan@huawei.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: <stable@vger.kernel.org> # 4.4+ prerequisite for the next patch
2016-04-25	Merge branches 'pci/enumeration', 'pci/hotplug', 'pci/misc', 'pci/ntb', ↵	Bjorn Helgaas
	'pci/thunderbolt' and 'pci/virtualization' into next * pci/enumeration: x86/PCI: Refine PCI support check in pcibios_init() * pci/hotplug: PCI: acpiphp_ibm: Avoid uninitialized variable reference * pci/misc: PCI: Fix spelling errors * pci/ntb: PCI: Add DMA alias quirk for mic_x200_dma PCI: Add support for multiple DMA aliases PCI: Move informational printk to pci_add_dma_alias() PCI: Add pci_add_dma_alias() to abstract implementation * pci/thunderbolt: thunderbolt: Support 1st gen Light Ridge controller thunderbolt: Fix typos and magic number PCI: Add Intel Thunderbolt device IDs * pci/virtualization: PCI: Work around Intel Sunrise Point PCH incorrect ACS capability PCI: Reverse standard ACS vs device-specific ACS enabling PCI: Mark Intel i40e NIC INTx masking as broken
2016-04-25	ipv6: Revert optional address flusing on ifdown.	David S. Miller
	This reverts the following three commits: 70af921db6f8835f4b11c65731116560adb00c14 799977d9aafbf0ca0b9c39b04cbfb16db71302c9 f1705ec197e705b79ea40fe7a2cc5acfa1d3bfac The feature was ill conceived, has terrible semantics, and has added nothing but regressions to the already fragile ipv6 stack. Fixes: f1705ec197e7 ("net: ipv6: Make address flushing on ifdown optional") Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-25	ieee802154: use nla_put_u64_64bit()	Nicolas Dichtel
	Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-25	libceph: make authorizer destruction independent of ceph_auth_client	Ilya Dryomov
	Starting the kernel client with cephx disabled and then enabling cephx and restarting userspace daemons can result in a crash: [262671.478162] BUG: unable to handle kernel paging request at ffffebe000000000 [262671.531460] IP: [<ffffffff811cd04a>] kfree+0x5a/0x130 [262671.584334] PGD 0 [262671.635847] Oops: 0000 [#1] SMP [262672.055841] CPU: 22 PID: 2961272 Comm: kworker/22:2 Not tainted 4.2.0-34-generic #39~14.04.1-Ubuntu [262672.162338] Hardware name: Dell Inc. PowerEdge R720/068CDY, BIOS 2.4.3 07/09/2014 [262672.268937] Workqueue: ceph-msgr con_work [libceph] [262672.322290] task: ffff88081c2d0dc0 ti: ffff880149ae8000 task.ti: ffff880149ae8000 [262672.428330] RIP: 0010:[<ffffffff811cd04a>] [<ffffffff811cd04a>] kfree+0x5a/0x130 [262672.535880] RSP: 0018:ffff880149aeba58 EFLAGS: 00010286 [262672.589486] RAX: 000001e000000000 RBX: 0000000000000012 RCX: ffff8807e7461018 [262672.695980] RDX: 000077ff80000000 RSI: ffff88081af2be04 RDI: 0000000000000012 [262672.803668] RBP: ffff880149aeba78 R08: 0000000000000000 R09: 0000000000000000 [262672.912299] R10: ffffebe000000000 R11: ffff880819a60e78 R12: ffff8800aec8df40 [262673.021769] R13: ffffffffc035f70f R14: ffff8807e5b138e0 R15: ffff880da9785840 [262673.131722] FS: 0000000000000000(0000) GS:ffff88081fac0000(0000) knlGS:0000000000000000 [262673.245377] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [262673.303281] CR2: ffffebe000000000 CR3: 0000000001c0d000 CR4: 00000000001406e0 [262673.417556] Stack: [262673.472943] ffff880149aeba88 ffff88081af2be04 ffff8800aec8df40 ffff88081af2be04 [262673.583767] ffff880149aeba98 ffffffffc035f70f ffff880149aebac8 ffff8800aec8df00 [262673.694546] ffff880149aebac8 ffffffffc035c89e ffff8807e5b138e0 ffff8805b047f800 [262673.805230] Call Trace: [262673.859116] [<ffffffffc035f70f>] ceph_x_destroy_authorizer+0x1f/0x50 [libceph] [262673.968705] [<ffffffffc035c89e>] ceph_auth_destroy_authorizer+0x3e/0x60 [libceph] [262674.078852] [<ffffffffc0352805>] put_osd+0x45/0x80 [libceph] [262674.134249] [<ffffffffc035290e>] remove_osd+0xae/0x140 [libceph] [262674.189124] [<ffffffffc0352aa3>] __reset_osd+0x103/0x150 [libceph] [262674.243749] [<ffffffffc0354703>] kick_requests+0x223/0x460 [libceph] [262674.297485] [<ffffffffc03559e2>] ceph_osdc_handle_map+0x282/0x5e0 [libceph] [262674.350813] [<ffffffffc035022e>] dispatch+0x4e/0x720 [libceph] [262674.403312] [<ffffffffc034bd91>] try_read+0x3d1/0x1090 [libceph] [262674.454712] [<ffffffff810ab7c2>] ? dequeue_entity+0x152/0x690 [262674.505096] [<ffffffffc034cb1b>] con_work+0xcb/0x1300 [libceph] [262674.555104] [<ffffffff8108fb3e>] process_one_work+0x14e/0x3d0 [262674.604072] [<ffffffff810901ea>] worker_thread+0x11a/0x470 [262674.652187] [<ffffffff810900d0>] ? rescuer_thread+0x310/0x310 [262674.699022] [<ffffffff810957a2>] kthread+0xd2/0xf0 [262674.744494] [<ffffffff810956d0>] ? kthread_create_on_node+0x1c0/0x1c0 [262674.789543] [<ffffffff817bd81f>] ret_from_fork+0x3f/0x70 [262674.834094] [<ffffffff810956d0>] ? kthread_create_on_node+0x1c0/0x1c0 What happens is the following: (1) new MON session is established (2) old "none" ac is destroyed (3) new "cephx" ac is constructed ... (4) old OSD session (w/ "none" authorizer) is put ceph_auth_destroy_authorizer(ac, osd->o_auth.authorizer) osd->o_auth.authorizer in the "none" case is just a bare pointer into ac, which contains a single static copy for all services. By the time we get to (4), "none" ac, freed in (2), is long gone. On top of that, a new vtable installed in (3) points us at ceph_x_destroy_authorizer(), so we end up trying to destroy a "none" authorizer with a "cephx" destructor operating on invalid memory! To fix this, decouple authorizer destruction from ac and do away with a single static "none" authorizer by making a copy for each OSD or MDS session. Authorizers themselves are independent of ac and so there is no reason for destroy_authorizer() to be an ac op. Make it an op on the authorizer itself by turning ceph_authorizer into a real struct. Fixes: http://tracker.ceph.com/issues/15447 Reported-by: Alan Zhang <alan.zhang@linux.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Sage Weil <sage@redhat.com>
2016-04-25	PM / OPP: dev_pm_opp_set_sharing_cpus() doesn't depend on CONFIG_OF	Viresh Kumar
	dev_pm_opp_set_sharing_cpus() doesn't do any DT specific stuff and its declarations are added within the CONFIG_OF ifdef by mistake. Take them out of that. Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org> Reviewed-by: Stephen Boyd <sboyd@codeaurora.org> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2016-04-25	arm64/perf: Filter common events based on PMCEIDn_EL0	Ashok Kumar
	The complete common architectural and micro-architectural event number structure is filtered based on PMCEIDn_EL0 and exposed to /sys using is_visibile function pointer in events attribute_group. To filter the events in is_visible function, pmceid based bitmap is stored in arm_pmu structure and the id field from perf_pmu_events_attr is used to check against the bitmap. The function which derives event bitmap from PMCEIDn_EL0 is executed in the cpus, which has the pmu being initialized, for heterogeneous pmu support. Acked-by: Mark Rutland <mark.rutland@arm.com> Signed-off-by: Ashok Kumar <ashoks@broadcom.com> Signed-off-by: Will Deacon <will.deacon@arm.com>
2016-04-24	net/mlx5: Add pci shutdown callback	Majd Dibbiny
	This patch introduces kexec support for mlx5. When switching kernels, kexec() calls shutdown, which unloads the driver and cleans its resources. In addition, remove unregister netdev from shutdown flow. This will allow a clean shutdown, even if some netdev clients did not release their reference from this netdev. Releasing The HW resources only is enough as the kernel is shutting down Signed-off-by: Majd Dibbiny <majd@mellanox.com> Signed-off-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: Haggai Abramovsky <hagaya@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-24	net/mlx5e: Use vport MTU rather than physical port MTU	Saeed Mahameed
	Set and report vport MTU rather than physical MTU, Driver will set both vport and physical port mtu and will rely on the query of vport mtu. SRIOV VFs have to report their MTU to their vport manager (PF), and this will allow them to work with any MTU they need without failing the request. Also for some cases where the PF is not a port owner, PF can work with MTU less than the physical port mtu if set physical port mtu didn't take effect. Signed-off-by: Saeed Mahameed <saeedm@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-24	net/mlx5e: Device's mtu field is u16 and not int	Saeed Mahameed
	For set/query MTU port firmware commands the MTU field is 16 bits, here I changed all the "int mtu" parameters of the functions wrapping those firmware commands to be u16. Signed-off-by: Saeed Mahameed <saeedm@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-24	Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next	David S. Miller
	Pablo Neira Ayuso says: ==================== Netfilter updates for net-next The following patchset contains Netfilter updates for your net-next tree, mostly from Florian Westphal to sort out the lack of sufficient validation in x_tables and connlabel preparation patches to add nf_tables support. They are: 1) Ensure we don't go over the ruleset blob boundaries in mark_source_chains(). 2) Validate that target jumps land on an existing xt_entry. This extra sanitization comes with a performance penalty when loading the ruleset. 3) Introduce xt_check_entry_offsets() and use it from {arp,ip,ip6}tables. 4) Get rid of the smallish check_entry() functions in {arp,ip,ip6}tables. 5) Make sure the minimal possible target size in x_tables. 6) Similar to #3, add xt_compat_check_entry_offsets() for compat code. 7) Check that standard target size is valid. 8) More sanitization to ensure that the target_offset field is correct. 9) Add xt_check_entry_match() to validate that matches are well-formed. 10-12) Three patch to reduce the number of parameters in translate_compat_table() for {arp,ip,ip6}tables by using a container structure. 13) No need to return value from xt_compat_match_from_user(), so make it void. 14) Consolidate translate_table() so it can be used by compat code too. 15) Remove obsolete check for compat code, so we keep consistent with what was already removed in the native layout code (back in 2007). 16) Get rid of target jump validation from mark_source_chains(), obsoleted by #2. 17) Introduce xt_copy_counters_from_user() to consolidate counter copying, and use it from {arp,ip,ip6}tables. 18,22) Get rid of unnecessary explicit inlining in ctnetlink for dump functions. 19) Move nf_connlabel_match() to xt_connlabel. 20) Skip event notification if connlabel did not change. 21) Update of nf_connlabels_get() to make the upcoming nft connlabel support easier. 23) Remove spinlock to read protocol state field in conntrack. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-23	Merge branch 'fixes' of ↵	Linus Torvalds
	git://git.kernel.org/pub/scm/linux/kernel/git/evalenti/linux-soc-thermal Pull thermal fixes from Eduardo Valentin: "Specifics in this pull request: - Fixes in mediatek and OF thermal drivers - Fixes in power_allocator governor - More fixes of unsigned to int type change in thermal_core.c. These change have been CI tested using KernelCI bot. \o/" * 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/evalenti/linux-soc-thermal: thermal: fix Mediatek thermal controller build thermal: consistently use int for trip temp thermal: fix mtk_thermal build dependency thermal: minor mtk_thermal.c cleanups thermal: power_allocator: req_range multiplication should be a 64 bit type thermal: of: add __init attribute
2016-04-23	libnl: nla_put_net64(): align on a 64-bit area	Nicolas Dichtel
	nla_data() is now aligned on a 64-bit area. The temporary function nla_put_be64_32bit() is removed in this patch. Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-23	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net	David S. Miller
	Conflicts were two cases of simple overlapping changes, nothing serious. In the UDP case, we need to add a hlist_add_tail_rcu() to linux/rculist.h, because we've moved UDP socket handling away from using nulls lists. Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-23	iio:imu:mpu6050: enhance mounting matrix support	Gregor Boirie
	Add a new rotation matrix sysfs attribute compliant with IIO core mounting matrix API. Matrix is retrieved from "in_anglvel_mount_matrix" and "in_accel_mount_matrix" sysfs attributes. It is declared into mpu6050 DTS entry as a "mount-matrix" property. Old interface is kept for backward userspace compatibility and may be retrieved from legacy platform_data mechanism only. Signed-off-by: Gregor Boirie <gregor.boirie@parrot.com> Acked-by: Rob Herring <robh@kernel.org> Signed-off-by: Jonathan Cameron <jic23@kernel.org>
2016-04-23	iio:ak8975: add mounting matrix support	Gregor Boirie
	Expose a rotation matrix to indicate userspace the chip orientation with respect to the overall hardware system. Matrix is retrieved from "in_mount_matrix". It is declared into ak8975 DTS entry as a "mount-matrix" property. Signed-off-by: Gregor Boirie <gregor.boirie@parrot.com> Acked-by: Rob Herring <robh@kernel.org> Signed-off-by: Jonathan Cameron <jic23@kernel.org>
2016-04-23	iio:core: mounting matrix support	Gregor Boirie
	Expose a rotation matrix to indicate userspace the chip placement with respect to the overall hardware system. This is needed to adjust coordinates sampled from a sensor chip when its position deviates from the main hardware system. Final coordinates computation is delegated to userspace since: * computation may involve floating point arithmetics ; * it allows an application to combine adjustments with arbitrary transformations. This 3 dimentional space rotation matrix is expressed as 3x3 array of strings to support floating point numbers. It may be retrieved from a "[<dir>_][<type>_]mount_matrix" sysfs attribute file. It is declared into a device / driver specific DTS property or platform data. Signed-off-by: Gregor Boirie <gregor.boirie@parrot.com> Signed-off-by: Jonathan Cameron <jic23@kernel.org>
2016-04-23	sched/fair: Correctly handle nohz ticks CPU load accounting	Frederic Weisbecker
	Ticks can happen while the CPU is in dynticks-idle or dynticks-singletask mode. In fact "nohz" or "dynticks" only mean that we exit the periodic mode and we try to minimize the ticks as much as possible. The nohz subsystem uses a confusing terminology with the internal state "ts->tick_stopped" which is also available through its public interface with tick_nohz_tick_stopped(). This is a misnomer as the tick is instead reduced with the best effort rather than stopped. In the best case the tick can indeed be actually stopped but there is no guarantee about that. If a timer needs to fire one second later, a tick will fire while the CPU is in nohz mode and this is a very common scenario. Now this confusion happens to be a problem with CPU load updates: cpu_load_update_active() doesn't handle nohz ticks correctly because it assumes that ticks are completely stopped in nohz mode and that cpu_load_update_active() can't be called in dynticks mode. When that happens, the whole previous tickless load is ignored and the function just records the load for the current tick, ignoring potentially long idle periods behind. In order to solve this, we could account the current load for the previous nohz time but there is a risk that we account the load of a task that got freshly enqueued for the whole nohz period. So instead, lets record the dynticks load on nohz frame entry so we know what to record in case of nohz ticks, then use this record to account the tickless load on nohz ticks and nohz frame end. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Byungchul Park <byungchul.park@lge.com> Cc: Chris Metcalf <cmetcalf@ezchip.com> Cc: Christoph Lameter <cl@linux.com> Cc: Luiz Capitulino <lcapitulino@redhat.com> Cc: Mike Galbraith <efault@gmx.de> Cc: Paul E . McKenney <paulmck@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1460555812-25375-3-git-send-email-fweisbec@gmail.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-04-23	sched/fair: Gather CPU load functions under a more conventional namespace	Frederic Weisbecker
	The CPU load update related functions have a weak naming convention currently, starting with update_cpu_load_() which isn't ideal as "update" is a very generic concept. Since two of these functions are public already (and a third is to come) that's enough to introduce a more conventional naming scheme. So let's do the following rename instead: update_cpu_load_() -> cpu_load_update_*() Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Byungchul Park <byungchul.park@lge.com> Cc: Chris Metcalf <cmetcalf@ezchip.com> Cc: Christoph Lameter <cl@linux.com> Cc: Luiz Capitulino <lcapitulino@redhat.com> Cc: Mike Galbraith <efault@gmx.de> Cc: Paul E . McKenney <paulmck@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1460555812-25375-2-git-send-email-fweisbec@gmail.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-04-23	Merge tag 'v4.6-rc4' into sched/core, to refresh the tree	Ingo Molnar
	Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-04-23	perf/core: Add ::write_backward attribute to perf event	Wang Nan
	This patch introduces 'write_backward' bit to perf_event_attr, which controls the direction of a ring buffer. After set, the corresponding ring buffer is written from end to beginning. This feature is design to support reading from overwritable ring buffer. Ring buffer can be created by mapping a perf event fd. Kernel puts event records into ring buffer, user tooling like perf fetch them from address returned by mmap(). To prevent racing between kernel and tooling, they communicate to each other through 'head' and 'tail' pointers. Kernel maintains 'head' pointer, points it to the next free area (tail of the last record). Tooling maintains 'tail' pointer, points it to the tail of last consumed record (record has already been fetched). Kernel determines the available space in a ring buffer using these two pointers to avoid overwrite unfetched records. By mapping without 'PROT_WRITE', an overwritable ring buffer is created. Different from normal ring buffer, tooling is unable to maintain 'tail' pointer because writing is forbidden. Therefore, for this type of ring buffers, kernel overwrite old records unconditionally, works like flight recorder. This feature would be useful if reading from overwritable ring buffer were as easy as reading from normal ring buffer. However, there's an obscure problem. The following figure demonstrates a full overwritable ring buffer. In this figure, the 'head' pointer points to the end of last record, and a long record 'E' is pending. For a normal ring buffer, a 'tail' pointer would have pointed to position (X), so kernel knows there's no more space in the ring buffer. However, for an overwritable ring buffer, kernel ignore the 'tail' pointer. (X) head . \| . V +------+-------+----------+------+---+ \|A....A\|B.....B\|C........C\|D....D\| \| +------+-------+----------+------+---+ Record 'A' is overwritten by event 'E': head \| V +--+---+-------+----------+------+---+ \|.E\|..A\|B.....B\|C........C\|D....D\|E..\| +--+---+-------+----------+------+---+ Now tooling decides to read from this ring buffer. However, none of these two natural positions, 'head' and the start of this ring buffer, are pointing to the head of a record. Even the full ring buffer can be accessed by tooling, it is unable to find a position to start decoding. The first attempt tries to solve this problem AFAIK can be found from [1]. It makes kernel to maintain 'tail' pointer: updates it when ring buffer is half full. However, this approach introduces overhead to fast path. Test result shows a 1% overhead [2]. In addition, this method utilizes no more tham 50% records. Another attempt can be found from [3], which allows putting the size of an event at the end of each record. This approach allows tooling to find records in a backward manner from 'head' pointer by reading size of a record from its tail. However, because of alignment requirement, it needs 8 bytes to record the size of a record, which is a huge waste. Its performance is also not good, because more data need to be written. This approach also introduces some extra branch instructions to fast path. 'write_backward' is a better solution to this problem. Following figure demonstrates the state of the overwritable ring buffer when 'write_backward' is set before overwriting: head \| V +---+------+----------+-------+------+ \| \|D....D\|C........C\|B.....B\|A....A\| +---+------+----------+-------+------+ and after overwriting: head \| V +---+------+----------+-------+---+--+ \|..E\|D....D\|C........C\|B.....B\|A..\|E.\| +---+------+----------+-------+---+--+ In each situation, 'head' points to the beginning of the newest record. From this record, tooling can iterate over the full ring buffer and fetch records one by one. The only limitation that needs to be considered is back-to-back reading. Due to the non-deterministic of user programs, it is impossible to ensure the ring buffer keeps stable during reading. Consider an extreme situation: tooling is scheduled out after reading record 'D', then a burst of events come, eat up the whole ring buffer (one or multiple rounds). When the tooling process comes back, reading after 'D' is incorrect now. To prevent this problem, we need to find a way to ensure the ring buffer is stable during reading. ioctl(PERF_EVENT_IOC_PAUSE_OUTPUT) is suggested because its overhead is lower than ioctl(PERF_EVENT_IOC_ENABLE). By carefully verifying 'header' pointer, reader can avoid pausing the ring-buffer. For example: /* A union of all possible events / union perf_event event; p = head = perf_mmap__read_head(); while (true) { / copy header of next event / fetch(&event.header, p, sizeof(event.header)); / read 'head' pointer / head = perf_mmap__read_head(); / check overwritten: is the header good? / if (!verify(sizeof(event.header), p, head)) break; / copy the whole event / fetch(&event, p, event.header.size); / read 'head' pointer again / head = perf_mmap__read_head(); / is the whole event good? / if (!verify(event.header.size, p, head)) break; p += event.header.size; } However, the overhead is high because: a) In-place decoding is not safe. Copying-verifying-decoding is required. b) Fetching 'head' pointer requires additional synchronization. (From Alexei Starovoitov: Even when this trick works, pause is needed for more than stability of reading. When we collect the events into overwrite buffer we're waiting for some other trigger (like all cpu utilization spike or just one cpu running and all others are idle) and when it happens the buffer has valuable info from the past. At this point new events are no longer interesting and buffer should be paused, events read and unpaused until next trigger comes.) This patch utilizes event's default overflow_handler introduced previously. perf_event_output_backward() is created as the default overflow handler for backward ring buffers. To avoid extra overhead to fast path, original perf_event_output() becomes __perf_event_output() and marked '__always_inline'. In theory, there's no extra overhead introduced to fast path. Performance testing: Calling 3000000 times of 'close(-1)', use gettimeofday() to check duration. Use 'perf record -o /dev/null -e raw_syscalls:' to capture system calls. In ns. Testing environment: CPU : Intel(R) Core(TM) i7-4790 CPU @ 3.60GHz Kernel : v4.5.0 MEAN STDVAR BASE 800214.950 2853.083 PRE1 2253846.700 9997.014 PRE2 2257495.540 8516.293 POST 2250896.100 8933.921 Where 'BASE' is pure performance without capturing. 'PRE1' is test result of pure 'v4.5.0' kernel. 'PRE2' is test result before this patch. 'POST' is test result after this patch. See [4] for the detailed experimental setup. Considering the stdvar, this patch doesn't introduce performance overhead to the fast path. [1] http://lkml.iu.edu/hypermail/linux/kernel/1304.1/04584.html [2] http://lkml.iu.edu/hypermail/linux/kernel/1307.1/00535.html [3] http://lkml.iu.edu/hypermail/linux/kernel/1512.0/01265.html [4] http://lkml.kernel.org/g/56F89DCD.1040202@huawei.com Signed-off-by: Wang Nan <wangnan0@huawei.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Alexei Starovoitov <ast@kernel.org> Cc: <acme@kernel.org> Cc: <pi3orama@163.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Brendan Gregg <brendan.d.gregg@gmail.com> Cc: He Kuang <hekuang@huawei.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephane Eranian <eranian@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vince Weaver <vincent.weaver@maine.edu> Cc: Zefan Li <lizefan@huawei.com> Link: http://lkml.kernel.org/r/1459865478-53413-1-git-send-email-wangnan0@huawei.com [ Fixed the changelog some more. ] Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-04-23	Merge branch 'perf/urgent' into perf/core, to resolve conflict	Ingo Molnar
	Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-04-23	lockdep: Fix lock_chain::base size	Peter Zijlstra
	lock_chain::base is used to store an index into the chain_hlocks[] array, however that array contains more elements than can be indexed using the u16. Change the lock_chain structure to use a bitfield to encode the data it needs and add BUILD_BUG_ON() assertions to check the fields are wide enough. Also, for DEBUG_LOCKDEP, assert that we don't run out of elements of that array; as that would wreck the collision detectoring. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Alfredo Alvarez Fernandez <alfredoalvarezfernandez@gmail.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Sedat Dilek <sedat.dilek@gmail.com> Cc: Theodore Ts'o <tytso@mit.edu> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/20160330093659.GS3408@twins.programming.kicks-ass.net Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-04-22	time: Introduce do_sys_settimeofday64()	Baolin Wang
	The do_sys_settimeofday() function uses a timespec, which is not year 2038 safe on 32bit systems. Thus this patch introduces do_sys_settimeofday64(), which allows us to transition users of do_sys_settimeofday() to using 64bit time types. Cc: Prarit Bhargava <prarit@redhat.com> Cc: Richard Cochran <richardcochran@gmail.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@kernel.org> Signed-off-by: Baolin Wang <baolin.wang@linaro.org> [jstultz: Include errno-base.h to avoid build issue on some arches] Signed-off-by: John Stultz <john.stultz@linaro.org>
2016-04-22	security: Introduce security_settime64()	Baolin Wang
	security_settime() uses a timespec, which is not year 2038 safe on 32bit systems. Thus this patch introduces the security_settime64() function with timespec64 type. We also convert the cap_settime() helper function to use the 64bit types. This patch then moves security_settime() to the header file as an inline helper function so that existing users can be iteratively converted. None of the existing hooks is using the timespec argument and therefor the patch is not making any functional changes. Cc: Serge Hallyn <serge.hallyn@canonical.com>, Cc: James Morris <james.l.morris@oracle.com>, Cc: "Serge E. Hallyn" <serge@hallyn.com>, Cc: Paul Moore <pmoore@redhat.com> Cc: Stephen Smalley <sds@tycho.nsa.gov> Cc: Kees Cook <keescook@chromium.org> Cc: Prarit Bhargava <prarit@redhat.com> Cc: Richard Cochran <richardcochran@gmail.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@kernel.org> Reviewed-by: James Morris <james.l.morris@oracle.com> Signed-off-by: Baolin Wang <baolin.wang@linaro.org> [jstultz: Reworded commit message] Signed-off-by: John Stultz <john.stultz@linaro.org>
2016-04-22	clocksource: Add missing include of of.h.	David Lechner
	This header uses OF_DELCARE_1 which is defined in linux/of.h. This fixes getting unhelpful compiler error messages about missing ')' before a string constant. Cc: Prarit Bhargava <prarit@redhat.com> Cc: Richard Cochran <richardcochran@gmail.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@kernel.org> Signed-off-by: David Lechner <david@lechnology.com> Signed-off-by: John Stultz <john.stultz@linaro.org>
2016-04-22	i2c: mux: drop old unused i2c-mux api	Peter Rosin
	All i2c mux users are using an explicit i2c mux core, drop support for implicit i2c mux cores. Signed-off-by: Peter Rosin <peda@axentia.se> Signed-off-by: Wolfram Sang <wsa@the-dreams.de>
2016-04-22	i2c: mux: add common data for every i2c-mux instance	Peter Rosin
	All i2c-muxes have a parent adapter and one or many child adapters. A mux also has some means of selection. Previously, this was stored per child adapter, but it is only needed to keep track of this per mux. Add an i2c mux core, that keeps track of this consistently. Also add some glue for users of the old interface, which will create one implicit mux core per child adapter. Signed-off-by: Peter Rosin <peda@axentia.se> Tested-by: Antti Palosaari <crope@iki.fi> Tested-by: Crestez Dan Leonard <leonard.crestez@intel.com> Signed-off-by: Wolfram Sang <wsa@the-dreams.de>
2016-04-22	mm: replace open coded page to virt conversion with page_to_virt()	Ard Biesheuvel
	The open coded conversion from struct page address to virtual address in lowmem_page_address() involves an intermediate conversion step to pfn number/physical address. Since the placement of the struct page array relative to the linear mapping may be completely independent from the placement of physical RAM (as is that case for arm64 after commit dfd55ad85e 'arm64: vmemmap: use virtual projection of linear region'), the conversion to physical address and back again should factor out of the equation, but unfortunately, the shifting and pointer arithmetic involved prevent this from happening, and the resulting calculation essentially subtracts the address of the start of physical memory and adds it back again, in a way that prevents the compiler from optimizing it away. Since the start of physical memory is not a build time constant on arm64, the resulting conversion involves an unnecessary memory access, which we would like to get rid of. So replace the open coded conversion with a call to page_to_virt(), and use the open coded conversion as its default definition, to be overriden by the architecture, if desired. The existing arch specific definitions of page_to_virt are all equivalent to this default definition, so by itself this patch is a no-op. Acked-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Will Deacon <will.deacon@arm.com> Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org> Signed-off-by: Will Deacon <will.deacon@arm.com>
2016-04-22	x86, drivers/pnpbios: Replace paravirt_enabled() check with legacy device check	Luis R. Rodriguez
	Since we are removing paravirt_enabled() replace it with a logical equivalent. Even though PNPBIOS is x86 specific we add an arch-specific type call, which can be implemented by any architecture to show how other legacy attribute devices can later be also checked for with other ACPI legacy attribute flags. This implicates the first ACPI 5.2.9.3 IA-PC Boot Architecture ACPI_FADT_LEGACY_DEVICES flag device, and shows how to add more. The reason pnpbios gets a defined structure and as such uses a different approach than the RTC legacy quirk is that ACPI has a respective RTC flag, while pnpbios does not. We fold the pnpbios quirk under ACPI_FADT_LEGACY_DEVICES ACPI flag use case, and use a struct of possible devices to enable future extensions of this. As per 0-day, this bumps the vmlinux size using i386-tinyconfig as follows: TOTAL TEXT init.text x86_early_init_platform_quirks() +32 +28 +28 +28 That's 4 byte overhead total, the rest is cleared out on init as its all __init text. v2: split out subarch handlng on switch to make it easier later to add other subarchs. The 'fall-through' switch handling can be confusing and we'll remove it later when we add handling for X86_SUBARCH_CE4100. v3: document vmlinux size impact as per 0-day, and also explain why pnpbios is treated differently than the RTC legacy feature. Signed-off-by: Luis R. Rodriguez <mcgrof@kernel.org> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: andrew.cooper3@citrix.com Cc: andriy.shevchenko@linux.intel.com Cc: bigeasy@linutronix.de Cc: boris.ostrovsky@oracle.com Cc: david.vrabel@citrix.com Cc: ffainelli@freebox.fr Cc: george.dunlap@citrix.com Cc: glin@suse.com Cc: jgross@suse.com Cc: jlee@suse.com Cc: josh@joshtriplett.org Cc: julien.grall@linaro.org Cc: konrad.wilk@oracle.com Cc: kozerkov@parallels.com Cc: lenb@kernel.org Cc: lguest@lists.ozlabs.org Cc: linux-acpi@vger.kernel.org Cc: lv.zheng@intel.com Cc: matt@codeblueprint.co.uk Cc: mbizon@freebox.fr Cc: rjw@rjwysocki.net Cc: robert.moore@intel.com Cc: rusty@rustcorp.com.au Cc: tiwai@suse.de Cc: toshi.kani@hp.com Cc: xen-devel@lists.xensource.com Link: http://lkml.kernel.org/r/1460592286-300-12-git-send-email-mcgrof@kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-04-22	soc: renesas: rcar-sysc: Make rcar_sysc_power_is_off() static	Geert Uytterhoeven
	As of commit b12ff41658171f53 ("ARM: shmobile: r8a7779: Remove legacy PM Domain remainings"), rcar_sysc_power_is_off() is no longer used from SoC-specific code. Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be> Reviewed-by: Laurent Pinchart <laurent.pinchart@ideasonboard.com> Signed-off-by: Simon Horman <horms+renesas@verge.net.au>
2016-04-22	soc: renesas: Move pm-rcar to drivers/soc/renesas/rcar-sysc	Geert Uytterhoeven
	Move the pm-rcar driver from arch/arm/mach-shmobile/ to drivers/soc/renesas/, and its header file to include/linux/soc/renesas/, so it can be shared between arm32 (R-Car H1 and Gen2) and arm64 (R-Car Gen3). Rename it to rcar-sysc as it's really a driver for the R-Car System Controller (SYSC). Kill the intermediate PM_RCAR config symbol, as it's not user configurable anymore, and to prepare for SoC-specific make rules. Add the missing #include <linux/types.h> to rcar-sysc.h, which was exposed by different include order. Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be> Reviewed-by: Laurent Pinchart <laurent.pinchart@ideasonboard.com> Signed-off-by: Simon Horman <horms+renesas@verge.net.au>
2016-04-22	locking/rwsem: Provide down_write_killable()	Michal Hocko
	Now that all the architectures implement the necessary glue code we can introduce down_write_killable(). The only difference wrt. regular down_write() is that the slow path waits in TASK_KILLABLE state and the interruption by the fatal signal is reported as -EINTR to the caller. Signed-off-by: Michal Hocko <mhocko@suse.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Chris Zankel <chris@zankel.net> Cc: David S. Miller <davem@davemloft.net> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Max Filippov <jcmvbkbc@gmail.com> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Signed-off-by: Davidlohr Bueso <dbueso@suse.de> Cc: Signed-off-by: Jason Low <jason.low2@hp.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tony Luck <tony.luck@intel.com> Cc: linux-alpha@vger.kernel.org Cc: linux-arch@vger.kernel.org Cc: linux-ia64@vger.kernel.org Cc: linux-s390@vger.kernel.org Cc: linux-sh@vger.kernel.org Cc: linux-xtensa@linux-xtensa.org Cc: sparclinux@vger.kernel.org Link: http://lkml.kernel.org/r/1460041951-22347-12-git-send-email-mhocko@kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-04-22	PM / Domains: Remove ->save\|restore_state() callbacks	Ulf Hansson
	As a part of the ongoing consolidation of genpd, it's become questionable whether clients actually needs to be able to assign their own set of ->save\|restore_state() callbacks. Currently all users copes fine with the default callbacks, so let's remove the configuration option and stick to the default ones. This enables further clarifications of the related code and let's also rename pm_genpd_default_save\|restore_state() into __genpd_runtime_suspend\|resume() to apply the rule of static functionnames in genpd. Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org> Reviewed-by: Kevin Hilman <khilman@baylibre.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2016-04-22	PM / Domains: Rename stop_ok to suspend_ok for the genpd governor	Ulf Hansson
	The genpd governor validates the latency constraints to find out whether it's acceptable to runtime suspend a device. Earlier this validation was made to know whether it was okay to invoke the ->stop() callback for the device, hence the governor used the name "stop_ok" for the related variables. To clarify the code around this, let's rename these variables from "stop_ok" to "suspend_ok". Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org> Reviewed-by: Kevin Hilman <khilman@baylibre.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2016-04-22	PM / Runtime: Move ignore_children flag under CONFIG_PM	Ulf Hansson
	The ignore_children flag is used only when CONFIG_PM is set, so let's move it into that section within the struct dev_pm_info. Move also the corresponding pm_suspend_ignore_children() API out of device.h into pm_runtime.h, to be consistent with similar APIs. Unfortunate this causes the Toshiba PCI SD mmc host driver to fail to compile as it needs pm_runtime.h, so let's fix this here as well. Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org> Acked-by: Pavel Machek <pavel@ucw.cz> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2016-04-21	Merge branch 'clk-hw-register' (early part) into clk-next	Stephen Boyd
	* 'clk-hw-register' (early part): clk: fixed-rate: Add hw based registration APIs clk: gpio: Add hw based registration APIs clk: composite: Add hw based registration APIs clk: fractional-divider: Add hw based registration APIs clk: fixed-factor: Add hw based registration APIs clk: mux: Add hw based registration APIs clk: gate: Add hw based registration APIs clk: divider: Add hw based registration APIs clkdev: Add clk_hw based registration APIs clk: Add clk_hw OF clk providers clk: Add {devm_}clk_hw_{register,unregister}() APIs clkdev: Remove clk_register_clkdevs()
2016-04-21	Merge branch 'clk-composite-unregister' into clk-next	Stephen Boyd
	* clk-composite-unregister: clk: composite: Add unregister function
2016-04-21	clk: composite: Add unregister function	Maxime Ripard
	The composite clock didn't have any unregistration function, which forced us to use clk_unregister directly on it. While it was already not great from an API point of view, it also meant that we were leaking the clk_composite structure allocated in clk_register_composite. Add a clk_unregister_composite function to fix this. Signed-off-by: Maxime Ripard <maxime.ripard@free-electrons.com> Signed-off-by: Stephen Boyd <sboyd@codeaurora.org>
2016-04-21	Merge branches 'doc.2016.04.19a', 'exp.2016.03.31d', 'fixes.2016.03.31d' and ↵	Paul E. McKenney
	'torture.2016.04.21a' into HEAD doc.2016.04.19a: Documentation updates exp.2016.03.31d: Expedited grace-period updates fixes.2016.03.31d: Miscellaneous fixes torture.2016.004.21a Torture-test updates
2016-04-21	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net	Linus Torvalds
	Pull networking fixes from David Miller: 1) Fix memory leak in iwlwifi, from Matti Gottlieb. 2) Add missing registration of netfilter arp_tables into initial namespace, from Florian Westphal. 3) Fix potential NULL deref in DecNET routing code. 4) Restrict NETLINK_URELEASE to truly bound sockets only, from Dmitry Ivanov. 5) Fix dst ref counting in VRF, from David Ahern. 6) Fix TSO segmenting limits in i40e driver, from Alexander Duyck. 7) Fix heap leak in PACKET_DIAG_MCLIST, from Mathias Krause. 8) Ravalidate IPV6 datagram socket cached routes properly, particularly with UDP, from Martin KaFai Lau. 9) Fix endian bug in RDS dp_ack_seq handling, from Qing Huang. 10) Fix stats typing in bcmgenet driver, from Eric Dumazet. 11) Openvswitch needs to orphan SKBs before ipv6 fragmentation handing, from Joe Stringer. 12) SPI device reference leak in spi_ks8895 PHY driver, from Mark Brown. 13) atl2 doesn't actually support scatter-gather, so don't advertise the feature. From Ben Hucthings. * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (72 commits) openvswitch: use flow protocol when recalculating ipv6 checksums Driver: Vmxnet3: set CHECKSUM_UNNECESSARY for IPv6 packets atl2: Disable unimplemented scatter/gather feature net/mlx4_en: Split SW RX dropped counter per RX ring net/mlx4_core: Don't allow to VF change global pause settings net/mlx4_core: Avoid repeated calls to pci enable/disable net/mlx4_core: Implement pci_resume callback net: phy: spi_ks8895: Don't leak references to SPI devices net: ethernet: davinci_emac: Fix platform_data overwrite net: ethernet: davinci_emac: Fix Unbalanced pm_runtime_enable qede: Fix single MTU sized packet from firmware GRO flow qede: Fix setting Skb network header qede: Fix various memory allocation error flows for fastpath tcp: Merge tx_flags and tskey in tcp_shifted_skb tcp: Merge tx_flags and tskey in tcp_collapse_retrans drivers: net: cpsw: fix wrong regs access in cpsw_ndo_open tcp: Fix SOF_TIMESTAMPING_TX_ACK when handling dup acks openvswitch: Orphan skbs before IPv6 defrag Revert "Prevent NUll pointer dereference with two PHYs on cpsw" VSOCK: Only check error on skb_recv_datagram when skb is NULL ...
2016-04-21	geneve: break dependency with netdev drivers	Hannes Frederic Sowa
	Equivalent to "vxlan: break dependency with netdev drivers", don't autoload geneve module in case the driver is loaded. Instead make the coupling weaker by using netdevice notifiers as proxy. Cc: Jesse Gross <jesse@kernel.org> Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-21	vxlan: break dependency with netdev drivers	Hannes Frederic Sowa
	Currently all drivers depend and autoload the vxlan module because how vxlan_get_rx_port is linked into them. Remove this dependency: By using a new event type in the netdevice notifier call chain we proxy the request from the drivers to flush and resetup the vxlan ports not directly via function call but by the already existing netdevice notifier call chain. I added a separate new event type, NETDEV_OFFLOAD_PUSH_VXLAN, to do so. We don't need to save those ids, as the event type field is an unsigned long and using specialized event types for this purpose seemed to be a more elegant way. This also comes in beneficial if in future we want to add offloading knobs for vxlan. Cc: Jesse Gross <jesse@kernel.org> Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-21	net/mlx5e: Support RX multi-packet WQE (Striding RQ)	Tariq Toukan
	Introduce the feature of multi-packet WQE (RX Work Queue Element) referred to as (MPWQE or Striding RQ), in which WQEs are larger and serve multiple packets each. Every WQE consists of many strides of the same size, every received packet is aligned to a beginning of a stride and is written to consecutive strides within a WQE. In the regular approach, each regular WQE is big enough to be capable of serving one received packet of any size up to MTU or 64K in case of device LRO is enabled, making it very wasteful when dealing with small packets or device LRO is enabled. For its flexibility, MPWQE allows a better memory utilization (implying improvements in CPU utilization and packet rate) as packets consume strides according to their size, preserving the rest of the WQE to be available for other packets. MPWQE default configuration: Num of WQEs = 16 Strides Per WQE = 2048 Stride Size = 64 byte The default WQEs memory footprint went from 1024mtu (~1.5MB) to 16 2048 * 64 = 2MB per ring. However, HW LRO can now be supported at no additional cost in memory footprint, and hence we turn it on by default and get an even better performance. Performance tested on ConnectX4-Lx 50G. To isolate the feature under test, the numbers below were measured with HW LRO turned off. We verified that the performance just improves when LRO is turned back on. * Netperf single TCP stream: - BW raised by 10-15% for representative packet sizes: default, 64B, 1024B, 1478B, 65536B. * Netperf multi TCP stream: - No degradation, line rate reached. * Pktgen: packet rate raised by 2-10% for traffic of different message sizes: 64B, 128B, 256B, 1024B, and 1500B. * Pktgen: packet loss in bursts of small messages (64byte), single stream: - \| num packets \| packets loss before \| packets loss after \| 2K \| ~ 1K \| 0 \| 8K \| ~ 6K \| 0 \| 16K \| ~13K \| 0 \| 32K \| ~28K \| 0 \| 64K \| ~57K \| ~24K As expected as the driver can receive as many small packets (<=64B) as the number of total strides in the ring (default = 2048 * 16) vs. 1024 (default ring size regardless of packets size) before this feature. Signed-off-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: Achiad Shochat <achiad@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-21	net/mlx5: Introduce device queue counters	Tariq Toukan
	A queue counter can collect several statistics for one or more hardware queues (QPs, RQs, etc ..) that the counter is attached to. For Ethernet it will provide an "out of buffer" counter which collects the number of all packets that are dropped due to lack of software buffers. Here we add device commands to alloc/query/dealloc queue counters. Signed-off-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: Rana Shahout <ranas@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-21	net/mlx4_core: Avoid repeated calls to pci enable/disable	Daniel Jurgens
	Maintain the PCI status and provide wrappers for enabling and disabling the PCI device. Performing the actions more than once without doing its opposite results in warning logs. This occurred when EEH hotplugged the device causing a warning for disabling an already disabled device. Fixes: 2ba5fbd62b25 ('net/mlx4_core: Handle AER flow properly') Signed-off-by: Daniel Jurgens <danielj@mellanox.com> Signed-off-by: Yishai Hadas <yishaih@mellanox.com> Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-21	netdev_features: Fold NETIF_F_ALL_TSO into NETIF_F_GSO_SOFTWARE	Alexander Duyck
	This patch folds NETIF_F_ALL_TSO into the bitmask for NETIF_F_GSO_SOFTWARE. The idea is to avoid duplication of defines since the only difference between the two was the GSO_UDP bit. Signed-off-by: Alexander Duyck <aduyck@mirantis.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-21	perf, bpf: minimize the size of perf_trace_() tracepoint handler	Alexei Starovoitov
	move trace_call_bpf() into helper function to minimize the size of perf_trace_*() tracepoint handlers. text data bss dec hex filename 10541679 5526646 2945024 19013349 1221ee5 vmlinux_before 10509422 5526646 2945024 18981092 121a0e4 vmlinux_after It may seem that perf_fetch_caller_regs() can also be moved, but that is incorrect, since ip/sp will be wrong. bpf+tracepoint performance is not affected, since perf_swevent_put_recursion_context() is now inlined. export_symbol_gpl can also be dropped. No measurable change in normal perf tracepoints. Suggested-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-20	thermal: consistently use int for trip temp	Wei Ni
	The commit 17e8351a7739 consistently use int for temperature, however it missed a few in trip temperature and thermal_core. In current codes, the trip->temperature used "unsigned long" and zone->temperature used"int", if the temperature is negative value, it will get wrong result when compare temperature with trip temperature. This patch can fix it. Signed-off-by: Wei Ni <wni@nvidia.com> Signed-off-by: Eduardo Valentin <edubezval@gmail.com>
2016-04-21	LSM: LoadPin for kernel file loading restrictions	Kees Cook
	This LSM enforces that kernel-loaded files (modules, firmware, etc) must all come from the same filesystem, with the expectation that such a filesystem is backed by a read-only device such as dm-verity or CDROM. This allows systems that have a verified and/or unchangeable filesystem to enforce module and firmware loading restrictions without needing to sign the files individually. Signed-off-by: Kees Cook <keescook@chromium.org> Acked-by: Serge Hallyn <serge.hallyn@canonical.com> Signed-off-by: James Morris <james.l.morris@oracle.com>