summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2016-04-28rbd: fix rbd map vs notify racesIlya Dryomov
A while ago, commit 9875201e1049 ("rbd: fix use-after free of rbd_dev->disk") fixed rbd unmap vs notify race by introducing an exported wrapper for flushing notifies and sticking it into do_rbd_remove(). A similar problem exists on the rbd map path, though: the watch is registered in rbd_dev_image_probe(), while the disk is set up quite a few steps later, in rbd_dev_device_setup(). Nothing prevents a notify from coming in and crashing on a NULL rbd_dev->disk: BUG: unable to handle kernel NULL pointer dereference at 0000000000000050 Call Trace: [<ffffffffa0508344>] rbd_watch_cb+0x34/0x180 [rbd] [<ffffffffa04bd290>] do_event_work+0x40/0xb0 [libceph] [<ffffffff8109d5db>] process_one_work+0x17b/0x470 [<ffffffff8109e3ab>] worker_thread+0x11b/0x400 [<ffffffff8109e290>] ? rescuer_thread+0x400/0x400 [<ffffffff810a5acf>] kthread+0xcf/0xe0 [<ffffffff810b41b3>] ? finish_task_switch+0x53/0x170 [<ffffffff810a5a00>] ? kthread_create_on_node+0x140/0x140 [<ffffffff81645dd8>] ret_from_fork+0x58/0x90 [<ffffffff810a5a00>] ? kthread_create_on_node+0x140/0x140 RIP [<ffffffffa050828a>] rbd_dev_refresh+0xfa/0x180 [rbd] If an error occurs during rbd map, we have to error out, potentially tearing down a watch. Just like on rbd unmap, notifies have to be flushed, otherwise rbd_watch_cb() may end up trying to read in the image header after rbd_dev_image_release() has run: Assertion failure in rbd_dev_header_info() at line 4722: rbd_assert(rbd_image_format_valid(rbd_dev->image_format)); Call Trace: [<ffffffff81cccee0>] ? rbd_parent_request_create+0x150/0x150 [<ffffffff81cd4e59>] rbd_dev_refresh+0x59/0x390 [<ffffffff81cd5229>] rbd_watch_cb+0x69/0x290 [<ffffffff81fde9bf>] do_event_work+0x10f/0x1c0 [<ffffffff81107799>] process_one_work+0x689/0x1a80 [<ffffffff811076f7>] ? process_one_work+0x5e7/0x1a80 [<ffffffff81132065>] ? finish_task_switch+0x225/0x640 [<ffffffff81107110>] ? pwq_dec_nr_in_flight+0x2b0/0x2b0 [<ffffffff81108c69>] worker_thread+0xd9/0x1320 [<ffffffff81108b90>] ? process_one_work+0x1a80/0x1a80 [<ffffffff8111b02d>] kthread+0x21d/0x2e0 [<ffffffff8111ae10>] ? kthread_stop+0x550/0x550 [<ffffffff82022802>] ret_from_fork+0x22/0x40 [<ffffffff8111ae10>] ? kthread_stop+0x550/0x550 RIP [<ffffffff81ccd8f9>] rbd_dev_header_info+0xa19/0x1e30 To fix this, a) check if RBD_DEV_FLAG_EXISTS is set before calling revalidate_disk(), b) move ceph_osdc_flush_notifies() call into rbd_dev_header_unwatch_sync() to cover rbd map error paths and c) turn header read-in into a critical section. The latter also happens to take care of rbd map foo@bar vs rbd snap rm foo@bar race. Fixes: http://tracker.ceph.com/issues/15490 Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Josh Durgin <jdurgin@redhat.com>
2016-04-28x86/apic: Handle zero vector gracefully in clear_vector_irq()Keith Busch
If x86_vector_alloc_irq() fails x86_vector_free_irqs() is invoked to cleanup the already allocated vectors. This subsequently calls clear_vector_irq(). The failed irq has no vector assigned, which triggers the BUG_ON(!vector) in clear_vector_irq(). We cannot suppress the call to x86_vector_free_irqs() for the failed interrupt, because the other data related to this irq must be cleaned up as well. So calling clear_vector_irq() with vector == 0 is legitimate. Remove the BUG_ON and return if vector is zero, [ tglx: Massaged changelog ] Fixes: b5dc8e6c21e7 "x86/irq: Use hierarchical irqdomain to manage CPU interrupt vectors" Signed-off-by: Keith Busch <keith.busch@intel.com> Cc: stable@vger.kernel.org Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-04-27Merge branch 'socket-space-optimizations'David S. Miller
Eric Dumazet says: ==================== net: avoid some atomic ops when FASYNC is not used We can avoid some atomic operations on sockets not using FASYNC ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-27net: SOCKWQ_ASYNC_WAITDATA optimizationsEric Dumazet
SOCKWQ_ASYNC_WAITDATA is set/cleared in sk_wait_data() and equivalent functions, so that sock_wake_async() can send a SIGIO only when necessary. Since these atomic operations are really not needed unless socket expressed interest in FASYNC, we can omit them in most cases. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-27net: SOCKWQ_ASYNC_NOSPACE optimizationsEric Dumazet
SOCKWQ_ASYNC_NOSPACE is tested in sock_wake_async() so that a SIGIO signal is sent when needed. tcp_sendmsg() clears the bit. tcp_poll() sets the bit when stream is not writeable. We can avoid two atomic operations by first checking if socket is actually interested in the FASYNC business (most sockets in real applications do not use AIO, but select()/poll()/epoll()) This also removes one cache line miss to access sk->sk_wq->flags in tcp_sendmsg() Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-27Merge branch 'snmp-stats-update'David S. Miller
Eric Dumazet says: ==================== net: snmp: update SNMP methods In the old days (before linux-3.0), SNMP counters were duplicated, one set for user context, and anther one for BH context. After commit 8f0ea0fe3a03 ("snmp: reduce percpu needs by 50%") we have a single copy, and what really matters is preemption being enabled or disabled, since we use this_cpu_inc() or __this_cpu_inc() respectively. This patch series kills the obsolete STATS_USER() helpers, and rename all XXX_BH() helpers to __XXX() ones, to more closely match conventions used to update per cpu variables. This is probably going to hurt maintainers job for a while, since cherry-picks will not be clean, but this had to be cleaned at one point. I am so sorry guys. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-27net: snmp: kill STATS_BH macrosEric Dumazet
There is nothing related to BH in SNMP counters anymore, since linux-3.0. Rename helpers to use __ prefix instead of _BH prefix, for contexts where preemption is disabled. This more closely matches convention used to update percpu variables. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-27ipv6: kill ICMP6MSGIN_INC_STATS_BH()Eric Dumazet
IPv6 ICMP stats are atomics anyway. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-27ipv6: rename IP6_UPD_PO_STATS_BH()Eric Dumazet
Rename IP6_UPD_PO_STATS_BH() to __IP6_UPD_PO_STATS() Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-27ipv6: rename IP6_INC_STATS_BH()Eric Dumazet
Rename IP6_INC_STATS_BH() to __IP6_INC_STATS() and IP6_ADD_STATS_BH() to __IP6_ADD_STATS() Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-27net: rename NET_{ADD|INC}_STATS_BH()Eric Dumazet
Rename NET_INC_STATS_BH() to __NET_INC_STATS() and NET_ADD_STATS_BH() to __NET_ADD_STATS() Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-27net: rename IP_UPD_PO_STATS_BH()Eric Dumazet
Rename IP_UPD_PO_STATS_BH() to __IP_UPD_PO_STATS() Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-27net: rename IP_ADD_STATS_BH()Eric Dumazet
Rename IP_ADD_STATS_BH() to __IP_ADD_STATS() Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-27net: rename ICMP6_INC_STATS_BH()Eric Dumazet
Rename ICMP6_INC_STATS_BH() to __ICMP6_INC_STATS() Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-27net: rename IP_INC_STATS_BH()Eric Dumazet
Rename IP_INC_STATS_BH() to __IP_INC_STATS(), to better express this is used in non preemptible context. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-27net: sctp: rename SCTP_INC_STATS_BH()Eric Dumazet
Rename SCTP_INC_STATS_BH() to __SCTP_INC_STATS() Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-27net: icmp: rename ICMPMSGIN_INC_STATS_BH()Eric Dumazet
Remove misleading _BH suffix. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-27net: tcp: rename TCP_INC_STATS_BHEric Dumazet
Rename TCP_INC_STATS_BH() to __TCP_INC_STATS() Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-27net: xfrm: kill XFRM_INC_STATS_BH()Eric Dumazet
Not used anymore. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-27net: udp: rename UDP_INC_STATS_BH()Eric Dumazet
Rename UDP_INC_STATS_BH() to __UDP_INC_STATS(), and UDP6_INC_STATS_BH() to __UDP6_INC_STATS() Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-27net: rename ICMP_INC_STATS_BH()Eric Dumazet
Rename ICMP_INC_STATS_BH() to __ICMP_INC_STATS() Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-27dccp: rename DCCP_INC_STATS_BH()Eric Dumazet
Rename DCCP_INC_STATS_BH() to __DCCP_INC_STATS() Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-27net: snmp: kill various STATS_USER() helpersEric Dumazet
In the old days (before linux-3.0), SNMP counters were duplicated, one for user context, and one for BH context. After commit 8f0ea0fe3a03 ("snmp: reduce percpu needs by 50%") we have a single copy, and what really matters is preemption being enabled or disabled, since we use this_cpu_inc() or __this_cpu_inc() respectively. We therefore kill SNMP_INC_STATS_USER(), SNMP_ADD_STATS_USER(), NET_INC_STATS_USER(), NET_ADD_STATS_USER(), SCTP_INC_STATS_USER(), SNMP_INC_STATS64_USER(), SNMP_ADD_STATS64_USER(), TCP_ADD_STATS_USER(), UDP_INC_STATS_USER(), UDP6_INC_STATS_USER(), and XFRM_INC_STATS_USER() Following patches will rename __BH helpers to make clear their usage is not tied to BH being disabled. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-27Merge branch '40GbE' of ↵David S. Miller
git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/next-queue Jeff Kirsher says: ==================== 40GbE Intel Wired LAN Driver Updates 2016-04-27 This series contains updates to i40e and i40evf. Alex Duyck cleans up the feature flags since they are becoming pretty "massive", the primary change being that we now build our features list around hw_encap_features. Added support for IPIP and SIT offloads, which should improvement in throughput for IPIP and SIT tunnels with the offload enabled. Mitch adds support for configuring RSS on behalf of the VFs, which removes the burden of dealing with different hardware interfaces from the VF drivers and improves future compatibility. Fix to ensure that we do not panic by checking that the vsi_res pointer is valid before dereferencing it, after which we can drink beer and eat peanuts. Shannon does come housekeeping in i40e_add_fdir_ethtool() in preparation for more cloud filter work. Added flexibility to the nvmupdate facility by adding the ability to specify an AQ event opcode to wait on after Exec_AQ request. Michal adds device capability which defines if an update is available and if a security check is needed during the update process. Kamil just adds a device id to support X722 QSFP+ device. Greg fixes an issue where a mirror rule ID may be zero, so do not return invalid parameter when the user passes in a zero for a rule ID. Adds support to steer packets to VSIs by VLAN tag alone while being in promiscuous mode for multicast and unicast MAC addresses. Jesse fixes the driver from offloading the VLAN tag into the skb any time there was a VLAN tag and the hardware stripping was enabled, to making sure it is enabled before put_tag. v2: Dropped patch 8 ("i40e: Allow user to change input set mask for flow director") while Kiran reworks a more generalized solution based on feedback from David Miller. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-27net-rfs: fix false sharing accessing sd->input_queue_headEric Dumazet
sd->input_queue_head is incremented for each processed packet in process_backlog(), and read from other cpus performing Out Of Order avoidance in get_rps_cpu() Moving this field in a separate cache line keeps it mostly hot for the cpu in process_backlog(), as other cpus will only read it. In a stress test, process_backlog() was consuming 6.80 % of cpu cycles, and the patch reduced the cost to 0.65 % Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Tom Herbert <tom@herbertland.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-27net: w5100: support W5500Akinobu Mita
This adds support for W5500 chip. W5500 has similar register and memory organization with W5100 and W5200. There are a few important differences listed below but it is still possible to share common code with W5100 and W5200. * W5500 register and memory are organized by multiple blocks. Each one is selected by 16bits offset address and 5bits block select bits. But the existing register access operations take u16 address. This change extends the addess by u32 address and put offset address to lower 16bits and block select bits to upper 16bits. This change also adds the offset addresses for socket register and TX/RX memory blocks to the driver private data structure in order to reduce conditional switches for each chip. * W5500 has the different register offset for socket interrupt mask register. Newly added internal functions w5100_enable_intr() and w5100_disable_intr() take care of the diffrence. * W5500 has the different register offset for retry time-value register. But this register is only used to verify that the reset value is correctly read at initialization. So move the verification to w5100_hw_reset() which already does different things for different chips. Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com> Cc: Mike Sinkovsky <msink@permonline.ru> Cc: David S. Miller <davem@davemloft.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-28crypto: s5p-sss - fix incorrect usage of scatterlists apiMarek Szyprowski
sg_dma_len() macro can be used only on scattelists which are mapped, so all calls to it before dma_map_sg() are invalid. Replace them by proper check for direct sg segment length read. Fixes: a49e490c7a8a ("crypto: s5p-sss - add S5PV210 advanced crypto engine support") Fixes: 9e4a1100a445 ("crypto: s5p-sss - Handle unaligned buffers") Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com> Reviewed-by: Krzysztof Kozlowski <k.kozlowski@samsung.com> Acked-by: Vladimir Zapolskiy <vz@mleia.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2016-04-28Merge git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6Herbert Xu
Merge the crypto tree to pull in the qat adf_init_pf_wq change.
2016-04-28crypto: qat - fix invalid pf2vf_resp_wq logicTadeusz Struk
The pf2vf_resp_wq is a global so it has to be created at init and destroyed at exit, instead of per device. Cc: <stable@vger.kernel.org> Tested-by: Suresh Marikkannu <sureshx.marikkannu@intel.com> Signed-off-by: Tadeusz Struk <tadeusz.struk@intel.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2016-04-28cpufreq: intel_pstate: Enable PPC enforcement for serversSrinivas Pandruvada
For platforms which are controlled via remove node manager, enable _PPC by default. These platforms are mostly categorized as enterprise server or performance servers. These platforms needs to go through some certifications tests, which tests control via _PPC. The relative risk of enabling by default is low as this is is less likely that these systems have broken _PSS table. Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2016-04-28cpufreq: intel_pstate: Adjust policy->maxSrinivas Pandruvada
When policy->max is changed via _PPC or sysfs and is more than the max non turbo frequency, it does not really change resulting performance in some processors. When policy->max results in a P-State ratio more than the turbo activation ratio, then processor can choose any P-State up to max turbo. So the user or _PPC setting has no value, but this can cause undesirable side effects like: - Showing reduced max percentage in Intel P-State sysfs - It can cause reduced max performance under certain boundary conditions: The requested max scaling frequency either via _PPC or via cpufreq-sysfs, will be converted into a fixed floating point max percent scale. In majority of the cases this will result in correct max. But not 100% of the time. If the _PPC is requested at a point where the calculation lead to a lower max, this can result in a lower P-State then expected and it will impact performance. Example of this condition using a Broadwell laptop with config TDP. ACPI _PSS table from a Broadwell laptop 2301000 2300000 2200000 2000000 1900000 1800000 1700000 1500000 1400000 1300000 1100000 1000000 900000 800000 600000 500000 The actual results by disabling config TDP so that we can get what is requested on or below 2300000Khz. scaling_max_freq Max Requested P-State Resultant scaling max ---------------------------------------- ---------------------- 2400000 18 2900000 (max turbo) 2300000 17 2300000 (max physical non turbo) 2200000 15 2100000 2100000 15 2100000 2000000 13 1900000 1900000 13 1900000 1800000 12 1800000 1700000 11 1700000 1600000 10 1600000 1500000 f 1500000 1400000 e 1400000 1300000 d 1300000 1200000 c 1200000 1100000 a 1000000 1000000 a 1000000 900000 9 900000 800000 8 800000 700000 7 700000 600000 6 600000 500000 5 500000 ------------------------------------------------------------------ Now set the config TDP level 1 ratio as 0x0b (equivalent to 1100000KHz) in BIOS (not every system will let you adjust this). The turbo activation ratio will be set to one less than that, which will be 0x0a (So any request above 1000000KHz should result in turbo region assuming no thermal limits). Here _PPC will request max to 1100000KHz (which basically should still result in turbo as this is more than the turbo activation ratio up to max allowable turbo frequency), but actual calculation resulted in a max ceiling P-State which is 0x0a. So under any load condition, this driver will not request turbo P-States. This will be a huge performance hit. When config TDP feature is ON, if the _PPC points to a frequency above turbo activation ratio, the performance can still reach max turbo. In this case we don't need to treat this as the reduced frequency in set_policy callback. In this change when config TDP is active (by checking if the physical max non turbo ratio is more than the current max non turbo ratio), any request above current max non turbo is treated as full performance. Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> [ rjw : Minor cleanups ] Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2016-04-28cpufreq: intel_pstate: Enforce _PPC limitsSrinivas Pandruvada
Use ACPI _PPC notification to limit max P state driver will request. ACPI _PPC change notification is sent by BIOS to limit max P state in several cases: - Reduce impact of platform thermal condition - When Config TDP feature is used, a changed _PPC is sent to follow TDP change - Remote node managers in server want to control platform power via baseboard management controller (BMC) This change registers with ACPI processor performance lib so that _PPC changes are notified to cpufreq core, which in turns will result in call to .setpolicy() callback. Also the way _PSS table identifies a turbo frequency is not compatible to max turbo frequency in intel_pstate, so the very first entry in _PSS needs to be adjusted. This feature can be turned on by using kernel parameters: intel_pstate=support_acpi_ppc Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> [ rjw: Minor cleanups ] Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2016-04-27thermal: use %d to print S32 parametersLeo Yan
Power allocator's parameters are S32 type, so use %d to print them. Acked-by: Javi Merino <javi.merino@arm.com> Signed-off-by: Leo Yan <leo.yan@linaro.org> Signed-off-by: Eduardo Valentin <edubezval@gmail.com>
2016-04-27thermal: hisilicon: increase temperature resolutionLeo Yan
When calculate temperature, old code firstly do division and then convert to "millicelsius" unit. This will lose resolution and only can read back temperature with "Celsius" unit. So firstly scale step value to "millicelsius" and then do division, so finally we can increase resolution for temperature value. Also refine the calculation from temperature value to step value. Signed-off-by: Leo Yan <leo.yan@linaro.org> Signed-off-by: Eduardo Valentin <edubezval@gmail.com>
2016-04-27misc: mic: Fix for double fetch security bug in VOP driverAshutosh Dixit
The MIC VOP driver does two successive reads from user space to read a variable length data structure. Kernel memory corruption can result if the data structure changes between the two reads. This patch disallows the chance of this happening. Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=116651 Reported by: Pengfei Wang <wpengfeinudt@gmail.com> Reviewed-by: Sudeep Dutt <sudeep.dutt@intel.com> Signed-off-by: Ashutosh Dixit <ashutosh.dixit@intel.com> Cc: stable <stable@vger.kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-04-27ARM: SoCFPGA: Fix secondary CPU startup in thumb2 kernelSascha Hauer
The secondary CPU starts up in ARM mode. When the kernel is compiled in thumb2 mode we have to explicitly compile the secondary startup trampoline in ARM mode, otherwise the CPU will go to Nirvana. Signed-off-by: Sascha Hauer <s.hauer@pengutronix.de> Reported-by: Steffen Trumtrar <s.trumtrar@pengutronix.de> Suggested-by: Ard Biesheuvel <ard.biesheuvel@linaro.org> Cc: stable@vger.kernel.org Signed-off-by: Dinh Nguyen <dinguyen@opensource.altera.com> Signed-off-by: Kevin Hilman <khilman@baylibre.com>
2016-04-27cpufreq: powernv: Ramp-down global pstate slower than local-pstateAkshay Adiga
The frequency transition latency from pmin to pmax is observed to be in few millisecond granurality. And it usually happens to take a performance penalty during sudden frequency rampup requests. This patch set solves this problem by using an entity called "global pstates". The global pstate is a Chip-level entity, so the global entitiy (Voltage) is managed across the cores. The local pstate is a Core-level entity, so the local entity (frequency) is managed across threads. This patch brings down global pstate at a slower rate than the local pstate. Hence by holding global pstates higher than local pstate makes the subsequent rampups faster. A per policy structure is maintained to keep track of the global and local pstate changes. The global pstate is brought down using a parabolic equation. The ramp down time to pmin is set to ~5 seconds. To make sure that the global pstates are dropped at regular interval , a timer is queued for every 2 seconds during ramp-down phase, which eventually brings the pstate down to local pstate. Iozone results show fairly consistent performance boost. YCSB on redis shows improved Max latencies in most cases. Iozone write/rewite test were made with filesizes 200704Kb and 401408Kb with different record sizes . The following table shows IOoperations/sec with and without patch. Iozone Results ( in op/sec) ( mean over 3 iterations ) --------------------------------------------------------------------- file size- with without % recordsize-IOtype patch patch change ---------------------------------------------------------------------- 200704-1-SeqWrite 1616532 1615425 0.06 200704-1-Rewrite 2423195 2303130 5.21 200704-2-SeqWrite 1628577 1602620 1.61 200704-2-Rewrite 2428264 2312154 5.02 200704-4-SeqWrite 1617605 1617182 0.02 200704-4-Rewrite 2430524 2351238 3.37 200704-8-SeqWrite 1629478 1600436 1.81 200704-8-Rewrite 2415308 2298136 5.09 200704-16-SeqWrite 1619632 1618250 0.08 200704-16-Rewrite 2396650 2352591 1.87 200704-32-SeqWrite 1632544 1598083 2.15 200704-32-Rewrite 2425119 2329743 4.09 200704-64-SeqWrite 1617812 1617235 0.03 200704-64-Rewrite 2402021 2321080 3.48 200704-128-SeqWrite 1631998 1600256 1.98 200704-128-Rewrite 2422389 2304954 5.09 200704-256 SeqWrite 1617065 1616962 0.00 200704-256-Rewrite 2432539 2301980 5.67 200704-512-SeqWrite 1632599 1598656 2.12 200704-512-Rewrite 2429270 2323676 4.54 200704-1024-SeqWrite 1618758 1616156 0.16 200704-1024-Rewrite 2431631 2315889 4.99 401408-1-SeqWrite 1631479 1608132 1.45 401408-1-Rewrite 2501550 2459409 1.71 401408-2-SeqWrite 1617095 1626069 -0.55 401408-2-Rewrite 2507557 2443621 2.61 401408-4-SeqWrite 1629601 1611869 1.10 401408-4-Rewrite 2505909 2462098 1.77 401408-8-SeqWrite 1617110 1626968 -0.60 401408-8-Rewrite 2512244 2456827 2.25 401408-16-SeqWrite 1632609 1609603 1.42 401408-16-Rewrite 2500792 2451405 2.01 401408-32-SeqWrite 1619294 1628167 -0.54 401408-32-Rewrite 2510115 2451292 2.39 401408-64-SeqWrite 1632709 1603746 1.80 401408-64-Rewrite 2506692 2433186 3.02 401408-128-SeqWrite 1619284 1627461 -0.50 401408-128-Rewrite 2518698 2453361 2.66 401408-256-SeqWrite 1634022 1610681 1.44 401408-256-Rewrite 2509987 2446328 2.60 401408-512-SeqWrite 1617524 1628016 -0.64 401408-512-Rewrite 2504409 2442899 2.51 401408-1024-SeqWrite 1629812 1611566 1.13 401408-1024-Rewrite 2507620 2442968 2.64 Tested with YCSB workload (50% update + 50% read) over redis for 1 million records and 1 million operation. Each test was carried out with target operations per second and persistence disabled. Max-latency (in us)( mean over 5 iterations ) --------------------------------------------------------------- op/s Operation with patch without patch %change --------------------------------------------------------------- 15000 Read 61480.6 50261.4 22.32 15000 cleanup 215.2 293.6 -26.70 15000 update 25666.2 25163.8 2.00 25000 Read 32626.2 89525.4 -63.56 25000 cleanup 292.2 263.0 11.10 25000 update 32293.4 90255.0 -64.22 35000 Read 34783.0 33119.0 5.02 35000 cleanup 321.2 395.8 -18.8 35000 update 36047.0 38747.8 -6.97 40000 Read 38562.2 42357.4 -8.96 40000 cleanup 371.8 384.6 -3.33 40000 update 27861.4 41547.8 -32.94 45000 Read 42271.0 88120.6 -52.03 45000 cleanup 263.6 383.0 -31.17 45000 update 29755.8 81359.0 -63.43 (test without target op/s) 47659 Read 83061.4 136440.6 -39.12 47659 cleanup 195.8 193.8 1.03 47659 update 73429.4 124971.8 -41.24 Signed-off-by: Akshay Adiga <akshay.adiga@linux.vnet.ibm.com> Reviewed-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com> Acked-by: Viresh Kumar <viresh.kumar@linaro.org> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2016-04-27cpufreq: powernv: Remove flag use-case of policy->driver_dataShilpasri G Bhat
commit 1b0289848d5d ("cpufreq: powernv: Add sysfs attributes to show throttle stats") used policy->driver_data as a flag for one-time creation of throttle sysfs files. Instead of this use 'kernfs_find_and_get()' to check if the attribute already exists. This is required as policy->driver_data is used for other purposes in the later patch. Signed-off-by: Shilpasri G Bhat <shilpa.bhat@linux.vnet.ibm.com> Signed-off-by: Akshay Adiga <akshay.adiga@linux.vnet.ibm.com> Acked-by: Viresh Kumar <viresh.kumar@linaro.org> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2016-04-27ACPI / amba: Remove CLK_IS_ROOTStephen Boyd
This flag is a no-op now (see commit 47b0eeb3dc8a "clk: Deprecate CLK_IS_ROOT", 2016-02-02) so remove it. Signed-off-by: Stephen Boyd <sboyd@codeaurora.org> Acked-by: Graeme Gregory <graeme.gregory@linaro.org> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2016-04-27ACPI / APD: Remove CLK_IS_ROOTStephen Boyd
This flag is a no-op now (see commit 47b0eeb3dc8a "clk: Deprecate CLK_IS_ROOT", 2016-02-02) so remove it. Signed-off-by: Stephen Boyd <sboyd@codeaurora.org> Acked-by: Mika Westerberg <mika.westerberg@linux.intel.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2016-04-27device property: Avoid potential dereferences of invalid pointersHeikki Krogerus
Since fwnode may hold ERR_PTR(-ENODEV) or it may be NULL, the fwnode type checks is_of_node(), is_acpi_node() and is is_pset_node() need to consider it. Using IS_ERR_OR_NULL() to check it. Fixes: 0d67e0fa1664 (device property: fix for a case of use-after-free) Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Heikki Krogerus <heikki.krogerus@linux.intel.com> [ rjw: Subject & changelog ] Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2016-04-27sparc64: Fix bootup regressions on some Kconfig combinations.David S. Miller
The system call tracing bug fix mentioned in the Fixes tag below increased the amount of assembler code in the sequence of assembler files included by head_64.S This caused to total set of code to exceed 0x4000 bytes in size, which overflows the expression in head_64.S that works to place swapper_tsb at address 0x408000. When this is violated, the TSB is not properly aligned, and also the trap table is not aligned properly either. All of this together results in failed boots. So, do two things: 1) Simplify some code by using ba,a instead of ba/nop to get those bytes back. 2) Add a linker script assertion to make sure that if this happens again the build will fail. Fixes: 1a40b95374f6 ("sparc: Fix system call tracing register handling.") Reported-by: Meelis Roos <mroos@linux.ee> Reported-by: Joerg Abraham <joerg.abraham@nokia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-27cpufreq: e_powersaver: Use IS_ENABLED() instead of checking for built-in or ↵Javier Martinez Canillas
module The IS_ENABLED() macro checks if a Kconfig symbol has been enabled either built-in or as a module, use that macro instead of open coding the same. Signed-off-by: Javier Martinez Canillas <javier@osg.samsung.com> Acked-by: Viresh Kumar <viresh.kumar@linaro.org> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2016-04-27Merge branch 'bnxt_en-fixes'David S. Miller
Michael Chan says: ==================== bnxt_en: Bug fixes for net. Only use MSIX on VF, and fix rx page buffers on architectures with PAGE_SIZE >= 64K. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-27bnxt_en: Divide a page into 32K buffers for the aggregation ring if necessary.Michael Chan
If PAGE_SIZE is bigger than BNXT_RX_PAGE_SIZE, that means the native CPU page is bigger than the maximum length of the RX BD. Divide the page into multiple 32K buffers for the aggregation ring. Add an offset field in the bnxt_sw_rx_agg_bd struct to keep track of the page offset of each buffer. Since each page can be referenced by multiple buffer entries, call get_page() as needed to get the proper reference count. Signed-off-by: Michael Chan <michael.chan@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-27bnxt_en: Limit RX BD pages to be no bigger than 32K.Michael Chan
The RX BD length field of this device is 16-bit, so the largest buffer size is 65535. For LRO and GRO, we allocate native CPU pages for the aggregation ring buffers. It won't work if the native CPU page size is 64K or bigger. We fix this by defining BNXT_RX_PAGE_SIZE to be native CPU page size up to 32K. Replace PAGE_SIZE with BNXT_RX_PAGE_SIZE in all appropriate places related to the rx aggregation ring logic. The next patch will add additional logic to divide the page into 32K chunks for aggrgation ring buffers if PAGE_SIZE is bigger than BNXT_RX_PAGE_SIZE. Signed-off-by: Michael Chan <michael.chan@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-27bnxt_en: Don't fallback to INTA on VF.Michael Chan
Only MSI-X can be used on a VF. The driver should fail initialization if it cannot successfully enable MSI-X. Signed-off-by: Michael Chan <michael.chan@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-27i40evf: Add driver support for promiscuous modeAnjali Singhai Jain
Add necessary Linux Ethernet driver support for promiscuous mode operation. Add a flag so the VF knows it is in promiscuous mode and two state flags to discreetly track multicast and unicast promiscuous states. Change-Id: Ib2f2dc7a7582304fec90fc917ebb7ded21ba1de4 Signed-off-by: Anjali Singhai Jain <anjali.singhai@intel.com> Signed-off-by: Greg Rose <gregory.v.rose@intel.com> Signed-off-by: Jesse Brandeburg <jesse.brandeburg@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2016-04-27i40e: Add VF promiscuous mode driver supportAnjali Singhai Jain
Add infrastructure for Network Function Virtualization VLAN tagged packet steering feature. Change-Id: I9b873d8fcc253858e6baba65ac68ec5b9363944e Signed-off-by: Anjali Singhai Jain <anjali.singhai@intel.com> Signed-off-by: Greg Rose <gregory.v.rose@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2016-04-27i40e: Add promiscuous on VLAN supportGreg Rose
NFV use cases require the ability to steer packets to VSIs by VLAN tag alone while being in promiscuous mode for multicast and unicast MAC addresses. These two new functions support that ability. Signed-off-by: Greg Rose <gregory.v.rose@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>