summaryrefslogtreecommitdiff
path: root/include/linux
AgeCommit message (Collapse)Author
2011-10-05drivers/video: fsl-diu-fb: move some definitions out of the header fileTimur Tabi
Move several macros and structures from the Freescale DIU driver's header file into the source file, because they're only used by that file. Also delete a few unused macros. The diu and diu_ad structures cannot be moved because they're being used by the MPC5121 platform file. A future patch eliminate the need for the platform file to access these structs, so they'll be moved also. Signed-off-by: Timur Tabi <timur@freescale.com> Signed-off-by: Florian Tobias Schandinat <FlorianSchandinat@gmx.de>
2011-10-05drivers/video: fsl-diu-fb: fix some ioctlsTimur Tabi
Use the _IOx macros to define the ioctl commands, instead of hard-coded numbers. Unfortunately, the original definitions of MFB_SET_PIXFMT and MFB_GET_PIXFMT used the wrong value for the size, so these macros have new values now. To avoid breaking binary compatibility with older applications, we retain support for the original values, but the driver displays a warning message if they're used. Also remove the FBIOGET_GWINFO and FBIOPUT_GWINFO ioctls. FBIOPUT_GWINFO was never implemented, and FBIOGET_GWINFO was never used by any application. Signed-off-by: Timur Tabi <timur@freescale.com> Signed-off-by: Florian Tobias Schandinat <FlorianSchandinat@gmx.de>
2011-10-04dt: add helper to read 64-bit integersJamie Iles
Add a helper similar to of_property_read_u32() that handles 64-bit integers. v2/v3: constify device node and property name parameters. Cc: Grant Likely <grant.likely@secretlab.ca> Reviewed-by: Rob Herring <rob.herring@calxeda.com> Signed-off-by: Jamie Iles <jamie@jamieiles.com> Signed-off-by: Grant Likely <grant.likely@secretlab.ca>
2011-10-04dw_apb_timer: constify clocksource nameJamie Iles
The clocksource name should be const for correctness. Cc: John Stultz <johnstul@us.ibm.com> Signed-off-by: Jamie Iles <jamie@jamieiles.com> Signed-off-by: John Stultz <john.stultz@linaro.org>
2011-10-04PM / QoS: Add function dev_pm_qos_read_value() (v3)Rafael J. Wysocki
To read the current PM QoS value for a given device we need to make sure that the device's power.constraints object won't be removed while we're doing that. For this reason, put the operation under dev->power.lock and acquire the lock around the initialization and removal of power.constraints. Moreover, since we're using the value of power.constraints to determine whether or not the object is present, the power.constraints_state field isn't necessary any more and may be removed. However, dev_pm_qos_add_request() needs to check if the device is being removed from the system before allocating a new PM QoS constraints object for it, so make it use the power.power_state field of struct device for this purpose. Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
2011-10-04Merge git://github.com/davem330/netLinus Torvalds
* git://github.com/davem330/net: pch_gbe: Fixed the issue on which a network freezes pch_gbe: Fixed the issue on which PC was frozen when link was downed. make PACKET_STATISTICS getsockopt report consistently between ring and non-ring net: xen-netback: correctly restart Tx after a VM restore/migrate bonding: properly stop queuing work when requested can bcm: fix incomplete tx_setup fix RDSRDMA: Fix cleanup of rds_iw_mr_pool net: Documentation: Fix type of variables ibmveth: Fix oops on request_irq failure ipv6: nullify ipv6_ac_list and ipv6_fl_list when creating new socket cxgb4: Fix EEH on IBM P7IOC can bcm: fix tx_setup off-by-one errors MAINTAINERS: tehuti: Alexander Indenbaum's address bounces dp83640: reduce driver noise ptp: fix L2 event message recognition
2011-10-04PCI: Disable MPS configuration by defaultJon Mason
Add the ability to disable PCI-E MPS turning and using the BIOS configured MPS defaults. Due to the number of issues recently discovered on some x86 chipsets, make this the default behavior. Also, add the option for peer to peer DMA MPS configuration. Peer to peer DMA is outside the scope of this patch, but MPS configuration could prevent it from working by having the MPS on one root port different than the MPS on another. To work around this, simply make the system wide MPS the smallest possible value (128B). Signed-off-by: Jon Mason <mason@myri.com> Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-10-04of: Add helpers to get one string in multiple strings propertyBenoit Cousson
Add of_property_read_string_index and of_property_count_strings to retrieve one string inside a property that will contains severals strings. Signed-off-by: Benoit Cousson <b-cousson@ti.com> Acked-by: Grant Likely <grant.likely@secretlab.ca> Signed-off-by: Kevin Hilman <khilman@ti.com>
2011-10-04ASoC: Add WM1811 supportMark Brown
The WM1811 is mostly register compatible with the WM8994 and WM8958, providing a high performance audio hub CODEC in a small form factor suitable for ultra compact system designs. Signed-off-by: Mark Brown <broonie@opensource.wolfsonmicro.com>
2011-10-04mfd: Add WM1811 supportMark Brown
The WM1811 is mostly register compatible with the WM8994 and WM8958, providing a high performance audio hub CODEC in a small form factor suitable for ultra compact system designs. Signed-off-by: Mark Brown <broonie@opensource.wolfsonmicro.com> Acked-by: Samuel Ortiz <sameo@linux.intel.com>
2011-10-04llist: Remove cpu_relax() usage in cmpxchg loopsPeter Zijlstra
Initial benchmarks show they're a net loss: $ for i in /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor ; do echo performance > $i; done $ echo 4096 32000 64 128 > /proc/sys/kernel/sem $ ./sembench -t 2048 -w 1900 -o 0 Pre: run time 30 seconds 778936 worker burns per second run time 30 seconds 912190 worker burns per second run time 30 seconds 817506 worker burns per second run time 30 seconds 830870 worker burns per second run time 30 seconds 845056 worker burns per second Post: run time 30 seconds 905920 worker burns per second run time 30 seconds 849046 worker burns per second run time 30 seconds 886286 worker burns per second run time 30 seconds 822320 worker burns per second run time 30 seconds 900283 worker burns per second So about 4% faster. (!) cpu_relax() stalls the pipeline, therefore, when used in a tight loop it has the following benefits: - allows SMT siblings to have a go; - reduces pressure on the CPU interconnect. However, cmpxchg loops are unfair and thus have unbounded completion time, therefore we should avoid getting in such heavily contended situations where the above benefits make any difference. A typical cmpxchg loop should not go round more than a handfull of times at worst, therefore adding extra delays just slows things down. Since the llist primitives are new, there aren't any bad users yet, and we should avoid growing them. Heavily contended sites should generally be better off using the ticket locks for serialization since they provide bounded completion times (fifo-fair over the cpus). Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Huang Ying <ying.huang@intel.com> Cc: Andrew Morton <akpm@linux-foundation.org> Link: http://lkml.kernel.org/r/1315836358.26517.43.camel@twins Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-10-04sched: Convert to struct llistPeter Zijlstra
Use the generic llist primitives. We had a private lockless list implementation in the scheduler in the wake-list code, now that we have a generic llist implementation that provides all required operations, switch to it. This patch is not expected to change any behavior. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Huang Ying <ying.huang@intel.com> Cc: Andrew Morton <akpm@linux-foundation.org> Link: http://lkml.kernel.org/r/1315836353.26517.42.camel@twins Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-10-04llist: Add llist_next()Peter Zijlstra
So we don't have to expose the struct list_node member. Cc: Huang Ying <ying.huang@intel.com> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/1315836348.26517.41.camel@twins Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-10-04irq_work: Use llist in the struct irq_work logicHuang Ying
Use llist in irq_work instead of the lock-less linked list implementation in irq_work to avoid the code duplication. Signed-off-by: Huang Ying <ying.huang@intel.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/1315461646-1379-6-git-send-email-ying.huang@intel.com Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-10-04llist: Return whether list is empty before adding in llist_add()Huang Ying
Extend the llist_add*() functions to return a success indicator, this allows us in the scheduler code to send an IPI if the queue was empty. ( There's no effect on existing users, because the list_add_xxx() functions are inline, thus this will be optimized out by the compiler if not used by callers. ) Signed-off-by: Huang Ying <ying.huang@intel.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/1315461646-1379-5-git-send-email-ying.huang@intel.com Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-10-04llist: Move cpu_relax() to after the cmpxchg()Huang Ying
If in llist_add()/etc. functions the first cmpxchg() call succeeds, it is not necessary to use cpu_relax() before the cmpxchg(). So cpu_relax() in a busy loop involving cmpxchg() should go after cmpxchg() instead of before that. This patch fixes this for all involved llist functions. Signed-off-by: Huang Ying <ying.huang@intel.com> Acked-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/1315461646-1379-4-git-send-email-ying.huang@intel.com Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-10-04llist: Remove the platform-dependent NMI checksIngo Molnar
Remove the nmi() checks spread around the code. in_nmi() is not available on every architecture and it's a pretty obscure and ugly check in any case. Cc: Huang Ying <ying.huang@intel.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/1315461646-1379-3-git-send-email-ying.huang@intel.com Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-10-04llist: Make some llist functions inlineHuang Ying
Because llist code will be used in performance critical scheduler code path, make llist_add() and llist_del_all() inline to avoid function calling overhead and related 'glue' overhead. Signed-off-by: Huang Ying <ying.huang@intel.com> Acked-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/1315461646-1379-2-git-send-email-ying.huang@intel.com Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-10-04Merge branch 'linus' into sched/coreIngo Molnar
Merge reason: pick up the latest fixes. Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-10-03net:rfkill: add a gpio setup function into GPIO rfkillSangwook Lee
Add a gpio setup function which gives a chance to set up platform specific configuration such as pin multiplexing, input/output direction at the runtime or booting time. Signed-off-by: Sangwook Lee <sangwook.lee@linaro.org> Signed-off-by: John W. Linville <linville@tuxdriver.com>
2011-10-03ipv4: NET_IPV4_ROUTE_GC_INTERVAL removalVasily Averin
removing obsoleted sysctl, ip_rt_gc_interval variable no longer used since 2.6.38 Signed-off-by: Vasily Averin <vvs@sw.ru> Signed-off-by: David S. Miller <davem@davemloft.net>
2011-10-03Repair wrong named definition aligned_u64Jiří Župka
This repairs problem with compile library in userspace (libnl). Signed-off-by: Jiří Župka <jzupka@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2011-10-03tcp: report ECN_SEEN in tcp_infoEric Dumazet
Allows ss command (iproute2) to display "ecnseen" if at least one packet with ECT(0) or ECT(1) or ECN was received by this socket. "ecn" means ECN was negotiated at session establishment (TCP level) "ecnseen" means we received at least one packet with ECT fields set (IP level) ss -i ... ESTAB 0 0 192.168.20.110:22 192.168.20.144:38016 ino:5950 sk:f178e400 mem:(r0,w0,f0,t0) ts sack ecn ecnseen bic wscale:7,8 rto:210 rtt:12.5/7.5 cwnd:10 send 9.3Mbps rcv_space:14480 Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2011-10-03genirq: percpu: allow interrupt type to be set at enable timeMarc Zyngier
As request_percpu_irq() doesn't allow for a percpu interrupt to have its type configured (it is generally impossible to configure it on all CPUs at once), add a 'type' argument to enable_percpu_irq(). This allows some low-level, board specific init code to be switched to a generic API. [ tglx: Added WARN_ON argument ] Signed-off-by: Marc Zyngier <marc.zyngier@arm.com> Cc: Abhijeet Dharmapurikar <adharmap@codeaurora.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2011-10-03genirq: Add support for per-cpu dev_id interruptsMarc Zyngier
The ARM GIC interrupt controller offers per CPU interrupts (PPIs), which are usually used to connect local timers to each core. Each CPU has its own private interface to the GIC, and only sees the PPIs that are directly connect to it. While these timers are separate devices and have a separate interrupt line to a core, they all use the same IRQ number. For these devices, request_irq() is not the right API as it assumes that an IRQ number is visible by a number of CPUs (through the affinity setting), but makes it very awkward to express that an IRQ number can be handled by all CPUs, and yet be a different interrupt line on each CPU, requiring a different dev_id cookie to be passed back to the handler. The *_percpu_irq() functions is designed to overcome these limitations, by providing a per-cpu dev_id vector: int request_percpu_irq(unsigned int irq, irq_handler_t handler, const char *devname, void __percpu *percpu_dev_id); void free_percpu_irq(unsigned int, void __percpu *); int setup_percpu_irq(unsigned int irq, struct irqaction *new); void remove_percpu_irq(unsigned int irq, struct irqaction *act); void enable_percpu_irq(unsigned int irq); void disable_percpu_irq(unsigned int irq); The API has a number of limitations: - no interrupt sharing - no threading - common handler across all the CPUs Once the interrupt is requested using setup_percpu_irq() or request_percpu_irq(), it must be enabled by each core that wishes its local interrupt to be delivered. Based on an initial patch by Thomas Gleixner. Signed-off-by: Marc Zyngier <marc.zyngier@arm.com> Cc: linux-arm-kernel@lists.infradead.org Link: http://lkml.kernel.org/r/1316793788-14500-2-git-send-email-marc.zyngier@arm.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2011-10-03writeback: per task dirty rate limitWu Fengguang
Add two fields to task_struct. 1) account dirtied pages in the individual tasks, for accuracy 2) per-task balance_dirty_pages() call intervals, for flexibility The balance_dirty_pages() call interval (ie. nr_dirtied_pause) will scale near-sqrt to the safety gap between dirty pages and threshold. The main problem of per-task nr_dirtied is, if 1k+ tasks start dirtying pages at exactly the same time, each task will be assigned a large initial nr_dirtied_pause, so that the dirty threshold will be exceeded long before each task reached its nr_dirtied_pause and hence call balance_dirty_pages(). The solution is to watch for the number of pages dirtied on each CPU in between the calls into balance_dirty_pages(). If it exceeds ratelimit_pages (3% dirty threshold), force call balance_dirty_pages() for a chance to set bdi->dirty_exceeded. In normal situations, this safeguarding condition is not expected to trigger at all. On the sqrt in dirty_poll_interval(): It will serve as an initial guess when dirty pages are still in the freerun area. When dirty pages are floating inside the dirty control scope [freerun, limit], a followup patch will use some refined dirty poll interval to get the desired pause time. thresh-dirty (MB) sqrt 1 16 2 22 4 32 8 45 16 64 32 90 64 128 128 181 256 256 512 362 1024 512 The above table means, given 1MB (or 1GB) gap and the dd tasks polling balance_dirty_pages() on every 16 (or 512) pages, the dirty limit won't be exceeded as long as there are less than 16 (or 512) concurrent dd's. So sqrt naturally leads to less overheads and more safe concurrent tasks for large memory servers, which have large (thresh-freerun) gaps. peter: keep the per-CPU ratelimit for safeguarding the 1k+ tasks case CC: Peter Zijlstra <a.p.zijlstra@chello.nl> Reviewed-by: Andrea Righi <andrea@betterlinux.com> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
2011-10-03writeback: stabilize bdi->dirty_ratelimitWu Fengguang
There are some imperfections in balanced_dirty_ratelimit. 1) large fluctuations The dirty_rate used for computing balanced_dirty_ratelimit is merely averaged in the past 200ms (very small comparing to the 3s estimation period for write_bw), which makes rather dispersed distribution of balanced_dirty_ratelimit. It's pretty hard to average out the singular points by increasing the estimation period. Considering that the averaging technique will introduce very undesirable time lags, I give it up totally. (btw, the 3s write_bw averaging time lag is much more acceptable because its impact is one-way and therefore won't lead to oscillations.) The more practical way is filtering -- most singular balanced_dirty_ratelimit points can be filtered out by remembering some prev_balanced_rate and prev_prev_balanced_rate. However the more reliable way is to guard balanced_dirty_ratelimit with task_ratelimit. 2) due to truncates and fs redirties, the (write_bw <=> dirty_rate) match could become unbalanced, which may lead to large systematical errors in balanced_dirty_ratelimit. The truncates, due to its possibly bumpy nature, can hardly be compensated smoothly. So let's face it. When some over-estimated balanced_dirty_ratelimit brings dirty_ratelimit high, dirty pages will go higher than the setpoint. task_ratelimit will in turn become lower than dirty_ratelimit. So if we consider both balanced_dirty_ratelimit and task_ratelimit and update dirty_ratelimit only when they are on the same side of dirty_ratelimit, the systematical errors in balanced_dirty_ratelimit won't be able to bring dirty_ratelimit far away. The balanced_dirty_ratelimit estimation may also be inaccurate near @limit or @freerun, however is less an issue. 3) since we ultimately want to - keep the fluctuations of task ratelimit as small as possible - keep the dirty pages around the setpoint as long time as possible the update policy used for (2) also serves the above goals nicely: if for some reason the dirty pages are high (task_ratelimit < dirty_ratelimit), and dirty_ratelimit is low (dirty_ratelimit < balanced_dirty_ratelimit), there is no point to bring up dirty_ratelimit in a hurry only to hurt both the above two goals. So, we make use of task_ratelimit to limit the update of dirty_ratelimit in two ways: 1) avoid changing dirty rate when it's against the position control target (the adjusted rate will slow down the progress of dirty pages going back to setpoint). 2) limit the step size. task_ratelimit is changing values step by step, leaving a consistent trace comparing to the randomly jumping balanced_dirty_ratelimit. task_ratelimit also has the nice smaller errors in stable state and typically larger errors when there are big errors in rate. So it's a pretty good limiting factor for the step size of dirty_ratelimit. Note that bdi->dirty_ratelimit is always tracking balanced_dirty_ratelimit. task_ratelimit is merely used as a limiting factor. Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
2011-10-03writeback: dirty rate controlWu Fengguang
It's all about bdi->dirty_ratelimit, which aims to be (write_bw / N) when there are N dd tasks. On write() syscall, use bdi->dirty_ratelimit ============================================ balance_dirty_pages(pages_dirtied) { task_ratelimit = bdi->dirty_ratelimit * bdi_position_ratio(); pause = pages_dirtied / task_ratelimit; sleep(pause); } On every 200ms, update bdi->dirty_ratelimit =========================================== bdi_update_dirty_ratelimit() { task_ratelimit = bdi->dirty_ratelimit * bdi_position_ratio(); balanced_dirty_ratelimit = task_ratelimit * write_bw / dirty_rate; bdi->dirty_ratelimit = balanced_dirty_ratelimit } Estimation of balanced bdi->dirty_ratelimit =========================================== balanced task_ratelimit ----------------------- balance_dirty_pages() needs to throttle tasks dirtying pages such that the total amount of dirty pages stays below the specified dirty limit in order to avoid memory deadlocks. Furthermore we desire fairness in that tasks get throttled proportionally to the amount of pages they dirty. IOW we want to throttle tasks such that we match the dirty rate to the writeout bandwidth, this yields a stable amount of dirty pages: dirty_rate == write_bw (1) The fairness requirement gives us: task_ratelimit = balanced_dirty_ratelimit == write_bw / N (2) where N is the number of dd tasks. We don't know N beforehand, but still can estimate balanced_dirty_ratelimit within 200ms. Start by throttling each dd task at rate task_ratelimit = task_ratelimit_0 (3) (any non-zero initial value is OK) After 200ms, we measured dirty_rate = # of pages dirtied by all dd's / 200ms write_bw = # of pages written to the disk / 200ms For the aggressive dd dirtiers, the equality holds dirty_rate == N * task_rate == N * task_ratelimit_0 (4) Or task_ratelimit_0 == dirty_rate / N (5) Now we conclude that the balanced task ratelimit can be estimated by write_bw balanced_dirty_ratelimit = task_ratelimit_0 * ---------- (6) dirty_rate Because with (4) and (5) we can get the desired equality (1): write_bw balanced_dirty_ratelimit == (dirty_rate / N) * ---------- dirty_rate == write_bw / N Then using the balanced task ratelimit we can compute task pause times like: task_pause = task->nr_dirtied / task_ratelimit task_ratelimit with position control ------------------------------------ However, while the above gives us means of matching the dirty rate to the writeout bandwidth, it at best provides us with a stable dirty page count (assuming a static system). In order to control the dirty page count such that it is high enough to provide performance, but does not exceed the specified limit we need another control. The dirty position control works by extending (2) to task_ratelimit = balanced_dirty_ratelimit * pos_ratio (7) where pos_ratio is a negative feedback function that subjects to 1) f(setpoint) = 1.0 2) df/dx < 0 That is, if the dirty pages are ABOVE the setpoint, we throttle each task a bit more HEAVY than balanced_dirty_ratelimit, so that the dirty pages are created less fast than they are cleaned, thus DROP to the setpoints (and the reverse). Based on (7) and the assumption that both dirty_ratelimit and pos_ratio remains CONSTANT for the past 200ms, we get task_ratelimit_0 = balanced_dirty_ratelimit * pos_ratio (8) Putting (8) into (6), we get the formula used in bdi_update_dirty_ratelimit(): write_bw balanced_dirty_ratelimit *= pos_ratio * ---------- (9) dirty_rate Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
2011-10-03writeback: add bg_threshold parameter to __bdi_update_bandwidth()Wu Fengguang
No behavior change. Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
2011-10-03writeback: account per-bdi accumulated dirtied pagesWu Fengguang
Introduce the BDI_DIRTIED counter. It will be used for estimating the bdi's dirty bandwidth. CC: Jan Kara <jack@suse.cz> CC: Michael Rubin <mrubin@google.com> CC: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
2011-10-03clocksource: fixup ux500 build problemsLinus Walleij
Based on a patch from Arnd Bergmann this fixes up the build problem of assigning a non-existing global when the ux500 PRCMU timer is not linked in by passing its base address to the init function. We also add a missing <linux/errno.h> inclusion and staticize the dummy function. Cc: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Linus Walleij <linus.walleij@linaro.org>
2011-10-02[SCSI] libsas,libata: fix ->change_queue_{depth|type} for sata devicesDan Williams
Pass queue_depth change requests to libata, and prevent queue_type changes for ATA devices. Otherwise: 1/ we do not honor the libata specific restrictions on the queue depth 2/ libsas drivers that do not set sdev->tagged_supported are unable to change the queue_depth of ata devices via sysfs Signed-off-by: Dan Williams <dan.j.williams@intel.com> Acked-by: Jeff Garzik <jgarzik@redhat.com> Signed-off-by: James Bottomley <JBottomley@Parallels.com>
2011-10-02PM / devfreq: Add basic governorsMyungJoo Ham
Four cpufreq-like governors are provided as examples. powersave: use the lowest frequency possible. The user (device) should set the polling_ms as 0 because polling is useless for this governor. performance: use the highest freqeuncy possible. The user (device) should set the polling_ms as 0 because polling is useless for this governor. userspace: use the user specified frequency stored at devfreq.user_set_freq. With sysfs support in the following patch, a user may set the value with the sysfs interface. simple_ondemand: simplified version of cpufreq's ondemand governor. When a user updates OPP entries (enable/disable/add), OPP framework automatically notifies devfreq to update operating frequency accordingly. Thus, devfreq users (device drivers) do not need to update devfreq manually with OPP entry updates or set polling_ms for powersave , performance, userspace, or any other "static" governors. Note that these are given only as basic examples for governors and any devices with devfreq may implement their own governors with the drivers and use them. Signed-off-by: MyungJoo Ham <myungjoo.ham@samsung.com> Signed-off-by: Kyungmin Park <kyungmin.park@samsung.com> Reviewed-by: Mike Turquette <mturquette@ti.com> Acked-by: Kevin Hilman <khilman@ti.com> Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
2011-10-02PM: Introduce devfreq: generic DVFS framework with device-specific OPPsMyungJoo Ham
With OPPs, a device may have multiple operable frequency and voltage sets. However, there can be multiple possible operable sets and a system will need to choose one from them. In order to reduce the power consumption (by reducing frequency and voltage) without affecting the performance too much, a Dynamic Voltage and Frequency Scaling (DVFS) scheme may be used. This patch introduces the DVFS capability to non-CPU devices with OPPs. DVFS is a techique whereby the frequency and supplied voltage of a device is adjusted on-the-fly. DVFS usually sets the frequency as low as possible with given conditions (such as QoS assurance) and adjusts voltage according to the chosen frequency in order to reduce power consumption and heat dissipation. The generic DVFS for devices, devfreq, may appear quite similar with /drivers/cpufreq. However, cpufreq does not allow to have multiple devices registered and is not suitable to have multiple heterogenous devices with different (but simple) governors. Normally, DVFS mechanism controls frequency based on the demand for the device, and then, chooses voltage based on the chosen frequency. devfreq also controls the frequency based on the governor's frequency recommendation and let OPP pick up the pair of frequency and voltage based on the recommended frequency. Then, the chosen OPP is passed to device driver's "target" callback. When PM QoS is going to be used with the devfreq device, the device driver should enable OPPs that are appropriate with the current PM QoS requests. In order to do so, the device driver may call opp_enable and opp_disable at the notifier callback of PM QoS so that PM QoS's update_target() call enables the appropriate OPPs. Note that at least one of OPPs should be enabled at any time; be careful when there is a transition. Signed-off-by: MyungJoo Ham <myungjoo.ham@samsung.com> Signed-off-by: Kyungmin Park <kyungmin.park@samsung.com> Reviewed-by: Mike Turquette <mturquette@ti.com> Acked-by: Kevin Hilman <khilman@ti.com> Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
2011-10-01Merge branches 'irq-urgent-for-linus', 'x86-urgent-for-linus' and ↵Linus Torvalds
'sched-urgent-for-linus' of git://tesla.tglx.de/git/linux-2.6-tip * 'irq-urgent-for-linus' of git://tesla.tglx.de/git/linux-2.6-tip: irq: Fix check for already initialized irq_domain in irq_domain_add irq: Add declaration of irq_domain_simple_ops to irqdomain.h * 'x86-urgent-for-linus' of git://tesla.tglx.de/git/linux-2.6-tip: x86/rtc: Don't recursively acquire rtc_lock * 'sched-urgent-for-linus' of git://tesla.tglx.de/git/linux-2.6-tip: posix-cpu-timers: Cure SMP wobbles sched: Fix up wchan borkage sched/rt: Migrate equal priority tasks to available CPUs
2011-09-30PM / OPP: Add OPP availability change notifier.MyungJoo Ham
The patch enables to register notifier_block for an OPP-device in order to get notified for any changes in the availability of OPPs of the device. For example, if a new OPP is inserted or enable/disable status of an OPP is changed, the notifier is executed. This enables the usage of opp_add, opp_enable, and opp_disable to directly take effect with any connected entities such as cpufreq or devfreq. Signed-off-by: MyungJoo Ham <myungjoo.ham@samsung.com> Signed-off-by: Kyungmin Park <kyungmin.park@samsung.com> Reviewed-by: Mike Turquette <mturquette@ti.com> Reviewed-by: Kevin Hilman <khilman@ti.com> Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
2011-09-30nl80211/mac80211: allow adding TDLS peers as stationsArik Nemtsov
When adding a TDLS peer STA, mark it with a new flag in both nl80211 and mac80211. Before adding a peer, make sure the wiphy supports TDLS and our operating mode is appropriate (managed). In addition, make sure all peers are removed on disassociation. A TDLS peer is first added just before link setup is initiated. In later setup stages we have more info about peer supported rates, capabilities, etc. This info is reported via nl80211_set_station(). Signed-off-by: Arik Nemtsov <arik@wizery.com> Cc: Kalyan C Gaddam <chakkal@iit.edu> Signed-off-by: John W. Linville <linville@tuxdriver.com>
2011-09-30mac80211: handle TDLS high-level commands and framesArik Nemtsov
Register and implement the TDLS cfg80211 callback functions. Internally prepare and send TDLS management frames. We incorporate local STA capabilities and supported rates with extra IEs given by usermode. The resulting packet is either encapsulated in a data frame, or assembled as an action frame. It is transmitted either directly or through the AP, as mandated by the TDLS specification. Declare support for the TDLS external setup wiphy capability. This tells usermode to handle link setup and discovery on its own, and use the kernel driver for sending TDLS mgmt packets. Signed-off-by: Arik Nemtsov <arik@wizery.com> Cc: Kalyan C Gaddam <chakkal@iit.edu> Signed-off-by: John W. Linville <linville@tuxdriver.com>
2011-09-30nl80211: support sending TDLS commands/framesArik Nemtsov
Add support for sending high-level TDLS commands and TDLS frames via NL80211_CMD_TDLS_OPER and NL80211_CMD_TDLS_MGMT, respectively. Add appropriate cfg80211 callbacks for lower level drivers. Add wiphy capability flags for TDLS support and advertise them via nl80211. Signed-off-by: Arik Nemtsov <arik@wizery.com> Cc: Kalyan C Gaddam <chakkal@iit.edu> Signed-off-by: John W. Linville <linville@tuxdriver.com>
2011-09-30Merge branch 'master' of ↵John W. Linville
git://git.infradead.org/users/linville/wireless-next into for-davem Conflicts: drivers/net/wireless/iwlwifi/iwl-pci.c drivers/net/wireless/wl12xx/main.c
2011-09-30iommu/core: let drivers know if an iommu fault handler isn't installedOhad Ben-Cohen
Make report_iommu_fault() return -ENOSYS whenever an iommu fault handler isn't installed, so IOMMU drivers can then do their own platform-specific default behavior if they wanted. Fault handlers can still return -ENOSYS in case they want to elicit the default behavior of the IOMMU drivers. Signed-off-by: Ohad Ben-Cohen <ohad@wizery.com> Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
2011-09-30regmap: Implement regcache_cache_bypass helper functionDimitris Papastamos
Ensure we've got a function so users can enable/disable the cache bypass option. Signed-off-by: Dimitris Papastamos <dp@opensource.wolfsonmicro.com> Signed-off-by: Mark Brown <broonie@opensource.wolfsonmicro.com>
2011-09-30posix-cpu-timers: Cure SMP wobblesPeter Zijlstra
David reported: Attached below is a watered-down version of rt/tst-cpuclock2.c from GLIBC. Just build it with "gcc -o test test.c -lpthread -lrt" or similar. Run it several times, and you will see cases where the main thread will measure a process clock difference before and after the nanosleep which is smaller than the cpu-burner thread's individual thread clock difference. This doesn't make any sense since the cpu-burner thread is part of the top-level process's thread group. I've reproduced this on both x86-64 and sparc64 (using both 32-bit and 64-bit binaries). For example: [davem@boricha build-x86_64-linux]$ ./test process: before(0.001221967) after(0.498624371) diff(497402404) thread: before(0.000081692) after(0.498316431) diff(498234739) self: before(0.001223521) after(0.001240219) diff(16698) [davem@boricha build-x86_64-linux]$ The diff of 'process' should always be >= the diff of 'thread'. I make sure to wrap the 'thread' clock measurements the most tightly around the nanosleep() call, and that the 'process' clock measurements are the outer-most ones. --- #include <unistd.h> #include <stdio.h> #include <stdlib.h> #include <time.h> #include <fcntl.h> #include <string.h> #include <errno.h> #include <pthread.h> static pthread_barrier_t barrier; static void *chew_cpu(void *arg) { pthread_barrier_wait(&barrier); while (1) __asm__ __volatile__("" : : : "memory"); return NULL; } int main(void) { clockid_t process_clock, my_thread_clock, th_clock; struct timespec process_before, process_after; struct timespec me_before, me_after; struct timespec th_before, th_after; struct timespec sleeptime; unsigned long diff; pthread_t th; int err; err = clock_getcpuclockid(0, &process_clock); if (err) return 1; err = pthread_getcpuclockid(pthread_self(), &my_thread_clock); if (err) return 1; pthread_barrier_init(&barrier, NULL, 2); err = pthread_create(&th, NULL, chew_cpu, NULL); if (err) return 1; err = pthread_getcpuclockid(th, &th_clock); if (err) return 1; pthread_barrier_wait(&barrier); err = clock_gettime(process_clock, &process_before); if (err) return 1; err = clock_gettime(my_thread_clock, &me_before); if (err) return 1; err = clock_gettime(th_clock, &th_before); if (err) return 1; sleeptime.tv_sec = 0; sleeptime.tv_nsec = 500000000; nanosleep(&sleeptime, NULL); err = clock_gettime(th_clock, &th_after); if (err) return 1; err = clock_gettime(my_thread_clock, &me_after); if (err) return 1; err = clock_gettime(process_clock, &process_after); if (err) return 1; diff = process_after.tv_nsec - process_before.tv_nsec; printf("process: before(%lu.%.9lu) after(%lu.%.9lu) diff(%lu)\n", process_before.tv_sec, process_before.tv_nsec, process_after.tv_sec, process_after.tv_nsec, diff); diff = th_after.tv_nsec - th_before.tv_nsec; printf("thread: before(%lu.%.9lu) after(%lu.%.9lu) diff(%lu)\n", th_before.tv_sec, th_before.tv_nsec, th_after.tv_sec, th_after.tv_nsec, diff); diff = me_after.tv_nsec - me_before.tv_nsec; printf("self: before(%lu.%.9lu) after(%lu.%.9lu) diff(%lu)\n", me_before.tv_sec, me_before.tv_nsec, me_after.tv_sec, me_after.tv_nsec, diff); return 0; } This is due to us using p->se.sum_exec_runtime in thread_group_cputime() where we iterate the thread group and sum all data. This does not take time since the last schedule operation (tick or otherwise) into account. We can cure this by using task_sched_runtime() at the cost of having to take locks. This also means we can (and must) do away with thread_group_sched_runtime() since the modified thread_group_cputime() is now more accurate and would deadlock when called from thread_group_sched_runtime(). Aside of that it makes the function safe on 32 bit systems. The old code added t->se.sum_exec_runtime unprotected. sum_exec_runtime is a 64bit value and could be changed on another cpu at the same time. Reported-by: David Miller <davem@davemloft.net> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: stable@kernel.org Link: http://lkml.kernel.org/r/1314874459.7945.22.camel@twins Tested-by: David Miller <davem@davemloft.net> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2011-09-29OF: Add of_match_ptr() macroBen Dooks
Add a macro of_match_ptr() that allows the .of_match_table entry in the driver structures to be assigned without having an #ifdef xxx NULL for the case that OF is not enabled Signed-off-by: Ben Dooks <ben-linux@fluff.org> Signed-off-by: Grant Likely <grant.likely@secretlab.ca>
2011-09-29user namespace: usb: make usb urbs user namespace aware (v2)Serge Hallyn
Add to the dev_state and alloc_async structures the user namespace corresponding to the uid and euid. Pass these to kill_pid_info_as_uid(), which can then implement a proper, user-namespace-aware uid check. Changelog: Sep 20: Per Oleg's suggestion: Instead of caching and passing user namespace, uid, and euid each separately, pass a struct cred. Sep 26: Address Alan Stern's comments: don't define a struct cred at usbdev_open(), and take and put a cred at async_completed() to ensure it lasts for the duration of kill_pid_info_as_cred(). Signed-off-by: Serge Hallyn <serge.hallyn@canonical.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-09-28rcu: Simplify unboosting checksPaul E. McKenney
Commit 7765be (Fix RCU_BOOST race handling current->rcu_read_unlock_special) introduced a new ->rcu_boosted field in the task structure. This is redundant because the existing ->rcu_boost_mutex will be non-NULL at any time that ->rcu_boosted is nonzero. Therefore, this commit removes ->rcu_boosted and tests ->rcu_boost_mutex instead. Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2011-09-28rcu: Move __rcu_read_unlock()'s barrier() within if-statementPaul E. McKenney
We only need to constrain the compiler if we are actually exiting the top-level RCU read-side critical section. This commit therefore moves the first barrier() cal in __rcu_read_unlock() to inside the "if" statement, thus avoiding needless register flushes for inner rcu_read_unlock() calls. Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2011-09-28rcu: Improve rcu_assign_pointer() and RCU_INIT_POINTER() documentationPaul E. McKenney
The differences between rcu_assign_pointer() and RCU_INIT_POINTER() are subtle, and it is easy to use the the cheaper RCU_INIT_POINTER() when the more-expensive rcu_assign_pointer() should have been used instead. The consequences of this mistake are quite severe. This commit therefore carefully lays out the situations in which it it permissible to use RCU_INIT_POINTER(). Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2011-09-28rcu: Make rcu_assign_pointer() unconditionally insert a memory barrierEric Dumazet
Recent changes to gcc give warning messages on rcu_assign_pointers()'s checks that allow it to determine when it is OK to omit the memory barrier. Stephen Hemminger tried a number of gcc tricks to silence this warning, but #pragmas and CPP macros do not work together in the way that would be required to make this work. However, we now have RCU_INIT_POINTER(), which already omits this memory barrier, and which therefore may be used when assigning NULL to an RCU-protected pointer that is accessible to readers. This commit therefore makes rcu_assign_pointer() unconditionally emit the memory barrier. Reported-by: Stephen Hemminger <shemminger@vyatta.com> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Acked-by: David S. Miller <davem@davemloft.net> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2011-09-28nohz: Remove nohz_cpu_maskShi, Alex
RCU no longer uses this global variable, nor does anyone else. This commit therefore removes this variable. This reduces memory footprint and also removes some atomic instructions and memory barriers from the dyntick-idle path. Signed-off-by: Alex Shi <alex.shi@intel.com> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>