linux.git - Linus' kernel tree

Age	Commit message (Collapse)	Author
2011-10-03	Repair wrong named definition aligned_u64	Jiří Župka
	This repairs problem with compile library in userspace (libnl). Signed-off-by: Jiří Župka <jzupka@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2011-10-03	tcp: report ECN_SEEN in tcp_info	Eric Dumazet
	Allows ss command (iproute2) to display "ecnseen" if at least one packet with ECT(0) or ECT(1) or ECN was received by this socket. "ecn" means ECN was negotiated at session establishment (TCP level) "ecnseen" means we received at least one packet with ECT fields set (IP level) ss -i ... ESTAB 0 0 192.168.20.110:22 192.168.20.144:38016 ino:5950 sk:f178e400 mem:(r0,w0,f0,t0) ts sack ecn ecnseen bic wscale:7,8 rto:210 rtt:12.5/7.5 cwnd:10 send 9.3Mbps rcv_space:14480 Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2011-10-03	ore: Make ore_striping_info and ore_calc_stripe_info public	Boaz Harrosh
	The struct ore_striping_info will be used later in other structures. And ore_calc_stripe_info as well. Rename them make struct ore_striping_info public. ore_calc_stripe_info is still static, will be made public on first use. Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
2011-10-03	exofs: Remove unused data_map member from exofs_sb_info	Boaz Harrosh
	The struct pnfs_osd_data_map data_map member of exofs_sb_info was never used after mount. In fact all it's members were duplicated by the ore_layout structure. So just remove the duplicated information. Also removed some stupid, but perfectly supported, restrictions on layout parameters. The case where num_devices is not divisible by mirror_count+1 is perfectly fine since the rotating device view will eventually use all the devices it can get. Signed-off-by: Boaz Harrosh <bharrosh@panasas.com> Signed-off-by: Benny Halevy <bhalevy@tonian.com>
2011-10-03	exofs: Rename struct ore_components comps => oc	Boaz Harrosh
	ore_components already has a comps member so this leads to things like comps->comps which is annoying. the name oc was already used in new code. So rename all old usage of ore_components comps => ore_components oc. Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
2011-10-03	genirq: percpu: allow interrupt type to be set at enable time	Marc Zyngier
	As request_percpu_irq() doesn't allow for a percpu interrupt to have its type configured (it is generally impossible to configure it on all CPUs at once), add a 'type' argument to enable_percpu_irq(). This allows some low-level, board specific init code to be switched to a generic API. [ tglx: Added WARN_ON argument ] Signed-off-by: Marc Zyngier <marc.zyngier@arm.com> Cc: Abhijeet Dharmapurikar <adharmap@codeaurora.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2011-10-03	genirq: Add support for per-cpu dev_id interrupts	Marc Zyngier
	The ARM GIC interrupt controller offers per CPU interrupts (PPIs), which are usually used to connect local timers to each core. Each CPU has its own private interface to the GIC, and only sees the PPIs that are directly connect to it. While these timers are separate devices and have a separate interrupt line to a core, they all use the same IRQ number. For these devices, request_irq() is not the right API as it assumes that an IRQ number is visible by a number of CPUs (through the affinity setting), but makes it very awkward to express that an IRQ number can be handled by all CPUs, and yet be a different interrupt line on each CPU, requiring a different dev_id cookie to be passed back to the handler. The _percpu_irq() functions is designed to overcome these limitations, by providing a per-cpu dev_id vector: int request_percpu_irq(unsigned int irq, irq_handler_t handler, const char devname, void __percpu percpu_dev_id); void free_percpu_irq(unsigned int, void __percpu ); int setup_percpu_irq(unsigned int irq, struct irqaction new); void remove_percpu_irq(unsigned int irq, struct irqaction act); void enable_percpu_irq(unsigned int irq); void disable_percpu_irq(unsigned int irq); The API has a number of limitations: - no interrupt sharing - no threading - common handler across all the CPUs Once the interrupt is requested using setup_percpu_irq() or request_percpu_irq(), it must be enabled by each core that wishes its local interrupt to be delivered. Based on an initial patch by Thomas Gleixner. Signed-off-by: Marc Zyngier <marc.zyngier@arm.com> Cc: linux-arm-kernel@lists.infradead.org Link: http://lkml.kernel.org/r/1316793788-14500-2-git-send-email-marc.zyngier@arm.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2011-10-02	PM / devfreq: Add basic governors	MyungJoo Ham
	Four cpufreq-like governors are provided as examples. powersave: use the lowest frequency possible. The user (device) should set the polling_ms as 0 because polling is useless for this governor. performance: use the highest freqeuncy possible. The user (device) should set the polling_ms as 0 because polling is useless for this governor. userspace: use the user specified frequency stored at devfreq.user_set_freq. With sysfs support in the following patch, a user may set the value with the sysfs interface. simple_ondemand: simplified version of cpufreq's ondemand governor. When a user updates OPP entries (enable/disable/add), OPP framework automatically notifies devfreq to update operating frequency accordingly. Thus, devfreq users (device drivers) do not need to update devfreq manually with OPP entry updates or set polling_ms for powersave , performance, userspace, or any other "static" governors. Note that these are given only as basic examples for governors and any devices with devfreq may implement their own governors with the drivers and use them. Signed-off-by: MyungJoo Ham <myungjoo.ham@samsung.com> Signed-off-by: Kyungmin Park <kyungmin.park@samsung.com> Reviewed-by: Mike Turquette <mturquette@ti.com> Acked-by: Kevin Hilman <khilman@ti.com> Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
2011-10-02	PM: Introduce devfreq: generic DVFS framework with device-specific OPPs	MyungJoo Ham
	With OPPs, a device may have multiple operable frequency and voltage sets. However, there can be multiple possible operable sets and a system will need to choose one from them. In order to reduce the power consumption (by reducing frequency and voltage) without affecting the performance too much, a Dynamic Voltage and Frequency Scaling (DVFS) scheme may be used. This patch introduces the DVFS capability to non-CPU devices with OPPs. DVFS is a techique whereby the frequency and supplied voltage of a device is adjusted on-the-fly. DVFS usually sets the frequency as low as possible with given conditions (such as QoS assurance) and adjusts voltage according to the chosen frequency in order to reduce power consumption and heat dissipation. The generic DVFS for devices, devfreq, may appear quite similar with /drivers/cpufreq. However, cpufreq does not allow to have multiple devices registered and is not suitable to have multiple heterogenous devices with different (but simple) governors. Normally, DVFS mechanism controls frequency based on the demand for the device, and then, chooses voltage based on the chosen frequency. devfreq also controls the frequency based on the governor's frequency recommendation and let OPP pick up the pair of frequency and voltage based on the recommended frequency. Then, the chosen OPP is passed to device driver's "target" callback. When PM QoS is going to be used with the devfreq device, the device driver should enable OPPs that are appropriate with the current PM QoS requests. In order to do so, the device driver may call opp_enable and opp_disable at the notifier callback of PM QoS so that PM QoS's update_target() call enables the appropriate OPPs. Note that at least one of OPPs should be enabled at any time; be careful when there is a transition. Signed-off-by: MyungJoo Ham <myungjoo.ham@samsung.com> Signed-off-by: Kyungmin Park <kyungmin.park@samsung.com> Reviewed-by: Mike Turquette <mturquette@ti.com> Acked-by: Kevin Hilman <khilman@ti.com> Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
2011-10-01	Merge branches 'irq-urgent-for-linus', 'x86-urgent-for-linus' and ↵	Linus Torvalds
	'sched-urgent-for-linus' of git://tesla.tglx.de/git/linux-2.6-tip * 'irq-urgent-for-linus' of git://tesla.tglx.de/git/linux-2.6-tip: irq: Fix check for already initialized irq_domain in irq_domain_add irq: Add declaration of irq_domain_simple_ops to irqdomain.h * 'x86-urgent-for-linus' of git://tesla.tglx.de/git/linux-2.6-tip: x86/rtc: Don't recursively acquire rtc_lock * 'sched-urgent-for-linus' of git://tesla.tglx.de/git/linux-2.6-tip: posix-cpu-timers: Cure SMP wobbles sched: Fix up wchan borkage sched/rt: Migrate equal priority tasks to available CPUs
2011-10-01	Merge branch 'rcu/next' of git://github.com/paulmckrcu/linux into core/rcu	Ingo Molnar

2011-09-30	PM / OPP: Add OPP availability change notifier.	MyungJoo Ham
	The patch enables to register notifier_block for an OPP-device in order to get notified for any changes in the availability of OPPs of the device. For example, if a new OPP is inserted or enable/disable status of an OPP is changed, the notifier is executed. This enables the usage of opp_add, opp_enable, and opp_disable to directly take effect with any connected entities such as cpufreq or devfreq. Signed-off-by: MyungJoo Ham <myungjoo.ham@samsung.com> Signed-off-by: Kyungmin Park <kyungmin.park@samsung.com> Reviewed-by: Mike Turquette <mturquette@ti.com> Reviewed-by: Kevin Hilman <khilman@ti.com> Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
2011-09-30	mac80211: document client powersave	Johannes Berg
	With the addition of uAPSD and driver buffering the powersave handling has gotten quite complex. Add a section to the documentation to explain it for anyone wanting to implement it. Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: John W. Linville <linville@tuxdriver.com>
2011-09-30	mac80211: allow out-of-band EOSP notification	Johannes Berg
	iwlwifi has a separate EOSP notification from the device, and to make use of that properly it needs to be passed to mac80211. To be able to mix with tx_status_irqsafe and rx_irqsafe it also needs to be an "_irqsafe" version in the sense that it goes through the tasklet, the actual flag clearing would be IRQ-safe but doing it directly would cause reordering issues. This is needed in the case of a P2P GO going into an absence period without transmitting any frames that should be driver-released as in this case there's no other way to inform mac80211 that the service period ended. Note that for drivers that don't use the _irqsafe functions another version of this function will be required. Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: John W. Linville <linville@tuxdriver.com>
2011-09-30	mac80211: explicitly notify drivers of frame release	Johannes Berg
	iwlwifi needs to know the number of frames that are going to be sent to a station while it is asleep so it can properly handle the uCode blocking of that station. Before uAPSD, we got by by telling the device that a single frame was going to be released whenever we encountered IEEE80211_TX_CTL_POLL_RESPONSE. With uAPSD, however, that is no longer possible since there could be more than a single frame. To support this model, add a new callback to notify drivers when frames are going to be released. Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: John W. Linville <linville@tuxdriver.com>
2011-09-30	mac80211: reply only once to each PS-poll	Johannes Berg
	If a PS-poll frame is retried (but was received) there is no way to detect that since it has no sequence number. As a consequence, the standard asks us to not react to PS-poll frames until the response to one made it out (was ACKed or lost). Implement this by using the WLAN_STA_SP flags to also indicate a PS-Poll "service period" and the IEEE80211_TX_STATUS_EOSP flag for the response packet to indicate the end of the "SP" as usual. We could use separate flags, but that will most likely completely confuse drivers, and while the standard doesn't exclude simultaneously polling using uAPSD and PS-Poll, doing that seems quite problematic. Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: John W. Linville <linville@tuxdriver.com>
2011-09-30	mac80211: implement uAPSD	Johannes Berg
	Add uAPSD support to mac80211. This is probably not possible with all devices, so advertising it with the cfg80211 flag will be left up to drivers that want it. Due to my previous patches it is now a fairly straight-forward extension. Drivers need to have accurate TX status reporting for the EOSP frame. For drivers that buffer themselves, the provided APIs allow releasing the right number of frames, but then drivers need to set EOSP and more-data themselves. This is documented in more detail in the new code itself. Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: John W. Linville <linville@tuxdriver.com>
2011-09-30	mac80211: allow releasing driver-buffered frames	Johannes Berg
	If there are frames for a station buffered in the driver, mac80211 announces those in the TIM IE but there's no way to release them. Add new API to release such frames and use it when the station polls for a frame. Since the API will soon also be used for uAPSD it is easily extensible. Note that before this change drivers announcing driver-buffered frames in the TIM bit actually will respond to a PS-Poll with a potentially lower priority frame (if there are any frames buffered in mac80211), after this patch a driver that hasn't been changed will no longer respond at all. This only affects ath9k, which will need to be fixed to implement the new API. Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: John W. Linville <linville@tuxdriver.com>
2011-09-30	mac80211: split PS buffers into ACs	Johannes Berg
	For uAPSD support we'll need to have per-AC PS buffers. As this is a major undertaking, split the buffers before really adding support for uAPSD. This already makes some reference to the uapsd_queues variable, but for now that will never be non-zero. Since book-keeping is complicated, also change the logic for keeping a maximum of frames only and allow 64 frames per AC (up from 128 for a station). Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: John W. Linville <linville@tuxdriver.com>
2011-09-30	mac80211: let drivers inform it about per TID buffered frames	Johannes Berg
	For uAPSD implementation, it is necessary to know on which ACs frames are buffered. mac80211 obviously knows about the frames it has buffered itself, but with aggregation many drivers buffer frames. Thus, mac80211 needs to be informed about this. For now, since we don't have APSD in any form, this will unconditionally set the TIM bit for the station but later with uAPSD only some ACs might cause the TIM bit to be set. ath9k is the only driver using this API and I only modify it in the most basic way, it won't be able to implement uAPSD with this yet. But it can't do that anyway since there's no way to selectively release frames to the peer yet. Since drivers will buffer frames per TID, let them inform mac80211 on a per TID basis, mac80211 will then sort out the AC mapping itself. Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: John W. Linville <linville@tuxdriver.com>
2011-09-30	nl80211/mac80211: allow adding TDLS peers as stations	Arik Nemtsov
	When adding a TDLS peer STA, mark it with a new flag in both nl80211 and mac80211. Before adding a peer, make sure the wiphy supports TDLS and our operating mode is appropriate (managed). In addition, make sure all peers are removed on disassociation. A TDLS peer is first added just before link setup is initiated. In later setup stages we have more info about peer supported rates, capabilities, etc. This info is reported via nl80211_set_station(). Signed-off-by: Arik Nemtsov <arik@wizery.com> Cc: Kalyan C Gaddam <chakkal@iit.edu> Signed-off-by: John W. Linville <linville@tuxdriver.com>
2011-09-30	mac80211: handle TDLS high-level commands and frames	Arik Nemtsov
	Register and implement the TDLS cfg80211 callback functions. Internally prepare and send TDLS management frames. We incorporate local STA capabilities and supported rates with extra IEs given by usermode. The resulting packet is either encapsulated in a data frame, or assembled as an action frame. It is transmitted either directly or through the AP, as mandated by the TDLS specification. Declare support for the TDLS external setup wiphy capability. This tells usermode to handle link setup and discovery on its own, and use the kernel driver for sending TDLS mgmt packets. Signed-off-by: Arik Nemtsov <arik@wizery.com> Cc: Kalyan C Gaddam <chakkal@iit.edu> Signed-off-by: John W. Linville <linville@tuxdriver.com>
2011-09-30	mac80211: standardize adding supported rates IEs	Arik Nemtsov
	Relocate the mesh implementation of adding the (extended) supported rates IE to util.c, anticipating its use by other parts of mac80211. Signed-off-by: Arik Nemtsov <arik@wizery.com> Cc: Kalyan C Gaddam <chakkal@iit.edu> Signed-off-by: John W. Linville <linville@tuxdriver.com>
2011-09-30	nl80211: support sending TDLS commands/frames	Arik Nemtsov
	Add support for sending high-level TDLS commands and TDLS frames via NL80211_CMD_TDLS_OPER and NL80211_CMD_TDLS_MGMT, respectively. Add appropriate cfg80211 callbacks for lower level drivers. Add wiphy capability flags for TDLS support and advertise them via nl80211. Signed-off-by: Arik Nemtsov <arik@wizery.com> Cc: Kalyan C Gaddam <chakkal@iit.edu> Signed-off-by: John W. Linville <linville@tuxdriver.com>
2011-09-30	cfg80211/mac80211: apply station uAPSD parameters selectively	Johannes Berg
	Currently, when hostapd sets the station as authorized we also overwrite its uAPSD parameter. This obviously leads to buggy behaviour (later, with my patches that actually add uAPSD support). To fix this, only apply those parameters if they were actually set in nl80211, and to achieve that add a bitmap of things to apply. Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: John W. Linville <linville@tuxdriver.com>
2011-09-30	Merge branch 'master' of ↵	John W. Linville
	git://git.infradead.org/users/linville/wireless-next into for-davem Conflicts: drivers/net/wireless/iwlwifi/iwl-pci.c drivers/net/wireless/wl12xx/main.c
2011-09-30	regmap: Implement regcache_cache_bypass helper function	Dimitris Papastamos
	Ensure we've got a function so users can enable/disable the cache bypass option. Signed-off-by: Dimitris Papastamos <dp@opensource.wolfsonmicro.com> Signed-off-by: Mark Brown <broonie@opensource.wolfsonmicro.com>
2011-09-30	posix-cpu-timers: Cure SMP wobbles	Peter Zijlstra
	David reported: Attached below is a watered-down version of rt/tst-cpuclock2.c from GLIBC. Just build it with "gcc -o test test.c -lpthread -lrt" or similar. Run it several times, and you will see cases where the main thread will measure a process clock difference before and after the nanosleep which is smaller than the cpu-burner thread's individual thread clock difference. This doesn't make any sense since the cpu-burner thread is part of the top-level process's thread group. I've reproduced this on both x86-64 and sparc64 (using both 32-bit and 64-bit binaries). For example: [davem@boricha build-x86_64-linux]$ ./test process: before(0.001221967) after(0.498624371) diff(497402404) thread: before(0.000081692) after(0.498316431) diff(498234739) self: before(0.001223521) after(0.001240219) diff(16698) [davem@boricha build-x86_64-linux]$ The diff of 'process' should always be >= the diff of 'thread'. I make sure to wrap the 'thread' clock measurements the most tightly around the nanosleep() call, and that the 'process' clock measurements are the outer-most ones. --- #include <unistd.h> #include <stdio.h> #include <stdlib.h> #include <time.h> #include <fcntl.h> #include <string.h> #include <errno.h> #include <pthread.h> static pthread_barrier_t barrier; static void chew_cpu(void arg) { pthread_barrier_wait(&barrier); while (1) __asm__ __volatile__("" : : : "memory"); return NULL; } int main(void) { clockid_t process_clock, my_thread_clock, th_clock; struct timespec process_before, process_after; struct timespec me_before, me_after; struct timespec th_before, th_after; struct timespec sleeptime; unsigned long diff; pthread_t th; int err; err = clock_getcpuclockid(0, &process_clock); if (err) return 1; err = pthread_getcpuclockid(pthread_self(), &my_thread_clock); if (err) return 1; pthread_barrier_init(&barrier, NULL, 2); err = pthread_create(&th, NULL, chew_cpu, NULL); if (err) return 1; err = pthread_getcpuclockid(th, &th_clock); if (err) return 1; pthread_barrier_wait(&barrier); err = clock_gettime(process_clock, &process_before); if (err) return 1; err = clock_gettime(my_thread_clock, &me_before); if (err) return 1; err = clock_gettime(th_clock, &th_before); if (err) return 1; sleeptime.tv_sec = 0; sleeptime.tv_nsec = 500000000; nanosleep(&sleeptime, NULL); err = clock_gettime(th_clock, &th_after); if (err) return 1; err = clock_gettime(my_thread_clock, &me_after); if (err) return 1; err = clock_gettime(process_clock, &process_after); if (err) return 1; diff = process_after.tv_nsec - process_before.tv_nsec; printf("process: before(%lu.%.9lu) after(%lu.%.9lu) diff(%lu)\n", process_before.tv_sec, process_before.tv_nsec, process_after.tv_sec, process_after.tv_nsec, diff); diff = th_after.tv_nsec - th_before.tv_nsec; printf("thread: before(%lu.%.9lu) after(%lu.%.9lu) diff(%lu)\n", th_before.tv_sec, th_before.tv_nsec, th_after.tv_sec, th_after.tv_nsec, diff); diff = me_after.tv_nsec - me_before.tv_nsec; printf("self: before(%lu.%.9lu) after(%lu.%.9lu) diff(%lu)\n", me_before.tv_sec, me_before.tv_nsec, me_after.tv_sec, me_after.tv_nsec, diff); return 0; } This is due to us using p->se.sum_exec_runtime in thread_group_cputime() where we iterate the thread group and sum all data. This does not take time since the last schedule operation (tick or otherwise) into account. We can cure this by using task_sched_runtime() at the cost of having to take locks. This also means we can (and must) do away with thread_group_sched_runtime() since the modified thread_group_cputime() is now more accurate and would deadlock when called from thread_group_sched_runtime(). Aside of that it makes the function safe on 32 bit systems. The old code added t->se.sum_exec_runtime unprotected. sum_exec_runtime is a 64bit value and could be changed on another cpu at the same time. Reported-by: David Miller <davem@davemloft.net> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: stable@kernel.org Link: http://lkml.kernel.org/r/1314874459.7945.22.camel@twins Tested-by: David Miller <davem@davemloft.net> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2011-09-29	OF: Add of_match_ptr() macro	Ben Dooks
	Add a macro of_match_ptr() that allows the .of_match_table entry in the driver structures to be assigned without having an #ifdef xxx NULL for the case that OF is not enabled Signed-off-by: Ben Dooks <ben-linux@fluff.org> Signed-off-by: Grant Likely <grant.likely@secretlab.ca>
2011-09-29	user namespace: usb: make usb urbs user namespace aware (v2)	Serge Hallyn
	Add to the dev_state and alloc_async structures the user namespace corresponding to the uid and euid. Pass these to kill_pid_info_as_uid(), which can then implement a proper, user-namespace-aware uid check. Changelog: Sep 20: Per Oleg's suggestion: Instead of caching and passing user namespace, uid, and euid each separately, pass a struct cred. Sep 26: Address Alan Stern's comments: don't define a struct cred at usbdev_open(), and take and put a cred at async_completed() to ensure it lasts for the duration of kill_pid_info_as_cred(). Signed-off-by: Serge Hallyn <serge.hallyn@canonical.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-09-29	xen: allow balloon driver to use more than one memory region	David Vrabel
	Allow the xen balloon driver to populate its list of extra pages from more than one region of memory. This will allow platforms to provide (for example) a region of low memory and a region of high memory. The maximum possible number of extra regions is 128 (== E820MAX) which is quite large so xen_extra_mem is placed in __initdata. This is safe as both xen_memory_setup() and balloon_init() are in __init. The balloon regions themselves are not altered (i.e., there is still only the one region). Signed-off-by: David Vrabel <david.vrabel@citrix.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2011-09-29	xen/balloon: account for pages released during memory setup	David Vrabel
	In xen_memory_setup() pages that occur in gaps in the memory map are released back to Xen. This reduces the domain's current page count in the hypervisor. The Xen balloon driver does not correctly decrease its initial current_pages count to reflect this. If 'delta' pages are released and the target is adjusted the resulting reservation is always 'delta' less than the requested target. This affects dom0 if the initial allocation of pages overlaps the PCI memory region but won't affect most domU guests that have been setup with pseudo-physical memory maps that don't have gaps. Fix this by accouting for the released pages when starting the balloon driver. If the domain's targets are managed by xapi, the domain may eventually run out of memory and die because xapi currently gets its target calculations wrong and whenever it is restarted it always reduces the target by 'delta'. Signed-off-by: David Vrabel <david.vrabel@citrix.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2011-09-29	xen: modify kernel mappings corresponding to granted pages	Stefano Stabellini
	If we want to use granted pages for AIO, changing the mappings of a user vma and the corresponding p2m is not enough, we also need to update the kernel mappings accordingly. Currently this is only needed for pages that are created for user usages through /dev/xen/gntdev. As in, pages that have been in use by the kernel and use the P2M will not need this special mapping. However there are no guarantees that in the future the kernel won't start accessing pages through the 1:1 even for internal usage. In order to avoid the complexity of dealing with highmem, we allocated the pages lowmem. We issue a HYPERVISOR_grant_table_op right away in m2p_add_override and we remove the mappings using another HYPERVISOR_grant_table_op in m2p_remove_override. Considering that m2p_add_override and m2p_remove_override are called once per page we use multicalls and hypercall batching. Use the kmap_op pointer directly as argument to do the mapping as it is guaranteed to be present up until the unmapping is done. Before issuing any unmapping multicalls, we need to make sure that the mapping has already being done, because we need the kmap->handle to be set correctly. Signed-off-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com> [v1: Removed GRANT_FRAME_BIT usage] Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2011-09-29	xen: add an "highmem" parameter to alloc_xenballooned_pages	Stefano Stabellini
	Add an highmem parameter to alloc_xenballooned_pages, to allow callers to request lowmem or highmem pages. Fix the code style of free_xenballooned_pages' prototype. Signed-off-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2011-09-28	rcu: Simplify unboosting checks	Paul E. McKenney
	Commit 7765be (Fix RCU_BOOST race handling current->rcu_read_unlock_special) introduced a new ->rcu_boosted field in the task structure. This is redundant because the existing ->rcu_boost_mutex will be non-NULL at any time that ->rcu_boosted is nonzero. Therefore, this commit removes ->rcu_boosted and tests ->rcu_boost_mutex instead. Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2011-09-28	rcu: Move __rcu_read_unlock()'s barrier() within if-statement	Paul E. McKenney
	We only need to constrain the compiler if we are actually exiting the top-level RCU read-side critical section. This commit therefore moves the first barrier() cal in __rcu_read_unlock() to inside the "if" statement, thus avoiding needless register flushes for inner rcu_read_unlock() calls. Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2011-09-28	rcu: Improve rcu_assign_pointer() and RCU_INIT_POINTER() documentation	Paul E. McKenney
	The differences between rcu_assign_pointer() and RCU_INIT_POINTER() are subtle, and it is easy to use the the cheaper RCU_INIT_POINTER() when the more-expensive rcu_assign_pointer() should have been used instead. The consequences of this mistake are quite severe. This commit therefore carefully lays out the situations in which it it permissible to use RCU_INIT_POINTER(). Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2011-09-28	rcu: Make rcu_assign_pointer() unconditionally insert a memory barrier	Eric Dumazet
	Recent changes to gcc give warning messages on rcu_assign_pointers()'s checks that allow it to determine when it is OK to omit the memory barrier. Stephen Hemminger tried a number of gcc tricks to silence this warning, but #pragmas and CPP macros do not work together in the way that would be required to make this work. However, we now have RCU_INIT_POINTER(), which already omits this memory barrier, and which therefore may be used when assigning NULL to an RCU-protected pointer that is accessible to readers. This commit therefore makes rcu_assign_pointer() unconditionally emit the memory barrier. Reported-by: Stephen Hemminger <shemminger@vyatta.com> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Acked-by: David S. Miller <davem@davemloft.net> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2011-09-28	nohz: Remove nohz_cpu_mask	Shi, Alex
	RCU no longer uses this global variable, nor does anyone else. This commit therefore removes this variable. This reduces memory footprint and also removes some atomic instructions and memory barriers from the dyntick-idle path. Signed-off-by: Alex Shi <alex.shi@intel.com> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2011-09-28	rcu: Remove unused and redundant interfaces	Paul E. McKenney
	The rcu_dereference_bh_protected() and rcu_dereference_sched_protected() macros are synonyms for rcu_dereference_protected() and are not used anywhere in mainline. This commit therefore removes them. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2011-09-28	rcu: Add grace-period, quiescent-state, and call_rcu trace events	Paul E. McKenney
	Add trace events to record grace-period start and end, quiescent states, CPUs noticing grace-period start and end, grace-period initialization, call_rcu() invocation, tasks blocking in RCU read-side critical sections, tasks exiting those same critical sections, force_quiescent_state() detection of dyntick-idle and offline CPUs, CPUs entering and leaving dyntick-idle mode (except from NMIs), CPUs coming online and going offline, and CPUs being kicked for staying in dyntick-idle mode for too long (as in many weeks, even on 32-bit systems). Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> rcu: Add the rcu flavor to callback trace events The earlier trace events for registering RCU callbacks and for invoking them did not include the RCU flavor (rcu_bh, rcu_preempt, or rcu_sched). This commit adds the RCU flavor to those trace events. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2011-09-28	rcu: Make TINY_RCU also use softirq for RCU_BOOST=n	Paul E. McKenney
	This patch #ifdefs TINY_RCU kthreads out of the kernel unless RCU_BOOST=y, thus eliminating context-switch overhead if RCU priority boosting has not been configured. Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2011-09-28	rcu: Add event-trace markers to TREE_RCU kthreads	Paul E. McKenney
	Add event-trace markers to TREE_RCU kthreads to allow including these kthread's CPU time in the utilization calculations. Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2011-09-28	rcu: Add RCU type to callback-invocation tracing	Paul E. McKenney
	Add a string to the rcu_batch_start() and rcu_batch_end() trace messages that indicates the RCU type ("rcu_sched", "rcu_bh", or "rcu_preempt"). The trace messages for the actual invocations themselves are not marked, as it should be clear from the rcu_batch_start() and rcu_batch_end() events before and after. Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2011-09-28	rcu: Event-trace markers for computing RCU CPU utilization	Paul E. McKenney
	This commit adds the trace_rcu_utilization() marker that is to be used to allow postprocessing scripts compute RCU's CPU utilization, give or take event-trace overhead. Note that we do not include RCU's dyntick-idle interface because event tracing requires RCU protection, which is not available in dyntick-idle mode. Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2011-09-28	rcu: Add event-tracing for RCU callback invocation	Paul E. McKenney
	There was recently some controversy about the overhead of invoking RCU callbacks. Add TRACE_EVENT()s to obtain fine-grained timings for the start and stop of a batch of callbacks and also for each callback invoked. Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2011-09-28	rcu: Abstract common code for RCU grace-period-wait primitives	Paul E. McKenney
	Pull the code that waits for an RCU grace period into a single function, which is then called by synchronize_rcu() and friends in the case of TREE_RCU and TREE_PREEMPT_RCU, and from rcu_barrier() and friends in the case of TINY_RCU and TINY_PREEMPT_RCU. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2011-09-28	rcu: Move rcu_head definition to types.h	Paul E. McKenney
	Take a first step towards untangling Linux kernel header files by placing the struct rcu_head definition into include/linux/types.h and including include/linux/types.h in include/linux/rcupdate.h where struct rcu_head used to be defined. The actual inclusion point for include/linux/types.h is with the rest of the #include directives rather than at the point where struct rcu_head used to be defined, as suggested by Mathieu Desnoyers. Once this is in place, then header files that need only rcu_head can include types.h rather than rcupdate.h. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Paul Gortmaker <paul.gortmaker@windriver.com> Acked-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
2011-09-28	rcu: Restore checks for blocking in RCU read-side critical sections	Paul E. McKenney
	Long ago, using TREE_RCU with PREEMPT would result in "scheduling while atomic" diagnostics if you blocked in an RCU read-side critical section. However, PREEMPT now implies TREE_PREEMPT_RCU, which defeats this diagnostic. This commit therefore adds a replacement diagnostic based on PROVE_RCU. Because rcu_lockdep_assert() and lockdep_rcu_dereference() are now being used for things that have nothing to do with rcu_dereference(), rename lockdep_rcu_dereference() to lockdep_rcu_suspicious() and add a third argument that is a string indicating what is suspicious. This third argument is passed in from a new third argument to rcu_lockdep_assert(). Update all calls to rcu_lockdep_assert() to add an informative third argument. Also, add a pair of rcu_lockdep_assert() calls from within rcu_note_context_switch(), one complaining if a context switch occurs in an RCU-bh read-side critical section and another complaining if a context switch occurs in an RCU-sched read-side critical section. These are present only if the PROVE_RCU kernel parameter is enabled. Finally, fix some checkpatch whitespace complaints in lockdep.c. Again, you must enable PROVE_RCU to see these new diagnostics. But you are enabling PROVE_RCU to check out new RCU uses in any case, aren't you? Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2011-09-29	ptp: fix L2 event message recognition	Richard Cochran
	The IEEE 1588 standard defines two kinds of messages, event and general messages. Event messages require time stamping, and general do not. When using UDP transport, two separate ports are used for the two message types. The BPF designed to recognize event messages incorrectly classifies L2 general messages as event messages. This commit fixes the issue by extending the filter to check the message type field for L2 PTP packets. Event messages are be distinguished from general messages by testing the "general" bit. Signed-off-by: Richard Cochran <richard.cochran@omicron.at> Cc: <stable@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>