summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2012-04-30net: allow skb->head to be a page fragmentEric Dumazet
skb->head is currently allocated from kmalloc(). This is convenient but has the drawback the data cannot be converted to a page fragment if needed. We have three spots were it hurts : 1) GRO aggregation When a linear skb must be appended to another skb, GRO uses the frag_list fallback, very inefficient since we keep all struct sk_buff around. So drivers enabling GRO but delivering linear skbs to network stack aren't enabling full GRO power. 2) splice(socket -> pipe). We must copy the linear part to a page fragment. This kind of defeats splice() purpose (zero copy claim) 3) TCP coalescing. Recently introduced, this permits to group several contiguous segments into a single skb. This shortens queue lengths and save kernel memory, and greatly reduce probabilities of TCP collapses. This coalescing doesnt work on linear skbs (or we would need to copy data, this would be too slow) Given all these issues, the following patch introduces the possibility of having skb->head be a fragment in itself. We use a new skb flag, skb->head_frag to carry this information. build_skb() is changed to accept a frag_size argument. Drivers willing to provide a page fragment instead of kmalloc() data will set a non zero value, set to the fragment size. Then, on situations we need to convert the skb head to a frag in itself, we can check if skb->head_frag is set and avoid the copies or various fallbacks we have. This means drivers currently using frags could be updated to avoid the current skb->head allocation and reduce their memory footprint (aka skb truesize). (thats 512 or 1024 bytes saved per skb). This also makes bpf/netfilter faster since the 'first frag' will be part of skb linear part, no need to copy data. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Cc: Herbert Xu <herbert@gondor.apana.org.au> Cc: Maciej Żenczykowski <maze@google.com> Cc: Neal Cardwell <ncardwell@google.com> Cc: Tom Herbert <therbert@google.com> Cc: Jeff Kirsher <jeffrey.t.kirsher@intel.com> Cc: Ben Hutchings <bhutchings@solarflare.com> Cc: Matt Carlson <mcarlson@broadcom.com> Cc: Michael Chan <mchan@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-04-30forcedeth: add transmit timestamping supportWillem de Bruijn
Insert an skb_tx_timestamp call in both ndo_start_xmit routines Tested to work for the nv_start_xmit_optimized case Signed-off-by: Willem de Bruijn <willemb@google.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-04-30bnx2x: add transmit timestamping supportWillem de Bruijn
Signed-off-by: Willem de Bruijn <willemb@google.com> Acked-by: Eilon Greenstein <eilong@broadcom.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-04-30e1000e: add transmit timestamping supportWillem de Bruijn
Signed-off-by: Willem de Bruijn <willemb@google.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-04-30e1000: add transmit timestamping supportWillem de Bruijn
Signed-off-by: Willem de Bruijn <willemb@google.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-04-30Merge branch 'topic/yinghai-hostbridge-cleanup' into nextBjorn Helgaas
2012-04-30PCI: move mutex locking out of pci_dev_reset functionKonrad Rzeszutek Wilk
The intent of git commit 6fbf9e7a90862988c278462d85ce9684605a52b2 "PCI: Introduce __pci_reset_function_locked to be used when holding device_lock." was to have a non-locking function that would call pci_dev_reset function. But it fell short of that by just probing and not actually reseting the device. To make that work we need a way to move the lock around device_lock to not be in pci_dev_reset (as the caller of __pci_reset_function_locked already holds said lock). We do this by renaming pci_dev_reset to __pci_dev_reset and bubbling said mutex out of __pci_dev_reset to pci_dev_reset (a wrapper around __pci_dev_reset). The __pci_reset_function_locked can now call __pci_dev_reset without having to worry about the dead-lock. Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
2012-04-30ASoC: s3c2412-i2s: Fix dai registrationHeiko Stübner
As s3c2412-i2s is using the s3c_i2sv2 it should call the more specialised s3c_i2sv2_register_dai instead of simply calling snd_soc_register_dai. Without this call the snd_soc_dai_ops structure isn't initialised correctly. Signed-off-by: Heiko Stuebner <heiko@sntech.de> Signed-off-by: Mark Brown <broonie@opensource.wolfsonmicro.com>
2012-04-30ASoC: wm8350: Don't use locally allocated codec structMark Brown
The core allocates the live copies, we shouldn't try to duplicate it and were buggy trying to do so as we were using uninitialised data for the control data. Signed-off-by: Mark Brown <broonie@opensource.wolfsonmicro.com>
2012-04-30Merge branch 'fixes' of git://git.linaro.org/people/rmk/linux-armLinus Torvalds
Pull ARM fixes from Russell King. * 'fixes' of git://git.linaro.org/people/rmk/linux-arm: ARM: 7406/1: hotplug: copy the affinity mask when forcefully migrating IRQs ARM: 7405/1: kexec: call platform_cpu_kill on the killer rather than the victim ARM: 7403/1: tls: remove covert channel via TPIDRURW ARM: 7401/1: mm: Fix section mismatches ARM: OMAP: fix DMA vs memory ordering ARM: 7390/1: dts: versatile-pb/ab fix MMC IRQs ARM: 7400/1: vfp: clear fpscr length and stride bits on entry to sig handler ARM: 7399/1: vfp: move user vfp state save/restore code out of signal.c ARM: 7398/1: l2x0: only write to debug registers on PL310 ARM: 7397/1: l2x0: only apply workaround for erratum #753970 on PL310 ARM: 7396/1: errata: only handle ARM erratum #326103 on affected cores
2012-04-30Merge tag 'scsi-fixes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi Pull SCSI fixes from James Bottomley: "This is a set of SAS and SATA fixes; there are one or two longstanding bug fixes, but most of this is regression fixes." * tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi: [SCSI] libfc: update mfs boundry checking [SCSI] Revert "[SCSI] libsas: fix sas port naming" [SCSI] libsas: fix false positive 'device attached' conditions [SCSI] libsas, libata: fix start of life for a sas ata_port [SCSI] libsas: fix ata_eh clobbering ex_phys via smp_ata_check_ready [SCSI] libsas: unify domain_device sas_rphy lifetimes [SCSI] libsas: fix sas_get_port_device regression [SCSI] libsas: fix sas_find_bcast_phy() in the presence of 'vacant' phys [SCSI] libsas: introduce sas_work to fix sas_drain_work vs sas_queue_work [SCSI] libata: Pass correct DMA device to scsi host [SCSI] scsi_lib: use correct DMA device in __scsi_alloc_queue
2012-04-30efi: Validate UEFI boot variablesMatthew Garrett
A common flaw in UEFI systems is a refusal to POST triggered by a malformed boot variable. Once in this state, machines may only be restored by reflashing their firmware with an external hardware device. While this is obviously a firmware bug, the serious nature of the outcome suggests that operating systems should filter their variable writes in order to prevent a malicious user from rendering the machine unusable. Signed-off-by: Matthew Garrett <mjg@redhat.com> Cc: stable@vger.kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-04-30efi: Add new variable attributesMatthew Garrett
More recent versions of the UEFI spec have added new attributes for variables. Add them. Signed-off-by: Matthew Garrett <mjg@redhat.com> Cc: stable@vger.kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-04-30regmap: Devices using format_write don't support bulk operationsMark Brown
Set the use_single_rw flag for devices that use format_write() since format_write() doesn't support any form of block operation. Signed-off-by: Mark Brown <broonie@opensource.wolfsonmicro.com>
2012-04-30regmap: Converts group operation into single read write operationsAshish Jangam
Some devices does not support bulk read and write operations, for them we have series of single write and read operations. Signed-off-by: Anthony Olech <Anthony.Olech@diasemi.com> Signed-off-by: Ashish Jangam <ashish.jangam@kpitcummins.com> [Fixed coding style, don't check use_single_rw before assign --broonie ] Signed-off-by: Mark Brown <broonie@opensource.wolfsonmicro.com>
2012-04-30regmap: Cache single values read from the chipMark Brown
If we don't have a cached value for a register and we can cache it then when we do a read a value we should add it to the cache to save rereading it later on. Do this for single register reads, for block reads the code would be a little more complex and this covers most practical usage. Signed-off-by: Mark Brown <broonie@opensource.wolfsonmicro.com>
2012-04-30PCI: work around Stratus ftServer broken PCIe hierarchyBjorn Helgaas
A PCIe downstream port is a P2P bridge. Its secondary interface is a link that should lead only to device 0 (unless ARI is enabled)[1], so we don't probe for non-zero device numbers. Some Stratus ftServer systems have a PCIe downstream port (02:00.0) that leads to both an upstream port (03:00.0) and a downstream port (03:01.0), and 03:01.0 has important devices below it: [0000:02]-+-00.0-[03-3c]--+-00.0-[04-09]--... \-01.0-[0a-0d]--+-[USB] +-[NIC] +-... Previously, we didn't enumerate device 03:01.0, so USB and the network didn't work. This patch adds a DMI quirk to scan all device numbers, not just 0, below a downstream port. Based on a patch by Prarit Bhargava. [1] PCIe spec r3.0, sec 7.3.1 CC: Myron Stowe <mstowe@redhat.com> CC: Don Dutile <ddutile@redhat.com> CC: James Paradis <james.paradis@stratus.com> CC: Matthew Wilcox <matthew.r.wilcox@intel.com> CC: Jesse Barnes <jbarnes@virtuousgeek.org> CC: Prarit Bhargava <prarit@redhat.com> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
2012-04-30x86/PCI: merge pcibios_scan_root() and pci_scan_bus_on_node()Yinghai Lu
pcibios_scan_root() and pci_scan_bus_on_node() were almost identical, so this patch merges them. Signed-off-by: Yinghai Lu <yinghai@kernel.org> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
2012-04-30x86/PCI: dynamically allocate pci_root_info for native host bridge driversYinghai Lu
This dynamically allocates struct pci_root_info instead of using a static array. Signed-off-by: Yinghai Lu <yinghai@kernel.org> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
2012-04-30x86/PCI: embed pci_sysdata into pci_root_info on ACPI pathYinghai Lu
Embed the x86 struct pci_sysdata in the struct pci_root_info so it will be automatically freed in the remove path. Signed-off-by: Yinghai Lu <yinghai@kernel.org> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
2012-04-30x86/PCI: embed name into pci_root_info structYinghai Lu
We now keep the pci_root_info struct for the entire lifetime of the host bridge, so just embed the name in the struct rather than allocating it separately. Signed-off-by: Yinghai Lu <yinghai@kernel.org> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
2012-04-30x86/PCI: add host bridge resource release for _CRS pathYinghai Lu
1. Allocate pci_root_info instead of using stack. We need to pass around info for release function. 2. Add release_pci_root_info 3. Set x86 host bridge release function to make sure root bridge related resources get freed during root bus removal. Signed-off-by: Yinghai Lu <yinghai@kernel.org> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
2012-04-30x86/PCI: refactor get_current_resources()Yinghai Lu
Rename get_current_resources() to probe_pci_root_info. 1. Remove resource list head from pci_root_info 2. Make get_current_resources() not pass resources 3. Rename get_current_resources() to probe_pci_root_info() Signed-off-by: Yinghai Lu <yinghai@kernel.org> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
2012-04-30PCI: add host bridge release supportYinghai Lu
We need a hook to release host bridge resources allocated when creating root bus. Signed-off-by: Yinghai Lu <yinghai@kernel.org> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
2012-04-30x86, relocs: Remove an unused variableKusanagi Kouichi
sh_symtab is set but not used. [ hpa: putting this in urgent because of the sheer harmlessness of the patch: it quiets a build warning but does not change any generated code. ] Signed-off-by: Kusanagi Kouichi <slash@ac.auone-net.jp> Link: http://lkml.kernel.org/r/20120401082932.D5E066FC03D@msa105.auone-net.jp Signed-off-by: H. Peter Anvin <hpa@linux.intel.com> Cc: <stable@vger.kernel.org>
2012-04-30asm-generic: Use __BITS_PER_LONG in statfs.hH. Peter Anvin
<asm-generic/statfs.h> is exported to userspace, so using BITS_PER_LONG is invalid. We need to use __BITS_PER_LONG instead. This is kernel bugzilla 43165. Reported-by: H.J. Lu <hjl.tools@gmail.com> Signed-off-by: H. Peter Anvin <hpa@linux.intel.com> Link: http://lkml.kernel.org/r/1335465916-16965-1-git-send-email-hpa@linux.intel.com Acked-by: Arnd Bergmann <arnd@arndb.de> Cc: <stable@vger.kernel.org>
2012-04-30tipc: compress out gratuitous extra carriage returnsPaul Gortmaker
Some of the comment blocks are floating in limbo between two functions, or between blocks of code. Delete the extra line feeds between any comment and its associated following block of code, to be consistent with the majority of the rest of the kernel. Also delete trailing newlines at EOF and fix a couple trivial typos in existing comments. This is a 100% cosmetic change with no runtime impact. We get rid of over 500 lines of non-code, and being blank line deletes, they won't even show up as noise in git blame. Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2012-04-30PCI: add generic device into pci_host_bridge structYinghai Lu
Use that device for pci_root_bus bridge pointer. Use pci_release_bus_bridge_dev() to release allocated pci_host_bridge in remove path. Use root bus bridge pointer to get host bridge pointer instead of searching host bridge list. That leaves the host bridge list unused, so remove it. Signed-off-by: Yinghai Lu <yinghai@kernel.org> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
2012-04-30PCI: rename pci_host_bridge() to find_pci_root_bridge()Yinghai Lu
pci_host_bridge() looks like a C++ constructor. Also separate find_pci_root_bus() out. Signed-off-by: Yinghai Lu <yinghai@kernel.org> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
2012-04-30x86/PCI: fix memleak with get_current_resources()Yinghai Lu
In pci_scan_acpi_root(), when pci_use_crs is set, get_current_resources() is used to get pci_root_info, and it will allocate name and resource array. Later if pci_create_root_bus() can not create bus (could be already there...) it will only free bus res list, but the name and res array is not freed. Let get_current_resource() take info pointer instead of using local info. Signed-off-by: Yinghai Lu <yinghai@kernel.org> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
2012-04-30PCI: move host bridge-related code to host-bridge.cYinghai Lu
Move host bridge-related code from probe.c to a new host-bridge.c. Signed-off-by: Yinghai Lu <yinghai@kernel.org> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
2012-04-30nfsd: fix nfs4recover.c printk format warningRandy Dunlap
Fix printk format warnings -- both items are size_t, so use %zu to print them. fs/nfsd/nfs4recover.c:580:3: warning: format '%lu' expects type 'long unsigned int', but argument 3 has type 'size_t' fs/nfsd/nfs4recover.c:580:3: warning: format '%lu' expects type 'long unsigned int', but argument 4 has type 'unsigned int' Signed-off-by: Randy Dunlap <rdunlap@xenotime.net> Cc: "J. Bruce Fields" <bfields@fieldses.org> Cc: linux-nfs@vger.kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-04-30mac80211: fix AP mode EAP tx for VLAN stationsFelix Fietkau
EAP frames for stations in an AP VLAN are sent on the main AP interface to avoid race conditions wrt. moving stations. For that to work properly, sta_info_get_bss must be used instead of sta_info_get when sending EAP packets. Previously this was only done for cooked monitor injected packets, so this patch adds a check for tx->skb->protocol to the same place. Signed-off-by: Felix Fietkau <nbd@openwrt.org> Cc: stable@vger.kernel.org Signed-off-by: John W. Linville <linville@tuxdriver.com>
2012-04-30Merge branch 'merge' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc Pull powerpc fixes from Benjamin Herrenschmidt: "Here are a handful more fixes for powerpc. The irq stuff are all regression fixes, and Gavin's patch is a simple compile fix." * 'merge' of git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc: tty/serial/pmac_zilog: Fix "nobody cared" IRQ message powerpc/pseries: Rivet CONFIG_EEH for pSeries platform powerpc/irqdomain: Fix broken NR_IRQ references powerpc/8xx: Fix NR_IRQ bugs and refactor 8xx interrupt controller
2012-04-30rcu: Add rcutorture test for call_srcu()Lai Jiangshan
Add srcu_torture_deferred_free() for srcu_ops so as to test the new call_srcu(). Rename the original srcu_ops to srcu_sync_ops. Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2012-04-30rcu: Implement per-domain single-threaded call_srcu() state machineLai Jiangshan
This commit implements an SRCU state machine in support of call_srcu(). The state machine is preemptible, light-weight, and single-threaded, minimizing synchronization overhead. In particular, there is no longer any need for synchronize_srcu() to be guarded by a mutex. Expedited processing is handled, at least in the absence of concurrent grace-period operations on that same srcu_struct structure, by having the synchronize_srcu_expedited() thread take on the role of the workqueue thread for one iteration. There is a reasonable probability that a given SRCU callback will be invoked on the same CPU that registered it, however, there is no guarantee. Concurrent SRCU grace-period primitives can cause callbacks to be executed elsewhere, even in absence of CPU-hotplug operations. Callbacks execute in process context, but under the influence of local_bh_disable(), so it is illegal to sleep in an SRCU callback function. Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2012-04-30rcu: Use single value to handle expedited SRCU grace periodsLai Jiangshan
The earlier algorithm used an "expedited" flag combined with a "trycount" counter to differentiate between normal and expedited SRCU grace periods. However, the difference can be encoded into a single counter with a cutoff value and different initial values for expedited and normal SRCU grace periods. This commit makes that change. Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Conflicts: kernel/srcu.c
2012-04-30rcu: Improve srcu_readers_active_idx()'s cache localityLai Jiangshan
Expand the calls to srcu_readers_active_idx() from srcu_readers_active() inline. This change improves cache locality by interating over the CPUs once rather than twice. Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2012-04-30rcu: Remove unused srcu_barrier()Lai Jiangshan
The old srcu_barrier() macro is now unused. This commit removes it so that it may be used for the SRCU flavor of rcu_barrier(), which will in turn be needed to allow the upcoming call_srcu() to be used from within modules. Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2012-04-30rcu: Implement a variant of Peter's SRCU algorithmLai Jiangshan
This commit implements a variant of Peter's algorithm, which may be found at https://lkml.org/lkml/2012/2/1/119. o Make the checking lock-free to enable parallel checking. Parallel checking is required when (1) the original checking task is preempted for a long time, (2) sychronize_srcu_expedited() starts during an ongoing SRCU grace period, or (3) we wish to avoid acquiring a lock. o Since the checking is lock-free, we avoid a mutex in state machine for call_srcu(). o Remove the SRCU_REF_MASK and remove the coupling with the flipping. This might allow us to remove the preempt_disable() in future versions, though such removal will need great care because it rescinds the one-old-reader-per-CPU guarantee. o Remove a smp_mb(), simplify the comments and make the smp_mb() pairs more intuitive. Inspired-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2012-04-30rcu: Improve SRCU's wait_idx() commentsLai Jiangshan
The safety of SRCU is provided byy wait_idx() rather than flipping. The flipping actually prevents starvation. This commit therefore updates the comments to more accurately and precisely describe what is going on. Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2012-04-30rcu: Flip ->completed only once per SRCU grace periodLai Jiangshan
This is an optimization of the SRCU grace period. To guard against preempted readers with old values of the counter, it suffices to scan the old counters once more, then flip ->completed only one time. The reason this works is that the old readers must have incremented the old set of counters (if they have not yet incremented, then their critical section starts after this grace period, so they may be safely ignored). This commit therefore optimizes the second flip out in favor of a simple rescan. Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2012-04-30rcu: Increment upper bit only for srcu_read_lock()Lai Jiangshan
The purpose of the upper bit of SRCU's per-CPU counters is to guarantee that no reasonable series of srcu_read_lock() and srcu_read_unlock() operations can return the value of the counter to its original value. This guarantee is require only after the index has been switched to the other set of counters, so at most one srcu_read_lock() can affect a given CPU's counter. The number of srcu_read_unlock() operations on a given counter is limited to the number of tasks in the system, which given the Linux kernel's current structure is limited to far less than 2^30 on 32-bit systems and far less than 2^62 on 64-bit systems. (Something about a limited number of bytes in the kernel's address space.) Therefore, if srcu_read_lock() increments the upper bits, then srcu_read_unlock() need not do so. In this case, an srcu_read_lock() and an srcu_read_unlock() will flip the lower bit of the upper field of the counter. An unreasonably large additional number of srcu_read_unlock() operations would be required to return the counter to its initial value, thus preserving the guarantee. This commit takes this approach, which further allows it to shrink the size of the upper field to one bit, making the number of srcu_read_unlock() operations required to return the counter to its initial value even more unreasonable than before. Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2012-04-30rcu: Remove fast check path from __synchronize_srcu()Lai Jiangshan
The fastpath in __synchronize_srcu() is designed to handle cases where there are a large number of concurrent calls for the same srcu_struct structure. However, the Linux kernel currently does not use SRCU in this manner, so remove the fastpath checks for simplicity. Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2012-04-30rcu: Direct algorithmic SRCU implementationPaul E. McKenney
The current implementation of synchronize_srcu_expedited() can cause severe OS jitter due to its use of synchronize_sched(), which in turn invokes try_stop_cpus(), which causes each CPU to be sent an IPI. This can result in severe performance degradation for real-time workloads and especially for short-interation-length HPC workloads. Furthermore, because only one instance of try_stop_cpus() can be making forward progress at a given time, only one instance of synchronize_srcu_expedited() can make forward progress at a time, even if they are all operating on distinct srcu_struct structures. This commit, inspired by an earlier implementation by Peter Zijlstra (https://lkml.org/lkml/2012/1/31/211) and by further offline discussions, takes a strictly algorithmic bits-in-memory approach. This has the disadvantage of requiring one explicit memory-barrier instruction in each of srcu_read_lock() and srcu_read_unlock(), but on the other hand completely dispenses with OS jitter and furthermore allows SRCU to be used freely by CPUs that RCU believes to be idle or offline. The update-side implementation handles the single read-side memory barrier by rechecking the per-CPU counters after summing them and by running through the update-side state machine twice. This implementation has passed moderate rcutorture testing on both x86 and Power. Also updated to use this_cpu_ptr() instead of per_cpu_ptr(), as suggested by Peter Zijlstra. Reported-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
2012-04-30rcu: Introduce rcutorture testing for rcu_barrier()Paul E. McKenney
Although rcutorture does invoke rcu_barrier() and friends, it cannot really be called a torture test given that it invokes them only once at the end of the test. This commit therefore introduces heavy-duty rcutorture testing for rcu_barrier(), which may be carried out concurrently with normal rcutorture testing. Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2012-04-30tcp: fix infinite cwnd in tcp_complete_cwr()Yuchung Cheng
When the cwnd reduction is done, ssthresh may be infinite if TCP enters CWR via ECN or F-RTO. If cwnd is not undone, i.e., undo_marker is set, tcp_complete_cwr() falsely set cwnd to the infinite ssthresh value. The correct operation is to keep cwnd intact because it has been updated in ECN or F-RTO. Signed-off-by: Yuchung Cheng <ycheng@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-04-30bpf jit: Let the powerpc jit handle negative offsetsJan Seiffert
Now the helper function from filter.c for negative offsets is exported, it can be used it in the jit to handle negative offsets. First modify the asm load helper functions to handle: - know positive offsets - know negative offsets - any offset then the compiler can be modified to explicitly use these helper when appropriate. This fixes the case of a negative X register and allows to lift the restriction that bpf programs with negative offsets can't be jited. Tested-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Jan Seiffert <kaffeemonster@googlemail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-04-30net: fix sk_sockets_allocated_read_positiveEric Dumazet
Denys Fedoryshchenko reported frequent crashes on a proxy server and kindly provided a lockdep report that explains it all : [ 762.903868] [ 762.903880] ================================= [ 762.903890] [ INFO: inconsistent lock state ] [ 762.903903] 3.3.4-build-0061 #8 Not tainted [ 762.904133] --------------------------------- [ 762.904344] inconsistent {IN-SOFTIRQ-W} -> {SOFTIRQ-ON-W} usage. [ 762.904542] squid/1603 [HC0[0]:SC0[0]:HE1:SE1] takes: [ 762.904542] (key#3){+.?...}, at: [<c0232cc4>] __percpu_counter_sum+0xd/0x58 [ 762.904542] {IN-SOFTIRQ-W} state was registered at: [ 762.904542] [<c0158b84>] __lock_acquire+0x284/0xc26 [ 762.904542] [<c01598e8>] lock_acquire+0x71/0x85 [ 762.904542] [<c0349765>] _raw_spin_lock+0x33/0x40 [ 762.904542] [<c0232c93>] __percpu_counter_add+0x58/0x7c [ 762.904542] [<c02cfde1>] sk_clone_lock+0x1e5/0x200 [ 762.904542] [<c0303ee4>] inet_csk_clone_lock+0xe/0x78 [ 762.904542] [<c0315778>] tcp_create_openreq_child+0x1b/0x404 [ 762.904542] [<c031339c>] tcp_v4_syn_recv_sock+0x32/0x1c1 [ 762.904542] [<c031615a>] tcp_check_req+0x1fd/0x2d7 [ 762.904542] [<c0313f77>] tcp_v4_do_rcv+0xab/0x194 [ 762.904542] [<c03153bb>] tcp_v4_rcv+0x3b3/0x5cc [ 762.904542] [<c02fc0c4>] ip_local_deliver_finish+0x13a/0x1e9 [ 762.904542] [<c02fc539>] NF_HOOK.clone.11+0x46/0x4d [ 762.904542] [<c02fc652>] ip_local_deliver+0x41/0x45 [ 762.904542] [<c02fc4d1>] ip_rcv_finish+0x31a/0x33c [ 762.904542] [<c02fc539>] NF_HOOK.clone.11+0x46/0x4d [ 762.904542] [<c02fc857>] ip_rcv+0x201/0x23e [ 762.904542] [<c02daa3a>] __netif_receive_skb+0x319/0x368 [ 762.904542] [<c02dac07>] netif_receive_skb+0x4e/0x7d [ 762.904542] [<c02dacf6>] napi_skb_finish+0x1e/0x34 [ 762.904542] [<c02db122>] napi_gro_receive+0x20/0x24 [ 762.904542] [<f85d1743>] e1000_receive_skb+0x3f/0x45 [e1000e] [ 762.904542] [<f85d3464>] e1000_clean_rx_irq+0x1f9/0x284 [e1000e] [ 762.904542] [<f85d3926>] e1000_clean+0x62/0x1f4 [e1000e] [ 762.904542] [<c02db228>] net_rx_action+0x90/0x160 [ 762.904542] [<c012a445>] __do_softirq+0x7b/0x118 [ 762.904542] irq event stamp: 156915469 [ 762.904542] hardirqs last enabled at (156915469): [<c019b4f4>] __slab_alloc.clone.58.clone.63+0xc4/0x2de [ 762.904542] hardirqs last disabled at (156915468): [<c019b452>] __slab_alloc.clone.58.clone.63+0x22/0x2de [ 762.904542] softirqs last enabled at (156915466): [<c02ce677>] lock_sock_nested+0x64/0x6c [ 762.904542] softirqs last disabled at (156915464): [<c0349914>] _raw_spin_lock_bh+0xe/0x45 [ 762.904542] [ 762.904542] other info that might help us debug this: [ 762.904542] Possible unsafe locking scenario: [ 762.904542] [ 762.904542] CPU0 [ 762.904542] ---- [ 762.904542] lock(key#3); [ 762.904542] <Interrupt> [ 762.904542] lock(key#3); [ 762.904542] [ 762.904542] *** DEADLOCK *** [ 762.904542] [ 762.904542] 1 lock held by squid/1603: [ 762.904542] #0: (sk_lock-AF_INET){+.+.+.}, at: [<c03055c0>] lock_sock+0xa/0xc [ 762.904542] [ 762.904542] stack backtrace: [ 762.904542] Pid: 1603, comm: squid Not tainted 3.3.4-build-0061 #8 [ 762.904542] Call Trace: [ 762.904542] [<c0347b73>] ? printk+0x18/0x1d [ 762.904542] [<c015873a>] valid_state+0x1f6/0x201 [ 762.904542] [<c0158816>] mark_lock+0xd1/0x1bb [ 762.904542] [<c015876b>] ? mark_lock+0x26/0x1bb [ 762.904542] [<c015805d>] ? check_usage_forwards+0x77/0x77 [ 762.904542] [<c0158bf8>] __lock_acquire+0x2f8/0xc26 [ 762.904542] [<c0159b8e>] ? mark_held_locks+0x5d/0x7b [ 762.904542] [<c0159cf6>] ? trace_hardirqs_on+0xb/0xd [ 762.904542] [<c0158dd4>] ? __lock_acquire+0x4d4/0xc26 [ 762.904542] [<c01598e8>] lock_acquire+0x71/0x85 [ 762.904542] [<c0232cc4>] ? __percpu_counter_sum+0xd/0x58 [ 762.904542] [<c0349765>] _raw_spin_lock+0x33/0x40 [ 762.904542] [<c0232cc4>] ? __percpu_counter_sum+0xd/0x58 [ 762.904542] [<c0232cc4>] __percpu_counter_sum+0xd/0x58 [ 762.904542] [<c02cebc4>] __sk_mem_schedule+0xdd/0x1c7 [ 762.904542] [<c02d178d>] ? __alloc_skb+0x76/0x100 [ 762.904542] [<c0305e8e>] sk_wmem_schedule+0x21/0x2d [ 762.904542] [<c0306370>] sk_stream_alloc_skb+0x42/0xaa [ 762.904542] [<c0306567>] tcp_sendmsg+0x18f/0x68b [ 762.904542] [<c031f3dc>] ? ip_fast_csum+0x30/0x30 [ 762.904542] [<c0320193>] inet_sendmsg+0x53/0x5a [ 762.904542] [<c02cb633>] sock_aio_write+0xd2/0xda [ 762.904542] [<c015876b>] ? mark_lock+0x26/0x1bb [ 762.904542] [<c01a1017>] do_sync_write+0x9f/0xd9 [ 762.904542] [<c01a2111>] ? file_free_rcu+0x2f/0x2f [ 762.904542] [<c01a17a1>] vfs_write+0x8f/0xab [ 762.904542] [<c01a284d>] ? fget_light+0x75/0x7c [ 762.904542] [<c01a1900>] sys_write+0x3d/0x5e [ 762.904542] [<c0349ec9>] syscall_call+0x7/0xb [ 762.904542] [<c0340000>] ? rp_sidt+0x41/0x83 Bug is that sk_sockets_allocated_read_positive() calls percpu_counter_sum_positive() without BH being disabled. This bug was added in commit 180d8cd942ce33 (foundations of per-cgroup memory pressure controlling.), since previous code was using percpu_counter_read_positive() which is IRQ safe. In __sk_mem_schedule() we dont need the precise count of allocated sockets and can revert to previous behavior. Reported-by: Denys Fedoryshchenko <denys@visp.net.lb> Sined-off-by: Eric Dumazet <edumazet@google.com> Cc: Glauber Costa <glommer@parallels.com> Acked-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-04-30bridge: Fix fatal typo in setup of multicast_querier_expiredHerbert Xu
Unfortunately it seems that I didn't properly test the case of an expired external querier in the recent multicast bridge series. The setup of the timer in that case is completely broken and leads to a NULL-pointer dereference. This patch fixes it. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Acked-by: Stephen Hemminger <shemminger@vyatta.com> Signed-off-by: David S. Miller <davem@davemloft.net>