summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2013-04-30powerpc: Update tlbie/tlbiel as per ISA docAneesh Kumar K.V
Encode the actual page correctly in tlbie/tlbiel. This make sure we handle multiple page size segment correctly. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2013-04-30powerpc: Print page size info during bootAneesh Kumar K.V
This gives hint about different base and actual page size combination supported by the platform. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2013-04-30powerpc: print both base and actual page size on hash failureAneesh Kumar K.V
Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2013-04-30powerpc: Fix hpte_decode to use the correct decoding for page sizesAneesh Kumar K.V
As per ISA doc, we encode base and actual page size in the LP bits of PTE. The number of bit used to encode the page sizes depend on actual page size. ISA doc lists this as PTE LP actual page size rrrr rrrz >=8KB rrrr rrzz >=16KB rrrr rzzz >=32KB rrrr zzzz >=64KB rrrz zzzz >=128KB rrzz zzzz >=256KB rzzz zzzz >=512KB zzzz zzzz >=1MB ISA doc also says "The values of the “z” bits used to specify each size, along with all possible values of “r” bits in the LP field, must result in LP values distinct from other LP values for other sizes." based on the above update hpte_decode to use the correct decoding for LP bits. Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Acked-by: Paul Mackerras <paulus@samba.org> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2013-04-30powerpc: Decode the pte-lp-encoding bits correctly.Aneesh Kumar K.V
We look at both the segment base page size and actual page size and store the pte-lp-encodings in an array per base page size. We also update all relevant functions to take actual page size argument so that we can use the correct PTE LP encoding in HPTE. This should also get the basic Multiple Page Size per Segment (MPSS) support. This is needed to enable THP on ppc64. [Fixed PR KVM build --BenH] Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Acked-by: Paul Mackerras <paulus@samba.org> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2013-04-30powerpc: Use encode avpn where we need only avpn valuesAneesh Kumar K.V
In all these cases we are doing something similar to HPTE_V_COMPARE(hpte_v, want_v) which ignores the HPTE_V_LARGE bit With MPSS support we would need actual page size to set HPTE_V_LARGE bit and that won't be available in most of these cases. Since we are ignoring HPTE_V_LARGE bit, use the avpn value instead. There should not be any change in behaviour after this patch. Acked-by: Paul Mackerras <paulus@samba.org> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2013-04-30powerpc: Reduce PTE table memory wastageAneesh Kumar K.V
We allocate one page for the last level of linux page table. With THP and large page size of 16MB, that would mean we are wasting large part of that page. To map 16MB area, we only need a PTE space of 2K with 64K page size. This patch reduce the space wastage by sharing the page allocated for the last level of linux page table with multiple pmd entries. We call these smaller chunks PTE page fragments and allocated page, PTE page. In order to support systems which doesn't have 64K HPTE support, we also add another 2K to PTE page fragment. The second half of the PTE fragments is used for storing slot and secondary bit information of an HPTE. With this we now have a 4K PTE fragment. We use a simple approach to share the PTE page. On allocation, we bump the PTE page refcount to 16 and share the PTE page with the next 16 pte alloc request. This should help in the node locality of the PTE page fragment, assuming that the immediate pte alloc request will mostly come from the same NUMA node. We don't try to reuse the freed PTE page fragment. Hence we could be waisting some space. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Acked-by: Paul Mackerras <paulus@samba.org> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2013-04-30powerpc: Move the pte free routines from common headerAneesh Kumar K.V
Acked-by: Paul Mackerras <paulus@samba.org> This patch moves the common code to 32/64 bit headers and also duplicate 4K_PAGES and 64K_PAGES section. We will later change the 64 bit 64K_PAGES version to support smaller PTE fragments. The patch doesn't introduce any functional changes. Acked-by: Paul Mackerras <paulus@samba.org> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2013-04-30powerpc: Reduce the PTE_INDEX_SIZEAneesh Kumar K.V
This make one PMD cover 16MB range. That helps in easier implementation of THP on power. THP core code make use of one pmd entry to track the hugepage and the range mapped by a single pmd entry should be equal to the hugepage size supported by the hardware. This also switch PGD to cover 16GB. That is needed so that we can simplify the hugetlb page walking code so that we have same pte format for explicit hugepage and THP hugepage. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Acked-by: Paul Mackerras <paulus@samba.org> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2013-04-30powerpc: Switch 16GB and 16MB explicit hugepages to a different page table ↵Aneesh Kumar K.V
format We will be switching PMD_SHIFT to 24 bits to facilitate THP impmenetation. With PMD_SHIFT set to 24, we now have 16MB huge pages allocated at PGD level. That means with 32 bit process we cannot allocate normal pages at all, because we cover the entire address space with one pgd entry. Fix this by switching to a new page table format for hugepages. With the new page table format for 16GB and 16MB hugepages we won't allocate hugepage directory. Instead we encode the PTE information directly at the directory level. This forces 16MB hugepage at PMD level. This will also make the page take walk much simpler later when we add the THP support. With the new table format we have 4 cases for pgds and pmds: (1) invalid (all zeroes) (2) pointer to next table, as normal; bottom 6 bits == 0 (3) leaf pte for huge page, bottom two bits != 00 (4) hugepd pointer, bottom two bits == 00, next 4 bits indicate size of table Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Acked-by: Paul Mackerras <paulus@samba.org> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2013-04-30powerpc: New hugepage directory formatAneesh Kumar K.V
Change the hugepage directory format so that we can have leaf ptes directly at page directory avoiding the allocation of hugepage directory. With the new table format we have 3 cases for pgds and pmds: (1) invalid (all zeroes) (2) pointer to next table, as normal; bottom 6 bits == 0 (4) hugepd pointer, bottom two bits == 00, next 4 bits indicate size of table Instead of storing shift value in hugepd pointer we use mmu_psize_def index so that we can fit all the supported hugepage size in 4 bits Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Acked-by: Paul Mackerras <paulus@samba.org> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2013-04-30powerpc: Don't truncate pgd_index wronglyAneesh Kumar K.V
With PGD_INDEX_SIZE set to 12 the existing macro doesn't work. Fix it to use PTRS_PER_PGD The idea originally was to have one more bit in the result of pgd_index() than PGD_INDEX_SIZE, so that if one had an address corresponding to the last PGD entry, and then incremented that address by PGD_SIZE, and took pgd_index() of that, you wouldn't end up with zero. The commit that introduced that dates back to 2002, and the code that was sensitive to that edge case has long since been refactored (several times), so there is no need for it these days. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Acked-by: Paul Mackerras <paulus@samba.org> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2013-04-30powerpc: Don't hard code the size of pte pageAneesh Kumar K.V
USE PTRS_PER_PTE to indicate the size of pte page. To support THP, later patches will be changing PTRS_PER_PTE value. Acked-by: Paul Mackerras <paulus@samba.org> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2013-04-30powerpc: Save DAR and DSISR in pt_regs on MCEAneesh Kumar K.V
We were not saving DAR and DSISR on MCE. Save then and also print the values along with exception details in xmon. Acked-by: Paul Mackerras <paulus@samba.org> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2013-04-30powerpc: Use signed formatting when printing errorAneesh Kumar K.V
PAPR defines these errors as negative values. So print them accordingly for easy debugging. Acked-by: Paul Mackerras <paulus@samba.org> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2013-04-30powerpc/pseries: Correct builds break when CONFIG_SMP not definedNathan Fontenot
Correct build failure for powerpc/pseries builds with CONFIG_SMP not defined. The function cpu_sibling_mask has no meaning (or definition) when CONFIG_SMP is not defined. Additionally, the updating of NUMA affinity for a CPU in a UP system doesn't really make sense. This patch ifdef's out the code making the affinity updates for PRRN events to fix the following build break. arch/powerpc/mm/numa.c: In function ‘stage_topology_update’: arch/powerpc/mm/numa.c:1535: error: implicit declaration of function ‘cpu_sibling_mask’ arch/powerpc/mm/numa.c:1535: warning: passing argument 3 of ‘cpumask_or’ makes pointer from integer without a cast make[1]: *** [arch/powerpc/mm/numa.o] Error 1 Signed-off-by: Nathan Fontenot <nfont@linux.vnet.ibm.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2013-04-30powerpc/booke: Remove obsolete macro FINISH_EXCEPTIONKevin Hao
This is stale and not used by anyone now. Signed-off-by: Kevin Hao <haokexin@gmail.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2013-04-30powerpc/rtas_flash: Fix bad memory accessVasant Hegde
We use kmem_cache_alloc() to allocate memory to hold the new firmware which will be flashed. kmem_cache_alloc() calls rtas_block_ctor() to set memory to NULL. But these constructor is called only for newly allocated slabs. If we run below command multiple time without rebooting, allocator may allocate memory from the area which was free'd by kmem_cache_free and it will not call constructor. In this situation we may hit kernel oops. dd if=<fw image> of=/proc/ppc64/rtas/firmware_flash bs=4096 oops message: ------------- [ 1602.399755] Oops: Kernel access of bad area, sig: 11 [#1] [ 1602.399772] SMP NR_CPUS=1024 NUMA pSeries [ 1602.399779] Modules linked in: rtas_flash nfsd lockd auth_rpcgss nfs_acl sunrpc fuse loop dm_mod sg ipv6 ses enclosure ehea ehci_pci ohci_hcd ehci_hcd usbcore sd_mod usb_common crc_t10dif scsi_dh_alua scsi_dh_emc scsi_dh_hp_sw scsi_dh_rdac scsi_dh ipr libata scsi_mod [ 1602.399817] NIP: d00000000a170b9c LR: d00000000a170b64 CTR: c00000000079cd58 [ 1602.399823] REGS: c0000003b9937930 TRAP: 0300 Not tainted (3.9.0-rc4-0.27-ppc64) [ 1602.399828] MSR: 8000000000009032 <SF,EE,ME,IR,DR,RI> CR: 22000428 XER: 20000000 [ 1602.399841] SOFTE: 1 [ 1602.399844] CFAR: c000000000005f24 [ 1602.399848] DAR: 8c2625a820631fef, DSISR: 40000000 [ 1602.399852] TASK = c0000003b4520760[3655] 'dd' THREAD: c0000003b9934000 CPU: 3 GPR00: 8c2625a820631fe7 c0000003b9937bb0 d00000000a179f28 d00000000a171f08 GPR04: 0000000010040000 0000000000001000 c0000003b9937df0 c0000003b5fb2080 GPR08: c0000003b58f7200 d00000000a179f28 c0000003b40058d4 c00000000079cd58 GPR12: d00000000a171450 c000000007f40900 0000000000000005 0000000010178d20 GPR16: 00000000100cb9d8 000000000000001d 0000000000000000 000000001003ffff GPR20: 0000000000000001 0000000000000000 00003fffa0b50d30 000000001001f010 GPR24: 0000000010020888 0000000010040000 d00000000a171f08 d00000000a172808 GPR28: 0000000000001000 0000000010040000 c0000003b4005880 8c2625a820631fe7 [ 1602.399924] NIP [d00000000a170b9c] .rtas_flash_write+0x7c/0x1e8 [rtas_flash] [ 1602.399930] LR [d00000000a170b64] .rtas_flash_write+0x44/0x1e8 [rtas_flash] [ 1602.399934] Call Trace: [ 1602.399939] [c0000003b9937bb0] [d00000000a170b64] .rtas_flash_write+0x44/0x1e8 [rtas_flash] (unreliable) [ 1602.399948] [c0000003b9937c60] [c000000000282830] .proc_reg_write+0x90/0xe0 [ 1602.399955] [c0000003b9937ce0] [c0000000001ff374] .vfs_write+0x114/0x238 [ 1602.399961] [c0000003b9937d80] [c0000000001ff5d8] .SyS_write+0x70/0xe8 [ 1602.399968] [c0000003b9937e30] [c000000000009cdc] syscall_exit+0x0/0xa0 [ 1602.399973] Instruction dump: [ 1602.399977] eb698010 801b0028 2f80dcd6 419e00a4 2fbc0000 419e009c ebfb0030 2fbf0000 [ 1602.399989] 409e0010 480000d8 60000000 7c1f0378 <e81f0008> 2fa00000 409efff4 e81f0000 [ 1602.400012] ---[ end trace b4136d115dc31dac ]--- [ 1602.402178] [ 1602.402185] Sending IPI to other CPUs [ 1602.403329] IPI complete This patch uses kmem_cache_zalloc() instead of kmem_cache_alloc() to allocate memory, which makes sure memory is set to 0 before using. Also removes rtas_block_ctor(), which is no longer required. Signed-off-by: Vasant Hegde <hegdevasant@linux.vnet.ibm.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2013-04-30powerpc: Fix build failure after merge of the cgroup treeStephen Rothwell
After merging the cgroup tree, today's linux-next build (powerpc ppc64_defconfig) failed like this: arch/powerpc/mm/numa.c: In function 'arch_update_cpu_topology': arch/powerpc/mm/numa.c:1465:2: error: implicit declaration of function 'kzalloc' [-Werror=implicit-function-declaration] arch/powerpc/mm/numa.c:1465:10: error: assignment makes pointer from integer without a cast [-Werror] arch/powerpc/mm/numa.c:1497:2: error: implicit declaration of function 'kfree' [-Werror=implicit-function-declaration] Caused by commit 30c05350c39d ("powerpc/pseries: Use stop machine to update cpu maps") from the powerpc tree interacting with (probably) commit ff794dea52ea ("cpuset: remove include of cgroup.h from cpuset.h") from the cgroup tree. Removing includes from header files is fraught with danger ... The former should have added an include of linux/slab.h to arch/powerpc/mm/numa.c. I have added the following merge fix patch for today (but it should be applied to the powerpc tree ASAP). From: Stephen Rothwell <sfr@canb.auug.org.au> Date: Mon, 29 Apr 2013 14:01:44 +1000 Subject: [PATCH] powerpc: numa.c: using kzalloc/kfree requires including slab.h fixes these build errors: arch/powerpc/mm/numa.c: In function 'arch_update_cpu_topology': arch/powerpc/mm/numa.c:1465:2: error: implicit declaration of function 'kzalloc' [-Werror=implicit-function-declaration] arch/powerpc/mm/numa.c:1465:10: error: assignment makes pointer from integer without a cast [-Werror] arch/powerpc/mm/numa.c:1497:2: error: implicit declaration of function 'kfree' [-Werror=implicit-function-declaration] Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au> Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2013-04-30powerpc: Fix usage of setup_pci_atmu()Michael Neuling
Linux next is currently failing to compile mpc85xx_defconfig with: arch/powerpc/sysdev/fsl_pci.c:944:2: error: too many arguments to function 'setup_pci_atmu' This is caused by (from Kumar's next branch): commit 34642bbb3d12121333efcf4ea7dfe66685e403a1 Author: Kumar Gala <galak@kernel.crashing.org> powerpc/fsl-pci: Keep PCI SoC controller registers in pci_controller Which changed definition of setup_pci_atmu() but didn't update one of the callers. Below fixes this. Signed-off-by: Michael Neuling <mikey@neuling.org> Reviewed-by: Kim Phillips <kim.phillips@freescale.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2013-04-30MD: ignore discard request for hard disks of hybid raid1/raid10 arrayShaohua Li
In SSD/hard disk hybid storage, discard request should be ignored for hard disk. We used to be doing this way, but the unplug path forgets it. This is suitable for stable tree since v3.6. Cc: stable@vger.kernel.org Reported-and-tested-by: Markus <M4rkusXXL@web.de> Signed-off-by: Shaohua Li <shli@fusionio.com> Signed-off-by: NeilBrown <neilb@suse.de>
2013-04-30md: bad block list should default to disabled.NeilBrown
Maintenance of a bad-block-list currently defaults to 'enabled' and is then disabled when it cannot be supported. This is backwards and causes problem for dm-raid which didn't know to disable it. So fix the defaults, and only enabled for v1.x metadata which explicitly has bad blocks enabled. The problem with dm-raid has been present since badblock support was added in v3.1, so this patch is suitable for any -stable from 3.1 onwards. Cc: stable@vger.kernel.org (3.1+) Reported-by: Jonathan Brassow <jbrassow@redhat.com> Signed-off-by: NeilBrown <neilb@suse.de>
2013-04-30md: raid1/raid10 md devices leak memory when stoppingHirokazu Takahashi
Hi. Raid1 and raid10 devices leak memory every time they stop. This is a patch for linux-3.9.0-rc7 to fix this problem. Thanks, Hirokazu Takahashi. Signed-off-by: Hirokazu Takahashi <taka@valinux.co.jp> Signed-off-by: NeilBrown <neilb@suse.de>
2013-04-30unix/stream: fix peeking with an offset larger than data in queueBenjamin Poirier
Currently, peeking on a unix stream socket with an offset larger than len of the data in the sk receive queue returns immediately with bogus data. This patch fixes this so that the behavior is the same as peeking with no offset on an empty queue: the caller blocks. Signed-off-by: Benjamin Poirier <bpoirier@suse.de> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-04-30unix/dgram: fix peeking with an offset larger than data in queueBenjamin Poirier
Currently, peeking on a unix datagram socket with an offset larger than len of the data in the sk receive queue returns immediately with bogus data. That's because *off is not reset between each skb_queue_walk(). This patch fixes this so that the behavior is the same as peeking with no offset on an empty queue: the caller blocks. Signed-off-by: Benjamin Poirier <bpoirier@suse.de> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-04-30unix/dgram: peek beyond 0-sized skbsBenjamin Poirier
"77c1090 net: fix infinite loop in __skb_recv_datagram()" (v3.8) introduced a regression: After that commit, recv can no longer peek beyond a 0-sized skb in the queue. __skb_recv_datagram() instead stops at the first skb with len == 0 and results in the system call failing with -EFAULT via skb_copy_datagram_iovec(). When peeking at an offset with 0-sized skb(s), each one of those is received only once, in sequence. The offset starts moving forward again after receiving datagrams with len > 0. Signed-off-by: Benjamin Poirier <bpoirier@suse.de> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-04-30openvswitch: Remove unneeded ovs_netdev_get_ifindex()Thomas Graf
The only user is get_dpifindex(), no need to redirect via the port operations. Signed-off-by: Thomas Graf <tgraf@suug.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-04-30net: Use consume_skb() to free gso segmented skbSridhar Samudrala
Use consume_skb() to free the original skb that is successfully transmitted as gso segmented skbs so that it is not treated as a drop due to an error. Signed-off-by: Sridhar Samudrala <sri@us.ibm.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-04-30Merge branch 'for-davem' of ↵David S. Miller
git://git.kernel.org/pub/scm/linux/kernel/git/linville/wireless-next John W. Linville says: ==================== A few more stragglers intended for 3.10... For the Bluetooth bits, Gustavo says: "A few more patches intended for 3.10, the most important one is the support in btusb for fw loading for the Intel Bluetooth device. Other than that we have only fixes and clean ups." For the iwlwifi bits, Johannes says: "Here are a few more changes for the 3.10 stream, some bugfixes, adjustments to some powersave parameters and a new device ID." For the NFC bits, Samuel says: "This pull request includes Marcel's Kconfig dependency fix on top of the LLCP code move to net/nfc." On top of that...Yogesh Ashok Powar provides a few PCI-related mwifiex updates, Hauke Mehrtens provides a small ssb feature for spurious tone avoidance on a specific chip, and Larry Finger provides a small rtlwifi fix related to avoiding false detection of AP loss. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2013-04-29Merge branch 'akpm' (incoming from Andrew)Linus Torvalds
Merge second batch of fixes from Andrew Morton: - various misc bits - some printk updates - a new "SRAM" driver. - MAINTAINERS updates - the backlight driver queue - checkpatch updates - a few init/ changes - a huge number of drivers/rtc changes - fatfs updates - some lib/idr.c work - some renaming of the random driver interfaces * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (285 commits) net: rename random32 to prandom net/core: remove duplicate statements by do-while loop net/core: rename random32() to prandom_u32() net/netfilter: rename random32() to prandom_u32() net/sched: rename random32() to prandom_u32() net/sunrpc: rename random32() to prandom_u32() scsi: rename random32() to prandom_u32() lguest: rename random32() to prandom_u32() uwb: rename random32() to prandom_u32() video/uvesafb: rename random32() to prandom_u32() mmc: rename random32() to prandom_u32() drbd: rename random32() to prandom_u32() kernel/: rename random32() to prandom_u32() mm/: rename random32() to prandom_u32() lib/: rename random32() to prandom_u32() x86: rename random32() to prandom_u32() x86: pageattr-test: remove srandom32 call uuid: use prandom_bytes() raid6test: use prandom_bytes() sctp: convert sctp_assoc_set_id() to use idr_alloc_cyclic() ...
2013-04-29Merge branch 'for-3.10' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup Pull cgroup updates from Tejun Heo: - Fixes and a lot of cleanups. Locking cleanup is finally complete. cgroup_mutex is no longer exposed to individual controlelrs which used to cause nasty deadlock issues. Li fixed and cleaned up quite a bit including long standing ones like racy cgroup_path(). - device cgroup now supports proper hierarchy thanks to Aristeu. - perf_event cgroup now supports proper hierarchy. - A new mount option "__DEVEL__sane_behavior" is added. As indicated by the name, this option is to be used for development only at this point and generates a warning message when used. Unfortunately, cgroup interface currently has too many brekages and inconsistencies to implement a consistent and unified hierarchy on top. The new flag is used to collect the behavior changes which are necessary to implement consistent unified hierarchy. It's likely that this flag won't be used verbatim when it becomes ready but will be enabled implicitly along with unified hierarchy. The option currently disables some of broken behaviors in cgroup core and also .use_hierarchy switch in memcg (will be routed through -mm), which can be used to make very unusual hierarchy where nesting is partially honored. It will also be used to implement hierarchy support for blk-throttle which would be impossible otherwise without introducing a full separate set of control knobs. This is essentially versioning of interface which isn't very nice but at this point I can't see any other options which would allow keeping the interface the same while moving towards hierarchy behavior which is at least somewhat sane. The planned unified hierarchy is likely to require some level of adaptation from userland anyway, so I think it'd be best to take the chance and update the interface such that it's supportable in the long term. Maintaining the existing interface does complicate cgroup core but shouldn't put too much strain on individual controllers and I think it'd be manageable for the foreseeable future. Maybe we'll be able to drop it in a decade. Fix up conflicts (including a semantic one adding a new #include to ppc that was uncovered by header the file changes) as per Tejun. * 'for-3.10' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (45 commits) cpuset: fix compile warning when CONFIG_SMP=n cpuset: fix cpu hotplug vs rebuild_sched_domains() race cpuset: use rebuild_sched_domains() in cpuset_hotplug_workfn() cgroup: restore the call to eventfd->poll() cgroup: fix use-after-free when umounting cgroupfs cgroup: fix broken file xattrs devcg: remove parent_cgroup. memcg: force use_hierarchy if sane_behavior cgroup: remove cgrp->top_cgroup cgroup: introduce sane_behavior mount option move cgroupfs_root to include/linux/cgroup.h cgroup: convert cgroupfs_root flag bits to masks and add CGRP_ prefix cgroup: make cgroup_path() not print double slashes Revert "cgroup: remove bind() method from cgroup_subsys." perf: make perf_event cgroup hierarchical cgroup: implement cgroup_is_descendant() cgroup: make sure parent won't be destroyed before its children cgroup: remove bind() method from cgroup_subsys. devcg: remove broken_hierarchy tag cgroup: remove cgroup_lock_is_held() ...
2013-04-29Merge branch 'for-3.10' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wqLinus Torvalds
Pull workqueue updates from Tejun Heo: "A lot of activities on workqueue side this time. The changes achieve the followings. - WQ_UNBOUND workqueues - the workqueues which are per-cpu - are updated to be able to interface with multiple backend worker pools. This involved a lot of churning but the end result seems actually neater as unbound workqueues are now a lot closer to per-cpu ones. - The ability to interface with multiple backend worker pools are used to implement unbound workqueues with custom attributes. Currently the supported attributes are the nice level and CPU affinity. It may be expanded to include cgroup association in future. The attributes can be specified either by calling apply_workqueue_attrs() or through /sys/bus/workqueue/WQ_NAME/* if the workqueue in question is exported through sysfs. The backend worker pools are keyed by the actual attributes and shared by any workqueues which share the same attributes. When attributes of a workqueue are changed, the workqueue binds to the worker pool with the specified attributes while leaving the work items which are already executing in its previous worker pools alone. This allows converting custom worker pool implementations which want worker attribute tuning to use workqueues. The writeback pool is already converted in block tree and there are a couple others are likely to follow including btrfs io workers. - WQ_UNBOUND's ability to bind to multiple worker pools is also used to make it NUMA-aware. Because there's no association between work item issuer and the specific worker assigned to execute it, before this change, using unbound workqueue led to unnecessary cross-node bouncing and it couldn't be helped by autonuma as it requires tasks to have implicit node affinity and workers are assigned randomly. After these changes, an unbound workqueue now binds to multiple NUMA-affine worker pools so that queued work items are executed in the same node. This is turned on by default but can be disabled system-wide or for individual workqueues. Crypto was requesting NUMA affinity as encrypting data across different nodes can contribute noticeable overhead and doing it per-cpu was too limiting for certain cases and IO throughput could be bottlenecked by one CPU being fully occupied while others have idle cycles. While the new features required a lot of changes including restructuring locking, it didn't complicate the execution paths much. The unbound workqueue handling is now closer to per-cpu ones and the new features are implemented by simply associating a workqueue with different sets of backend worker pools without changing queue, execution or flush paths. As such, even though the amount of change is very high, I feel relatively safe in that it isn't likely to cause subtle issues with basic correctness of work item execution and handling. If something is wrong, it's likely to show up as being associated with worker pools with the wrong attributes or OOPS while workqueue attributes are being changed or during CPU hotplug. While this creates more backend worker pools, it doesn't add too many more workers unless, of course, there are many workqueues with unique combinations of attributes. Assuming everything else is the same, NUMA awareness costs an extra worker pool per NUMA node with online CPUs. There are also a couple things which are being routed outside the workqueue tree. - block tree pulled in workqueue for-3.10 so that writeback worker pool can be converted to unbound workqueue with sysfs control exposed. This simplifies the code, makes writeback workers NUMA-aware and allows tuning nice level and CPU affinity via sysfs. - The conversion to workqueue means that there's no 1:1 association between a specific worker, which makes writeback folks unhappy as they want to be able to tell which filesystem caused a problem from backtrace on systems with many filesystems mounted. This is resolved by allowing work items to set debug info string which is printed when the task is dumped. As this change involves unifying implementations of dump_stack() and friends in arch codes, it's being routed through Andrew's -mm tree." * 'for-3.10' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq: (84 commits) workqueue: use kmem_cache_free() instead of kfree() workqueue: avoid false negative WARN_ON() in destroy_workqueue() workqueue: update sysfs interface to reflect NUMA awareness and a kernel param to disable NUMA affinity workqueue: implement NUMA affinity for unbound workqueues workqueue: introduce put_pwq_unlocked() workqueue: introduce numa_pwq_tbl_install() workqueue: use NUMA-aware allocation for pool_workqueues workqueue: break init_and_link_pwq() into two functions and introduce alloc_unbound_pwq() workqueue: map an unbound workqueues to multiple per-node pool_workqueues workqueue: move hot fields of workqueue_struct to the end workqueue: make workqueue->name[] fixed len workqueue: add workqueue->unbound_attrs workqueue: determine NUMA node of workers accourding to the allowed cpumask workqueue: drop 'H' from kworker names of unbound worker pools workqueue: add wq_numa_tbl_len and wq_numa_possible_cpumask[] workqueue: move pwq_pool_locking outside of get/put_unbound_pool() workqueue: fix memory leak in apply_workqueue_attrs() workqueue: fix unbound workqueue attrs hashing / comparison workqueue: fix race condition in unbound workqueue free path workqueue: remove pwq_lock which is no longer used ...
2013-04-29Merge branch 'for-3.10-async' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq Pull async update from Tejun Heo: "This contains three cleanup patches for async from Lai. All three patches are essentially cosmetic." * 'for-3.10-async' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq: async: rename and redefine async_func_ptr async: remove unused @node from struct async_domain async: simplify lowest_in_progress()
2013-04-29Merge branch 'for-3.10' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu Pull percpu patch from Tejun Heo: "A puny pull request for percpu. We were expecting more cleanup patches but didn't happen this time, so just a single patch adding documentation from Christoph." * 'for-3.10' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu: percpu: add documentation on this_cpu operations
2013-04-29net: rename random32 to prandomAkinobu Mita
Commit 496f2f93b1cc ("random32: rename random32 to prandom") renamed random32() and srandom32() to prandom_u32() and prandom_seed() respectively. net_random() and net_srandom() need to be redefined with prandom_* in order to finish the naming transition. While I'm at it, enclose macro argument of net_srandom() with parenthesis. Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com> Cc: "David S. Miller" <davem@davemloft.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-04-29net/core: remove duplicate statements by do-while loopAkinobu Mita
Remove duplicate statements by using do-while loop instead of while loop. - A; - while (e) { + do { A; - } + } while (e); Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com> Cc: "David S. Miller" <davem@davemloft.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-04-29net/core: rename random32() to prandom_u32()Akinobu Mita
Use preferable function name which implies using a pseudo-random number generator. Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com> Cc: "David S. Miller" <davem@davemloft.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-04-29net/netfilter: rename random32() to prandom_u32()Akinobu Mita
Use preferable function name which implies using a pseudo-random number generator. Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com> Cc: Pablo Neira Ayuso <pablo@netfilter.org> Cc: Patrick McHardy <kaber@trash.net> Cc: "David S. Miller" <davem@davemloft.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-04-29net/sched: rename random32() to prandom_u32()Akinobu Mita
Use preferable function name which implies using a pseudo-random number generator. Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com> Cc: Stephen Hemminger <shemminger@vyatta.com> Cc: Jamal Hadi Salim <jhs@mojatatu.com> Cc: "David S. Miller" <davem@davemloft.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-04-29net/sunrpc: rename random32() to prandom_u32()Akinobu Mita
Use preferable function name which implies using a pseudo-random number generator. Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com> Cc: "J. Bruce Fields" <bfields@fieldses.org> Acked-by: Trond Myklebust <Trond.Myklebust@netapp.com> Cc: "David S. Miller" <davem@davemloft.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-04-29scsi: rename random32() to prandom_u32()Akinobu Mita
Use preferable function name which implies using a pseudo-random number generator. Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com> Cc: "James E.J. Bottomley" <JBottomley@parallels.com> Cc: Robert Love <robert.w.love@intel.com> Cc: James Smart <james.smart@emulex.com> Cc: Andrew Vasquez <andrew.vasquez@qlogic.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-04-29lguest: rename random32() to prandom_u32()Akinobu Mita
Use preferable function name which implies using a pseudo-random number generator. Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com> Acked-by: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-04-29uwb: rename random32() to prandom_u32()Akinobu Mita
Use preferable function name which implies using a pseudo-random number generator. Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com> Cc: Greg KH <greg@kroah.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-04-29video/uvesafb: rename random32() to prandom_u32()Akinobu Mita
Use preferable function name which implies using a pseudo-random number generator. Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com> Cc: Michal Januszewski <spock@gentoo.org> Cc: Florian Tobias Schandinat <FlorianSchandinat@gmx.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-04-29mmc: rename random32() to prandom_u32()Akinobu Mita
Use preferable function name which implies using a pseudo-random number generator. Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com> Cc: Chris Ball <cjb@laptop.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-04-29drbd: rename random32() to prandom_u32()Akinobu Mita
Use preferable function name which implies using a pseudo-random number generator. Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com> Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-04-29kernel/: rename random32() to prandom_u32()Akinobu Mita
Use preferable function name which implies using a pseudo-random number generator. Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-04-29mm/: rename random32() to prandom_u32()Akinobu Mita
Use preferable function name which implies using a pseudo-random number generator. Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-04-29lib/: rename random32() to prandom_u32()Akinobu Mita
Use preferable function name which implies using a pseudo-random number generator. Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-04-29x86: rename random32() to prandom_u32()Akinobu Mita
Use preferable function name which implies using a pseudo-random number generator. Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com> Acked-by: H. Peter Anvin <hpa@zytor.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>