summaryrefslogtreecommitdiff
path: root/arch/powerpc/include/asm/topology.h
AgeCommit message (Collapse)Author
2017-08-16powerpc/topology: Remove the unused parent_node() macroDou Liyang
Commit a7be6e5a7f8d ("mm: drop useless local parameters of __register_one_node()") removes the last user of parent_node(). The parent_node() macro in POWERPC platform is unnecessary. Remove it for cleanup. Reported-by: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: Dou Liyang <douly.fnst@cn.fujitsu.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-07-03Merge branch 'smp-hotplug-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull SMP hotplug updates from Thomas Gleixner: "This update is primarily a cleanup of the CPU hotplug locking code. The hotplug locking mechanism is an open coded RWSEM, which allows recursive locking. The main problem with that is the recursive nature as it evades the full lockdep coverage and hides potential deadlocks. The rework replaces the open coded RWSEM with a percpu RWSEM and establishes full lockdep coverage that way. The bulk of the changes fix up recursive locking issues and address the now fully reported potential deadlocks all over the place. Some of these deadlocks have been observed in the RT tree, but on mainline the probability was low enough to hide them away." * 'smp-hotplug-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (37 commits) cpu/hotplug: Constify attribute_group structures powerpc: Only obtain cpu_hotplug_lock if called by rtasd ARM/hw_breakpoint: Fix possible recursive locking for arch_hw_breakpoint_init cpu/hotplug: Remove unused check_for_tasks() function perf/core: Don't release cred_guard_mutex if not taken cpuhotplug: Link lock stacks for hotplug callbacks acpi/processor: Prevent cpu hotplug deadlock sched: Provide is_percpu_thread() helper cpu/hotplug: Convert hotplug locking to percpu rwsem s390: Prevent hotplug rwsem recursion arm: Prevent hotplug rwsem recursion arm64: Prevent cpu hotplug rwsem recursion kprobes: Cure hotplug lock ordering issues jump_label: Reorder hotplug lock and jump_label_lock perf/tracing/cpuhotplug: Fix locking order ACPI/processor: Use cpu_hotplug_disable() instead of get_online_cpus() PCI: Replace the racy recursion prevention PCI: Use cpu_hotplug_disable() instead of get_online_cpus() perf/x86/intel: Drop get_online_cpus() in intel_snb_check_microcode() x86/perf: Drop EXPORT of perf_check_microcode ...
2017-06-23powerpc: Only obtain cpu_hotplug_lock if called by rtasdThiago Jung Bauermann
Calling arch_update_cpu_topology from a CPU hotplug state machine callback hits a deadlock because the function tries to get a read lock on cpu_hotplug_lock while the state machine still holds a write lock on it. Since all callers of arch_update_cpu_topology except rtasd already hold cpu_hotplug_lock, this patch changes the function to use stop_machine_cpuslocked and creates a separate function for rtasd which still tries to obtain the lock. Michael Bringmann investigated the bug and provided a detailed analysis of the deadlock on this previous RFC for an alternate solution: Signed-off-by: Thiago Jung Bauermann <bauerman@linux.vnet.ibm.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Michael Ellerman <mpe@ellerman.id.au> Cc: John Allen <jallen@linux.vnet.ibm.com> Cc: Michael Bringmann <mwb@linux.vnet.ibm.com> Cc: Nathan Fontenot <nfont@linux.vnet.ibm.com> Cc: linuxppc-dev@lists.ozlabs.org Link: http://lkml.kernel.org/r/1497996510-4032-1-git-send-email-bauerman@linux.vnet.ibm.com Link: https://patchwork.ozlabs.org/patch/771293/
2017-06-06powerpc/numa: Fix percpu allocations to be NUMA awareMichael Ellerman
In commit 8c272261194d ("powerpc/numa: Enable USE_PERCPU_NUMA_NODE_ID"), we switched to the generic implementation of cpu_to_node(), which uses a percpu variable to hold the NUMA node for each CPU. Unfortunately we neglected to notice that we use cpu_to_node() in the allocation of our percpu areas, leading to a chicken and egg problem. In practice what happens is when we are setting up the percpu areas, cpu_to_node() reports that all CPUs are on node 0, so we allocate all percpu areas on node 0. This is visible in the dmesg output, as all pcpu allocs being in group 0: pcpu-alloc: [0] 00 01 02 03 [0] 04 05 06 07 pcpu-alloc: [0] 08 09 10 11 [0] 12 13 14 15 pcpu-alloc: [0] 16 17 18 19 [0] 20 21 22 23 pcpu-alloc: [0] 24 25 26 27 [0] 28 29 30 31 pcpu-alloc: [0] 32 33 34 35 [0] 36 37 38 39 pcpu-alloc: [0] 40 41 42 43 [0] 44 45 46 47 To fix it we need an early_cpu_to_node() which can run prior to percpu being setup. We already have the numa_cpu_lookup_table we can use, so just plumb it in. With the patch dmesg output shows two groups, 0 and 1: pcpu-alloc: [0] 00 01 02 03 [0] 04 05 06 07 pcpu-alloc: [0] 08 09 10 11 [0] 12 13 14 15 pcpu-alloc: [0] 16 17 18 19 [0] 20 21 22 23 pcpu-alloc: [1] 24 25 26 27 [1] 28 29 30 31 pcpu-alloc: [1] 32 33 34 35 [1] 36 37 38 39 pcpu-alloc: [1] 40 41 42 43 [1] 44 45 46 47 We can also check the data_offset in the paca of various CPUs, with the fix we see: CPU 0: data_offset = 0x0ffe8b0000 CPU 24: data_offset = 0x1ffe5b0000 And we can see from dmesg that CPU 24 has an allocation on node 1: node 0: [mem 0x0000000000000000-0x0000000fffffffff] node 1: [mem 0x0000001000000000-0x0000001fffffffff] Cc: stable@vger.kernel.org # v3.16+ Fixes: 8c272261194d ("powerpc/numa: Enable USE_PERCPU_NUMA_NODE_ID") Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Reviewed-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2015-05-27sched/topology: Rename topology_thread_cpumask() to topology_sibling_cpumask()Bartosz Golaszewski
Rename topology_thread_cpumask() to topology_sibling_cpumask() for more consistency with scheduler code. Signed-off-by: Bartosz Golaszewski <bgolaszewski@baylibre.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Russell King <rmk+kernel@arm.linux.org.uk> Acked-by: Catalin Marinas <catalin.marinas@arm.com> Cc: Benoit Cousson <bcousson@baylibre.com> Cc: Fenghua Yu <fenghua.yu@intel.com> Cc: Guenter Roeck <linux@roeck-us.net> Cc: Jean Delvare <jdelvare@suse.de> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Oleg Drokin <oleg.drokin@intel.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rafael J. Wysocki <rjw@rjwysocki.net> Cc: Russell King <linux@arm.linux.org.uk> Cc: Viresh Kumar <viresh.kumar@linaro.org> Link: http://lkml.kernel.org/r/1432645896-12588-2-git-send-email-bgolaszewski@baylibre.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-06-10Merge branch 'next' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc Pull powerpc updates from Ben Herrenschmidt: "Here is the bulk of the powerpc changes for this merge window. It got a bit delayed in part because I wasn't paying attention, and in part because I discovered I had a core PCI change without a PCI maintainer ack in it. Bjorn eventually agreed it was ok to merge it though we'll probably improve it later and I didn't want to rebase to add his ack. There is going to be a bit more next week, essentially fixes that I still want to sort through and test. The biggest item this time is the support to build the ppc64 LE kernel with our new v2 ABI. We previously supported v2 userspace but the kernel itself was a tougher nut to crack. This is now sorted mostly thanks to Anton and Rusty. We also have a fairly big series from Cedric that add support for 64-bit LE zImage boot wrapper. This was made harder by the fact that traditionally our zImage wrapper was always 32-bit, but our new LE toolchains don't really support 32-bit anymore (it's somewhat there but not really "supported") so we didn't want to rely on it. This meant more churn that just endian fixes. This brings some more LE bits as well, such as the ability to run in LE mode without a hypervisor (ie. under OPAL firmware) by doing the right OPAL call to reinitialize the CPU to take HV interrupts in the right mode and the usual pile of endian fixes. There's another series from Gavin adding EEH improvements (one day we *will* have a release with less than 20 EEH patches, I promise!). Another highlight is the support for the "Split core" functionality on P8 by Michael. This allows a P8 core to be split into "sub cores" of 4 threads which allows the subcores to run different guests under KVM (the HW still doesn't support a partition per thread). And then the usual misc bits and fixes ..." [ Further delayed by gmail deciding that BenH is a dirty spammer. Google knows. ] * 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc: (155 commits) powerpc/powernv: Add missing include to LPC code selftests/powerpc: Test the THP bug we fixed in the previous commit powerpc/mm: Check paca psize is up to date for huge mappings powerpc/powernv: Pass buffer size to OPAL validate flash call powerpc/pseries: hcall functions are exported to modules, need _GLOBAL_TOC() powerpc: Exported functions __clear_user and copy_page use r2 so need _GLOBAL_TOC() powerpc/powernv: Set memory_block_size_bytes to 256MB powerpc: Allow ppc_md platform hook to override memory_block_size_bytes powerpc/powernv: Fix endian issues in memory error handling code powerpc/eeh: Skip eeh sysfs when eeh is disabled powerpc: 64bit sendfile is capped at 2GB powerpc/powernv: Provide debugfs access to the LPC bus via OPAL powerpc/serial: Use saner flags when creating legacy ports powerpc: Add cpu family documentation powerpc/xmon: Fix up xmon format strings powerpc/powernv: Add calls to support little endian host powerpc: Document sysfs DSCR interface powerpc: Fix regression of per-CPU DSCR setting powerpc: Split __SYSFS_SPRSETUP macro arch: powerpc/fadump: Cleaning up inconsistent NULL checks ...
2014-06-04mm: disable zone_reclaim_mode by defaultMel Gorman
When it was introduced, zone_reclaim_mode made sense as NUMA distances punished and workloads were generally partitioned to fit into a NUMA node. NUMA machines are now common but few of the workloads are NUMA-aware and it's routine to see major performance degradation due to zone_reclaim_mode being enabled but relatively few can identify the problem. Those that require zone_reclaim_mode are likely to be able to detect when it needs to be enabled and tune appropriately so lets have a sensible default for the bulk of users. This patch (of 2): zone_reclaim_mode causes processes to prefer reclaiming memory from local node instead of spilling over to other nodes. This made sense initially when NUMA machines were almost exclusively HPC and the workload was partitioned into nodes. The NUMA penalties were sufficiently high to justify reclaiming the memory. On current machines and workloads it is often the case that zone_reclaim_mode destroys performance but not all users know how to detect this. Favour the common case and disable it by default. Users that are sophisticated enough to know they need zone_reclaim_mode will detect it. Signed-off-by: Mel Gorman <mgorman@suse.de> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com> Acked-by: Michal Hocko <mhocko@suse.cz> Reviewed-by: Christoph Lameter <cl@linux.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-05-28powerpc: Fix comment around arch specific definition of RECLAIM_DISTANCEPreeti U Murthy
Commit 32e45ff43eaf5c17f changed the default value of RECLAIM_DISTANCE to 30. However the comment around arch specifc definition of RECLAIM_DISTANCE is not updated to reflect the same. Correct the value mentioned in the comment. Signed-off-by: Preeti U Murthy <preeti@linux.vnet.ibm.com> Cc: Anton Blanchard <anton@samba.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Acked-by: KOSAKI Motohiro <Kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2014-05-28powerpc/numa: Enable USE_PERCPU_NUMA_NODE_IDNishanth Aravamudan
Based off 3bccd996 for ia64, convert powerpc to use the generic per-CPU topology tracking, specifically: initialize per cpu numa_node entry in start_secondary remove the powerpc cpu_to_node() define CONFIG_USE_PERCPU_NUMA_NODE_ID if NUMA Signed-off-by: Nishanth Aravamudan <nacc@linux.vnet.ibm.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2014-03-11sched: Remove unused mc_capable() and smt_capable()Bjorn Helgaas
Remove mc_capable() and smt_capable(). Neither is used. Both were added by 5c45bf279d37 ("sched: mc/smt power savings sched policy"). Uses of both were removed by 8e7fbcbc22c1 ("sched: Remove stale power aware scheduling remnants and dysfunctional knobs"). Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Acked-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: David S. Miller <davem@davemloft.net> Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Link: http://lkml.kernel.org/r/20140304210737.16893.54289.stgit@bhelgaas-glaptop.roam.corp.google.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-01-15powerpc: Fix the setup of CPU-to-Node mappings during CPU onlineSrivatsa S. Bhat
On POWER platforms, the hypervisor can notify the guest kernel about dynamic changes in the cpu-numa associativity (VPHN topology update). Hence the cpu-to-node mappings that we got from the firmware during boot, may no longer be valid after such updates. This is handled using the arch_update_cpu_topology() hook in the scheduler, and the sched-domains are rebuilt according to the new mappings. But unfortunately, at the moment, CPU hotplug ignores these updated mappings and instead queries the firmware for the cpu-to-numa relationships and uses them during CPU online. So the kernel can end up assigning wrong NUMA nodes to CPUs during subsequent CPU hotplug online operations (after booting). Further, a particularly problematic scenario can result from this bug: On POWER platforms, the SMT mode can be switched between 1, 2, 4 (and even 8) threads per core. The switch to Single-Threaded (ST) mode is performed by offlining all except the first CPU thread in each core. Switching back to SMT mode involves onlining those other threads back, in each core. Now consider this scenario: 1. During boot, the kernel gets the cpu-to-node mappings from the firmware and assigns the CPUs to NUMA nodes appropriately, during CPU online. 2. Later on, the hypervisor updates the cpu-to-node mappings dynamically and communicates this update to the kernel. The kernel in turn updates its cpu-to-node associations and rebuilds its sched domains. Everything is fine so far. 3. Now, the user switches the machine from SMT to ST mode (say, by running ppc64_cpu --smt=1). This involves offlining all except 1 thread in each core. 4. The user then tries to switch back from ST to SMT mode (say, by running ppc64_cpu --smt=4), and this involves onlining those threads back. Since CPU hotplug ignores the new mappings, it queries the firmware and tries to associate the newly onlined sibling threads to the old NUMA nodes. This results in sibling threads within the same core getting associated with different NUMA nodes, which is incorrect. The scheduler's build-sched-domains code gets thoroughly confused with this and enters an infinite loop and causes soft-lockups, as explained in detail in commit 3be7db6ab (powerpc: VPHN topology change updates all siblings). So to fix this, use the numa_cpu_lookup_table to remember the updated cpu-to-node mappings, and use them during CPU hotplug online operations. Further, we also need to ensure that all threads in a core are assigned to a common NUMA node, irrespective of whether all those threads were online during the topology update. To achieve this, we take care not to use cpu_sibling_mask() since it is not hotplug invariant. Instead, we use cpu_first_sibling_thread() and set up the mappings manually using the 'threads_per_core' value for that particular platform. This helps us ensure that we don't hit this bug with any combination of CPU hotplug and SMT mode switching. Cc: stable@vger.kernel.org Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2013-08-14powerpc: Make chip-id information available to userspaceVasant Hegde
So far "/sys/devices/system/cpu/cpuX/topology/physical_package_id" was always default (-1) on ppc64 architecture. Now, some systems have an ibm,chip-id property in the cpu nodes in the device tree. On these systems, we now use this information to display physical_package_id. Signed-off-by: Vasant Hegde <hegdevasant@linux.vnet.ibm.com> Signed-off-by: Shivaprasad G Bhat <sbhat@linux.vnet.ibm.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2013-04-26powerpc/pseries: Add /proc interface to control topology updatesNathan Fontenot
There are instances in which we do not want topology updates to occur. In order to allow this a /proc interface (/proc/powerpc/topology_updates) is introduced so that topology updates can be enabled and disabled. This patch also adds a prrn_is_enabled() call so that PRRN events are handled in the kernel only if topology updating is enabled. Signed-off-by: Nathan Fontenot <nfont@linux.vnet.ibm.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2012-05-09sched/numa: Rewrite the CONFIG_NUMA sched domain supportPeter Zijlstra
The current code groups up to 16 nodes in a level and then puts an ALLNODES domain spanning the entire tree on top of that. This doesn't reflect the numa topology and esp for the smaller not-fully-connected machines out there today this might make a difference. Therefore, build a proper numa topology based on node_distance(). Since there's no fixed numa layers anymore, the static SD_NODE_INIT and SD_ALLNODES_INIT aren't usable anymore, the new code tries to construct something similar and scales some values either on the number of cpus in the domain and/or the node_distance() ratio. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Anton Blanchard <anton@samba.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Chris Metcalf <cmetcalf@tilera.com> Cc: David Howells <dhowells@redhat.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: Fenghua Yu <fenghua.yu@intel.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru> Cc: linux-alpha@vger.kernel.org Cc: linux-ia64@vger.kernel.org Cc: linux-kernel@vger.kernel.org Cc: linux-mips@linux-mips.org Cc: linuxppc-dev@lists.ozlabs.org Cc: linux-sh@vger.kernel.org Cc: Matt Turner <mattst88@gmail.com> Cc: Paul Mackerras <paulus@samba.org> Cc: Paul Mundt <lethal@linux-sh.org> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Richard Henderson <rth@twiddle.net> Cc: sparclinux@vger.kernel.org Cc: Tony Luck <tony.luck@intel.com> Cc: x86@kernel.org Cc: Dimitri Sivanich <sivanich@sgi.com> Cc: Greg Pearson <greg.pearson@hp.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: bob.picco@oracle.com Cc: chris.mason@oracle.com Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andrew Morton <akpm@linux-foundation.org> Link: http://lkml.kernel.org/n/tip-r74n3n8hhuc2ynbrnp3vt954@git.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2011-12-21cpu: convert 'cpu' and 'machinecheck' sysdev_class to a regular subsystemKay Sievers
This moves the 'cpu sysdev_class' over to a regular 'cpu' subsystem and converts the devices to regular devices. The sysdev drivers are implemented as subsystem interfaces now. After all sysdev classes are ported to regular driver core entities, the sysdev implementation will be entirely removed from the kernel. Userspace relies on events and generic sysfs subsystem infrastructure from sysdev devices, which are made available with this conversion. Cc: Haavard Skinnemoen <hskinnemoen@gmail.com> Cc: Hans-Christian Egtvedt <egtvedt@samfundet.no> Cc: Tony Luck <tony.luck@intel.com> Cc: Fenghua Yu <fenghua.yu@intel.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Paul Mundt <lethal@linux-sh.org> Cc: "David S. Miller" <davem@davemloft.net> Cc: Chris Metcalf <cmetcalf@tilera.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Borislav Petkov <bp@amd64.org> Cc: Tigran Aivazian <tigran@aivazian.fsnet.co.uk> Cc: Len Brown <lenb@kernel.org> Cc: Zhang Rui <rui.zhang@intel.com> Cc: Dave Jones <davej@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Russell King <rmk+kernel@arm.linux.org.uk> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Arjan van de Ven <arjan@linux.intel.com> Cc: "Rafael J. Wysocki" <rjw@sisk.pl> Cc: "Srivatsa S. Bhat" <srivatsa.bhat@linux.vnet.ibm.com> Signed-off-by: Kay Sievers <kay.sievers@vrfy.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-09-20powerpc/numa: Remove duplicate RECLAIM_DISTANCE definitionAnton Blanchard
We have two identical definitions of RECLAIM_DISTANCE, looks like the patch got applied twice. Remove one. Signed-off-by: Anton Blanchard <anton@samba.org> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2011-09-20powerpc/numa: Disable NEWIDLE balancing at node levelAnton Blanchard
On big POWER7 boxes we see large amounts of CPU time in system processes like workqueue and watchdog kernel threads. We currently rebalance the entire machine each time a task goes idle and this is very expensive on large machines. Disable newidle balancing at the node level and rely on the scheduler tick to rebalance across nodes. Signed-off-by: Anton Blanchard <anton@samba.org> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2011-09-20powerpc/numa: Increase SD_NODES_PER_DOMAIN to 32.Anton Blanchard
The largest POWER7 boxes have 32 nodes. SD_NODES_PER_DOMAIN groups nodes into chunks of 16 and adds a global balancing domain (SD_ALLNODES) above it. If we bump SD_NODES_PER_DOMAIN to 32, then we avoid this extra level of balancing on our largest boxes. Signed-off-by: Anton Blanchard <anton@samba.org> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2011-09-20powerpc/numa: Enable SD_WAKE_AFFINE in node definitionAnton Blanchard
When chasing a performance issue on ppc64, I noticed tasks communicating via a pipe would often end up on different nodes. It turns out SD_WAKE_AFFINE is not set in our node defition. Commit 9fcd18c9e63e (sched: re-tune balancing) enabled SD_WAKE_AFFINE in the node definition for x86 and we need a similar change for ppc64. I used lmbench lat_ctx and perf bench pipe to verify this fix. Each benchmark was run 10 times and the average taken. lmbench lat_ctx: before: 66565 ops/sec after: 204700 ops/sec 3.1x faster perf bench pipe: before: 5.6570 usecs after: 1.3470 usecs 4.2x faster Signed-off-by: Anton Blanchard <anton@samba.org> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2011-01-12powerpc/pseries: Fix build of topology stuff without CONFIG_NUMABenjamin Herrenschmidt
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2011-01-11powerpc/pseries: Fix VPHN build errors on non-SMP systemsJesse Larrew
The header asm/hvcall.h was previously included indirectly via smp.h. On non-SMP systems, however, these declarations are excluded and the build breaks. This is easily fixed by including asm/hvcall.h directly. The VPHN feature is only meaningful on NUMA systems that implement the SPLPAR option, so exclude the VPHN code on systems without SPLPAR enabled. Also, expose unmap_cpu_from_node() on systems with SPLPAR enabled, even if CONFIG_HOTPLUG_CPU is disabled. Lastly, map_cpu_to_node() is now needed by VPHN to manipulate the node masks after boot time, so remove the __cpuinit annotation to fix a section mismatch. Signed-off-by: Jesse Larrew <jlarrew@linux.vnet.ibm.com>
2010-12-09powerpc: Disable VPHN polling during a suspend operationJesse Larrew
Tie the polling mechanism into the ibm,suspend-me rtas call to stop/restart polling before/after a suspend, hibernate, migrate, or checkpoint restart operation. This ensures that the system has a chance to disable the polling if the partition is migrated to a system that does not support VPHN (and vice versa). Signed-off-by: Jesse Larrew <jlarrew@linux.vnet.ibm.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2010-08-05Merge branch 'next-devicetree' of git://git.secretlab.ca/git/linux-2.6Linus Torvalds
* 'next-devicetree' of git://git.secretlab.ca/git/linux-2.6: (63 commits) of/platform: Register of_platform_drivers with an "of:" prefix of/address: Clean up function declarations of/spi: call of_register_spi_devices() from spi core code of: Provide default of_node_to_nid() implementation. of/device: Make of_device_make_bus_id() usable by other code. of/irq: Fix endian issues in parsing interrupt specifiers of: Fix phandle endian issues of/flattree: fix of_flat_dt_is_compatible() to match the full compatible string of: remove of_default_bus_ids of: make of_find_device_by_node generic microblaze: remove references to of_device and to_of_device sparc: remove references to of_device and to_of_device powerpc: remove references to of_device and to_of_device of/device: Replace of_device with platform_device in includes and core code of/device: Protect against binding of_platform_drivers to non-OF devices of: remove asm/of_device.h of: remove asm/of_platform.h of/platform: remove all of_bus_type and of_platform_bus_type references of: Merge of_platform_bus_type with platform_bus_type drivercore/of: Add OF style matching to platform bus ... Fix up trivial conflicts in arch/microblaze/kernel/Makefile due to just some obj-y removals by the devicetree branch, while the microblaze updates added a new file.
2010-07-30of: Provide default of_node_to_nid() implementation.Grant Likely
of_node_to_nid() is only relevant in a few architectures. Don't force everyone to implement it anyway. Signed-off-by: Grant Likely <grant.likely@secretlab.ca>
2010-07-09powerpc/numa: Use form 1 affinity to setup node distanceAnton Blanchard
Form 1 affinity allows multiple entries in ibm,associativity-reference-points which represent affinity domains in decreasing order of importance. The Linux concept of a node is always the first entry, but using the other values as an input to node_distance() allows the memory allocator to make better decisions on which node to go first when local memory has been exhausted. We keep things simple and create an array indexed by NUMA node, capped at 4 entries. Each time we lookup an associativity property we initialise the array which is overkill, but since we should only hit this path during boot it didn't seem worth adding a per node valid bit. Signed-off-by: Anton Blanchard <anton@samba.org> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2010-05-21powerpc/numa: Set a smaller value for RECLAIM_DISTANCE to enable zone reclaimAnton Blanchard
I noticed /proc/sys/vm/zone_reclaim_mode was 0 on a ppc64 NUMA box. It gets enabled via this: /* * If another node is sufficiently far away then it is better * to reclaim pages in a zone before going off node. */ if (distance > RECLAIM_DISTANCE) zone_reclaim_mode = 1; Since we use the default value of 20 for REMOTE_DISTANCE and 20 for RECLAIM_DISTANCE it never kicks in. The local to remote bandwidth ratios can be quite large on System p machines so it makes sense for us to reclaim clean pagecache locally before going off node. The patch below sets a smaller value for RECLAIM_DISTANCE and thus enables zone reclaim. Signed-off-by: Anton Blanchard <anton@samba.org> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2010-05-06powerpc/cpumask: Convert NUMA code to new cpumask APIAnton Blanchard
Convert NUMA code to new cpumask API. We shift the node to cpumask setup code until after we complete bootmem allocation so we can dynamically allocate the cpumasks. Signed-off-by: Anton Blanchard <anton@samba.org> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2010-05-06powerpc/cpumask: Dynamically allocate cpu_sibling_map and cpu_core_map cpumasksAnton Blanchard
Dynamically allocate cpu_sibling_map and cpu_core_map cpumasks. We don't need to set_cpu_online() the boot cpu in smp_prepare_boot_cpu, init/main.c does it for us. We also postpone setting of the boot cpu in cpu_sibling_map and cpu_core_map until when the memory allocator is available (smp_prepare_cpus), similar to x86. Signed-off-by: Anton Blanchard <anton@samba.org> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2010-04-07powerpc/numa: Set a smaller value for RECLAIM_DISTANCE to enable zone reclaimAnton Blanchard
I noticed /proc/sys/vm/zone_reclaim_mode was 0 on a ppc64 NUMA box. It gets enabled via this: /* * If another node is sufficiently far away then it is better * to reclaim pages in a zone before going off node. */ if (distance > RECLAIM_DISTANCE) zone_reclaim_mode = 1; Since we use the default value of 20 for REMOTE_DISTANCE and 20 for RECLAIM_DISTANCE it never kicks in. The local to remote bandwidth ratios can be quite large on System p machines so it makes sense for us to reclaim clean pagecache locally before going off node. The patch below sets a smaller value for RECLAIM_DISTANCE and thus enables zone reclaim. Signed-off-by: Anton Blanchard <anton@samba.org> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2010-02-09powerpc: Reformat SD_NODE_INIT to match x86Anton Blanchard
Clean up SD_NODE_INITS so we can easily compare it to x86. Similar to the work in 47734f89be0614b5acbd6a532390f9c72f019648 (sched: Clean up topology.h) Signed-off-by: Anton Blanchard <anton@samba.org> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2010-01-15powerpc: cpumask_of_node() should handle -1 as a nodeAnton Blanchard
pcibus_to_node can return -1 if we cannot determine which node a pci bus is on. If passed -1, cpumask_of_node will negatively index the lookup array and pull in random data: # cat /sys/devices/pci0000:00/0000:00:01.0/local_cpus 00000000,00000003,00000000,00000000 # cat /sys/devices/pci0000:00/0000:00:01.0/local_cpulist 64-65 Change cpumask_of_node to check for -1 and return cpu_all_mask in this case: # cat /sys/devices/pci0000:00/0000:00:01.0/local_cpus ffffffff,ffffffff,ffffffff,ffffffff # cat /sys/devices/pci0000:00/0000:00:01.0/local_cpulist 0-127 Signed-off-by: Anton Blanchard <anton@samba.org> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2009-09-24cpumask: remove obsolete topology_core_siblings and ↵Rusty Russell
topology_thread_siblings: powerpc There were replaced by topology_core_cpumask and topology_thread_cpumask. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2009-09-24cpumask: remove obsolete node_to_cpumask now everyone uses cpumask_of_nodeRusty Russell
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2009-09-24cpumask: remove the now-obsoleted pcibus_to_cpumask(): powerpcRusty Russell
cpumask_of_pcibus() is the new version. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2009-09-16sched: Disable wakeup balancingPeter Zijlstra
Sysbench thinks SD_BALANCE_WAKE is too agressive and kbuild doesn't really mind too much, SD_BALANCE_NEWIDLE picks up most of the slack. On a dual socket, quad core, dual thread nehalem system: sysbench (--num_threads=16): SD_BALANCE_WAKE-: 13982 tx/s SD_BALANCE_WAKE+: 15688 tx/s kbuild (-j16): SD_BALANCE_WAKE-: 47.648295846 seconds time elapsed ( +- 0.312% ) SD_BALANCE_WAKE+: 47.608607360 seconds time elapsed ( +- 0.026% ) (same within noise) Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <new-submission> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-09-15sched: Improve latencies and throughputMike Galbraith
Make the idle balancer more agressive, to improve a x264 encoding workload provided by Jason Garrett-Glaser: NEXT_BUDDY NO_LB_BIAS encoded 600 frames, 252.82 fps, 22096.60 kb/s encoded 600 frames, 250.69 fps, 22096.60 kb/s encoded 600 frames, 245.76 fps, 22096.60 kb/s NO_NEXT_BUDDY LB_BIAS encoded 600 frames, 344.44 fps, 22096.60 kb/s encoded 600 frames, 346.66 fps, 22096.60 kb/s encoded 600 frames, 352.59 fps, 22096.60 kb/s NO_NEXT_BUDDY NO_LB_BIAS encoded 600 frames, 425.75 fps, 22096.60 kb/s encoded 600 frames, 425.45 fps, 22096.60 kb/s encoded 600 frames, 422.49 fps, 22096.60 kb/s Peter pointed out that this is better done via newidle_idx, not via LB_BIAS, newidle balancing should look for where there is load _now_, not where there was load 2 ticks ago. Worst-case latencies are improved as well as no buddies means less vruntime spread. (as per prior lkml discussions) This change improves kbuild-peak parallelism as well. Reported-by: Jason Garrett-Glaser <darkshikari@gmail.com> Signed-off-by: Mike Galbraith <efault@gmx.de> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <1253011667.9128.16.camel@marge.simson.net> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-09-15sched: Tweak wake_idxPeter Zijlstra
When merging select_task_rq_fair() and sched_balance_self() we lost the use of wake_idx, restore that and set them to 0 to make wake balancing more aggressive. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <new-submission> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-09-15sched: Merge select_task_rq_fair() and sched_balance_self()Peter Zijlstra
The problem with wake_idle() is that is doesn't respect things like cpu_power, which means it doesn't deal well with SMT nor the recent RT interaction. To cure this, it needs to do what sched_balance_self() does, which leads to the possibility of merging select_task_rq_fair() and sched_balance_self(). Modify sched_balance_self() to: - update_shares() when walking up the domain tree, (it only called it for the top domain, but it should have done this anyway), which allows us to remove this ugly bit from try_to_wake_up(). - do wake_affine() on the smallest domain that contains both this (the waking) and the prev (the wakee) cpu for WAKE invocations. Then use the top-down balance steps it had to replace wake_idle(). This leads to the dissapearance of SD_WAKE_BALANCE and SD_WAKE_IDLE_FAR, with SD_WAKE_IDLE replaced with SD_BALANCE_WAKE. SD_WAKE_AFFINE needs SD_BALANCE_WAKE to be effective. Touch all topology bits to replace the old with new SD flags -- platforms might need re-tuning, enabling SD_BALANCE_WAKE conditionally on a NUMA distance seems like a good additional feature, magny-core and small nehalem systems would want this enabled, systems with slow interconnects would not. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <new-submission> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-03-30cpumask: remove node_to_first_cpuRusty Russell
Everyone defines it, and only one person uses it (arch/mips/sgi-ip27/ip27-nmi.c). So just open code it there. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> Cc: linux-mips@linux-mips.org
2009-01-03Merge branch 'master' of ↵Mike Travis
git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux-2.6-cpumask into merge-rr-cpumask Conflicts: arch/x86/kernel/io_apic.c kernel/rcuclassic.c kernel/sched.c kernel/time/tick-sched.c Signed-off-by: Mike Travis <travis@sgi.com> [ mingo@elte.hu: backmerged typo fix for io_apic.c ] Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-01-01cpumask: Introduce topology_core_cpumask()/topology_thread_cpumask(): powerpcRusty Russell
Impact: New API The old topology_core_siblings() and topology_thread_siblings() return a cpumask_t; these new ones return a (const) struct cpumask *. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: Mike Travis <travis@sgi.com>
2008-12-26cpumask: powerpc: Introduce cpumask_of_{node,pcibus} to replace ↵Rusty Russell
{node,pcibus}_to_cpumask Impact: New APIs The old node_to_cpumask/node_to_pcibus returned a cpumask_t: these return a pointer to a struct cpumask. Part of removing cpumasks from the stack. (Also replaces powerpc internal uses of node_to_cpumask). Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2008-11-26sched: convert struct root_domain to cpumask_var_t, fixIngo Molnar
Mathieu Desnoyers reported this build failure on powerpc: kernel/sched.c: In function 'sd_init_NODE': kernel/sched.c:7319: error: non-static initialization of a flexible array member kernel/sched.c:7319: error: (near initialization for '(anonymous)') this happens because .span changed to cpumask_var_t, hence the static CPU_MASK_NONE initializers in the SD_*_INIT templates are not type-correct anymore. Remove them, as they default to empty anyway. Also remove them from IA64, MIPS and SH. Reported-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-08-04powerpc: Move include files to arch/powerpc/include/asmStephen Rothwell
from include/asm-powerpc. This is the result of a mkdir arch/powerpc/include/asm git mv include/asm-powerpc/* arch/powerpc/include/asm Followed by a few documentation/comment fixups and a couple of places where <asm-powepc/...> was being used explicitly. Of the latter only one was outside the arch code and it is a driver only built for powerpc. Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Paul Mackerras <paulus@samba.org>