summaryrefslogtreecommitdiff
path: root/arch
AgeCommit message (Collapse)Author
2020-10-06powerpc/smp: Optimize update_mask_by_l2Srikar Dronamraju
All threads of a SMT4 core can either be part of this CPU's l2-cache mask or not related to this CPU l2-cache mask. Use this relation to reduce the number of iterations needed to find all the CPUs that share the same l2-cache. Use a temporary mask to iterate through the CPUs that may share l2_cache mask. Also instead of setting one CPU at a time into cpu_l2_cache_mask, copy the SMT4/sub mask at one shot. Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20200921095653.9701-10-srikar@linux.vnet.ibm.com
2020-10-06powerpc/smp: Check for duplicate topologies and consolidateSrikar Dronamraju
CACHE and COREGROUP domains are now part of default topology. However on systems that don't support CACHE or COREGROUP, these domains will eventually be degenerated. The degeneration happens per CPU. Do note the current fixup_topology() logic ensures that mask of a domain that is not supported on the current platform is set to the previous domain. Instead of waiting for the scheduler to degenerated try to consolidate based on their masks and sd_flags. This is done just before setting the scheduler topology. Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20200921095653.9701-9-srikar@linux.vnet.ibm.com
2020-10-06powerpc/smp: Depend on cpu_l1_cache_map when adding CPUsSrikar Dronamraju
Currently on hotplug/hotunplug, CPU iterates through all the CPUs in its core to find threads in its thread group. However this info is already captured in cpu_l1_cache_map. Hence reduce iterations and cleanup add_cpu_to_smallcore_masks function. Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Tested-by: Satheesh Rajendran <sathnaga@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20200921095653.9701-8-srikar@linux.vnet.ibm.com
2020-10-06powerpc/smp: Stop passing mask to update_mask_by_l2Srikar Dronamraju
update_mask_by_l2 is called only once. But it passes cpu_l2_cache_mask as parameter. Instead of passing cpu_l2_cache_mask, use it directly in update_mask_by_l2. Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Tested-by: Satheesh Rajendran <sathnaga@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20200921095653.9701-7-srikar@linux.vnet.ibm.com
2020-10-06powerpc/smp: Limit CPUs traversed to within a node.Srikar Dronamraju
All the arch specific topology cpumasks are within a node/DIE. However when setting these per CPU cpumasks, system traverses through all the online CPUs. This is redundant. Reduce the traversal to only CPUs that are online in the node to which the CPU belongs to. Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Tested-by: Satheesh Rajendran <sathnaga@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20200921095653.9701-6-srikar@linux.vnet.ibm.com
2020-10-06powerpc/smp: Optimize remove_cpu_from_masksSrikar Dronamraju
While offlining a CPU, system currently iterate through all the CPUs in the DIE to clear sibling, l2_cache and smallcore maps. However if there are more cores in a DIE, system can end up spending more time iterating through CPUs which are completely unrelated. Optimize this by only iterating through smaller but relevant cpumap. If shared_cache is set, cpu_l2_cache_map should be relevant else cpu_sibling_map would be relevant. Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Tested-by: Satheesh Rajendran <sathnaga@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20200921095653.9701-5-srikar@linux.vnet.ibm.com
2020-10-06powerpc/smp: Remove get_physical_package_idSrikar Dronamraju
Now that cpu_core_mask has been removed and topology_core_cpumask has been updated to use cpu_cpu_mask, we no more need get_physical_package_id. Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Tested-by: Satheesh Rajendran <sathnaga@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20200921095653.9701-4-srikar@linux.vnet.ibm.com
2020-10-06powerpc/smp: Stop updating cpu_core_maskSrikar Dronamraju
Anton Blanchard reported that his 4096 vcpu KVM guest took around 30 minutes to boot. He also analyzed it to the time taken to iterate while setting the cpu_core_mask. Further analysis shows that cpu_core_mask and cpu_cpu_mask for any CPU would be equal on Power. However updating cpu_core_mask took forever to update as its a per cpu cpumask variable. Instead cpu_cpu_mask was a per NODE /per DIE cpumask that was shared by all the respective CPUs. Also cpu_cpu_mask is needed from a scheduler perspective. However cpu_core_map is an exported symbol. Hence stop updating cpu_core_map and make it point to cpu_cpu_mask. Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Tested-by: Satheesh Rajendran <sathnaga@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20200921095653.9701-3-srikar@linux.vnet.ibm.com
2020-10-06powerpc/topology: Update topology_core_cpumaskSrikar Dronamraju
On Power, cpu_core_mask and cpu_cpu_mask refer to the same set of CPUs. cpu_cpu_mask is needed by scheduler, hence look at deprecating cpu_core_mask. Before deleting the cpu_core_mask, ensure its only user is moved to cpu_cpu_mask. Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Tested-by: Satheesh Rajendran <sathnaga@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20200921095653.9701-2-srikar@linux.vnet.ibm.com
2020-10-06powerpc/tm: Save and restore AMR on treclaim and trechkptGustavo Romero
Althought AMR is stashed in the checkpoint area, currently we don't save it to the per thread checkpoint struct after a treclaim and so we don't restore it either from that struct when we trechkpt. As a consequence when the transaction is later rolled back the kernel space AMR value when the trechkpt was done appears in userspace. That commit saves and restores AMR accordingly on treclaim and trechkpt. Since AMR value is also used in kernel space in other functions, it also takes care of stashing kernel live AMR into the stack before treclaim and before trechkpt, restoring it later, just before returning from tm_reclaim and __tm_recheckpoint. Is also fixes two nonrelated comments about CR and MSR. Signed-off-by: Gustavo Romero <gromero@linux.ibm.com> Tested-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20200919150025.9609-1-gromero@linux.ibm.com
2020-10-06powerpc/eeh: Clean up PE addressingOliver O'Halloran
When support for EEH on PowerNV was added a lot of pseries specific code was made "generic" and some of the quirks of pseries EEH came along for the ride. One of the stranger quirks is eeh_pe containing two types of PE address: pe->addr and pe->config_addr. There reason for this appears to be historical baggage rather than any real requirements. On pseries EEH PEs are manipulated using RTAS calls. Each EEH RTAS call takes a "PE configuration address" as an input which is used to identify which EEH PE is being manipulated by the call. When initialising the EEH state for a device the first thing we need to do is determine the configuration address for the PE which contains the device so we can enable EEH on that PE. This process is outlined in PAPR which is the modern (i.e post-2003) FW specification for pseries. However, EEH support was first described in the pSeries RISC Platform Architecture (RPA) and although they are mostly compatible EEH is one of the areas where they are not. The major difference is that RPA doesn't actually have the concept of a PE. On RPA systems the EEH RTAS calls are done on a per-device basis using the same config_addr that would be passed to the RTAS functions to access PCI config space (e.g. ibm,read-pci-config). The config_addr is not identical since the function and config register offsets of the config_addr must be set to zero. EEH operations being done on a per-device basis doesn't make a whole lot of sense when you consider how EEH was implemented on legacy PCI systems. For legacy PCI(-X) systems EEH was implemented using special PCI-PCI bridges which contained logic to detect errors and freeze the secondary bus when one occurred. This means that the EEH enabled state is shared among all devices behind that EEH bridge. As a result there's no way to implement the per-device control required for the semantics specified by RPA. It can be made to work if we assume that a separate EEH bridge exists for each EEH capable PCI slot and there are no bridges behind those slots. However, RPA also specifies the ibm,configure-bridge RTAS call for re-initalising bridges behind EEH capable slots after they are reset due to an EEH event so that is probably not a valid assumption. This incoherence was fixed in later PAPR, which succeeded RPA. Unfortunately, since Linux EEH support seems to have been implemented based on the RPA spec some of the legacy assumptions were carried over (probably for POWER4 compatibility). The fix made in PAPR was the introduction of the "PE" concept and redefining the EEH RTAS calls (set-eeh-option, reset-slot, etc) to operate on a per-PE basis so all devices behind an EEH bride would share the same EEH state. The "config_addr" argument to the EEH RTAS calls became the "PE_config_addr" and the OS was required to use the ibm,get-config-addr-info RTAS call to find the correct PE address for the device. When support for the new interfaces was added to Linux it was implemented using something like: At probe time: pdn->eeh_config_addr = rtas_config_addr(pdn); pdn->eeh_pe_config_addr = rtas_get_config_addr_info(pdn); When performing an RTAS call: config_addr = pdn->eeh_config_addr; if (pdn->eeh_pe_config_addr) config_addr = pdn->eeh_pe_config_addr; rtas_call(..., config_addr, ...); In other words, if the ibm,get-config-addr-info RTAS call is implemented and returned a valid result we'd use that as the argument to the EEH RTAS calls. If not, Linux would fall back to using the device's config_addr. Over time these addresses have moved around going from pci_dn to eeh_dev and finally into eeh_pe. Today the users look like this: config_addr = pe->config_addr; if (pe->addr) config_addr = pe->addr; rtas_call(..., config_addr, ...); However, considering the EEH core always operates on a per-PE basis and even on pseries the only per-device operation is the initial call to ibm,set-eeh-option I'm not sure if any of this actually works on an RPA system today. It doesn't make much sense to have the fallback address in a generic structure either since the bulk of the code which reference it is in pseries anyway. The EEH core makes a token effort to support looking up a PE using the config_addr by having two arguments to eeh_pe_get(). However, a survey of all the callers to eeh_pe_get() shows that all bar one have the config_addr argument hard-coded to zero.The only caller that doesn't is in eeh_pe_tree_insert() which has: if (!eeh_has_flag(EEH_VALID_PE_ZERO) && !edev->pe_config_addr) return -EINVAL; pe = eeh_pe_get(hose, edev->pe_config_addr, edev->bdfn); The third argument (config_addr) is only used if the second (pe->addr) argument is invalid. The preceding check ensures that the call to eeh_pe_get() will never happen if edev->pe_config_addr is invalid so there is no situation where eeh_pe_get() will search for a PE based on the 3rd argument. The check also means that we'll never insert a PE into the tree where pe_config_addr is zero since EEH_VALID_PE_ZERO is never set on pseries. All the users of the fallback address on pseries never actually use the fallback and all the only caller that supplies something for the config_addr argument to eeh_pe_get() never use it either. It's all dead code. This patch removes the fallback address from eeh_pe since nothing uses it. Specificly, we do this by: 1) Removing pe->config_addr 2) Removing the EEH_VALID_PE_ZERO flag 3) Removing the fallback address argument to eeh_pe_get(). 4) Removing all the checks for pe->addr being zero in the pseries EEH code. This leaves us with PE's only being identified by what's in their pe->addr field and the EEH core relying on the platform to ensure that eeh_dev's are only inserted into the EEH tree if they're actually inside a PE. No functional changes, I hope. Signed-off-by: Oliver O'Halloran <oohall@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20200918093050.37344-9-oohall@gmail.com
2020-10-06powerpc/pseries/eeh: Allow zero to be a valid PE configuration addressOliver O'Halloran
There's no real reason why zero can't be a valid PE configuration address. Under qemu each sPAPR PHB (i.e. EEH supporting) has the passed-though devices on bus zero, so the PE address of bus <dddd>:00 should be zero. However, all previous versions of Linux will reject that, so Qemu at least goes out of it's way to avoid it. The Qemu implementation of ibm,get-config-addr-info2 RTAS has the following comment: > /* > * We always have PE address of form "00BB0001". "BB" > * represents the bus number of PE's primary bus. > */ So qemu puts a one into the register portion of the PE's config_addr to avoid it being zero. The whole is pretty silly considering that RTAS will return a negative error code if it can't map the device's config_addr to a PE. This patch fixes Linux to treat zero as a valid PE address. This shouldn't have any real effects due to the Qemu hack mentioned above. And the fact that Linux EEH has worked historically on PowerVM means they never pass through devices on bus zero so we would never see the problem there either. Signed-off-by: Oliver O'Halloran <oohall@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20200918093050.37344-8-oohall@gmail.com
2020-10-06powerpc/pseries/eeh: Rework device EEH PE determinationOliver O'Halloran
The process Linux uses for determining if a device supports EEH or not appears to be at odds with what PAPR says the OS should be doing. The current flow is something like: 1. Assume pe_config_addr is equal the the device's config_addr. 2. Attempt to enable EEH on that PE 3. Verify EEH was enabled (POWER4 bug workaround) 4. Try find the pe_config_addr using the ibm,get-config-addr-info2 RTAS call. 5. If that fails walk the pci_dn tree upwards trying to find a parent device with EEH support. If we find one then add the device to that PE. The first major problem with this process is that we need the PE config address in step 2) since its needs to be passed to the ibm,set-eeh-option RTAS call when enabling EEH for th PE. We hack around this requirement in by making the assumption in 1) and delay finding the actual PE address until 4). This is fine if: a) The PCI device is the 0th function, and b) The device is on the PE's root bus. Granted, the current sequence does appear to work on most systems even when these conditions are false. At a guess PowerVM's RTAS has workarounds to accommodate Linux's quirks or the RTAS call to enable EEH is treated as no-op on most platforms since EEH is usually enabled by default. However, what is currently implemented is a bit sketch and is downright confusing since it doesn't match up with what what PAPR suggests we should be doing. This patch re-works how we handle EEH init so that we find the PE config address using the ibm,get-config-addr-info2 RTAS call first, then use the found address to finish the EEH init process. It also drops the Power4 workaround since as of commit 471d7ff8b51b ("powerpc/64s: Remove POWER4 support") the kernel does not support running on a Power4 CPU so there's no need to support the Power4 platform's quirks either. With the patch applied the sequence is now: 1. Find the pe_config_addr from the device using the RTAS call. 2. Enable the PE. 3. Insert the edev into the tree and create an eeh_pe if needed. The other change made here is ignoring unsupported devices entirely. Currently the device's BARs are saved to the eeh_dev even if the device is not part of an EEH PE. Not being part of a PE means that an EEH recovery pass will never see that device so the saving the BARs is pointless. Signed-off-by: Oliver O'Halloran <oohall@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20200918093050.37344-7-oohall@gmail.com
2020-10-06powerpc/pseries/eeh: Clean up pe_config_addr lookupsOliver O'Halloran
De-duplicate, and fix up the comments, and make the prototype just take a pci_dn since the job of the function is to return the pe_config_addr of the PE which contains a given device. Signed-off-by: Oliver O'Halloran <oohall@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20200918093050.37344-6-oohall@gmail.com
2020-10-06powerpc/eeh: Move EEH initialisation to an arch initcallOliver O'Halloran
The initialisation of EEH mostly happens in a core_initcall_sync initcall, followed by registering a bus notifier later on in an arch_initcall. Anything involving initcall dependecies is mostly incomprehensible unless you've spent a while staring at code so here's the full sequence: ppc_md.setup_arch <-- pci_controllers are created here ...time passes... core_initcall <-- pci_dns are created from DT nodes core_initcall_sync <-- platforms call eeh_init() postcore_initcall <-- PCI bus type is registered postcore_initcall_sync arch_initcall <-- EEH pci_bus notifier registered subsys_initcall <-- PHBs are scanned here There's no real requirement to do the EEH setup at the core_initcall_sync level. It just needs to be done after pci_dn's are created and before we start scanning PHBs. Simplify the flow a bit by moving the platform EEH inititalisation to an arch_initcall so we can fold the bus notifier registration into eeh_init(). Signed-off-by: Oliver O'Halloran <oohall@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20200918093050.37344-5-oohall@gmail.com
2020-10-06powerpc/eeh: Delete eeh_ops->initOliver O'Halloran
No longer used since the platforms perform their EEH initialisation before calling eeh_init(). Signed-off-by: Oliver O'Halloran <oohall@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20200918093050.37344-4-oohall@gmail.com
2020-10-06powerpc/pseries: Stop using eeh_ops->init()Oliver O'Halloran
Fold pseries_eeh_init() into eeh_pseries_init() rather than having eeh_init() call it via eeh_ops->init(). It's simpler and it'll let us delete eeh_ops.init. Signed-off-by: Oliver O'Halloran <oohall@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20200918093050.37344-3-oohall@gmail.com
2020-10-06powerpc/powernv: Stop using eeh_ops->init()Oliver O'Halloran
Fold pnv_eeh_init() into eeh_powernv_init() rather than having eeh_init() call it via eeh_ops->init(). It's simpler and it'll let us delete eeh_ops.init. Signed-off-by: Oliver O'Halloran <oohall@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20200918093050.37344-2-oohall@gmail.com
2020-10-06powerpc/eeh: Rework EEH initialisationOliver O'Halloran
Drop the EEH register / unregister ops thing and have the platform pass the ops structure into eeh_init() directly. This takes one initcall out of the EEH setup path and it means we're only doing EEH setup on the platforms which actually support it. It's also less code and generally easier to follow. No functional changes. Signed-off-by: Oliver O'Halloran <oohall@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20200918093050.37344-1-oohall@gmail.com
2020-10-06powerpc: switch 85xx defconfigs from legacy ide to libataChristoph Hellwig
Switch the 85xx defconfigs from the soon to be removed legacy ide driver to libata. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20200924041310.520970-1-hch@lst.de
2020-10-06powerpc: PPC_SECURE_BOOT should not require PowerNVDaniel Axtens
In commit 61f879d97ce4 ("powerpc/pseries: Detect secure and trusted boot state of the system.") we taught the kernel how to understand the secure-boot parameters used by a pseries guest. However, CONFIG_PPC_SECURE_BOOT still requires PowerNV. I didn't catch this because pseries_le_defconfig includes support for PowerNV and so everything still worked. Indeed, most configs will. Nonetheless, technically PPC_SECURE_BOOT doesn't require PowerNV any more. The secure variables support (PPC_SECVAR_SYSFS) doesn't do anything on pSeries yet, but I don't think it's worth adding a new condition - at some stage we'll want to add a backend for pSeries anyway. Fixes: 61f879d97ce4 ("powerpc/pseries: Detect secure and trusted boot state of the system.") Signed-off-by: Daniel Axtens <dja@axtens.net> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20200924014922.172914-1-dja@axtens.net
2020-10-06powerpc/papr_scm: Fix warnings about undeclared variableWang Wensheng
Build the kernel with 'make C=2': arch/powerpc/platforms/pseries/papr_scm.c:825:1: warning: symbol 'dev_attr_perf_stats' was not declared. Should it be static? Signed-off-by: Wang Wensheng <wangwensheng4@huawei.com> Reviewed-by: Vaibhav Jain <vaibhav@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20200918085951.44983-1-wangwensheng4@huawei.com
2020-10-06powerpc/64: make restore_interrupts 64e onlyNicholas Piggin
This is not used by 64s. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20200915114650.3980244-5-npiggin@gmail.com
2020-10-06powerpc/64e: remove 64s specific interrupt soft-mask codeNicholas Piggin
Since the assembly soft-masking code was moved to 64e specific, there are some 64s specific interrupt types still there. Remove them. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20200915114650.3980244-4-npiggin@gmail.com
2020-10-06powerpc/64e: remove PACA_IRQ_EE_EDGENicholas Piggin
This is not used anywhere. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20200915114650.3980244-3-npiggin@gmail.com
2020-10-06powerpc/64: fix irq replay pt_regs->softe valueNicholas Piggin
Replayed interrupts get an "artificial" struct pt_regs constructed to pass to interrupt handler functions. This did not get the softe field set correctly, it's as though the interrupt has hit while irqs are disabled. It should be IRQS_ENABLED. This is possibly harmless, asynchronous handlers should not be testing if irqs were disabled, but it might be possible for example some code is shared with synchronous or NMI handlers, and it makes more sense if debug output looks at this. Fixes: 3282a3da25bd ("powerpc/64: Implement soft interrupt replay in C") Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20200915114650.3980244-2-npiggin@gmail.com
2020-10-06powerpc/64: fix irq replay missing preemptNicholas Piggin
Prior to commit 3282a3da25bd ("powerpc/64: Implement soft interrupt replay in C"), replayed interrupts returned by the regular interrupt exit code, which performs preemption in case an interrupt had set need_resched. This logic was missed by the conversion. Adding preempt_disable/enable around the interrupt replay and final irq enable will reschedule if needed. Fixes: 3282a3da25bd ("powerpc/64: Implement soft interrupt replay in C") Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20200915114650.3980244-1-npiggin@gmail.com
2020-10-06powerpc/pseries: add new branch prediction security bits for link stackNicholas Piggin
The hypervisor interface has defined branch prediction security bits for handling the link stack. Wire them up. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20200825075612.224656-1-npiggin@gmail.com
2020-10-06powerpc/64s: Add cp_abort after tlbiel to invalidate copy-buffer addressNicholas Piggin
The copy buffer is implemented as a real address in the nest which is translated from EA by copy, and used for memory access by paste. This requires that it be invalidated by TLB invalidation. TLBIE does invalidate the copy buffer, but TLBIEL does not. Add cp_abort to the tlbiel sequence. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> [mpe: Fixup whitespace and comment formatting] Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20200916030234.4110379-2-npiggin@gmail.com
2020-10-06powerpc: untangle cputable mce includeNicholas Piggin
Having cputable.h include mce.h means it pulls in a bunch of low level headers (e.g., synch.h) which then can't use CPU_FTR_ definitions. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20200916030234.4110379-1-npiggin@gmail.com
2020-10-06powerpc/powernv/elog: Fix race while processing OPAL error log event.Mahesh Salgaonkar
Every error log reported by OPAL is exported to userspace through a sysfs interface and notified using kobject_uevent(). The userspace daemon (opal_errd) then reads the error log and acknowledges the error log is saved safely to disk. Once acknowledged the kernel removes the respective sysfs file entry causing respective resources to be released including kobject. However it's possible the userspace daemon may already be scanning elog entries when a new sysfs elog entry is created by the kernel. User daemon may read this new entry and ack it even before kernel can notify userspace about it through kobject_uevent() call. If that happens then we have a potential race between elog_ack_store->kobject_put() and kobject_uevent which can lead to use-after-free of a kernfs object resulting in a kernel crash. eg: BUG: Unable to handle kernel data access on read at 0x6b6b6b6b6b6b6bfb Faulting instruction address: 0xc0000000008ff2a0 Oops: Kernel access of bad area, sig: 11 [#1] LE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS=2048 NUMA PowerNV CPU: 27 PID: 805 Comm: irq/29-opal-elo Not tainted 5.9.0-rc2-gcc-8.2.0-00214-g6f56a67bcbb5-dirty #363 ... NIP kobject_uevent_env+0xa0/0x910 LR elog_event+0x1f4/0x2d0 Call Trace: 0x5deadbeef0000122 (unreliable) elog_event+0x1f4/0x2d0 irq_thread_fn+0x4c/0xc0 irq_thread+0x1c0/0x2b0 kthread+0x1c4/0x1d0 ret_from_kernel_thread+0x5c/0x6c This patch fixes this race by protecting the sysfs file creation/notification by holding a reference count on kobject until we safely send kobject_uevent(). The function create_elog_obj() returns the elog object which if used by caller function will end up in use-after-free problem again. However, the return value of create_elog_obj() function isn't being used today and there is no need as well. Hence change it to return void to make this fix complete. Fixes: 774fea1a38c6 ("powerpc/powernv: Read OPAL error log and export it through sysfs") Cc: stable@vger.kernel.org # v3.15+ Reported-by: Oliver O'Halloran <oohall@gmail.com> Signed-off-by: Mahesh Salgaonkar <mahesh@linux.ibm.com> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Reviewed-by: Oliver O'Halloran <oohall@gmail.com> Reviewed-by: Vasant Hegde <hegdevasant@linux.vnet.ibm.com> [mpe: Rework the logic to use a single return, reword comments, add oops] Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20201006122051.190176-1-mpe@ellerman.id.au
2020-10-06MIPS: pgtable: Remove used PAGE_USERIO defineThomas Bogendoerfer
There are no users of PAGE_USERIO. Signed-off-by: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
2020-10-06MIPS: alchemy: Share prom_init implementationThomas Bogendoerfer
All boards have the same prom_init() function. Move it to common code and delete the duplicates. Signed-off-by: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
2020-10-06MIPS: alchemy: Fix build breakage, if TOUCHSCREEN_WM97XX is disabledThomas Bogendoerfer
Only include wm97xx touchscreen probing code, if driver is enabled. Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
2020-10-06x86/copy_mc: Introduce copy_mc_enhanced_fast_string()Dan Williams
The motivations to go rework memcpy_mcsafe() are that the benefit of doing slow and careful copies is obviated on newer CPUs, and that the current opt-in list of CPUs to instrument recovery is broken relative to those CPUs. There is no need to keep an opt-in list up to date on an ongoing basis if pmem/dax operations are instrumented for recovery by default. With recovery enabled by default the old "mcsafe_key" opt-in to careful copying can be made a "fragile" opt-out. Where the "fragile" list takes steps to not consume poison across cachelines. The discussion with Linus made clear that the current "_mcsafe" suffix was imprecise to a fault. The operations that are needed by pmem/dax are to copy from a source address that might throw #MC to a destination that may write-fault, if it is a user page. So copy_to_user_mcsafe() becomes copy_mc_to_user() to indicate the separate precautions taken on source and destination. copy_mc_to_kernel() is introduced as a non-SMAP version that does not expect write-faults on the destination, but is still prepared to abort with an error code upon taking #MC. The original copy_mc_fragile() implementation had negative performance implications since it did not use the fast-string instruction sequence to perform copies. For this reason copy_mc_to_kernel() fell back to plain memcpy() to preserve performance on platforms that did not indicate the capability to recover from machine check exceptions. However, that capability detection was not architectural and now that some platforms can recover from fast-string consumption of memory errors the memcpy() fallback now causes these more capable platforms to fail. Introduce copy_mc_enhanced_fast_string() as the fast default implementation of copy_mc_to_kernel() and finalize the transition of copy_mc_fragile() to be a platform quirk to indicate 'copy-carefully'. With this in place, copy_mc_to_kernel() is fast and recovery-ready by default regardless of hardware capability. Thanks to Vivek for identifying that copy_user_generic() is not suitable as the copy_mc_to_user() backend since the #MC handler explicitly checks ex_has_fault_handler(). Thanks to the 0day robot for catching a performance bug in the x86/copy_mc_to_user implementation. [ bp: Add the "why" for this change from the 0/2th message, massage. ] Fixes: 92b0729c34ca ("x86/mm, x86/mce: Add memcpy_mcsafe()") Reported-by: Erwin Tsaur <erwin.tsaur@intel.com> Reported-by: 0day robot <lkp@intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Tony Luck <tony.luck@intel.com> Tested-by: Erwin Tsaur <erwin.tsaur@intel.com> Cc: <stable@vger.kernel.org> Link: https://lkml.kernel.org/r/160195562556.2163339.18063423034951948973.stgit@dwillia2-desk3.amr.corp.intel.com
2020-10-06x86, powerpc: Rename memcpy_mcsafe() to copy_mc_to_{user, kernel}()Dan Williams
In reaction to a proposal to introduce a memcpy_mcsafe_fast() implementation Linus points out that memcpy_mcsafe() is poorly named relative to communicating the scope of the interface. Specifically what addresses are valid to pass as source, destination, and what faults / exceptions are handled. Of particular concern is that even though x86 might be able to handle the semantics of copy_mc_to_user() with its common copy_user_generic() implementation other archs likely need / want an explicit path for this case: On Fri, May 1, 2020 at 11:28 AM Linus Torvalds <torvalds@linux-foundation.org> wrote: > > On Thu, Apr 30, 2020 at 6:21 PM Dan Williams <dan.j.williams@intel.com> wrote: > > > > However now I see that copy_user_generic() works for the wrong reason. > > It works because the exception on the source address due to poison > > looks no different than a write fault on the user address to the > > caller, it's still just a short copy. So it makes copy_to_user() work > > for the wrong reason relative to the name. > > Right. > > And it won't work that way on other architectures. On x86, we have a > generic function that can take faults on either side, and we use it > for both cases (and for the "in_user" case too), but that's an > artifact of the architecture oddity. > > In fact, it's probably wrong even on x86 - because it can hide bugs - > but writing those things is painful enough that everybody prefers > having just one function. Replace a single top-level memcpy_mcsafe() with either copy_mc_to_user(), or copy_mc_to_kernel(). Introduce an x86 copy_mc_fragile() name as the rename for the low-level x86 implementation formerly named memcpy_mcsafe(). It is used as the slow / careful backend that is supplanted by a fast copy_mc_generic() in a follow-on patch. One side-effect of this reorganization is that separating copy_mc_64.S to its own file means that perf no longer needs to track dependencies for its memcpy_64.S benchmarks. [ bp: Massage a bit. ] Signed-off-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Tony Luck <tony.luck@intel.com> Acked-by: Michael Ellerman <mpe@ellerman.id.au> Cc: <stable@vger.kernel.org> Link: http://lore.kernel.org/r/CAHk-=wjSqtXAqfUJxFtWNwmguFASTgB0dz1dT3V-78Quiezqbg@mail.gmail.com Link: https://lkml.kernel.org/r/160195561680.2163339.11574962055305783722.stgit@dwillia2-desk3.amr.corp.intel.com
2020-10-06dma-mapping: merge <linux/dma-noncoherent.h> into <linux/dma-map-ops.h>Christoph Hellwig
Move more nitty gritty DMA implementation details into the common internal header. Signed-off-by: Christoph Hellwig <hch@lst.de>
2020-10-06dma-mapping: move dma-debug.h to kernel/dma/Christoph Hellwig
Most of dma-debug.h is not required by anything outside of kernel/dma. Move the four declarations needed by dma-mappin.h or dma-ops providers into dma-mapping.h and dma-map-ops.h, and move the remainder of the file to kernel/dma/debug.h. Signed-off-by: Christoph Hellwig <hch@lst.de>
2020-10-06dma-mapping: remove <asm/dma-contiguous.h>Christoph Hellwig
Just provide a weak default definition of dma_contiguous_early_fixup and let arm override it. Signed-off-by: Christoph Hellwig <hch@lst.de>
2020-10-06dma-mapping: merge <linux/dma-contiguous.h> into <linux/dma-map-ops.h>Christoph Hellwig
Merge dma-contiguous.h into dma-map-ops.h, after removing the comment describing the contiguous allocator into kernel/dma/contigous.c. Signed-off-by: Christoph Hellwig <hch@lst.de>
2020-10-06dma-contiguous: remove dev_set_cma_areaChristoph Hellwig
dev_set_cma_area contains a trivial assignment. It has just three callers that all have a non-NULL device and depend on CONFIG_DMA_CMA, so remove the wrapper. Signed-off-by: Christoph Hellwig <hch@lst.de>
2020-10-06dma-contiguous: remove dma_declare_contiguousChristoph Hellwig
dma_declare_contiguous is a trivial wrapper around dma_contiguous_reserve_area and just has a single caller. Signed-off-by: Christoph Hellwig <hch@lst.de>
2020-10-06dma-mapping: split <linux/dma-mapping.h>Christoph Hellwig
Split out all the bits that are purely for dma_map_ops implementations and related code into a new <linux/dma-map-ops.h> header so that they don't get pulled into all the drivers. That also means the architecture specific <asm/dma-mapping.h> is not pulled in by <linux/dma-mapping.h> any more, which leads to a missing includes that were pulled in by the x86 or arm versions in a few not overly portable drivers. Signed-off-by: Christoph Hellwig <hch@lst.de>
2020-10-05arc: include/asm: fix typos of "themselves"Randy Dunlap
Fix copy/paste spello of "themselves" in 3 places. Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Cc: Vineet Gupta <vgupta@synopsys.com> Cc: linux-snps-arc@lists.infradead.org Signed-off-by: Vineet Gupta <vgupta@synopsys.com>
2020-10-05ARC: SMP: fix typo and use "come up" instead of "comeup"Mike Rapoport
When a secondary CPU fails to come up, there is a missing space in the log: Timeout: CPU1 FAILED to comeup !!! Fix it. Signed-off-by: Mike Rapoport <rppt@linux.ibm.com> Signed-off-by: Vineet Gupta <vgupta@synopsys.com>
2020-10-05ARC: [dts] fix the errors detected by dtbs_checkZhen Lei
xxx/arc/boot/dts/axs101.dt.yaml: dw-apb-ictl@e0012000: $nodename:0: \ 'dw-apb-ictl@e0012000' does not match '^interrupt-controller(@[0-9a-f,]+)*$' From schema: xxx/interrupt-controller/snps,dw-apb-ictl.yaml The node name of the interrupt controller must start with "interrupt-controller" instead of "dw-apb-ictl". Signed-off-by: Zhen Lei <thunder.leizhen@huawei.com> Signed-off-by: Vineet Gupta <vgupta@synopsys.com>
2020-10-05arc: plat-hsdk: fix kconfig dependency warning when !RESET_CONTROLLERNecip Fazil Yildiran
When ARC_SOC_HSDK is enabled and RESET_CONTROLLER is disabled, it results in the following Kbuild warning: WARNING: unmet direct dependencies detected for RESET_HSDK Depends on [n]: RESET_CONTROLLER [=n] && HAS_IOMEM [=y] && (ARC_SOC_HSDK [=y] || COMPILE_TEST [=n]) Selected by [y]: - ARC_SOC_HSDK [=y] && ISA_ARCV2 [=y] The reason is that ARC_SOC_HSDK selects RESET_HSDK without depending on or selecting RESET_CONTROLLER while RESET_HSDK is subordinate to RESET_CONTROLLER. Honor the kconfig menu hierarchy to remove kconfig dependency warnings. Fixes: a528629dfd3b ("ARC: [plat-hsdk] select CONFIG_RESET_HSDK from Kconfig") Signed-off-by: Necip Fazil Yildiran <fazilyildiran@gmail.com> Signed-off-by: Vineet Gupta <vgupta@synopsys.com>
2020-10-05ARC: [plat-eznps]: Drop support for EZChip NPS platformVineet Gupta
NPS customers are no longer doing active development, as evident from rand config build failures reported in recent times, so drop support for NPS platform. Tested-by: kernel test robot <lkp@intel.com> Signed-off-by: Vineet Gupta <vgupta@synopsys.com>
2020-10-05Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netDavid S. Miller
Rejecting non-native endian BTF overlapped with the addition of support for it. The rest were more simple overlapping changes, except the renesas ravb binding update, which had to follow a file move as well as a YAML conversion. Signed-off-by: David S. Miller <davem@davemloft.net>
2020-10-05Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netLinus Torvalds
Pull networking fixes from David Miller: 1) Make sure SKB control block is in the proper state during IPSEC ESP-in-TCP encapsulation. From Sabrina Dubroca. 2) Various kinds of attributes were not being cloned properly when we build new xfrm_state objects from existing ones. Fix from Antony Antony. 3) Make sure to keep BTF sections, from Tony Ambardar. 4) TX DMA channels need proper locking in lantiq driver, from Hauke Mehrtens. 5) Honour route MTU during forwarding, always. From Maciej Żenczykowski. 6) Fix races in kTLS which can result in crashes, from Rohit Maheshwari. 7) Skip TCP DSACKs with rediculous sequence ranges, from Priyaranjan Jha. 8) Use correct address family in xfrm state lookups, from Herbert Xu. 9) A bridge FDB flush should not clear out user managed fdb entries with the ext_learn flag set, from Nikolay Aleksandrov. 10) Fix nested locking of netdev address lists, from Taehee Yoo. 11) Fix handling of 32-bit DATA_FIN values in mptcp, from Mat Martineau. 12) Fix r8169 data corruptions on RTL8402 chips, from Heiner Kallweit. 13) Don't free command entries in mlx5 while comp handler could still be running, from Eran Ben Elisha. 14) Error flow of request_irq() in mlx5 is busted, due to an off by one we try to free and IRQ never allocated. From Maor Gottlieb. 15) Fix leak when dumping netlink policies, from Johannes Berg. 16) Sendpage cannot be performed when a page is a slab page, or the page count is < 1. Some subsystems such as nvme were doing so. Create a "sendpage_ok()" helper and use it as needed, from Coly Li. 17) Don't leak request socket when using syncookes with mptcp, from Paolo Abeni. * git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (111 commits) net/core: check length before updating Ethertype in skb_mpls_{push,pop} net: mvneta: fix double free of txq->buf net_sched: check error pointer in tcf_dump_walker() net: team: fix memory leak in __team_options_register net: typhoon: Fix a typo Typoon --> Typhoon net: hinic: fix DEVLINK build errors net: stmmac: Modify configuration method of EEE timers tcp: fix syn cookied MPTCP request socket leak libceph: use sendpage_ok() in ceph_tcp_sendpage() scsi: libiscsi: use sendpage_ok() in iscsi_tcp_segment_map() drbd: code cleanup by using sendpage_ok() to check page for kernel_sendpage() tcp: use sendpage_ok() to detect misused .sendpage nvme-tcp: check page by sendpage_ok() before calling kernel_sendpage() net: add WARN_ONCE in kernel_sendpage() for improper zero-copy send net: introduce helper sendpage_ok() in include/linux/net.h net: usb: pegasus: Proper error handing when setting pegasus' MAC address net: core: document two new elements of struct net_device netlink: fix policy dump leak net/mlx5e: Fix race condition on nhe->n pointer in neigh update net/mlx5e: Fix VLAN create flow ...