summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2016-09-22cxgb4: add support for offloading u32 filtersRahul Lakkireddy
Add support for offloading u32 filter onto hardware. Links are stored in a jump table to perform necessary jumps to match TCP/UDP header. When inserting rules in the linked bucket, the TCP/UDP match fields in the corresponding entry of the jump table are appended to the filter rule before insertion. If a link is deleted, then all corresponding filters associated with the link are also deleted. Also enable hardware tc offload as a supported feature. Signed-off-by: Rahul Lakkireddy <rahul.lakkireddy@chelsio.com> Signed-off-by: Hariprasad Shenai <hariprasad@chelsio.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-22cxgb4: add parser to translate u32 filters to internal specRahul Lakkireddy
Parse information sent by u32 into internal filter specification. Add support for parsing several fields in IPv4, IPv6, TCP, and UDP. Signed-off-by: Rahul Lakkireddy <rahul.lakkireddy@chelsio.com> Signed-off-by: Hariprasad Shenai <hariprasad@chelsio.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-22cxgb4: add common api support for configuring filtersRahul Lakkireddy
Enable filters for non-offload configuration and add common api support for setting and deleting filters in LE-TCAM region of the hardware. IPv4 filters occupy one slot. IPv6 filters occupy 4 slots and must be on a 4-slot boundary. IPv4 filters can not occupy a slot belonging to IPv6 and the vice-versa is also true. Filters are set and deleted asynchronously. Use completion to wait for reply from firmware in order to allow for synchronization if needed. Signed-off-by: Rahul Lakkireddy <rahul.lakkireddy@chelsio.com> Signed-off-by: Hariprasad Shenai <hariprasad@chelsio.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-22cxgb4: move common filter code to separate fileRahul Lakkireddy
Move common filter code to separate files. Also fix the following checkpatch checks. CHECK: Comparison to NULL could be written "!f->l2t" + if (f->l2t == NULL) { CHECK: spaces preferred around that '/' (ctx:VxV) + fwr->len16_pkd = htonl(FW_WR_LEN16_V(sizeof(*fwr)/16)); Signed-off-by: Rahul Lakkireddy <rahul.lakkireddy@chelsio.com> Signed-off-by: Hariprasad Shenai <hariprasad@chelsio.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-22net: skbuff: Coding: Use eth_type_vlan() instead of open coding itShmulik Ladkani
Fix 'skb_vlan_pop' to use eth_type_vlan instead of directly comparing skb->protocol to ETH_P_8021Q or ETH_P_8021AD. Signed-off-by: Shmulik Ladkani <shmulik.ladkani@gmail.com> Reviewed-by: Pravin B Shelar <pshelar@ovn.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-22net: skbuff: Remove errornous length validation in skb_vlan_pop()Shmulik Ladkani
In 93515d53b1 "net: move vlan pop/push functions into common code" skb_vlan_pop was moved from its private location in openvswitch to skbuff common code. In case skb has non hw-accel vlan tag, the original 'pop_vlan()' assured that skb->len is sufficient (if skb->len < VLAN_ETH_HLEN then pop was considered a no-op). This validation was moved as is into the new common 'skb_vlan_pop'. Alas, in its original location (openvswitch), there was a guarantee that 'data' points to the mac_header, therefore the 'skb->len < VLAN_ETH_HLEN' condition made sense. However there's no such guarantee in the generic 'skb_vlan_pop'. For short packets received in rx path going through 'skb_vlan_pop', this causes 'skb_vlan_pop' to fail pop-ing a valid vlan hdr (in the non hw-accel case) or to fail moving next tag into hw-accel tag. Remove the 'skb->len < VLAN_ETH_HLEN' condition entirely: It is superfluous since inner '__skb_vlan_pop' already verifies there are VLAN_ETH_HLEN writable bytes at the mac_header. Note this presents a slight change to skb_vlan_pop() users: In case total length is smaller than VLAN_ETH_HLEN, skb_vlan_pop() now returns an error, as opposed to previous "no-op" behavior. Existing callers (e.g. tc act vlan, ovs) usually drop the packet if 'skb_vlan_pop' fails. Fixes: 93515d53b1 ("net: move vlan pop/push functions into common code") Signed-off-by: Shmulik Ladkani <shmulik.ladkani@gmail.com> Cc: Pravin Shelar <pshelar@ovn.org> Reviewed-by: Pravin B Shelar <pshelar@ovn.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-22Merge branch 'vlan_act_modify'David S. Miller
Shmulik Ladkani says: ==================== act_vlan: Introduce TCA_VLAN_ACT_MODIFY vlan action TCA_VLAN_ACT_MODIFY allows one to change an existing tag. It accepts same attributes as TCA_VLAN_ACT_PUSH (protocol, id, priority). If packet is vlan tagged, then the tag gets overwritten according to user specified attributes. For example, this allows user to replace a tag's vid while preserving its priority bits (as opposed to "action vlan pop pipe action vlan push"). ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-22net/sched: act_vlan: Introduce TCA_VLAN_ACT_MODIFY vlan actionShmulik Ladkani
TCA_VLAN_ACT_MODIFY allows one to change an existing tag. It accepts same attributes as TCA_VLAN_ACT_PUSH (protocol, id, priority). If packet is vlan tagged, then the tag gets overwritten according to user specified attributes. For example, this allows user to replace a tag's vid while preserving its priority bits (as opposed to "action vlan pop pipe action vlan push"). Signed-off-by: Shmulik Ladkani <shmulik.ladkani@gmail.com> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-22net: skbuff: Export __skb_vlan_popShmulik Ladkani
This exports the functionality of extracting the tag from the payload, without moving next vlan tag into hw accel tag. Signed-off-by: Shmulik Ladkani <shmulik.ladkani@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-21Merge branch 'mlx4-next'David S. Miller
Tariq Toukan says: ==================== mlx4 misc cleanups and improvements This patchset contains some cleanups and improvements from the team to the mlx4 Eth and core drivers. Series generated against net-next commit: 5a7a5555a362 'net sched: stylistic cleanups' ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-21net/mlx4_core: Fix deadlock when switching between polling and event fw commandsJack Morgenstein
When switching from polling-based fw commands to event-based fw commands, there is a race condition which could cause a fw command in another task to hang: that task will keep waiting for the polling sempahore, but may never be able to acquire it. This is due to mlx4_cmd_use_events, which "down"s the sempahore back to 0. During driver initialization, this is not a problem, since no other tasks which invoke FW commands are active. However, there is a problem if the driver switches to polling mode and then back to event mode during normal operation. The "test_interrupts" feature does exactly that. Running "ethtool -t <eth device> offline" causes the PF driver to temporarily switch to polling mode, and then back to event mode. (Note that for VF drivers, such switching is not performed). Fix this by adding a read-write semaphore for protection when switching between modes. Fixes: 225c7b1feef1 ("IB/mlx4: Add a driver Mellanox ConnectX InfiniBand adapters") Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il> Signed-off-by: Matan Barak <matanb@mellanox.com> Signed-off-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-21net/mlx4_core: Use RCU to perform radix tree lookup for SRQLeon Romanovsky
Radix tree lookup can be performed without locking. Fixes: 225c7b1feef1 ("IB/mlx4: Add a driver Mellanox ConnectX InfiniBand adapters") Signed-off-by: Leon Romanovsky <leonro@mellanox.com> Suggested-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-21net/mlx4_en: Fix wrong indentationKamal Heib
Use tabs instead of spaces before if statement, no functional change. Fixes: e7c1c2c46201 ("mlx4_en: Added self diagnostics test implementation") Signed-off-by: Kamal Heib <kamalh@mellanox.com> Signed-off-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-21net/mlx4_en: Add branch prediction hints in RX data-pathTariq Toukan
Add likely/unlikely hints to improve branch predictions in the RX data-path. Signed-off-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-21Merge tag 'wireless-drivers-for-davem-2016-09-20' of ↵David S. Miller
git://git.kernel.org/pub/scm/linux/kernel/git/kvalo/wireless-drivers Kalle Valo says: ==================== wireless-drivers fixes for 4.8 iwlwifi * fix to prevent firmware crash when sending off-channel frames ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-22rxrpc: Add per-peer RTT trackerDavid Howells
Add a function to track the average RTT for a peer. Sources of RTT data will be added in subsequent patches. The RTT data will be useful in the future for determining resend timeouts and for handling the slow-start part of the Rx protocol. Also add a pair of tracepoints, one to log transmissions to elicit a response for RTT purposes and one to log responses that contribute RTT data. Signed-off-by: David Howells <dhowells@redhat.com>
2016-09-22rxrpc: Add re-sent Tx annotationDavid Howells
Add a Tx-phase annotation for packet buffers to indicate that a buffer has already been retransmitted. This will be used by future congestion management. Re-retransmissions of a packet don't affect the congestion window managment in the same way as initial retransmissions. Signed-off-by: David Howells <dhowells@redhat.com>
2016-09-22rxrpc: Don't store the rxrpc header in the Tx queue sk_buffsDavid Howells
Don't store the rxrpc protocol header in sk_buffs on the transmit queue, but rather generate it on the fly and pass it to kernel_sendmsg() as a separate iov. This reduces the amount of storage required. Note that the security header is still stored in the sk_buff as it may get encrypted along with the data (and doesn't change with each transmission). Signed-off-by: David Howells <dhowells@redhat.com>
2016-09-21btrfs: ensure that file descriptor used with subvol ioctls is a dirJeff Mahoney
If the subvol/snapshot create/destroy ioctls are passed a regular file with execute permissions set, we'll eventually Oops while trying to do inode->i_op->lookup via lookup_one_len. This patch ensures that the file descriptor refers to a directory. Fixes: cb8e70901d (Btrfs: Fix subvolume creation locking rules) Fixes: 76dda93c6a (Btrfs: add snapshot/subvolume destroy ioctl) Cc: <stable@vger.kernel.org> #v2.6.29+ Signed-off-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
2016-09-21Btrfs: handle quota reserve failure properlyJosef Bacik
btrfs/022 was spitting a warning for the case that we exceed the quota. If we fail to make our quota reservation we need to clean up our data space reservation. Thanks, Signed-off-by: Josef Bacik <jbacik@fb.com> Tested-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
2016-09-21Merge branch 'bpf-hw-offload'David S. Miller
Jakub Kicinski says: ==================== BPF hardware offload (cls_bpf for now) Rebased and improved. v7: - fix patch 4. v6 (patch 8 only): - explicitly check for registers >= MAX_BPF_REG; - fix leaky error path. v5: - fix names of guard defines in bpf_verfier.h. v4: - rename parser -> analyzer; - reorganize the analyzer patches a bit; - use bitfield.h directly. --- merge blurb: In the last year a lot of progress have been made on offloading simpler TC classifiers. There is also growing interest in using BPF for generic high-speed packet processing in the kernel. It seems beneficial to tie those two trends together and think about hardware offloads of BPF programs. This patch set presents such offload to Netronome smart NICs. cls_bpf is extended with hardware offload capabilities and NFP driver gets a JIT translator which in presence of capable firmware can be used to offload the BPF program onto the card. BPF JIT implementation is not 100% complete (e.g. missing instructions) but it is functional. Encouragingly it should be possible to offload most (if not all) advanced BPF features onto the NIC - including packet modification, maps, tunnel encap/decap etc. Example of basic tests I used: __section_cls_entry int cls_entry(struct __sk_buff *skb) { if (load_byte(skb, 0) != 0x0) return 0; if (load_byte(skb, 4) != 0x1) return 0; skb->mark = 0xcafe; if (load_byte(skb, 50) != 0xff) return 0; return ~0U; } Above code can be compiled with Clang and loaded like this: ethtool -K p1p1 hw-tc-offload on tc qdisc add dev p1p1 ingress tc filter add dev p1p1 parent ffff: bpf obj prog.o action drop This set implements the basic transparent offload, the skip_{sw,hw} flags and reporting statistics for cls_bpf. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-21nfp: bpf: add offload of TC direct action modeJakub Kicinski
Add offload of TC in direct action mode. We just need to provide appropriate checks in the verifier and a new outro block to translate the exit codes to what data path expects Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-21nfp: bpf: add support for legacy redirect actionJakub Kicinski
Data path has redirect support so expressing redirect to the port frame came from is a trivial matter of setting the right result code. Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-21net: act_mirred: allow statistic updates from offloaded actionsJakub Kicinski
Implement .stats_update() callback. The implementation is generic and can be reused by other simple actions if needed. Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-21nfp: bpf: add packet marking supportJakub Kicinski
Add missing ABI defines and eBPF instructions to allow mark to be passed on and extend prepend parsing on the RX path to pick it up from packet metadata. Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-21nfp: bpf: allow offloaded filters to update statsJakub Kicinski
Periodically poll stats and call into offloaded actions to update them. Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-21net: cls_bpf: allow offloaded filters to update statsJakub Kicinski
Call into offloaded filters to update stats. Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-21nfp: bpf: add hardware bpf offloadJakub Kicinski
Add hardware bpf offload on our smart NICs. Detect if capable firmware is loaded and use it to load the code JITed with just added translator onto programmable engines. This commit only supports offloading cls_bpf in legacy mode (non-direct action). Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-21nfp: add BPF to NFP code translatorJakub Kicinski
Add translator for JITing eBPF to operations which can be executed on NFP's programmable engines. Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-21bpf: recognize 64bit immediate loads as constsJakub Kicinski
When running as parser interpret BPF_LD | BPF_IMM | BPF_DW instructions as loading CONST_IMM with the value stored in imm. The verifier will continue not recognizing those due to concerns about search space/program complexity increase. Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Acked-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-21bpf: enable non-core use of the verfierJakub Kicinski
Advanced JIT compilers and translators may want to use eBPF verifier as a base for parsers or to perform custom checks and validations. Add ability for external users to invoke the verifier and provide callbacks to be invoked for every intruction checked. For now only add most basic callback for per-instruction pre-interpretation checks is added. More advanced users may also like to have per-instruction post callback and state comparison callback. Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-21bpf: expose internal verfier structuresJakub Kicinski
Move verifier's internal structures to a header file and prefix their names with bpf_ to avoid potential namespace conflicts. Those structures will soon be used by external analyzers. Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Acked-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-21bpf: don't (ab)use instructions to store stateJakub Kicinski
Storing state in reserved fields of instructions makes it impossible to run verifier on programs already marked as read-only. Allocate and use an array of per-instruction state instead. While touching the error path rename and move existing jump target. Suggested-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Acked-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-21net: cls_bpf: add support for marking filters as hardware-onlyJakub Kicinski
Add cls_bpf support for the TCA_CLS_FLAGS_SKIP_SW flag. Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-21net: cls_bpf: limit hardware offload by software-only flagJakub Kicinski
Add cls_bpf support for the TCA_CLS_FLAGS_SKIP_HW flag. Unlike U32 and flower cls_bpf already has some netlink flags defined. Create a new attribute to be able to use the same flag values as the above. Unlike U32 and flower reject unknown flags. Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-21net: cls_bpf: add hardware offloadJakub Kicinski
This patch adds hardware offload capability to cls_bpf classifier, similar to what have been done with U32 and flower. Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-21docs: Remove space-before-label guidance from CodingStyleJonathan Corbet
Recent discussion has made it clear that there is no community consensus on this particular rule. Remove it now, lest it inspire yet another set of unwanted "cleanup" patches. This partially reverts 865a1caa4b6b (CodingStyle: Clarify and complete chapter 7). Cc: Jean Delvare <jdelvare@suse.de> Signed-off-by: Jonathan Corbet <corbet@lwn.net>
2016-09-21docs-rst: add inter-document cross referencesMauro Carvalho Chehab
Add cross references for the development process documents that were converted to ReST: Documentation/SubmitChecklist Documentation/SubmittingDrivers Documentation/SubmittingPatches Documentation/development-process/development-process.rst Documentation/stable_kernel_rules.txt Signed-off-by: Mauro Carvalho Chehab <mchehab@s-opensource.com> Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Jonathan Corbet <corbet@lwn.net>
2016-09-21Documentation/email-clients.txt: convert it to ReST markupMauro Carvalho Chehab
As this file is mentioned at the development-process/ book, let's convert it to ReST markup. Signed-off-by: Mauro Carvalho Chehab <mchehab@s-opensource.com> Signed-off-by: Jonathan Corbet <corbet@lwn.net>
2016-09-21i2c: mux: pca954x: retry updating the mux selection on failurePeter Rosin
The cached value of the last selected channel prevents retries on the next call, even on failure to update the selected channel. Fix that. Signed-off-by: Peter Rosin <peda@axentia.se> Signed-off-by: Wolfram Sang <wsa@the-dreams.de> Cc: stable@kernel.org
2016-09-21clk: nxp: clk-lpc32xx: Unmap region obtained by of_iomapArvind Yadav
Free memory mapping, if lpc32xx_clk_init is not successful. Signed-off-by: Arvind Yadav <arvind.yadav.cs@gmail.com> Acked-by: Vladimir Zapolskiy <vz@mleia.com> Acked-by: Sylvain Lemieux <slemieux.tyco@gmail.com> Signed-off-by: Stephen Boyd <sboyd@codeaurora.org>
2016-09-21acpi: Validate processor id when mapping the processorDou Liyang
When we want to identify whether the proc_id is unreasonable or not, we can call the "acpi_processor_validate_proc_id" function. It will search in the duplicate IDs. If we find the proc_id in the IDs, we return true to the call function. Conversely, the false represents available. When we establish all possible cpuid <-> nodeid mapping to handle the cpu hotplugs, we will use the proc_id from ACPI table. We do validation when we get the proc_id. If the result is true, we will stop the mapping. [ tglx: Mark the new function __init ] Signed-off-by: Dou Liyang <douly.fnst@cn.fujitsu.com> Acked-by: Ingo Molnar <mingo@kernel.org> Cc: mika.j.penttila@gmail.com Cc: len.brown@intel.com Cc: rafael@kernel.org Cc: rjw@rjwysocki.net Cc: yasu.isimatu@gmail.com Cc: linux-mm@kvack.org Cc: linux-acpi@vger.kernel.org Cc: isimatu.yasuaki@jp.fujitsu.com Cc: gongzhaogang@inspur.com Cc: tj@kernel.org Cc: izumi.taku@jp.fujitsu.com Cc: cl@linux.com Cc: chen.tang@easystack.cn Cc: akpm@linux-foundation.org Cc: kamezawa.hiroyu@jp.fujitsu.com Cc: lenb@kernel.org Link: http://lkml.kernel.org/r/1472114120-3281-8-git-send-email-douly.fnst@cn.fujitsu.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-09-21acpi: Provide mechanism to validate processors in the ACPI tablesDou Liyang
[Problem] When we set cpuid <-> nodeid mapping to be persistent, it will use the DSDT As we know, the ACPI tables are just like user's input in that respect, and we don't crash if user's input is unreasonable. Such as, the mapping of the proc_id and pxm in some machine's ACPI table is like this: proc_id | pxm -------------------- 0 <-> 0 1 <-> 0 2 <-> 1 3 <-> 1 89 <-> 0 89 <-> 0 89 <-> 0 89 <-> 1 89 <-> 1 89 <-> 2 89 <-> 3 ..... We can't be sure which one is correct to the proc_id 89. We may map a wrong node to a cpu. When pages are allocated, this may cause a kernal panic. So, we should provide mechanisms to validate the ACPI tables, just like we do validation to check user's input in web project. The mechanism is that the processor objects which have the duplicate IDs are not valid. [Solution] We add a validation function, like this: foreach Processor in DSDT proc_id = get_ACPI_Processor_number(Processor) if (proc_id exists ) mark both of them as being unreasonable; The function will record the unique or duplicate processor IDs. The duplicate processor IDs such as 89 are regarded as the unreasonable IDs which mean that the processor objects in question are not valid. [ tglx: Add __init[data] annotations ] Signed-off-by: Dou Liyang <douly.fnst@cn.fujitsu.com> Acked-by: Ingo Molnar <mingo@kernel.org> Cc: mika.j.penttila@gmail.com Cc: len.brown@intel.com Cc: rafael@kernel.org Cc: rjw@rjwysocki.net Cc: yasu.isimatu@gmail.com Cc: linux-mm@kvack.org Cc: linux-acpi@vger.kernel.org Cc: isimatu.yasuaki@jp.fujitsu.com Cc: gongzhaogang@inspur.com Cc: tj@kernel.org Cc: izumi.taku@jp.fujitsu.com Cc: cl@linux.com Cc: chen.tang@easystack.cn Cc: akpm@linux-foundation.org Cc: kamezawa.hiroyu@jp.fujitsu.com Cc: lenb@kernel.org Link: http://lkml.kernel.org/r/1472114120-3281-7-git-send-email-douly.fnst@cn.fujitsu.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-09-21x86/acpi: Set persistent cpuid <-> nodeid mapping when bootingGu Zheng
The whole patch-set aims at making cpuid <-> nodeid mapping persistent. So that, when node online/offline happens, cache based on cpuid <-> nodeid mapping such as wq_numa_possible_cpumask will not cause any problem. It contains 4 steps: 1. Enable apic registeration flow to handle both enabled and disabled cpus. 2. Introduce a new array storing all possible cpuid <-> apicid mapping. 3. Enable _MAT and MADT relative apis to return non-present or disabled cpus' apicid. 4. Establish all possible cpuid <-> nodeid mapping. This patch finishes step 4. This patch set the persistent cpuid <-> nodeid mapping for all enabled/disabled processors at boot time via an additional acpi namespace walk for processors. [ tglx: Remove the unneeded exports ] Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com> Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com> Signed-off-by: Zhu Guihua <zhugh.fnst@cn.fujitsu.com> Signed-off-by: Dou Liyang <douly.fnst@cn.fujitsu.com> Acked-by: Ingo Molnar <mingo@kernel.org> Cc: mika.j.penttila@gmail.com Cc: len.brown@intel.com Cc: rafael@kernel.org Cc: rjw@rjwysocki.net Cc: yasu.isimatu@gmail.com Cc: linux-mm@kvack.org Cc: linux-acpi@vger.kernel.org Cc: isimatu.yasuaki@jp.fujitsu.com Cc: gongzhaogang@inspur.com Cc: tj@kernel.org Cc: izumi.taku@jp.fujitsu.com Cc: cl@linux.com Cc: chen.tang@easystack.cn Cc: akpm@linux-foundation.org Cc: kamezawa.hiroyu@jp.fujitsu.com Cc: lenb@kernel.org Link: http://lkml.kernel.org/r/1472114120-3281-6-git-send-email-douly.fnst@cn.fujitsu.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-09-21x86/acpi: Enable MADT APIs to return disabled apicidsGu Zheng
The whole patch-set aims at making cpuid <-> nodeid mapping persistent. So that, when node online/offline happens, cache based on cpuid <-> nodeid mapping such as wq_numa_possible_cpumask will not cause any problem. It contains 4 steps: 1. Enable apic registeration flow to handle both enabled and disabled cpus. 2. Introduce a new array storing all possible cpuid <-> apicid mapping. 3. Enable _MAT and MADT relative apis to return non-present or disabled cpus' apicid. 4. Establish all possible cpuid <-> nodeid mapping. This patch finishes step 3. There are four mappings in the kernel: 1. nodeid (logical node id) <-> pxm (persistent) 2. apicid (physical cpu id) <-> nodeid (persistent) 3. cpuid (logical cpu id) <-> apicid (not persistent, now persistent by step 2) 4. cpuid (logical cpu id) <-> nodeid (not persistent) So, in order to setup persistent cpuid <-> nodeid mapping for all possible CPUs, we should: 1. Setup cpuid <-> apicid mapping for all possible CPUs, which has been done in step 1, 2. 2. Setup cpuid <-> nodeid mapping for all possible CPUs. But before that, we should obtain all apicids from MADT. All processors' apicids can be obtained by _MAT method or from MADT in ACPI. The current code ignores disabled processors and returns -ENODEV. After this patch, a new parameter will be added to MADT APIs so that caller is able to control if disabled processors are ignored. Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com> Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com> Signed-off-by: Zhu Guihua <zhugh.fnst@cn.fujitsu.com> Signed-off-by: Dou Liyang <douly.fnst@cn.fujitsu.com> Acked-by: Ingo Molnar <mingo@kernel.org> Cc: mika.j.penttila@gmail.com Cc: len.brown@intel.com Cc: rafael@kernel.org Cc: rjw@rjwysocki.net Cc: yasu.isimatu@gmail.com Cc: linux-mm@kvack.org Cc: linux-acpi@vger.kernel.org Cc: isimatu.yasuaki@jp.fujitsu.com Cc: gongzhaogang@inspur.com Cc: tj@kernel.org Cc: izumi.taku@jp.fujitsu.com Cc: cl@linux.com Cc: chen.tang@easystack.cn Cc: akpm@linux-foundation.org Cc: kamezawa.hiroyu@jp.fujitsu.com Cc: lenb@kernel.org Link: http://lkml.kernel.org/r/1472114120-3281-5-git-send-email-douly.fnst@cn.fujitsu.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-09-21x86/acpi: Introduce persistent storage for cpuid <-> apicid mappingGu Zheng
The whole patch-set aims at making cpuid <-> nodeid mapping persistent. So that, when node online/offline happens, cache based on cpuid <-> nodeid mapping such as wq_numa_possible_cpumask will not cause any problem. It contains 4 steps: 1. Enable apic registeration flow to handle both enabled and disabled cpus. 2. Introduce a new array storing all possible cpuid <-> apicid mapping. 3. Enable _MAT and MADT relative apis to return non-present or disabled cpus' apicid. 4. Establish all possible cpuid <-> nodeid mapping. This patch finishes step 2. In this patch, we introduce a new static array named cpuid_to_apicid[], which is large enough to store info for all possible cpus. And then, we modify the cpuid calculation. In generic_processor_info(), it simply finds the next unused cpuid. And it is also why the cpuid <-> nodeid mapping changes with node hotplug. After this patch, we find the next unused cpuid, map it to an apicid, and store the mapping in cpuid_to_apicid[], so that cpuid <-> apicid mapping will be persistent. And finally we will use this array to make cpuid <-> nodeid persistent. cpuid <-> apicid mapping is established at local apic registeration time. But non-present or disabled cpus are ignored. In this patch, we establish all possible cpuid <-> apicid mapping when registering local apic. Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com> Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com> Signed-off-by: Zhu Guihua <zhugh.fnst@cn.fujitsu.com> Signed-off-by: Dou Liyang <douly.fnst@cn.fujitsu.com> Acked-by: Ingo Molnar <mingo@kernel.org> Cc: mika.j.penttila@gmail.com Cc: len.brown@intel.com Cc: rafael@kernel.org Cc: rjw@rjwysocki.net Cc: yasu.isimatu@gmail.com Cc: linux-mm@kvack.org Cc: linux-acpi@vger.kernel.org Cc: isimatu.yasuaki@jp.fujitsu.com Cc: gongzhaogang@inspur.com Cc: tj@kernel.org Cc: izumi.taku@jp.fujitsu.com Cc: cl@linux.com Cc: chen.tang@easystack.cn Cc: akpm@linux-foundation.org Cc: kamezawa.hiroyu@jp.fujitsu.com Cc: lenb@kernel.org Link: http://lkml.kernel.org/r/1472114120-3281-4-git-send-email-douly.fnst@cn.fujitsu.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-09-21x86/acpi: Enable acpi to register all possible cpus at boot timeGu Zheng
cpuid <-> nodeid mapping is firstly established at boot time. And workqueue caches the mapping in wq_numa_possible_cpumask in wq_numa_init() at boot time. When doing node online/offline, cpuid <-> nodeid mapping is established/destroyed, which means, cpuid <-> nodeid mapping will change if node hotplug happens. But workqueue does not update wq_numa_possible_cpumask. So here is the problem: Assume we have the following cpuid <-> nodeid in the beginning: Node | CPU ------------------------ node 0 | 0-14, 60-74 node 1 | 15-29, 75-89 node 2 | 30-44, 90-104 node 3 | 45-59, 105-119 and we hot-remove node2 and node3, it becomes: Node | CPU ------------------------ node 0 | 0-14, 60-74 node 1 | 15-29, 75-89 and we hot-add node4 and node5, it becomes: Node | CPU ------------------------ node 0 | 0-14, 60-74 node 1 | 15-29, 75-89 node 4 | 30-59 node 5 | 90-119 But in wq_numa_possible_cpumask, cpu30 is still mapped to node2, and the like. When a pool workqueue is initialized, if its cpumask belongs to a node, its pool->node will be mapped to that node. And memory used by this workqueue will also be allocated on that node. static struct worker_pool *get_unbound_pool(const struct workqueue_attrs *attrs){ ... /* if cpumask is contained inside a NUMA node, we belong to that node */ if (wq_numa_enabled) { for_each_node(node) { if (cpumask_subset(pool->attrs->cpumask, wq_numa_possible_cpumask[node])) { pool->node = node; break; } } } Since wq_numa_possible_cpumask is not updated, it could be mapped to an offline node, which will lead to memory allocation failure: SLUB: Unable to allocate memory on node 2 (gfp=0x80d0) cache: kmalloc-192, object size: 192, buffer size: 192, default order: 1, min order: 0 node 0: slabs: 6172, objs: 259224, free: 245741 node 1: slabs: 3261, objs: 136962, free: 127656 It happens here: create_worker(struct worker_pool *pool) |--> worker = alloc_worker(pool->node); static struct worker *alloc_worker(int node) { struct worker *worker; worker = kzalloc_node(sizeof(*worker), GFP_KERNEL, node); --> Here, useing the wrong node. ...... return worker; } [Solution] There are four mappings in the kernel: 1. nodeid (logical node id) <-> pxm 2. apicid (physical cpu id) <-> nodeid 3. cpuid (logical cpu id) <-> apicid 4. cpuid (logical cpu id) <-> nodeid 1. pxm (proximity domain) is provided by ACPI firmware in SRAT, and nodeid <-> pxm mapping is setup at boot time. This mapping is persistent, won't change. 2. apicid <-> nodeid mapping is setup using info in 1. The mapping is setup at boot time and CPU hotadd time, and cleared at CPU hotremove time. This mapping is also persistent. 3. cpuid <-> apicid mapping is setup at boot time and CPU hotadd time. cpuid is allocated, lower ids first, and released at CPU hotremove time, reused for other hotadded CPUs. So this mapping is not persistent. 4. cpuid <-> nodeid mapping is also setup at boot time and CPU hotadd time, and cleared at CPU hotremove time. As a result of 3, this mapping is not persistent. To fix this problem, we establish cpuid <-> nodeid mapping for all the possible cpus at boot time, and make it persistent. And according to init_cpu_to_node(), cpuid <-> nodeid mapping is based on apicid <-> nodeid mapping and cpuid <-> apicid mapping. So the key point is obtaining all cpus' apicid. apicid can be obtained by _MAT (Multiple APIC Table Entry) method or found in MADT (Multiple APIC Description Table). So we finish the job in the following steps: 1. Enable apic registeration flow to handle both enabled and disabled cpus. This is done by introducing an extra parameter to generic_processor_info to let the caller control if disabled cpus are ignored. 2. Introduce a new array storing all possible cpuid <-> apicid mapping. And also modify the way cpuid is calculated. Establish all possible cpuid <-> apicid mapping when registering local apic. Store the mapping in this array. 3. Enable _MAT and MADT relative apis to return non-present or disabled cpus' apicid. This is also done by introducing an extra parameter to these apis to let the caller control if disabled cpus are ignored. 4. Establish all possible cpuid <-> nodeid mapping. This is done via an additional acpi namespace walk for processors. This patch finished step 1. Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com> Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com> Signed-off-by: Zhu Guihua <zhugh.fnst@cn.fujitsu.com> Signed-off-by: Dou Liyang <douly.fnst@cn.fujitsu.com> Acked-by: Ingo Molnar <mingo@kernel.org> Cc: mika.j.penttila@gmail.com Cc: len.brown@intel.com Cc: rafael@kernel.org Cc: rjw@rjwysocki.net Cc: yasu.isimatu@gmail.com Cc: linux-mm@kvack.org Cc: linux-acpi@vger.kernel.org Cc: isimatu.yasuaki@jp.fujitsu.com Cc: gongzhaogang@inspur.com Cc: tj@kernel.org Cc: izumi.taku@jp.fujitsu.com Cc: cl@linux.com Cc: chen.tang@easystack.cn Cc: akpm@linux-foundation.org Cc: kamezawa.hiroyu@jp.fujitsu.com Cc: lenb@kernel.org Link: http://lkml.kernel.org/r/1472114120-3281-3-git-send-email-douly.fnst@cn.fujitsu.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-09-21x86/numa: Online memory-less nodes at boot timeTang Chen
For now, x86 does not support memory-less node. A node without memory will not be onlined, and the cpus on it will be mapped to the other online nodes with memory in init_cpu_to_node(). The reason of doing this is to ensure each cpu has mapped to a node with memory, so that it will be able to allocate local memory for that cpu. But we don't have to do it in this way. In this series of patches, we are going to construct cpu <-> node mapping for all possible cpus at boot time, which is a persistent mapping. It means that the cpu will be mapped to the node which it belongs to, and will never be changed. If a node has only cpus but no memory, the cpus on it will be mapped to a memory-less node. And the memory-less node should be onlined. Allocate pgdats for all memory-less nodes and online them at boot time. Then build zonelists for these nodes. As a result, when cpus on these memory-less nodes try to allocate memory from local node, it will automatically fall back to the proper zones in the zonelists. Signed-off-by: Zhu Guihua <zhugh.fnst@cn.fujitsu.com> Signed-off-by: Dou Liyang <douly.fnst@cn.fujitsu.com> Acked-by: Ingo Molnar <mingo@kernel.org> Cc: mika.j.penttila@gmail.com Cc: len.brown@intel.com Cc: Tang Chen <tangchen@cn.fujitsu.com> Cc: rafael@kernel.org Cc: rjw@rjwysocki.net Cc: yasu.isimatu@gmail.com Cc: linux-mm@kvack.org Cc: linux-acpi@vger.kernel.org Cc: isimatu.yasuaki@jp.fujitsu.com Cc: gongzhaogang@inspur.com Cc: tj@kernel.org Cc: izumi.taku@jp.fujitsu.com Cc: cl@linux.com Cc: chen.tang@easystack.cn Cc: akpm@linux-foundation.org Cc: kamezawa.hiroyu@jp.fujitsu.com Cc: lenb@kernel.org Link: http://lkml.kernel.org/r/1472114120-3281-2-git-send-email-douly.fnst@cn.fujitsu.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-09-21ftrace/scripts: Add helper script to bisect function tracing problem functionsSteven Rostedt (Red Hat)
Every so often, with a special config or a architecture change, running function or function_graph tracing can cause the machien to hard reboot, crash, or simply hard lockup. There's some functions in the function graph tracer that can not be traced otherwise it causes the function tracer to recurse before the recursion protection mechanisms are in place. When this occurs, using the dynamic ftrace featuer that allows limiting what actually gets traced can be used to bisect down to the problem function. This adds a script that helps with this process in the scripts/tracing directory, called ftrace-bisect.sh The set up is to read all the functions that can be traced from available_filter_functions into a file (full_file). Then run this script passing it the full_file and a "test_file" and "non_test_file", where the test_file will be add to set_ftrace_filter. What ftarce_bisect.sh does, is to copy half of the functions in full_file into the test_file and the other half into the non_test_file. This way, one can cat the test_file into the set_ftrace_filter functions and only test the functions that are in that file. If it works, then we run the process again after copying non_test_file to full_file and repeating the process. If the system crashed, then the bad function is in the test_file and after a reboot, the test_file becomes the new full_file in the next iteration. When we get down to a single function in the full_file, then ftrace_bisect.sh will report that as the bad function. Full documentation of how to use this simple script is within the script file itself. Link: http://lkml.kernel.org/r/20160920100716.131d3647@gandalf.local.home Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2016-09-21ARM: sa1111: provide to_sa1111_device() macroRussell King
Provide a nicer to_sa1111_device macro to convert a struct device to a sa1111_dev. We will need this for drivers when converting them to dev_pm_ops, or removing shutdown methods. Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk>