summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2018-10-11net: ena: explicit casting and initialization, and clearer error handlingArthur Kiyanovski
Signed-off-by: Arthur Kiyanovski <akiyano@amazon.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-10-11net: ena: use CSUM_CHECKED device indication to report skb's checksum statusArthur Kiyanovski
Set skb->ip_summed to the correct value as reported by the device. Add counter for the case where rx csum offload is enabled but device didn't check it. Signed-off-by: Arthur Kiyanovski <akiyano@amazon.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-10-11net: ena: add functions for handling Low Latency Queues in ena_netdevArthur Kiyanovski
This patch includes all code changes necessary in ena_netdev to enable packet sending via the LLQ placemnt mode. Signed-off-by: Arthur Kiyanovski <akiyano@amazon.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-10-11net: ena: add functions for handling Low Latency Queues in ena_comArthur Kiyanovski
This patch introduces APIs for detection, initialization, configuration and actual usage of low latency queues(LLQ). It extends transmit API with creation of LLQ descriptors in device memory (which include host buffers descriptors as well as packet header) Signed-off-by: Arthur Kiyanovski <akiyano@amazon.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-10-11net: ena: introduce Low Latency Queues data structures according to ENA specArthur Kiyanovski
Low Latency Queues(LLQ) allow usage of device's memory for descriptors and headers. Such queues decrease processing time since data is already located on the device when driver rings the doorbell. Signed-off-by: Arthur Kiyanovski <akiyano@amazon.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-10-11net: ena: complete host info to match latest ENA specArthur Kiyanovski
Add new fields and definitions to host info and fill them according to the latest ENA spec version. Signed-off-by: Arthur Kiyanovski <akiyano@amazon.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-10-11net: ena: minor performance improvementArthur Kiyanovski
Reduce fastpath overhead by making ena_com_tx_comp_req_id_get() inline. Also move it to ena_eth_com.h file with its dependency function ena_com_cq_inc_head(). Signed-off-by: Arthur Kiyanovski <akiyano@amazon.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-10-11Merge tag 'alloc-args-v4.19-rc8' of ↵Greg Kroah-Hartman
https://git.kernel.org/pub/scm/linux/kernel/git/kees/linux Kees writes: "Fix open-coded multiplication arguments to allocators - Fixes several new open-coded multiplications added in the 4.19 merge window." * tag 'alloc-args-v4.19-rc8' of https://git.kernel.org/pub/scm/linux/kernel/git/kees/linux: treewide: Replace more open-coded allocation size multiplications
2018-10-11Merge branch 'mlxsw-Preparations-for-VxLAN-support'David S. Miller
Ido Schimmel says: ==================== mlxsw: Preparations for VxLAN support This patchset prepares mlxsw for VxLAN support. It contains small and mostly non-functional changes. The first eight patches perform small changes in the code to make it more receptive towards the actual VxLAN changes in the next patchset. Patches 9-17 add the registers used to configure the device for VxLAN offload. Last two patches add the required resources and trap IDs. The next patchset is available here [1]. 1. https://github.com/idosch/linux/tree/vxlan ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2018-10-11mlxsw: spectrum: Add NVE packet trapsIdo Schimmel
The DECAP_ECN0 trap will be used to trap packets where the overlay packet is marked with Non-ECT, but the underlay packet is marked with either ECT(0), ECT(1) or CE. When trapped, such packets will be counted as errors by the VxLAN driver and thus provide better visibility. The NVE_ENCAP_ARP trap will be used to trap ARP packets undergoing NVE encapsulation. This is needed in order to support E-VPN ARP suppression, where the Linux bridge does not flood ARP packets through tunnel ports in case it can answer the ARP request itself. Note that all the packets trapped via these traps are marked with 'offload_fwd_mark', so as to not be re-flooded by the Linux bridge through the ASIC ports. Signed-off-by: Ido Schimmel <idosch@mellanox.com> Reviewed-by: Petr Machata <petrm@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-10-11mlxsw: resources: Add NVE resourcesIdo Schimmel
Add the following resources to be used by the NVE code: * Number of IPv4 underlay destination IPs in a single TNUMT record * Number of IPv6 underlay destination IPs in a single TNUMT record Signed-off-by: Ido Schimmel <idosch@mellanox.com> Reviewed-by: Petr Machata <petrm@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-10-11mlxsw: reg: Add Monitoring Parsing State RegisterIdo Schimmel
This register is used for setting up the parsing for hash, policy-engine and routing. Signed-off-by: Ido Schimmel <idosch@mellanox.com> Reviewed-by: Petr Machata <petrm@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-10-11mlxsw: reg: Add definition of unicast tunnel record for SFD registerIdo Schimmel
Will be used to program the device with FDB records pointing to a NVE tunnel. Signed-off-by: Ido Schimmel <idosch@mellanox.com> Reviewed-by: Petr Machata <petrm@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-10-11mlxsw: reg: Add Tunneling NVE QoS Default RegisterIdo Schimmel
The TNQDR register configures the default QoS settings for NVE encapsulation. It will be used to set the default DSCP of each port to 0, so that when DSCP is set to inherit and the overlay packet does not have an IP header the outer DSCP will be set to 0, in accordance with the software data path. Signed-off-by: Ido Schimmel <idosch@mellanox.com> Reviewed-by: Petr Machata <petrm@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-10-11mlxsw: reg: Add Tunneling NVE QoS Configuration RegisterIdo Schimmel
The register configures how QoS is set in Encapsulation into the underlay network. Signed-off-by: Ido Schimmel <idosch@mellanox.com> Reviewed-by: Petr Machata <petrm@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-10-11mlxsw: reg: Add Tunneling NVE Decapsulation ECN Mapping RegisterIdo Schimmel
This register configures the actions that are done during NVE decapsulation based on the ECN bits. Signed-off-by: Ido Schimmel <idosch@mellanox.com> Reviewed-by: Petr Machata <petrm@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-10-11mlxsw: reg: Add Tunneling NVE Encapsulation ECN Mapping RegisterIdo Schimmel
This register performs mapping from overlay ECN to underlay ECN during NVE encapsulation. Signed-off-by: Ido Schimmel <idosch@mellanox.com> Reviewed-by: Petr Machata <petrm@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-10-11mlxsw: reg: Add Tunneling NVE Underlay Multicast Table RegisterIdo Schimmel
This register builds the linked list of underlay destination IPs used for BUM traffic on the overlay. Signed-off-by: Ido Schimmel <idosch@mellanox.com> Reviewed-by: Petr Machata <petrm@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-10-11mlxsw: reg: Add Tunnel Port Configuration RegisterIdo Schimmel
This register enables / disables learning on different types of tunnel ports (e.g., NVE, VPLS). Signed-off-by: Ido Schimmel <idosch@mellanox.com> Reviewed-by: Petr Machata <petrm@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-10-11mlxsw: reg: Add Tunneling NVE General Configuration RegisterIdo Schimmel
This register configures global NVE configuration such as source IP of the NVE tunnel and UDP source port calculation. Signed-off-by: Ido Schimmel <idosch@mellanox.com> Reviewed-by: Petr Machata <petrm@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-10-11mlxsw: spectrum: Seed LAG hash functionIdo Schimmel
Currently, the seed of the LAG hash function is always set to 0, which means it is identical across all switches. Instead, use a random number. This is especially important now that VxLAN is supported, as the LAG hash function is used to calculate the UDP source port of the encapsulated packet. Signed-off-by: Ido Schimmel <idosch@mellanox.com> Reviewed-by: Petr Machata <petrm@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-10-11mlxsw: reg: Extend FDB flush types for NVEIdo Schimmel
The device has the ability to flush all the FDB records that perform NVE encapsulation or only a subset of these with a specific filtering identifier (FID). Expose these types so that they could be used by subsequent patches where we need to flush the FDB records when an NVE device is unlinked from a bridge (FID). Signed-off-by: Ido Schimmel <idosch@mellanox.com> Reviewed-by: Petr Machata <petrm@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-10-11mlxsw: spectrum: Add a new type of KVD linear recordIdo Schimmel
When the device needs to flood an overlay packet to remote VTEPs it retrieves a pointer to the head of a linked-list of records that store the IP addresses of these VTEPs. These records are stored in the KVD linear memory and configured via the Tunneling NVE Underlay Multicast Table (TNUMT) register. Add a new KVD linear entry type for these records, so that we will be able to allocate and free them. Signed-off-by: Ido Schimmel <idosch@mellanox.com> Reviewed-by: Petr Machata <petrm@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-10-11mlxsw: spectrum: Move L3 protocol and address definitions to global header fileIdo Schimmel
The L3 protocol and address definitions are going to be used by the NVE code, so move them to the global header file from the one private to the router. Signed-off-by: Ido Schimmel <idosch@mellanox.com> Reviewed-by: Petr Machata <petrm@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-10-11mlxsw: spectrum_switchdev: Do not assume notifier information typeIdo Schimmel
VxLAN notifications are going to use a different notifier information type, so cast to the correct type based on the received event. Signed-off-by: Ido Schimmel <idosch@mellanox.com> Reviewed-by: Petr Machata <petrm@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-10-11mlxsw: spectrum_switchdev: Check notification relevance based on upper deviceIdo Schimmel
VxLAN FDB updates are sent with the VxLAN device which is not our upper and will therefore be ignored by current code. Solve this by checking whether the upper device (bridge) is our upper. Signed-off-by: Ido Schimmel <idosch@mellanox.com> Reviewed-by: Petr Machata <petrm@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-10-11mlxsw: spectrum_switchdev: Prepare for VxLAN FDB notificationsIdo Schimmel
VxLAN FDB notifications need to be handled differently than bridge FDB notifications, so initialize the work item based on the received notification and rename the invoked function accordingly. Signed-off-by: Ido Schimmel <idosch@mellanox.com> Reviewed-by: Petr Machata <petrm@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-10-11mlxsw: spectrum: Remove misuses of private header fileIdo Schimmel
The spectrum_router.h header file is private to the router block and should only be included by direct consumers of it, such as dpipe and the multicast routing code. Signed-off-by: Ido Schimmel <idosch@mellanox.com> Reviewed-by: Petr Machata <petrm@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-10-11octeontx2-af: Remove set but not used variable 'dev'YueHaibing
Fixes gcc '-Wunused-but-set-variable' warning: drivers/net/ethernet/marvell/octeontx2/af/cgx.c: In function 'cgx_fwi_event_handler': drivers/net/ethernet/marvell/octeontx2/af/cgx.c:257:17: warning: variable 'dev' set but not used [-Wunused-but-set-variable] It never be used since introduction in commit 1463f382f58d ("octeontx2-af: Add support for CGX link management") Signed-off-by: YueHaibing <yuehaibing@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-10-11net: mscc: allow extracting the FCS into the skbAntoine Tenart
This patch adds support for the NETIF_F_RXFCS feature in the Mscc Ethernet driver. This feature is disabled by default and allow a user to request the driver not to drop the FCS and to extract it into the skb for debugging purposes. Signed-off-by: Antoine Tenart <antoine.tenart@bootlin.com> Reviewed-by: Alexandre Belloni <alexandre.belloni@bootlin.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-10-11perf vendor events intel: Fix wrong filter_band* values for uncore eventsJiri Olsa
Michael reported that he could not stat following event: $ perf stat -e unc_p_freq_ge_1200mhz_cycles -a -- ls event syntax error: '..e_1200mhz_cycles' \___ value too big for format, maximum is 255 Run 'perf list' for a list of valid events The event is unwrapped into: uncore_pcu/event=0xb,filter_band0=1200/ where filter_band0 format says it's one byte only: # cat uncore_pcu/format/filter_band0 config1:0-7 while JSON files specifies bigger number: "Filter": "filter_band0=1200", all the filter_band* formats show 1 byte width: # cat uncore_pcu/format/filter_band1 config1:8-15 # cat uncore_pcu/format/filter_band2 config1:16-23 # cat uncore_pcu/format/filter_band3 config1:24-31 The reason of the issue is that filter_band* values are supposed to be in 100Mhz units.. it's stated in the JSON help for the events, like: filter_band3=XXX, with XXX in 100Mhz units This patch divides the filter_band* values by 100, plus there's couple of changes that actually change the number completely, like: - "Filter": "edge=1,filter_band2=4000", + "Filter": "edge=1,filter_band2=30", Reported-by: Michael Petlan <mpetlan@redhat.com> Signed-off-by: Jiri Olsa <jolsa@kernel.org> Acked-by: Andi Kleen <ak@linux.intel.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Kan Liang <kan.liang@linux.intel.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/20181010080339.GB15790@krava Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2018-10-11mac80211: Extend SAE authentication in infra BSS STA modeJouni Malinen
Previous implementation of SAE authentication in infrastructure BSS was somewhat restricting and not exactly clean way of handling the two auth() operations. This ended up removing and re-adding the STA entry for the AP in the middle of authentication and also messing up authentication state tracking through the sequence of four Authentication frames. Furthermore, this did not work if the AP ended up sending out SAE Confirm (auth trans #2) immediately after SAE Commit (auth trans #1) before the station had time to transmit its SAE Confirm. Clean up authentication state handling for the SAE case to allow two rounds of auth() calls without dropping all state between those operations. Track peer Confirmed status and mark authentication completed only once both ends have confirmed. ieee80211_mgd_auth() check for EBUSY cases is now handling only the pending association (ifmgd->assoc_data) while all pending authentication (ifmgd->auth_data) cases are allowed to proceed to allow user space to start a new connection attempt from scratch even if the previously requested authentication is still waiting completion. This is needed to avoid making SAE error cases with retries take excessive amount of time with no means for the user space to stop that (apart from setting the netdev down). As an extra bonus, the end of ieee80211_rx_mgmt_auth() can be cleaned up to avoid the extra copy of the cfg80211_rx_mlme_mgmt() call for ongoing SAE authentication since the new ieee80211_mark_sta_auth() helper function can handle both completion of authentication and updates to the STA entry under the same condition and there is no need to return from the function between those operations. Signed-off-by: Jouni Malinen <jouni@codeaurora.org> Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2018-10-11mac80211: Move ieee80211_mgd_auth() EBUSY check to be before allocationJouni Malinen
This makes it easier to conditionally replace full allocation of auth_data to use reallocation for the case of continuing SAE authentication. Furthermore, there was not really any point in having this check done so late in the function after having already completed number of steps that cannot be used anyway in the error case. Signed-off-by: Jouni Malinen <jouni@codeaurora.org> Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2018-10-11mac80211: Helper function for marking STA authenticatedJouni Malinen
Authentication exchange can be completed in both TX and RX paths for SAE, so move this common functionality into a helper function to avoid having to implement practically the same operations in two places when extending SAE implementation in the following commits. Signed-off-by: Jouni Malinen <jouni@codeaurora.org> Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2018-10-11mac80211: rc80211_minstrel: remove variance / stddev calculationFelix Fietkau
When there are few packets (e.g. for sampling attempts), the exponentially weighted variance is usually vastly overestimated, making the resulting data essentially useless. As far as I know, there has not been any practical use for this, so let's not waste any cycles on it. Signed-off-by: Felix Fietkau <nbd@nbd.name> Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2018-10-11mac80211: minstrel: do not sample rates 3 times slower than max_prob_rateFelix Fietkau
These rates are highly unlikely to be used quickly, even if the link deteriorates rapidly. This improves throughput in cases where CCK rates are not reliable enough to be skipped entirely during sampling. Sampling these rates regularly can cost a lot of airtime. Signed-off-by: Felix Fietkau <nbd@nbd.name> Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2018-10-11mac80211: minstrel: fix sampling/reporting of CCK rates in HT modeFelix Fietkau
Long/short preamble selection cannot be sampled separately, since it depends on the BSS state. Because of that, sampling attempts to currently not used preamble modes are not counted in the statistics, which leads to CCK rates being sampled too often. Fix statistics accounting for long/short preamble by increasing the index where necessary. Fix excessive CCK rate sampling by dropping unsupported sample attempts. This improves throughput on 2.4 GHz channels Signed-off-by: Felix Fietkau <nbd@nbd.name> Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2018-10-11mac80211: minstrel: fix CCK rate group streams valueFelix Fietkau
Fixes a harmless underflow issue when CCK rates are actively being used Signed-off-by: Felix Fietkau <nbd@nbd.name> Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2018-10-11mac80211: minstrel: fix using short preamble CCK rates on HT clientsFelix Fietkau
mi->supported[MINSTREL_CCK_GROUP] needs to be updated short preamble rates need to be marked as supported regardless of whether it's currently enabled. Its state can change at any time without a rate_update call. Fixes: 782dda00ab8e ("mac80211: minstrel_ht: move short preamble check out of get_rate") Signed-off-by: Felix Fietkau <nbd@nbd.name> Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2018-10-11mac80211: minstrel: reduce minstrel_mcs_groups sizeFelix Fietkau
By storing a shift value for all duration values of a group, we can reduce precision by a neglegible amount to make it fit into a u16 value. This improves cache footprint and reduces size: Before: text data bss dec hex filename 10024 116 0 10140 279c rc80211_minstrel_ht.o After: text data bss dec hex filename 9368 116 0 9484 250c rc80211_minstrel_ht.o Signed-off-by: Felix Fietkau <nbd@nbd.name> Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2018-10-11mac80211: minstrel: merge with minstrel_ht, always enable VHT supportFelix Fietkau
Legacy-only devices are not very common and the overhead of the extra code for HT and VHT rates is not big enough to justify all those extra lines of code to make it optional. Signed-off-by: Felix Fietkau <nbd@nbd.name> Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2018-10-11mac80211: minstrel: remove unnecessary debugfs cleanup codeFelix Fietkau
debugfs entries are cleaned up by debugfs_remove_recursive already. Signed-off-by: Felix Fietkau <nbd@nbd.name> Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2018-10-11mac80211: minstrel: Enable STBC and LDPC for VHT RatesChaitanya T K
If peer support reception of STBC and LDPC, enable them for better performance. Signed-off-by: Chaitanya TK <chaitanya.mgit@gmail.com> Signed-off-by: Felix Fietkau <nbd@nbd.name> Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2018-10-11mac80211: avoid reflecting frames back to the clientJohannes Berg
I'm not really sure exactly _why_ I've been carrying a note for what's probably _years_ to check that we don't do this, but we clearly do reflect frames back to the station itself if it sends such. One way or the other, it's useless since the station doesn't really need the AP to talk to itself, so suppress it. While at it, clarify some of the logic by removing skb->data references in favour of the destination address (pointer) we already have separately. Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2018-10-11nl80211: use netlink policy validation function for elementsJohannes Berg
Instead of open-coding a lot of calls to is_valid_ie_attr(), add this validation directly to the policy, now that we can. Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2018-10-11nl80211: use policy range validation where applicableJohannes Berg
Many range checks can be done in the policy, move them there. A few in mesh are added in the code (taken out of the macros) because they don't fit into the s16 range in the policy validation. Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2018-10-11xfrm: policy: use hlist rcu variants on insertFlorian Westphal
bydst table/list lookups use rcu, so insertions must use rcu versions. Fixes: a7c44247f704e ("xfrm: policy: make xfrm_policy_lookup_bytype lockless") Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2018-10-11net/xfrm: fix out-of-bounds packet accessAlexei Starovoitov
BUG: KASAN: slab-out-of-bounds in _decode_session6+0x1331/0x14e0 net/ipv6/xfrm6_policy.c:161 Read of size 1 at addr ffff8801d882eec7 by task syz-executor1/6667 Call Trace: __dump_stack lib/dump_stack.c:77 [inline] dump_stack+0x1c9/0x2b4 lib/dump_stack.c:113 print_address_description+0x6c/0x20b mm/kasan/report.c:256 kasan_report_error mm/kasan/report.c:354 [inline] kasan_report.cold.7+0x242/0x30d mm/kasan/report.c:412 __asan_report_load1_noabort+0x14/0x20 mm/kasan/report.c:430 _decode_session6+0x1331/0x14e0 net/ipv6/xfrm6_policy.c:161 __xfrm_decode_session+0x71/0x140 net/xfrm/xfrm_policy.c:2299 xfrm_decode_session include/net/xfrm.h:1232 [inline] vti6_tnl_xmit+0x3c3/0x1bc1 net/ipv6/ip6_vti.c:542 __netdev_start_xmit include/linux/netdevice.h:4313 [inline] netdev_start_xmit include/linux/netdevice.h:4322 [inline] xmit_one net/core/dev.c:3217 [inline] dev_hard_start_xmit+0x272/0xc10 net/core/dev.c:3233 __dev_queue_xmit+0x2ab2/0x3870 net/core/dev.c:3803 dev_queue_xmit+0x17/0x20 net/core/dev.c:3836 Reported-by: syzbot+acffccec848dc13fe459@syzkaller.appspotmail.com Reported-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2018-10-11sched/fair: Fix throttle_list starvation with low CFS quotaPhil Auld
With a very low cpu.cfs_quota_us setting, such as the minimum of 1000, distribute_cfs_runtime may not empty the throttled_list before it runs out of runtime to distribute. In that case, due to the change from c06f04c7048 to put throttled entries at the head of the list, later entries on the list will starve. Essentially, the same X processes will get pulled off the list, given CPU time and then, when expired, get put back on the head of the list where distribute_cfs_runtime will give runtime to the same set of processes leaving the rest. Fix the issue by setting a bit in struct cfs_bandwidth when distribute_cfs_runtime is running, so that the code in throttle_cfs_rq can decide to put the throttled entry on the tail or the head of the list. The bit is set/cleared by the callers of distribute_cfs_runtime while they hold cfs_bandwidth->lock. This is easy to reproduce with a handful of CPU consumers. I use 'crash' on the live system. In some cases you can simply look at the throttled list and see the later entries are not changing: crash> list cfs_rq.throttled_list -H 0xffff90b54f6ade40 -s cfs_rq.runtime_remaining | paste - - | awk '{print $1" "$4}' | pr -t -n3 1 ffff90b56cb2d200 -976050 2 ffff90b56cb2cc00 -484925 3 ffff90b56cb2bc00 -658814 4 ffff90b56cb2ba00 -275365 5 ffff90b166a45600 -135138 6 ffff90b56cb2da00 -282505 7 ffff90b56cb2e000 -148065 8 ffff90b56cb2fa00 -872591 9 ffff90b56cb2c000 -84687 10 ffff90b56cb2f000 -87237 11 ffff90b166a40a00 -164582 crash> list cfs_rq.throttled_list -H 0xffff90b54f6ade40 -s cfs_rq.runtime_remaining | paste - - | awk '{print $1" "$4}' | pr -t -n3 1 ffff90b56cb2d200 -994147 2 ffff90b56cb2cc00 -306051 3 ffff90b56cb2bc00 -961321 4 ffff90b56cb2ba00 -24490 5 ffff90b166a45600 -135138 6 ffff90b56cb2da00 -282505 7 ffff90b56cb2e000 -148065 8 ffff90b56cb2fa00 -872591 9 ffff90b56cb2c000 -84687 10 ffff90b56cb2f000 -87237 11 ffff90b166a40a00 -164582 Sometimes it is easier to see by finding a process getting starved and looking at the sched_info: crash> task ffff8eb765994500 sched_info PID: 7800 TASK: ffff8eb765994500 CPU: 16 COMMAND: "cputest" sched_info = { pcount = 8, run_delay = 697094208, last_arrival = 240260125039, last_queued = 240260327513 }, crash> task ffff8eb765994500 sched_info PID: 7800 TASK: ffff8eb765994500 CPU: 16 COMMAND: "cputest" sched_info = { pcount = 8, run_delay = 697094208, last_arrival = 240260125039, last_queued = 240260327513 }, Signed-off-by: Phil Auld <pauld@redhat.com> Reviewed-by: Ben Segall <bsegall@google.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: stable@vger.kernel.org Fixes: c06f04c70489 ("sched: Fix potential near-infinite distribute_cfs_runtime() loop") Link: http://lkml.kernel.org/r/20181008143639.GA4019@pauld.bos.csb Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-10-11Merge branch 'x86-urgent-for-linus' of ↵Greg Kroah-Hartman
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Ingo writes: "x86 fixes An intel_rdt memory access fix and a VLA fix in pgd_alloc()." * 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/mm: Avoid VLA in pgd_alloc() x86/intel_rdt: Fix out-of-bounds memory access in CBM tests