summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2017-08-27i40e/i40evf: support for VF VLAN tag stripping controlMariusz Stachura
This patch gives VF capability to control VLAN tag stripping via ethtool. As rx-vlan-offload was fixed before, now the VF is able to change it using "ethtool --offload <IF> rxvlan on/off" settings. Signed-off-by: Mariusz Stachura <mariusz.stachura@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2017-08-27i40e: force VMDQ device name truncationJacob Keller
In new versions of GCC since 7.x a new warning exists which warns when a string is truncated before all of the format can be completed. When we setup VMDQ netdev names we are copying a pre-existing interface name which could be up to 15 characters in length. Since we also add 4 bytes, v, the literal %, the d and a \0 null, we would overrun the available size unless snprintf truncated for us. The snprintf call will of course truncate on the end, so lets instead modify the code to force truncation of the copied netdev name by 4 characters, to create enough space for the 4 bytes we're adding. Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2017-08-27i40evf: fix possible snprintf truncation of q_vector->nameJacob Keller
The q_vector names are based on the interface name with a driver prefix, the type of q_vector setup, and the queue number. We previously set the size of this variable to IFNAMSIZ + 9, which is incorrect, because we actually include a minimum of 14 characters extra beyond the interface name size. New versions of GCC since 7 include a new warning that detects this possible truncation and complains. We can fix this by increasing the size in case our interface name is too large to avoid truncation. We don't need to go beyond 14 because the compiler is smart enough to realize our values can never exceed size of 1. We do go up to 15 here because possible future changes may increase the number of queues beyond one digit. While we are here, also change some variables to be unsigned (since they are never negative) and stop using an extra unnecessary %s format specifier. Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2017-08-27i40e: Use correct flag to enable egress traffic for unicast promiscAkeem G Abodunrin
Albeit, we usually set true promiscuous mode for both multicast and unicast at the same time - however, it is possible to set it individually, so using allmulti flag which is only for allmulticast might caused unwanted behavior in mirroring egress traffic promiscuous for unicast in VF. Signed-off-by: Akeem G Abodunrin <akeem.g.abodunrin@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2017-08-27i40e: prevent snprintf format specifier truncationJacob Keller
Increase the size of the prefix buffer so that it can hold enough characters for every possible input. Although 20 is enough for all expected inputs, it is possible for the values to be larger than expected, resulting in a possibly truncated string. Additionally, lets use sizeof(prefix) in order to ensure we use the correct size if we need to change the array length in the future. New versions of GCC starting at 7 now include warnings to prevent truncation unless you handle the return code. At most 27 bytes can be written here, so lets just increase the buffer size even if for all expected hw->bus.* values we only needed 20. Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2017-08-27i40e: Store the requested FEC informationMariusz Stachura
Store information about FEC modes, that were requested. It will be used in printing link status information function and this way there is no need to call admin queue there. Signed-off-by: Mariusz Stachura <mariusz.stachura@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2017-08-27i40e: Update state variable for adminq subtaskSudheer Mogilappagari
During NVM update, state machine gets into unrecoverable state because i40e_clean_adminq_subtask can get scheduled after the admin queue command but before other state variables are updated. This causes incorrect input to i40e_nvmupd_check_wait_event and state transitions don't happen. This fix updates the state variables so that adminq_subtask will have accurate information whenever it gets scheduled. Signed-off-by: Sudheer Mogilappagari <sudheer.mogilappagari@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2017-08-27Minor page waitqueue cleanupsLinus Torvalds
Tim Chen and Kan Liang have been battling a customer load that shows extremely long page wakeup lists. The cause seems to be constant NUMA migration of a hot page that is shared across a lot of threads, but the actual root cause for the exact behavior has not been found. Tim has a patch that batches the wait list traversal at wakeup time, so that we at least don't get long uninterruptible cases where we traverse and wake up thousands of processes and get nasty latency spikes. That is likely 4.14 material, but we're still discussing the page waitqueue specific parts of it. In the meantime, I've tried to look at making the page wait queues less expensive, and failing miserably. If you have thousands of threads waiting for the same page, it will be painful. We'll need to try to figure out the NUMA balancing issue some day, in addition to avoiding the excessive spinlock hold times. That said, having tried to rewrite the page wait queues, I can at least fix up some of the braindamage in the current situation. In particular: (a) we don't want to continue walking the page wait list if the bit we're waiting for already got set again (which seems to be one of the patterns of the bad load). That makes no progress and just causes pointless cache pollution chasing the pointers. (b) we don't want to put the non-locking waiters always on the front of the queue, and the locking waiters always on the back. Not only is that unfair, it means that we wake up thousands of reading threads that will just end up being blocked by the writer later anyway. Also add a comment about the layout of 'struct wait_page_key' - there is an external user of it in the cachefiles code that means that it has to match the layout of 'struct wait_bit_key' in the two first members. It so happens to match, because 'struct page *' and 'unsigned long *' end up having the same values simply because the page flags are the first member in struct page. Cc: Tim Chen <tim.c.chen@linux.intel.com> Cc: Kan Liang <kan.liang@intel.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Christopher Lameter <cl@linux.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-08-27Clarify (and fix) MAX_LFS_FILESIZE macrosLinus Torvalds
We have a MAX_LFS_FILESIZE macro that is meant to be filled in by filesystems (and other IO targets) that know they are 64-bit clean and don't have any 32-bit limits in their IO path. It turns out that our 32-bit value for that limit was bogus. On 32-bit, the VM layer is limited by the page cache to only 32-bit index values, but our logic for that was confusing and actually wrong. We used to define that value to (((loff_t)PAGE_SIZE << (BITS_PER_LONG-1))-1) which is actually odd in several ways: it limits the index to 31 bits, and then it limits files so that they can't have data in that last byte of a page that has the highest 31-bit index (ie page index 0x7fffffff). Neither of those limitations make sense. The index is actually the full 32 bit unsigned value, and we can use that whole full page. So the maximum size of the file would logically be "PAGE_SIZE << BITS_PER_LONG". However, we do wan tto avoid the maximum index, because we have code that iterates over the page indexes, and we don't want that code to overflow. So the maximum size of a file on a 32-bit host should actually be one page less than the full 32-bit index. So the actual limit is ULONG_MAX << PAGE_SHIFT. That means that we will not actually be using the page of that last index (ULONG_MAX), but we can grow a file up to that limit. The wrong value of MAX_LFS_FILESIZE actually caused problems for Doug Nazar, who was still using a 32-bit host, but with a 9.7TB 2 x RAID5 volume. It turns out that our old MAX_LFS_FILESIZE was 8TiB (well, one byte less), but the actual true VM limit is one page less than 16TiB. This was invisible until commit c2a9737f45e2 ("vfs,mm: fix a dead loop in truncate_inode_pages_range()"), which started applying that MAX_LFS_FILESIZE limit to block devices too. NOTE! On 64-bit, the page index isn't a limiter at all, and the limit is actually just the offset type itself (loff_t), which is signed. But for clarity, on 64-bit, just use the maximum signed value, and don't make people have to count the number of 'f' characters in the hex constant. So just use LLONG_MAX for the 64-bit case. That was what the value had been before too, just written out as a hex constant. Fixes: c2a9737f45e2 ("vfs,mm: fix a dead loop in truncate_inode_pages_range()") Reported-and-tested-by: Doug Nazar <nazard@nazar.ca> Cc: Andreas Dilger <adilger@dilger.ca> Cc: Mark Fasheh <mfasheh@versity.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Dave Kleikamp <shaggy@kernel.org> Cc: stable@kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-08-26Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input Pull input fixes from Dmitry Torokhov: - a tweak to the IBM Trackpoint driver that helps recognizing trackpoints on never Lenovo Carbons - a fix to the ALPS driver solving scroll issues on some Dells - yet another ACPI ID has been added to Elan I2C toucpad driver - quieted diagnostic message in soc_button_array driver * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input: Input: ALPS - fix two-finger scroll breakage in right side on ALPS touchpad Input: soc_button_array - silence -ENOENT error on Dell XPS13 9365 Input: trackpoint - add new trackpoint firmware ID Input: elan_i2c - add ELAN0602 ACPI ID to support Lenovo Yoga310
2017-08-26Merge tag 'pci-v4.13-fixes-3' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci Pull PCI fix from Bjorn Helgaas: "Remove needlessly alarming MSI affinity warning (this is not actually a bug fix, but the warning prompts unnecessary bug reports)" * tag 'pci-v4.13-fixes-3' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci: PCI/MSI: Don't warn when irq_create_affinity_masks() returns NULL
2017-08-26Merge branch 'x86-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 fixes from Ingo Molnar: "Two fixes: one for an ldt_struct handling bug and a cherry-picked objtool fix" * 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/mm: Fix use-after-free of ldt_struct objtool: Fix '-mtune=atom' decoding support in objtool 2.0
2017-08-26Merge branch 'timers-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull timer fix from Ingo Molnar: "Fix a timer granularity handling race+bug, which would manifest itself by spuriously increasing timeouts of some timers (from 1 jiffy to ~500 jiffies in the worst case measured) in certain nohz states" * 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: timers: Fix excessive granularity of new timers after a nohz idle
2017-08-26Merge branch 'perf-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull perf fix from Ingo Molnar: "A single fix to not allow nonsensical event groups that result in kernel warnings" * 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: perf/core: Fix group {cpu,task} validation
2017-08-25Merge tag 'wireless-drivers-for-davem-2017-08-25' of ↵David S. Miller
git://git.kernel.org/pub/scm/linux/kernel/git/kvalo/wireless-drivers Kalle Valo says: ==================== wireless-drivers fixes for 4.13 Only one iwlwifi patch this time. iwlwifi * fix multiple times reported lockdep warning found by new locking annotation introduced in v4.13-rc1 ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-25net: mvpp2: fix the packet size configuration for 10GAntoine Ténart
The MVPP22_XLG_CTRL1_FRAMESIZELIMIT define is used as an offset, but is defined as BIT(0). Updated its name to contains "OFFS" as in offset and fix its value using the offset value, 0. Reported-by: Stefan Chulski <stefanc@marvell.com> Signed-off-by: Antoine Tenart <antoine.tenart@free-electrons.com> Fixes: 76eb1b1de5b6 ("net: mvpp2: set maximum packet size for 10G ports") Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-25udp6: set rx_dst_cookie on rx_dst updatesPaolo Abeni
Currently, in the udp6 code, the dst cookie is not initialized/updated concurrently with the RX dst used by early demux. As a result, the dst_check() in the early_demux path always fails, the rx dst cache is always invalidated, and we can't really leverage significant gain from the demux lookup. Fix it adding udp6 specific variant of sk_rx_dst_set() and use it to set the dst cookie when the dst entry is really changed. The issue is there since the introduction of early demux for ipv6. Fixes: 5425077d73e0 ("net: ipv6: Add early demux handler for UDP unicast") Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-25net: sxgbe: check memory allocation failureChristophe Jaillet
Check memory allocation failure and return -ENOMEM in such a case, as already done few lines below for another memory allocation. Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-25Merge branch '40GbE' of ↵David S. Miller
git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/next-queue Jeff Kirsher says: ==================== 40GbE Intel Wired LAN Driver Updates 2017-08-25 This series contains updates to i40e and i40evf only. Mitch adjusts the max packet size to account for two VLAN tags. Sudheer provides a fix to ensure that the watchdog timer is scheduled immediately after admin queue operations are scheduled in i40evf_down(). Fixes an issue by adding locking around the admin queue command and update of state variables so that adminq_subtask will have the accurate information whenever it gets scheduled. Anjali fixes a bug where the PF flag setup should happen before the VMDq RSS queue count is initialized for VMDq VSI to get the right number of queues for RSS in the case of x722 devices. Fixed a problem with the hardware ATR eviction feature where the NVM setting was incorrect. Jake separates the flags into two types, hw_features and flags. The hw_features flags contain a set of features which are enabled at init time and will not contain feature flags that can be toggled. Everything else will remain in the flags variable, and can be modified anytime during run time. We should not be directly copying a cpumask_t, since it is bitmap and might not be copied correctly, so use cpumask_copy() instead. Stefan Assmann makes vf _offload_flags more "generic" by renaming it to vf_cap_flags, which allows other capabilities besides offloading to be added. Alan makes it such that if adaptive-rx/tx is enabled, the user cannot make any manual adjustments to interrupt moderation. Also makes it so that if ITR is disabled by adaptive-rx/tx is then enabled, ITR will be re-enabled. v2: Dropped patches #1 & #8 from the original patch series submission, while Jesse and Jake re-work their patches based on feedback from David Miller. Also removed the duplicate patch 3 that was accidentally sent out twice in the previous submission. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-25Merge branch 'nfp-SR-IOV-ndos-support'David S. Miller
Jakub Kicinski says: ==================== nfp: SR-IOV ndos support This set adds basic SR-IOV including setting/getting VF MAC addresses, VLANs, link state and spoofcheck settings. It is wired up for both vNICs and representors (note: ip link will not report VF settings on VF/PF representors because they are not linked to the PF PCI device). Pablo and team add the basic implementation, Simon and Dirk follow up with the representor plumbing. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-25nfp: add basic SR-IOV ndo functions to representorsSimon Horman
Add basic ndo_set/get_vf to support SR-IOV on all types of port representors. Signed-off-by: Simon Horman <simon.horman@netronome.com> Signed-off-by: Dirk van der Merwe <dirk.vandermerwe@netronome.com> Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-25nfp: add basic SR-IOV ndo functionsPablo Cascón
Add basic ndo_set/get_vf to support SR-IOV. VF to egress phy static mapping by now. Use vfcfg ABI version 2 to write the info to the FW and collect the return value from the mailbox. Signed-off-by: Pablo Cascón <pablo.cascon@netronome.com> Signed-off-by: Jimmy Kizito <jimmy.kizito@netronome.com> Signed-off-by: Rami Tomer <rami.tomer@netronome.com> Signed-off-by: Simon Horman <simon.horman@netronome.com> Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-25Merge branch 'r8169-Be-drop-monitor-friendly'David S. Miller
Florian Fainelli says: ==================== r8169: Be drop monitor friendly First patch may be questionable but no other driver appears to be doing that and while it is defendable to account for left packets as dropped during TX clean, this appears misleading. I picked Stanislaw changes which brings us back to 2010, but this was present from pre-git days as well. Second patch fixes the two missing calls to dev_consume_skb_any(). ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-25r8169: Be drop monitor friendlyFlorian Fainelli
rtl_tx() is the TX reclamation process whereas rtl8169_tx_clear_range() does the TX ring cleaning during shutdown, both of these functions should call dev_consume_skb_any() to be drop monitor friendly. Fixes: cac4b22f3d6a ("r8169: do not account fragments as packets") Fixes: eb781397904e ("r8169: Do not use dev_kfree_skb in xmit path") Signed-off-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-25r8169: Do not increment tx_dropped in TX ring cleaningFlorian Fainelli
rtl8169_tx_clear_range() is responsible for cleaning up the TX ring during interface shutdown, incrementing tx_dropped for every SKB that we left at the time in the ring is misleading. Fixes: cac4b22f3d6a ("r8169: do not account fragments as packets") Signed-off-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-25Merge branch 'akpm' (patches from Andrew)Linus Torvalds
Merge misc fixes from Andrew Morton: "6 fixes" * emailed patches from Andrew Morton <akpm@linux-foundation.org>: mm/memblock.c: reversed logic in memblock_discard() fork: fix incorrect fput of ->exe_file causing use-after-free mm/madvise.c: fix freeing of locked page with MADV_FREE dax: fix deadlock due to misaligned PMD faults mm, shmem: fix handling /sys/kernel/mm/transparent_hugepage/shmem_enabled PM/hibernate: touch NMI watchdog when creating snapshot
2017-08-25Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds
Pull Paolo Bonzini: "Bugfixes for x86, PPC and s390" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: KVM: PPC: Book3S: Fix race and leak in kvm_vm_ioctl_create_spapr_tce() KVM, pkeys: do not use PKRU value in vcpu->arch.guest_fpu.state KVM: x86: simplify handling of PKRU KVM: x86: block guest protection keys unless the host has them enabled KVM: PPC: Book3S HV: Add missing barriers to XIVE code and document them KVM: PPC: Book3S HV: Workaround POWER9 DD1.0 bug causing IPB bit loss KVM: PPC: Book3S HV: Use msgsync with hypervisor doorbells on POWER9 KVM: s390: sthyi: fix specification exception detection KVM: s390: sthyi: fix sthyi inline assembly
2017-08-25Merge tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhostLinus Torvalds
Pull virtio fixes from Michael Tsirkin: "Fixes two obvious bugs in virtio pci" * tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost: virtio_pci: fix cpu affinity support virtio_blk: fix incorrect message when disk is resized
2017-08-25Merge tag 'powerpc-4.13-8' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux Pull powerpc fix from Michael Ellerman: "Just one fix, to add a barrier in the switch_mm() code to make sure the mm cpumask update is ordered vs the MMU starting to load translations. As far as we know no one's actually hit the bug, but that's just luck. Thanks to Benjamin Herrenschmidt, Nicholas Piggin" * tag 'powerpc-4.13-8' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: powerpc/mm: Ensure cpumask update is ordered
2017-08-25Merge tag 'nfsd-4.13-2' of git://linux-nfs.org/~bfields/linuxLinus Torvalds
Pull nfsd fixes from Bruce Fields: "Two nfsd bugfixes, neither 4.13 regressions, but both potentially serious" * tag 'nfsd-4.13-2' of git://linux-nfs.org/~bfields/linux: net: sunrpc: svcsock: fix NULL-pointer exception nfsd: Limit end of page list when decoding NFSv4 WRITE
2017-08-25Merge tag 'cifs-fixes-for-4.13-rc6-and-stable' of ↵Linus Torvalds
git://git.samba.org/sfrench/cifs-2.6 Pull cifs fixes from Steve French: "Some bug fixes for stable for cifs" * tag 'cifs-fixes-for-4.13-rc6-and-stable' of git://git.samba.org/sfrench/cifs-2.6: cifs: return ENAMETOOLONG for overlong names in cifs_open()/cifs_lookup() cifs: Fix df output for users with quota limits
2017-08-25tcp: fix hang in tcp_sendpage_locked()Eric Dumazet
syszkaller got a hang in tcp stack, related to a bug in tcp_sendpage_locked() root@syzkaller:~# cat /proc/3059/stack [<ffffffff83de926c>] __lock_sock+0x1dc/0x2f0 [<ffffffff83de9473>] lock_sock_nested+0xf3/0x110 [<ffffffff8408ce01>] tcp_sendmsg+0x21/0x50 [<ffffffff84163b6f>] inet_sendmsg+0x11f/0x5e0 [<ffffffff83dd8eea>] sock_sendmsg+0xca/0x110 [<ffffffff83dd9547>] kernel_sendmsg+0x47/0x60 [<ffffffff83de35dc>] sock_no_sendpage+0x1cc/0x280 [<ffffffff8408916b>] tcp_sendpage_locked+0x10b/0x160 [<ffffffff84089203>] tcp_sendpage+0x43/0x60 [<ffffffff841641da>] inet_sendpage+0x1aa/0x660 [<ffffffff83dd4fcd>] kernel_sendpage+0x8d/0xe0 [<ffffffff83dd50ac>] sock_sendpage+0x8c/0xc0 [<ffffffff81b63300>] pipe_to_sendpage+0x290/0x3b0 [<ffffffff81b67243>] __splice_from_pipe+0x343/0x750 [<ffffffff81b6a459>] splice_from_pipe+0x1e9/0x330 [<ffffffff81b6a5e0>] generic_splice_sendpage+0x40/0x50 [<ffffffff81b6b1d7>] SyS_splice+0x7b7/0x1610 [<ffffffff84d77a01>] entry_SYSCALL_64_fastpath+0x1f/0xbe Fixes: 306b13eb3cf9 ("proto_ops: Add locked held versions of sendmsg and sendpage") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Dmitry Vyukov <dvyukov@google.com> Cc: Tom Herbert <tom@quantonium.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-25Merge branch 'net_sched-clean-up-tc-classes-and-u32-filter'David S. Miller
Cong Wang says: ==================== net_sched: clean up tc classes and u32 filter Patch 1 and patch 2 prepare for patch 3. Major changes are in patch 3 and patch 4, details are there too. v2: Add patch 1 and 2, group all into a patchset Fix a coding style issue in patch 4 ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-25net_sched: kill u32_node pointer in QdiscWANG Cong
It is ugly to hide a u32-filter-specific pointer inside Qdisc, this breaks the TC layers: 1. Qdisc is a generic representation, should not have any specific data of any type 2. Qdisc layer is above filter layer, should only save filters in the list of struct tcf_proto. This pointer is used as the head of the chain of u32 hash tables, that is struct tc_u_hnode, because u32 filter is very special, it allows to create multiple hash tables within one qdisc and across multiple u32 filters. Instead of using this ugly pointer, we can just save it in a global hash table key'ed by (dev ifindex, qdisc handle), therefore we can still treat it as a per qdisc basis data structure conceptually. Of course, because of network namespaces, this key is not unique at all, but it is fine as we already have a pointer to Qdisc in struct tc_u_common, we can just compare the pointers when collision. And this only affects slow paths, has no impact to fast path, thanks to the pointer ->tp_c. Cc: Jamal Hadi Salim <jhs@mojatatu.com> Cc: Jiri Pirko <jiri@resnulli.us> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-25net_sched: remove tc class reference countingWANG Cong
For TC classes, their ->get() and ->put() are always paired, and the reference counting is completely useless, because: 1) For class modification and dumping paths, we already hold RTNL lock, so all of these ->get(),->change(),->put() are atomic. 2) For filter bindiing/unbinding, we use other reference counter than this one, and they should have RTNL lock too. 3) For ->qlen_notify(), it is special because it is called on ->enqueue() path, but we already hold qdisc tree lock there, and we hold this tree lock when graft or delete the class too, so it should not be gone or changed until we release the tree lock. Therefore, this patch removes ->get() and ->put(), but: 1) Adds a new ->find() to find the pointer to a class by classid, no refcnt. 2) Move the original class destroy upon the last refcnt into ->delete(), right after releasing tree lock. This is fine because the class is already removed from hash when holding the lock. For those who also use ->put() as ->unbind(), just rename them to reflect this change. Cc: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-25net_sched: introduce tclass_del_notify()WANG Cong
Like for TC actions, ->delete() is a special case, we have to prepare and fill the notification before delete otherwise would get use-after-free after we remove the reference count. Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-25net_sched: get rid of more forward declarationsWANG Cong
This is not needed if we move them up properly. Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-25tcp: fix refcnt leak with ebpf congestion controlSabrina Dubroca
There are a few bugs around refcnt handling in the new BPF congestion control setsockopt: - The new ca is assigned to icsk->icsk_ca_ops even in the case where we cannot get a reference on it. This would lead to a use after free, since that ca is going away soon. - Changing the congestion control case doesn't release the refcnt on the previous ca. - In the reinit case, we first leak a reference on the old ca, then we call tcp_reinit_congestion_control on the ca that we have just assigned, leading to deinitializing the wrong ca (->release of the new ca on the old ca's data) and releasing the refcount on the ca that we actually want to use. This is visible by building (for example) BIC as a module and setting net.ipv4.tcp_congestion_control=bic, and using tcp_cong_kern.c from samples/bpf. This patch fixes the refcount issues, and moves reinit back into tcp core to avoid passing a ca pointer back to BPF. Fixes: 91b5b21c7c16 ("bpf: Add support for changing congestion control") Signed-off-by: Sabrina Dubroca <sd@queasysnail.net> Acked-by: Lawrence Brakmo <brakmo@fb.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-25hinic: skb_pad() frees on errorDan Carpenter
The skb_pad() function frees the skb on error, so this code has a double free. Fixes: 00e57a6d4ad3 ("net-next/hinic: Add Tx operation") Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-25Merge branch 'ipv6-sr-updates'David S. Miller
David Lebrun says: ==================== net: updates for IPv6 Segment Routing v2: seg6_lwt_headroom() is not relevant for lwtunnel_input_redirect() use cases, and L2ENCAP only uses this redirection. Fix incoherence between arbitrary MAC header size support and fixed headroom computation by setting only LWTUNNEL_STATE_INPUT_REDIRECT for L2ENCAP mode. This patch series provides several updates for the SRv6 implementation. The first patch leverages the existing infrastructure to support encapsulation of IPv4 packets. The second patch implements the T.Encaps.L2 SR function, enabling to encapsulate an L2 Ethernet frame within an IPv6+SRH packet. The last three patches update the seg6local lightweight tunnel, and mainly implement four new actions: End.T, End.DX2, End.DX4 and End.DT6. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-25ipv6: sr: implement additional seg6local actionsDavid Lebrun
This patch implements the following seg6local actions. - SEG6_LOCAL_ACTION_END_T: regular SRH processing and forward to the next-hop looked up in the specified routing table. - SEG6_LOCAL_ACTION_END_DX2: decapsulate an L2 frame and forward it to the specified network interface. - SEG6_LOCAL_ACTION_END_DX4: decapsulate an IPv4 packet and forward it, possibly to the specified next-hop. - SEG6_LOCAL_ACTION_END_DT6: decapsulate an IPv6 packet and forward it to the next-hop looked up in the specified routing table. Signed-off-by: David Lebrun <david.lebrun@uclouvain.be> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-25ipv6: sr: add helper functions for seg6localDavid Lebrun
This patch adds three helper functions to be used with the seg6local packet processing actions. The decap_and_validate() function will be used by the End.D* actions, that decapsulate an SR-enabled packet. The advance_nextseg() function applies the fundamental operations to update an SRH for the next segment. The lookup_nexthop() function helps select the next-hop for the processed SR packets. It supports an optional next-hop address to route the packet specifically through it, and an optional routing table to use. Signed-off-by: David Lebrun <david.lebrun@uclouvain.be> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-25ipv6: sr: enforce IPv6 packets for seg6local lwtDavid Lebrun
This patch ensures that the seg6local lightweight tunnel is used solely with IPv6 routes and processes only IPv6 packets. Signed-off-by: David Lebrun <david.lebrun@uclouvain.be> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-25ipv6: sr: add support for encapsulation of L2 framesDavid Lebrun
This patch implements the L2 frame encapsulation mechanism, referred to as T.Encaps.L2 in the SRv6 specifications [1]. A new type of SRv6 tunnel mode is added (SEG6_IPTUN_MODE_L2ENCAP). It only accepts packets with an existing MAC header (i.e., it will not work for locally generated packets). The resulting packet looks like IPv6 -> SRH -> Ethernet -> original L3 payload. The next header field of the SRH is set to NEXTHDR_NONE. [1] https://tools.ietf.org/html/draft-filsfils-spring-srv6-network-programming-01 Signed-off-by: David Lebrun <david.lebrun@uclouvain.be> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-25ipv6: sr: add support for ip4ip6 encapsulationDavid Lebrun
This patch enables the SRv6 encapsulation mode to carry an IPv4 payload. All the infrastructure was already present, I just had to add a parameter to seg6_do_srh_encap() to specify the inner packet protocol, and perform some additional checks. Usage example: ip route add 1.2.3.4 encap seg6 mode encap segs fc00::1,fc00::2 dev eth0 Signed-off-by: David Lebrun <david.lebrun@uclouvain.be> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-25Merge tag 'for-linus-20170825' of git://git.infradead.org/linux-mtdLinus Torvalds
Pull MTD fixes from Brian Norris: "Two fixes - one for a 4.13 regression, and the other for an older one: - Atmel NAND: since we started utilizing ONFI timings, we found that we were being too restrict at rejecting them, partly due to discrepancies in ONFI 4.0 and earlier versions. Relax the restriction to keep these platforms booting. This is a 4.13-rc1 regression. - nandsim: repeated probe/removal may not work after a failed init, because we didn't free up our debugfs files properly on the failure path. This has been around since 3.8, but it's nice to get this fixed now in a nice easy patch that can target -stable, since there's already refactoring work (that also fixes the issue) targeted for the next merge window" * tag 'for-linus-20170825' of git://git.infradead.org/linux-mtd: mtd: nand: atmel: Relax tADL_min constraint mtd: nandsim: remove debugfs entries in error path
2017-08-25ipv6: Fix may be used uninitialized warning in rt6_checkSteffen Klassert
rt_cookie might be used uninitialized, fix this by initializing it. Fixes: c5cff8561d2d ("ipv6: add rcu grace period before freeing fib6_node") Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-25Merge branch 'for-linus' of git://git.kernel.dk/linux-blockLinus Torvalds
Pull block fixes from Jens Axboe: "A small batch of fixes that should be included for the 4.13 release. This contains: - Revert of the 4k loop blocksize support. Even with a recent batch of 4 fixes, we're still not really happy with it. Rather than be stuck with an API issue, let's revert it and get it right for 4.14. - Trivial patch from Bart, adding a few flags to the blk-mq debugfs exports that were added in this release, but not to the debugfs parts. - Regression fix for bsg, fixing a potential kernel panic. From Benjamin. - Tweak for the blk throttling, improving how we account discards. From Shaohua" * 'for-linus' of git://git.kernel.dk/linux-block: blk-mq-debugfs: Add names for recently added flags bsg-lib: fix kernel panic resulting from missing allocation of reply-buffer Revert "loop: support 4k physical blocksize" blk-throttle: cap discard request size
2017-08-25Merge branch 'i2c/for-current' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux Pull i2c fixes from Wolfram Sang: "I2C has some bugfixes for you: mainly Jarkko fixed up a few things in the designware driver regarding the new slave mode. But Ulf also fixed a long-standing and now agreed suspend problem. Plus, some simple stuff which nonetheless needs fixing" * 'i2c/for-current' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux: i2c: designware: Fix runtime PM for I2C slave mode i2c: designware: Remove needless pm_runtime_put_noidle() call i2c: aspeed: fixed potential null pointer dereference i2c: simtec: use release_mem_region instead of release_resource i2c: core: Make comment about I2C table requirement to reflect the code i2c: designware: Fix standard mode speed when configuring the slave mode i2c: designware: Fix oops from i2c_dw_irq_handler_slave i2c: designware: Fix system suspend
2017-08-25PCI/MSI: Don't warn when irq_create_affinity_masks() returns NULLChristoph Hellwig
irq_create_affinity_masks() can return NULL on non-SMP systems, when there are not enough "free" vectors available to spread, or if memory allocation for the CPU masks fails. Only the allocation failure is of interest, and even then the system will work just fine except for non-optimally spread vectors. Thus remove the warnings. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Acked-by: David S. Miller <davem@davemloft.net>