summaryrefslogtreecommitdiff
path: root/include
AgeCommit message (Collapse)Author
2018-04-25neighbour: support for NTF_EXT_LEARNED flagRoopa Prabhu
This patch extends NTF_EXT_LEARNED support to the neighbour system. Example use-case: An Ethernet VPN implementation (eg in FRR routing suite) can use this flag to add dynamic reachable external neigh entires learned via control plane. The use of neigh NTF_EXT_LEARNED in this patch is consistent with its use with bridge and vxlan fdb entries. Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-25ipv6: sr: Compute flowlabel for outer IPv6 header of seg6 encap modeAhmed Abdelsalam
ECMP (equal-cost multipath) hashes are typically computed on the packets' 5-tuple(src IP, dst IP, src port, dst port, L4 proto). For encapsulated packets, the L4 data is not readily available and ECMP hashing will often revert to (src IP, dst IP). This will lead to traffic polarization on a single ECMP path, causing congestion and waste of network capacity. In IPv6, the 20-bit flow label field is also used as part of the ECMP hash. In the lack of L4 data, the hashing will be on (src IP, dst IP, flow label). Having a non-zero flow label is thus important for proper traffic load balancing when L4 data is unavailable (i.e., when packets are encapsulated). Currently, the seg6_do_srh_encap() function extracts the original packet's flow label and set it as the outer IPv6 flow label. There are two issues with this behaviour: a) There is no guarantee that the inner flow label is set by the source. b) If the original packet is not IPv6, the flow label will be set to zero (e.g., IPv4 or L2 encap). This patch adds a function, named seg6_make_flowlabel(), that computes a flow label from a given skb. It supports IPv6, IPv4 and L2 payloads, and leverages the per namespace 'seg6_flowlabel" sysctl value. The currently support behaviours are as follows: -1 set flowlabel to zero. 0 copy flowlabel from Inner paceket in case of Inner IPv6 (Set flowlabel to 0 in case IPv4/L2) 1 Compute the flowlabel using seg6_make_flowlabel() This patch has been tested for IPv6, IPv4, and L2 traffic. Signed-off-by: Ahmed Abdelsalam <amsalam20@gmail.com> Acked-by: David Lebrun <dlebrun@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-24Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller
2018-04-24Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netLinus Torvalds
Pull networking fixes from David Miller: 1) Fix rtnl deadlock in ipvs, from Julian Anastasov. 2) s390 qeth fixes from Julian Wiedmann (control IO completion stalls, bad MAC address update sequence, request side races on command IO timeouts). 3) Handle seq_file overflow properly in l2tp, from Guillaume Nault. 4) Fix VLAN priority mappings in cpsw driver, from Ivan Khoronzhuk. 5) Packet scheduler ife action fixes (malformed TLV lengths, etc.) from Alexander Aring. 6) Fix out of bounds access in tcp md5 option parser, from Jann Horn. 7) Missing netlink attribute policies in rtm_ipv6_policy table, from Eric Dumazet. 8) Missing socket address length checks in l2tp and pppoe connect, from Guillaume Nault. 9) Fix netconsole over team and bonding, from Xin Long. 10) Fix race with AF_PACKET socket state bitfields, from Willem de Bruijn. * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (51 commits) ice: Fix insufficient memory issue in ice_aq_manage_mac_read sfc: ARFS filter IDs net: ethtool: Add missing kernel doc for FEC parameters packet: fix bitfield update race ice: Do not check INTEVENT bit for OICR interrupts ice: Fix incorrect comment for action type ice: Fix initialization for num_nodes_added igb: Fix the transmission mode of queue 0 for Qav mode ixgbevf: ensure xdp_ring resources are free'd on error exit team: fix netconsole setup over team amd-xgbe: Only use the SFP supported transceiver signals amd-xgbe: Improve KR auto-negotiation and training amd-xgbe: Add pre/post auto-negotiation phy hooks pppoe: check sockaddr length in pppoe_connect() l2tp: check sockaddr length in pppol2tp_connect() net: phy: marvell: clear wol event before setting it ipv6: add RTA_TABLE and RTA_PREFSRC to rtm_ipv6_policy bonding: do not set slave_dev npinfo before slave_enable_netpoll in bond_enslave tcp: don't read out-of-bounds opsize ibmvnic: Clean actual number of RX or TX pools ...
2018-04-24net: ethtool: Add missing kernel doc for FEC parametersFlorian Fainelli
While adding support for ethtool::get_fecparam and set_fecparam, kernel doc for these functions was missed, add those. Fixes: 1a5f3da20bd9 ("net: ethtool: add support for forward error correction modes") Signed-off-by: Florian Fainelli <f.fainelli@gmail.com> Acked-by: Roopa Prabhu <roopa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-24rhashtable: Revise incorrect comment on r{hl, hash}table_walk_enter()NeilBrown
Neither rhashtable_walk_enter() or rhltable_walk_enter() sleep, though they do take a spinlock without irq protection. So revise the comments to accurately state the contexts in which these functions can be called. Acked-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-24rhashtable: remove outdated comments about grow_decision etcNeilBrown
grow_decision and shink_decision no longer exist, so remove the remaining references to them. Acked-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-24Revert "net: init sk_cookie for inet socket"Yafang Shao
This reverts commit <c6849a3ac17e> ("net: init sk_cookie for inet socket") Per discussion with Eric, when update sock_net(sk)->cookie_gen, the whole cache cache line will be invalidated, as this cache line is shared with all cpus, that may cause great performace hit. Bellow is the data form Eric. "Performance is reduced from ~5 Mpps to ~3.8 Mpps with 16 RX queues on my host" when running synflood test. Have to revert it to prevent from cache line false sharing. Signed-off-by: Yafang Shao <laoar.shao@gmail.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-24net/dim: Support adaptive TX moderationTal Gilboa
Interrupt moderation for TX traffic requires different profiles than RX interrupt moderation. The main goal here is to reduce interrupt rate and allow better payload aggregation by keeping SKBs in the TX queue a bit longer. Ping-pong behavior would get a profile with a short timer, so latency wouldn't increase for these scenarios. There might be a slight degradation in bandwidth for single stream with large message sizes, since net.ipv4.tcp_limit_output_bytes is limiting the allowed TX traffic, but with many streams performance is always improved. Signed-off-by: Tal Gilboa <talgi@mellanox.com> Reviewed-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-24net/dim: Rename *_get_profile() functions to *_get_rx_moderation()Tal Gilboa
Preparation for introducing adaptive TX to net DIM. Signed-off-by: Tal Gilboa <talgi@mellanox.com> Reviewed-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-23qed: Add configuration information to register dump and debug dataDenis Bolotin
Configuration information is added to the debug data collection, in addition to register dump. Added qed_dbg_nvm_image() that receives an image type, allocates a buffer and reads the image. The images are saved in the buffers and the dump size is updated. Signed-off-by: Denis Bolotin <denis.bolotin@cavium.com> Signed-off-by: Ariel Elior <ariel.elior@cavium.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-23net: init sk_cookie for inet socketYafang Shao
With sk_cookie we can identify a socket, that is very helpful for traceing and statistic, i.e. tcp tracepiont and ebpf. So we'd better init it by default for inet socket. When using it, we just need call atomic64_read(&sk->sk_cookie). Signed-off-by: Yafang Shao <laoar.shao@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-23net: fib_rules: add extack supportRoopa Prabhu
Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-23net: introduce a new tracepoint for tcp_rcv_space_adjustYafang Shao
tcp_rcv_space_adjust is called every time data is copied to user space, introducing a tcp tracepoint for which could show us when the packet is copied to user. When a tcp packet arrives, tcp_rcv_established() will be called and with the existed tracepoint tcp_probe we could get the time when this packet arrives. Then this packet will be copied to user, and tcp_rcv_space_adjust will be called and with this new introduced tracepoint we could get the time when this packet is copied to user. With these two tracepoints, we could figure out whether the user program processes this packet immediately or there's latency. Hence in the printk message, sk_cookie is printed as a key to relate tcp_rcv_space_adjust with tcp_probe. Maybe we could export sockfd in this new tracepoint as well, then we could relate this new tracepoint with epoll/read/recv* tracepoints, and finally that could show us the whole lifespan of this packet. But we could also implement that with pid as these functions are executed in process context. Signed-off-by: Yafang Shao <laoar.shao@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-22Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpfDavid S. Miller
Daniel Borkmann says: ==================== pull-request: bpf 2018-04-21 The following pull-request contains BPF updates for your *net* tree. The main changes are: 1) Fix a deadlock between mm->mmap_sem and bpf_event_mutex when one task is detaching a BPF prog via perf_event_detach_bpf_prog() and another one dumping through bpf_prog_array_copy_info(). For the latter we move the copy_to_user() out of the bpf_event_mutex lock to fix it, from Yonghong. 2) Fix test_sock and test_sock_addr.sh failures. The former was hitting rlimit issues and the latter required ping to specify the address family, from Yonghong. 3) Remove a dead check in sockmap's sock_map_alloc(), from Jann. 4) Add generated files to BPF kselftests gitignore that were previously missed, from Anders. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-22net: sched: ife: handle malformed tlv lengthAlexander Aring
There is currently no handling to check on a invalid tlv length. This patch adds such handling to avoid killing the kernel with a malformed ife packet. Signed-off-by: Alexander Aring <aring@mojatatu.com> Reviewed-by: Yotam Gigi <yotam.gi@gmail.com> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-22Merge tag 'drm-fixes-for-v4.17-rc2' of ↵Linus Torvalds
git://people.freedesktop.org/~airlied/linux Pull drm fixes from Dave Airlie: "Exynos, i915, vc4, amdgpu fixes. i915: - an oops fix - two race fixes - some gvt fixes amdgpu: - dark screen fix - clk/voltage fix - vega12 smu fix vc4: - memory leak fix exynos just drops some code" * tag 'drm-fixes-for-v4.17-rc2' of git://people.freedesktop.org/~airlied/linux: (23 commits) drm/amd/powerplay: header file interface to SMU update drm/amd/pp: Fix bug voltage can't be OD separately on VI drm/amd/display: Don't program bypass on linear regamma LUT drm/i915: Fix LSPCON TMDS output buffer enabling from low-power state drm/i915/audio: Fix audio detection issue on GLK drm/i915: Call i915_perf_fini() on init_hw error unwind drm/i915/bios: filter out invalid DDC pins from VBT child devices drm/i915/pmu: Inspect runtime PM state more carefully while estimating RC6 drm/i915: Do no use kfree() to free a kmem_cache_alloc() return value drm/exynos: exynos_drm_fb -> drm_framebuffer drm/exynos: Move dma_addr out of exynos_drm_fb drm/exynos: Move GEM BOs to drm_framebuffer drm: Fix HDCP downstream dev count read drm/vc4: Fix memory leak during BO teardown drm/i915/execlists: Clear user-active flag on preemption completion drm/i915/gvt: Add drm_format_mod update drm/i915/gvt: Disable primary/sprite/cursor plane at virtual display initialization drm/i915/gvt: Delete redundant error message in fb_decode.c drm/i915/gvt: Cancel dma map when resetting ggtt entries drm/i915/gvt: Missed to cancel dma map for ggtt entries ...
2018-04-22llc: delete timers synchronously in llc_sk_free()Cong Wang
The connection timers of an llc sock could be still flying after we delete them in llc_sk_free(), and even possibly after we free the sock. We could just wait synchronously here in case of troubles. Note, I leave other call paths as they are, since they may not have to wait, at least we can change them to synchronously when needed. Also, move the code to net/llc/llc_conn.c, which is apparently a better place. Reported-by: <syzbot+f922284c18ea23a8e457@syzkaller.appspotmail.com> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-22Merge branch 'timers-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull timer fixes from Thomas Gleixner: "A small set of timer fixes: - Evaluate the -ETIME condition correctly in the imx tpm driver - Fix the evaluation order of a condition in posix cpu timers - Use pr_cont() in the clockevents code to prevent ugly message splitting - Remove __current_kernel_time() which is now unused to prevent that new users show up. - Remove a stale forward declaration" * 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: clocksource/imx-tpm: Correct -ETIME return condition check posix-cpu-timers: Ensure set_process_cpu_timer is always evaluated timekeeping: Remove __current_kernel_time() timers: Remove stale struct tvec_base forward declaration clockevents: Fix kernel messages split across multiple lines
2018-04-22Merge branch 'perf-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull perf fixes from Thomas Gleixner: "A larger set of updates for perf. Kernel: - Handle the SBOX uncore monitoring correctly on Broadwell CPUs which do not have SBOX. - Store context switch out type in PERF_RECORD_SWITCH[_CPU_WIDE]. The percentage of preempting and non-preempting context switches help understanding the nature of workloads (CPU or IO bound) that are running on a machine. This adds the kernel facility and userspace changes needed to show this information in 'perf script' and 'perf report -D' (Alexey Budankov) - Remove a WARN_ON() in the trace/kprobes code which is pointless because the return error code is already telling the caller what's wrong. - Revert a fugly workaround for clang BPF targets. - Fix sample_max_stack maximum check and do not proceed when an error has been detect, return them to avoid misidentifying errors (Jiri Olsa) - Add SPDX idenitifiers and get rid of GPL boilderplate. Tools: - Synchronize kernel ABI headers, v4.17-rc1 (Ingo Molnar) - Support MAP_FIXED_NOREPLACE, noticed when updating the tools/include/ copies (Arnaldo Carvalho de Melo) - Add '\n' at the end of parse-options error messages (Ravi Bangoria) - Add s390 support for detailed/verbose PMU event description (Thomas Richter) - perf annotate fixes and improvements: * Allow showing offsets in more than just jump targets, use the new 'O' hotkey in the TUI, config ~/.perfconfig annotate.offset_level for it and for --stdio2 (Arnaldo Carvalho de Melo) * Use the resolved variable names from objdump disassembled lines to make them more compact, just like was already done for some instructions, like "mov", this eventually will be done more generally, but lets now add some more to the existing mechanism (Arnaldo Carvalho de Melo) - perf record fixes: * Change warning for missing topology sysfs entry to debug, as not all architectures have those files, s390 being one of those (Thomas Richter) * Remove old error messages about things that unlikely to be the root cause in modern systems (Andi Kleen) - perf sched fixes: * Fix -g/--call-graph documentation (Takuya Yamamoto) - perf stat: * Enable 1ms interval for printing event counters values in (Alexey Budankov) - perf test fixes: * Run dwarf unwind on arm32 (Kim Phillips) * Remove unused ptrace.h include from LLVM test, sidesteping older clang's lack of support for some asm constructs (Arnaldo Carvalho de Melo) * Fixup BPF test using epoll_pwait syscall function probe, to cope with the syscall routines renames performed in this development cycle (Arnaldo Carvalho de Melo) - perf version fixes: * Do not print info about HAVE_LIBAUDIT_SUPPORT in 'perf version --build-options' when HAVE_SYSCALL_TABLE_SUPPORT is true, as libaudit won't be used in that case, print info about syscall_table support instead (Jin Yao) - Build system fixes: * Use HAVE_..._SUPPORT used consistently (Jin Yao) * Restore READ_ONCE() C++ compatibility in tools/include (Mark Rutland) * Give hints about package names needed to build jvmti (Arnaldo Carvalho de Melo)" * 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (37 commits) perf/x86/intel/uncore: Fix SBOX support for Broadwell CPUs perf/x86/intel/uncore: Revert "Remove SBOX support for Broadwell server" coresight: Move to SPDX identifier perf test BPF: Fixup BPF test using epoll_pwait syscall function probe perf tests mmap: Show which tracepoint is failing perf tools: Add '\n' at the end of parse-options error messages perf record: Remove suggestion to enable APIC perf record: Remove misleading error suggestion perf hists browser: Clarify top/report browser help perf mem: Allow all record/report options perf trace: Support MAP_FIXED_NOREPLACE perf: Remove superfluous allocation error check perf: Fix sample_max_stack maximum check perf: Return proper values for user stack errors perf list: Add s390 support for detailed/verbose PMU event description perf script: Extend misc field decoding with switch out event type perf report: Extend raw dump (-D) out with switch out event type perf/core: Store context switch out type in PERF_RECORD_SWITCH[_CPU_WIDE] tools/headers: Synchronize kernel ABI headers, v4.17-rc1 trace_kprobe: Remove warning message "Could not insert probe at..." ...
2018-04-21Merge tag 'random_for_linus_stable' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tytso/random Pull /dev/random fixes from Ted Ts'o: "Fix some bugs in the /dev/random driver which causes getrandom(2) to unblock earlier than designed. Thanks to Jann Horn from Google's Project Zero for pointing this out to me" * tag 'random_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/random: random: add new ioctl RNDRESEEDCRNG random: crng_reseed() should lock the crng instance that it is modifying random: set up the NUMA crng instances after the CRNG is fully initialized random: use a different mixing algorithm for add_device_randomness() random: fix crng_ready() test
2018-04-21Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller
Conflicts were simple overlapping changes in microchip driver. Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-21net/ipv6: Make from in rt6_info rcu protectedDavid Ahern
When a dst entry is created from a fib entry, the 'from' in rt6_info is set to the fib entry. The 'from' reference is used most notably for cookie checking - making sure stale dst entries are updated if the fib entry is changed. When a fib entry is deleted, the pcpu routes on it are walked releasing the fib6_info reference. This is needed for the fib6_info cleanup to happen and to make sure all device references are released in a timely manner. There is a race window when a FIB entry is deleted and the 'from' on the pcpu route is dropped and the pcpu route hits a cookie check. Handle this race using rcu on from. Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-21net/ipv6: Move rcu locking to callers of fib6_get_cookie_safeDavid Ahern
A later patch protects 'from' in rt6_info and this simplifies the locking needed by it. Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-21net/ipv6: Rename rt6_get_cookie_safeDavid Ahern
rt6_get_cookie_safe takes a fib6_info and checks the sernum of the node. Update the name to reflect its purpose. Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-21net/ipv6: Clean up rt expires helpersDavid Ahern
rt6_clean_expires and rt6_set_expires are no longer used. Removed them. rt6_update_expires has 1 caller in route.c, so move it from the header. Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-21Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-nextDavid S. Miller
Daniel Borkmann says: ==================== pull-request: bpf-next 2018-04-21 The following pull-request contains BPF updates for your *net-next* tree. The main changes are: 1) Initial work on BPF Type Format (BTF) is added, which is a meta data format which describes the data types of BPF programs / maps. BTF has its roots from CTF (Compact C-Type format) with a number of changes to it. First use case is to provide a generic pretty print capability for BPF maps inspection, later work will also add BTF to bpftool. pahole support to convert dwarf to BTF will be upstreamed as well (https://github.com/iamkafai/pahole/tree/btf), from Martin. 2) Add a new xdp_bpf_adjust_tail() BPF helper for XDP that allows for changing the data_end pointer. Only shrinking is currently supported which helps for crafting ICMP control messages. Minor changes in drivers have been added where needed so they recalc the packet's length also when data_end was adjusted, from Nikita. 3) Improve bpftool to make it easier to feed hex bytes via cmdline for map operations, from Quentin. 4) Add support for various missing BPF prog types and attach types that have been added to kernel recently but neither to bpftool nor libbpf yet. Doc and bash completion updates have been added as well for bpftool, from Andrey. 5) Proper fix for avoiding to leak info stored in frame data on page reuse for the two bpf_xdp_adjust_{head,meta} helpers by disallowing to move the pointers into struct xdp_frame area, from Jesper. 6) Follow-up compile fix from BTF in order to include stdbool.h in libbpf, from Björn. 7) Few fixes in BPF sample code, that is, a typo on the netdevice in a comment and fixup proper dump of XDP action code in the tracepoint exception, from Wang and Jesper. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-20kasan: add no_sanitize attribute for clang buildsAndrey Konovalov
KASAN uses the __no_sanitize_address macro to disable instrumentation of particular functions. Right now it's defined only for GCC build, which causes false positives when clang is used. This patch adds a definition for clang. Note, that clang's revision 329612 or higher is required. [andreyknvl@google.com: remove redundant #ifdef CONFIG_KASAN check] Link: http://lkml.kernel.org/r/c79aa31a2a2790f6131ed607c58b0dd45dd62a6c.1523967959.git.andreyknvl@google.com Link: http://lkml.kernel.org/r/4ad725cc903f8534f8c8a60f0daade5e3d674f8d.1523554166.git.andreyknvl@google.com Signed-off-by: Andrey Konovalov <andreyknvl@google.com> Acked-by: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Alexander Potapenko <glider@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: David Rientjes <rientjes@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@kernel.org> Cc: David Woodhouse <dwmw@amazon.co.uk> Cc: Andrey Konovalov <andreyknvl@google.com> Cc: Will Deacon <will.deacon@arm.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Paul Lawrence <paullawrence@google.com> Cc: Sandipan Das <sandipan@linux.vnet.ibm.com> Cc: Kees Cook <keescook@chromium.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-20writeback: safer lock nestingGreg Thelen
lock_page_memcg()/unlock_page_memcg() use spin_lock_irqsave/restore() if the page's memcg is undergoing move accounting, which occurs when a process leaves its memcg for a new one that has memory.move_charge_at_immigrate set. unlocked_inode_to_wb_begin,end() use spin_lock_irq/spin_unlock_irq() if the given inode is switching writeback domains. Switches occur when enough writes are issued from a new domain. This existing pattern is thus suspicious: lock_page_memcg(page); unlocked_inode_to_wb_begin(inode, &locked); ... unlocked_inode_to_wb_end(inode, locked); unlock_page_memcg(page); If both inode switch and process memcg migration are both in-flight then unlocked_inode_to_wb_end() will unconditionally enable interrupts while still holding the lock_page_memcg() irq spinlock. This suggests the possibility of deadlock if an interrupt occurs before unlock_page_memcg(). truncate __cancel_dirty_page lock_page_memcg unlocked_inode_to_wb_begin unlocked_inode_to_wb_end <interrupts mistakenly enabled> <interrupt> end_page_writeback test_clear_page_writeback lock_page_memcg <deadlock> unlock_page_memcg Due to configuration limitations this deadlock is not currently possible because we don't mix cgroup writeback (a cgroupv2 feature) and memory.move_charge_at_immigrate (a cgroupv1 feature). If the kernel is hacked to always claim inode switching and memcg moving_account, then this script triggers lockup in less than a minute: cd /mnt/cgroup/memory mkdir a b echo 1 > a/memory.move_charge_at_immigrate echo 1 > b/memory.move_charge_at_immigrate ( echo $BASHPID > a/cgroup.procs while true; do dd if=/dev/zero of=/mnt/big bs=1M count=256 done ) & while true; do sync done & sleep 1h & SLEEP=$! while true; do echo $SLEEP > a/cgroup.procs echo $SLEEP > b/cgroup.procs done The deadlock does not seem possible, so it's debatable if there's any reason to modify the kernel. I suggest we should to prevent future surprises. And Wang Long said "this deadlock occurs three times in our environment", so there's more reason to apply this, even to stable. Stable 4.4 has minor conflicts applying this patch. For a clean 4.4 patch see "[PATCH for-4.4] writeback: safer lock nesting" https://lkml.org/lkml/2018/4/11/146 Wang Long said "this deadlock occurs three times in our environment" [gthelen@google.com: v4] Link: http://lkml.kernel.org/r/20180411084653.254724-1-gthelen@google.com [akpm@linux-foundation.org: comment tweaks, struct initialization simplification] Change-Id: Ibb773e8045852978f6207074491d262f1b3fb613 Link: http://lkml.kernel.org/r/20180410005908.167976-1-gthelen@google.com Fixes: 682aa8e1a6a1 ("writeback: implement unlocked_inode_to_wb transaction and use it for stat updates") Signed-off-by: Greg Thelen <gthelen@google.com> Reported-by: Wang Long <wanglong19@meituan.com> Acked-by: Wang Long <wanglong19@meituan.com> Acked-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Tejun Heo <tj@kernel.org> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: <stable@vger.kernel.org> [v4.2+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-20fork: unconditionally clear stack on forkKees Cook
One of the classes of kernel stack content leaks[1] is exposing the contents of prior heap or stack contents when a new process stack is allocated. Normally, those stacks are not zeroed, and the old contents remain in place. In the face of stack content exposure flaws, those contents can leak to userspace. Fixing this will make the kernel no longer vulnerable to these flaws, as the stack will be wiped each time a stack is assigned to a new process. There's not a meaningful change in runtime performance; it almost looks like it provides a benefit. Performing back-to-back kernel builds before: Run times: 157.86 157.09 158.90 160.94 160.80 Mean: 159.12 Std Dev: 1.54 and after: Run times: 159.31 157.34 156.71 158.15 160.81 Mean: 158.46 Std Dev: 1.46 Instead of making this a build or runtime config, Andy Lutomirski recommended this just be enabled by default. [1] A noisy search for many kinds of stack content leaks can be seen here: https://cve.mitre.org/cgi-bin/cvekey.cgi?keyword=linux+kernel+stack+leak I did some more with perf and cycle counts on running 100,000 execs of /bin/true. before: Cycles: 218858861551 218853036130 214727610969 227656844122 224980542841 Mean: 221015379122.60 Std Dev: 4662486552.47 after: Cycles: 213868945060 213119275204 211820169456 224426673259 225489986348 Mean: 217745009865.40 Std Dev: 5935559279.99 It continues to look like it's faster, though the deviation is rather wide, but I'm not sure what I could do that would be less noisy. I'm open to ideas! Link: http://lkml.kernel.org/r/20180221021659.GA37073@beast Signed-off-by: Kees Cook <keescook@chromium.org> Acked-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Laura Abbott <labbott@redhat.com> Cc: Rasmus Villemoes <rasmus.villemoes@prevas.dk> Cc: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-20Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netLinus Torvalds
Pull networking fixes from David Miller: 1) Unbalanced refcounting in TIPC, from Jon Maloy. 2) Only allow TCP_MD5SIG to be set on sockets in close or listen state. Once the connection is established it makes no sense to change this. From Eric Dumazet. 3) Missing attribute validation in neigh_dump_table(), also from Eric Dumazet. 4) Fix address comparisons in SCTP, from Xin Long. 5) Neigh proxy table clearing can deadlock, from Wolfgang Bumiller. 6) Fix tunnel refcounting in l2tp, from Guillaume Nault. 7) Fix double list insert in team driver, from Paolo Abeni. 8) af_vsock.ko module was accidently made unremovable, from Stefan Hajnoczi. 9) Fix reference to freed llc_sap object in llc stack, from Cong Wang. 10) Don't assume netdevice struct is DMA'able memory in virtio_net driver, from Michael S. Tsirkin. * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (62 commits) net/smc: fix shutdown in state SMC_LISTEN bnxt_en: Fix memory fault in bnxt_ethtool_init() virtio_net: sparse annotation fix virtio_net: fix adding vids on big-endian virtio_net: split out ctrl buffer net: hns: Avoid action name truncation docs: ip-sysctl.txt: fix name of some ipv6 variables vmxnet3: fix incorrect dereference when rxvlan is disabled llc: hold llc_sap before release_sock() MAINTAINERS: Direct networking documentation changes to netdev atm: iphase: fix spelling mistake: "Tansmit" -> "Transmit" net: qmi_wwan: add Wistron Neweb D19Q1 net: caif: fix spelling mistake "UKNOWN" -> "UNKNOWN" net: stmmac: Disable ACS Feature for GMAC >= 4 net: mvpp2: Fix DMA address mask size net: change the comment of dev_mc_init net: qualcomm: rmnet: Fix warning seen with fill_info tun: fix vlan packet truncation tipc: fix infinite loop when dumping link monitor summary tipc: fix use-after-free in tipc_nametbl_stop ...
2018-04-20Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull vfs fixes from Al Viro: "Assorted fixes. Some of that is only a matter with fault injection (broken handling of small allocation failure in various mount-related places), but the last one is a root-triggerable stack overflow, and combined with userns it gets really nasty ;-/" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: Don't leak MNT_INTERNAL away from internal mounts mm,vmscan: Allow preallocating memory for register_shrinker(). rpc_pipefs: fix double-dput() orangefs_kill_sb(): deal with allocation failures jffs2_kill_sb(): deal with failed allocations hypfs_kill_super(): deal with failed allocations
2018-04-20Merge tag 'for_v4.17-rc2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs - isofs memory leak fix - two fsnotify fixes of event mask handling - udf fix of UTF-16 handling - couple other smaller cleanups * tag 'for_v4.17-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs: udf: Fix leak of UTF-16 surrogates into encoded strings fs: ext2: Adding new return type vm_fault_t isofs: fix potential memory leak in mount option parsing MAINTAINERS: add an entry for FSNOTIFY infrastructure fsnotify: fix typo in a comment about mark->g_list fsnotify: fix ignore mask logic in send_to_group() isofs compress: Remove VLA usage fs: quota: Replace GFP_ATOMIC with GFP_KERNEL in dquot_init fanotify: fix logic of events on child
2018-04-20Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/jikos/hid Pull HID updates from Jiri Kosina: - suspend/resume handling fix for Raydium I2C-connected touchscreen from Aaron Ma - protocol fixup for certain BT-connected Wacoms from Aaron Armstrong Skomra - battery level reporting fix on BT-connected mice from Dmitry Torokhov - hidraw race condition fix from Rodrigo Rivas Costa * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/hid: HID: i2c-hid: fix inverted return value from i2c_hid_command() HID: i2c-hid: Fix resume issue on Raydium touchscreen device HID: wacom: bluetooth: send exit report for recent Bluetooth devices HID: hidraw: Fix crash on HIDIOCGFEATURE with a destroyed device HID: input: fix battery level reporting on BT mice
2018-04-20Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/jikos/livepatching Pull livepatching fix from Jiri Kosina: "Shadow variable API list_head initialization fix from Petr Mladek" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/livepatching: livepatch: Allow to call a custom callback when freeing shadow variables livepatch: Initialize shadow variables safely by a custom callback
2018-04-20lan78xx: Read LED states from Device TreePhil Elwell
Add support for DT property "microchip,led-modes", a vector of zero to four cells (u32s) in the range 0-15, each of which sets the mode for one of the LEDs. Some possible values are: 0=link/activity 1=link1000/activity 2=link100/activity 3=link10/activity 4=link100/1000/activity 5=link10/1000/activity 6=link10/100/activity 14=off 15=on These values are given symbolic constants in a dt-bindings header. Also use the presence of the DT property to indicate that the LEDs should be enabled - necessary in the event that no valid OTP or EEPROM is available. Signed-off-by: Phil Elwell <phil@raspberrypi.org> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-20Merge tag 'for-linus-4.17-rc2-tag' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip Pull xen fixes from Juergen Gross: - some fixes of kmalloc() flags - one fix of the xenbus driver - an update of the pv sound driver interface needed for a driver which will go through the sound tree * tag 'for-linus-4.17-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip: xen: xenbus_dev_frontend: Really return response string xen/sndif: Sync up with the canonical definition in Xen xen: xen-pciback: Replace GFP_ATOMIC with GFP_KERNEL in pcistub_reg_add xen: xen-pciback: Replace GFP_ATOMIC with GFP_KERNEL in xen_pcibk_config_quirks_init xen: xen-pciback: Replace GFP_ATOMIC with GFP_KERNEL in pcistub_device_alloc xen: xen-pciback: Replace GFP_ATOMIC with GFP_KERNEL in pcistub_init_device xen: xen-pciback: Replace GFP_ATOMIC with GFP_KERNEL in pcistub_probe
2018-04-20qed* : Add new TLV to request PF to update MAC in bulletin boardShahed Shaikh
There may be a need for VF driver to request PF to explicitly update its bulletin with a MAC address. e.g. When user assigns a MAC address to VF while VF is still down, and PF's bulletin board contains different MAC address, in this case, when VF's interface is brought up, it gets loaded with MAC address from bulletin board which is not desirable. To handle this corner case, we need a new TLV to request PF to update its bulletin board with suggested MAC. This request will be honored only for trusted VFs. Signed-off-by: Shahed Shaikh <shahed.shaikh@cavium.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-20tipc: implement configuration of UDP media MTUGhantaKrishnamurthy MohanKrishna
In previous commit, we changed the default emulated MTU for UDP bearers to 14k. This commit adds the functionality to set/change the default value by configuring new MTU for UDP media. UDP bearer(s) have to be disabled and enabled back for the new MTU to take effect. Acked-by: Ying Xue <ying.xue@windriver.com> Acked-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: GhantaKrishnamurthy MohanKrishna <mohan.krishna.ghanta.krishnamurthy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-20tipc: set default MTU for UDP mediaGhantaKrishnamurthy MohanKrishna
Currently, all bearers are configured with MTU value same as the underlying L2 device. However, in case of bearers with media type UDP, higher throughput is possible with a fixed and higher emulated MTU value than adapting to the underlying L2 MTU. In this commit, we introduce a parameter mtu in struct tipc_media and a default value is set for UDP. A default value of 14k was determined by experimentation and found to have a higher throughput than 16k. MTU for UDP bearers are assigned the above set value of media MTU. Acked-by: Ying Xue <ying.xue@windriver.com> Acked-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: GhantaKrishnamurthy MohanKrishna <mohan.krishna.ghanta.krishnamurthy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-19net-next: ax88796: add interrupt status callback to platform dataMichael Karcher
To be able to tell the ax88796 driver whether it is sensible to enter the 8390 interrupt handler, an "is this interrupt caused by the 88796" callback has been added to the ax_plat_data structure (with NULL being compatible to the previous behaviour). Signed-off-by: Michael Karcher <kernel@mkarcher.dialup.fu-berlin.de> Signed-off-by: Michael Schmitz <schmitzmic@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-19net-next: ax88796: Add block_input/output hooks to ax_plat_dataMichael Karcher
Add platform specific hooks for block transfer reads/writes of packet buffer data, superseding the default provided ax_block_input/output. Currently used for m68k Amiga XSurf100. Signed-off-by: Michael Karcher <kernel@mkarcher.dialup.fu-berlin.de> Signed-off-by: Michael Schmitz <schmitzmic@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-19net: phy: mdio-gpio: Remove redundant platform data headerAndrew Lunn
The platform data header file is now unused. Remove it, but add an extra include which it brought in. Signed-off-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-19net: phy: mdio-gpio: Add #defines for the GPIO index'sAndrew Lunn
The GPIOs are described in device tree using a list, without names. Add defines to indicate what each index in the list means. These defines should also be used by platform devices passing GPIOs via a GPIO lookup table. Signed-off-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-19net: phy: mdio-gpio: Swap to using gpio descriptorsAndrew Lunn
This simplifies the code, removing the need to handle active low flags, etc. Signed-off-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-19net: phy: mdio-gpio: Remove support for IRQs in platform dataAndrew Lunn
No current devices use IRQs in platform data, so remove support for it. The MDIO core will also initialise the new bus such that all addresses are polled, so remove the unneeded re-initialisation. Signed-off-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-19net: phy: mdio-gpio: remove support for phy maskAndrew Lunn
This is not needed any more by devices using platform data, so remove it. Signed-off-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-19net: phy: mdio-gpio: remove support for ignoring turn aroundAndrew Lunn
This is not needed any more by devices using platform data, so remove it. Signed-off-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-19net: phy: mdio-bitbang: Remove reset supportAndrew Lunn
The mdio-gpio driver was the only user of the interface reset option. Since it no longer uses it, remove it from the bit banging code. Signed-off-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-19net: phy: mdio-gpio: Remove reset functionAndrew Lunn
The platform data can contain a function to call to reset the bit banging interface. It is not used, so remove it. Signed-off-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: David S. Miller <davem@davemloft.net>