linux.git - Linus' kernel tree

Age	Commit message (Collapse)	Author
2017-08-20	media: rc: simplify ir_raw_event_store_edge()	Sean Young
	Since commit 12749b198fa4 ("[media] rc: saa7134: add trailing space for timely decoding"), the workaround of inserting reset events is no longer needed. Note that the initial reset is not needed either; other rc-core drivers that don't use ir_raw_event_store_edge() never call this at all. Verified on a HVR-1150 and Raspberry Pi. Fixes: 3f5c4c73322e ("[media] rc: fix ghost keypresses with certain hw") Signed-off-by: Sean Young <sean@mess.org> Signed-off-by: Mauro Carvalho Chehab <mchehab@s-opensource.com>
2017-08-20	media: rc: add zx-irdec remote control driver	Shawn Guo
	It adds the remote control driver and corresponding keymap file for IRDEC block found on ZTE ZX family SoCs. Signed-off-by: Shawn Guo <shawn.guo@linaro.org> Signed-off-by: Sean Young <sean@mess.org> Signed-off-by: Mauro Carvalho Chehab <mchehab@s-opensource.com>
2017-08-20	media: rc: ir-nec-decoder: move scancode composing code into a shared function	Shawn Guo
	The NEC scancode composing and protocol type detection in ir_nec_decode() is generic enough to be a shared function. Let's create an inline function in rc-core.h, so that other remote control drivers can reuse this function to save some code. Signed-off-by: Shawn Guo <shawn.guo@linaro.org> Signed-off-by: Sean Young <sean@mess.org> Signed-off-by: Mauro Carvalho Chehab <mchehab@s-opensource.com>
2017-08-20	media: rc-core: rename input_name to device_name	Sean Young
	When an ir-spi is registered, you get this message. rc rc0: Unspecified device as /devices/platform/soc/3f215080.spi/spi_master/spi32766/spi32766.128/rc/rc0 "Unspecified device" refers to input_name, which makes no sense for IR TX only devices. So, rename to device_name. Also make driver_name const char* so that no casts are needed anywhere. Now ir-spi reports: rc rc0: IR SPI as /devices/platform/soc/3f215080.spi/spi_master/spi32766/spi32766.128/rc/rc0 Signed-off-by: Sean Young <sean@mess.org> Signed-off-by: Mauro Carvalho Chehab <mchehab@s-opensource.com>
2017-08-20	media: Convert to using %pOF instead of full_name	Rob Herring
	Now that we have a custom printf format specifier, convert users of full_name to use %pOF instead. This is preparation to remove storing of the full path string for each node. Signed-off-by: Rob Herring <robh@kernel.org> Acked-by: Niklas Söderlund <niklas.soderlund+renesas@ragnatech.se> Acked-by: Laurent Pinchart <laurent.pinchart@ideasonboard.com> Reviewed-by: Matthias Brugger <matthias.bgg@gmail.com> Acked-by: Nicolas Ferre <nicolas.ferre@microchip.com> Acked-by: Lad, Prabhakar <prabhakar.csengg@gmail.com> Cc: Kyungmin Park <kyungmin.park@samsung.com> Cc: Andrzej Hajda <a.hajda@samsung.com> Cc: Mauro Carvalho Chehab <mchehab@kernel.org> Cc: Songjun Wu <songjun.wu@microchip.com> Cc: Kukjin Kim <kgene@kernel.org> Cc: Krzysztof Kozlowski <krzk@kernel.org> Cc: Javier Martinez Canillas <javier@osg.samsung.com> Cc: Minghsiu Tsai <minghsiu.tsai@mediatek.com> Cc: Houlong Wei <houlong.wei@mediatek.com> Cc: Andrew-CT Chen <andrew-ct.chen@mediatek.com> Cc: Guennadi Liakhovetski <g.liakhovetski@gmx.de> Cc: Hyun Kwon <hyun.kwon@xilinx.com> Cc: Michal Simek <michal.simek@xilinx.com> Cc: "Sören Brinkmann" <soren.brinkmann@xilinx.com> Cc: linux-arm-kernel@lists.infradead.org Cc: linux-samsung-soc@vger.kernel.org Cc: linux-mediatek@lists.infradead.org Cc: linux-renesas-soc@vger.kernel.org Reviewed-by: Sylwester Nawrocki <s.nawrocki@samsung.com> Signed-off-by: Hans Verkuil <hans.verkuil@cisco.com> Signed-off-by: Mauro Carvalho Chehab <mchehab@s-opensource.com>
2017-08-20	media: cec-pin: fix irq handling	Hans Verkuil
	The free_irq() function could be called from interrupt context, which is invalid. Move this to the thread. In the interrupt handler we just request that the thread disables the irq. This is done through an atomic so we don't need to add any spinlocks. Signed-off-by: Hans Verkuil <hans.verkuil@cisco.com> Signed-off-by: Mauro Carvalho Chehab <mchehab@s-opensource.com>
2017-08-20	media: cec: rename pin events/function	Hans Verkuil
	The CEC_EVENT_PIN_LOW/HIGH defines and the cec_queue_pin_event() function did not specify that these were about CEC pin events. Since in the future there will also be HPD pin events it is wise to rename the event defines and function to CEC_EVENT_PIN_CEC_LOW/HIGH and cec_queue_pin_cec_event() now before these become part of the ABI. Signed-off-by: Hans Verkuil <hans.verkuil@cisco.com> Signed-off-by: Mauro Carvalho Chehab <mchehab@s-opensource.com>
2017-08-20	media: v4l2-ctrls.h: better document the arguments for v4l2_ctrl_fill	Mauro Carvalho Chehab
	The arguments for this function are pointers. Make it clear at its documentation. Signed-off-by: Hans Verkuil <hans.verkuil@cisco.com> Signed-off-by: Mauro Carvalho Chehab <mchehab@s-opensource.com>
2017-08-20	net/mlx5e: Add outbound PCI buffer overflow counter	Eran Ben Elisha
	Add outbound_pci_buffer_overflow to ethtool output for monitoring the number of packets that were dropped due to lack of PCIe buffers on receive path from NIC port toward the host(s). This counter is valid only in case that tx_overflow_buffer_pkt is supported in MCAM enhanced features. Signed-off-by: Eran Ben Elisha <eranbe@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2017-08-20	net/mlx5: Add RX buffer fullness counters infrastructure	Gal Pressman
	Add capability bit in PCAM register and counters to PPCNT register. Signed-off-by: Gal Pressman <galp@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2017-08-20	net/mlx5: Add PCIe outbound stalls counters infrastructure	Gal Pressman
	Add capability bit in MCAM register and counters to MPCNT register. Signed-off-by: Gal Pressman <galp@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2017-08-19	bpf: linux/bpf.h needs linux/numa.h	David S. Miller
	Reported-by: kbuild test robot <fengguang.wu@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-19	bpf: Allow selecting numa node during map creation	Martin KaFai Lau
	The current map creation API does not allow to provide the numa-node preference. The memory usually comes from where the map-creation-process is running. The performance is not ideal if the bpf_prog is known to always run in a numa node different from the map-creation-process. One of the use case is sharding on CPU to different LRU maps (i.e. an array of LRU maps). Here is the test result of map_perf_test on the INNER_LRU_HASH_PREALLOC test if we force the lru map used by CPU0 to be allocated from a remote numa node: [ The machine has 20 cores. CPU0-9 at node 0. CPU10-19 at node 1 ] ># taskset -c 10 ./map_perf_test 512 8 1260000 8000000 5:inner_lru_hash_map_perf pre-alloc 1628380 events per sec 4:inner_lru_hash_map_perf pre-alloc 1626396 events per sec 3:inner_lru_hash_map_perf pre-alloc 1626144 events per sec 6:inner_lru_hash_map_perf pre-alloc 1621657 events per sec 2:inner_lru_hash_map_perf pre-alloc 1621534 events per sec 1:inner_lru_hash_map_perf pre-alloc 1620292 events per sec 7:inner_lru_hash_map_perf pre-alloc 1613305 events per sec 0:inner_lru_hash_map_perf pre-alloc 1239150 events per sec #<<< After specifying numa node: ># taskset -c 10 ./map_perf_test 512 8 1260000 8000000 5:inner_lru_hash_map_perf pre-alloc 1629627 events per sec 3:inner_lru_hash_map_perf pre-alloc 1628057 events per sec 1:inner_lru_hash_map_perf pre-alloc 1623054 events per sec 6:inner_lru_hash_map_perf pre-alloc 1616033 events per sec 2:inner_lru_hash_map_perf pre-alloc 1614630 events per sec 4:inner_lru_hash_map_perf pre-alloc 1612651 events per sec 7:inner_lru_hash_map_perf pre-alloc 1609337 events per sec 0:inner_lru_hash_map_perf pre-alloc 1619340 events per sec #<<< This patch adds one field, numa_node, to the bpf_attr. Since numa node 0 is a valid node, a new flag BPF_F_NUMA_NODE is also added. The numa_node field is honored if and only if the BPF_F_NUMA_NODE flag is set. Numa node selection is not supported for percpu map. This patch does not change all the kmalloc. F.e. 'htab = kzalloc()' is not changed since the object is small enough to stay in the cache. Signed-off-by: Martin KaFai Lau <kafai@fb.com> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@fb.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-19	netfilter: rt: add support to fetch path mss	Florian Westphal
	to be used in combination with tcp option set support to mimic iptables TCPMSS --clamp-mss-to-pmtu. v2: Eric Dumazet points out dst must be initialized. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-08-19	netfilter: exthdr: tcp option set support	Florian Westphal
	This allows setting 2 and 4 byte quantities in the tcp option space. Main purpose is to allow native replacement for xt_TCPMSS to work around pmtu blackholes. Writes to kind and len are now allowed at the moment, it does not seem useful to do this as it causes corruption of the tcp option space. We can always lift this restriction later if a use-case appears. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-08-18	net: drop unused attribute argument from sysfs queue funcs	stephen hemminger
	The show and store functions don't need/use the attribute. Signed-off-by: Stephen Hemminger <sthemmin@microsoft.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-18	net: constify net_ns_type_operations	stephen hemminger
	This can be const. Signed-off-by: Stephen Hemminger <sthemmin@microsoft.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-18	net: constify netdev_class_file	stephen hemminger
	These functions are wrapper arount class_create_file which can take a const attribute. Signed-off-by: Stephen Hemminger <sthemmin@microsoft.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-18	net: ethtool: Add macro to clear a link mode setting	Lendacky, Thomas
	There are currently macros to set and test an ETHTOOL_LINK_MODE_ setting, but not to clear one. Add a macro to clear an ETHTOOL_LINK_MODE_ setting. Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-18	xdp: adjust xdp redirect tracepoint to include return error code	Jesper Dangaard Brouer
	The return error code need to be included in the tracepoint xdp:xdp_redirect, else its not possible to distinguish successful or failed XDP_REDIRECT transmits. XDP have no queuing mechanism. Thus, it is fairly easily to overrun a NIC transmit queue. The eBPF program invoking helpers (bpf_redirect or bpf_redirect_map) to redirect a packet doesn't get any feedback whether the packet was actually transmitted. Info on failed transmits in the tracepoint xdp:xdp_redirect, is interesting as this opens for providing a feedback-loop to the receiving XDP program. Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-18	net: inet: diag: expose sockets cgroup classid	Levin, Alexander (Sasha Levin)
	This is useful for directly looking up a task based on class id rather than having to scan through all open file descriptors. Signed-off-by: Sasha Levin <alexander.levin@verizon.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-18	mm, oom: fix potential data corruption when oom_reaper races with writer	Michal Hocko
	Wenwei Tao has noticed that our current assumption that the oom victim is dying and never doing any visible changes after it dies, and so the oom_reaper can tear it down, is not entirely true. __task_will_free_mem consider a task dying when SIGNAL_GROUP_EXIT is set but do_group_exit sends SIGKILL to all threads _after_ the flag is set. So there is a race window when some threads won't have fatal_signal_pending while the oom_reaper could start unmapping the address space. Moreover some paths might not check for fatal signals before each PF/g-u-p/copy_from_user. We already have a protection for oom_reaper vs. PF races by checking MMF_UNSTABLE. This has been, however, checked only for kernel threads (use_mm users) which can outlive the oom victim. A simple fix would be to extend the current check in handle_mm_fault for all tasks but that wouldn't be sufficient because the current check assumes that a kernel thread would bail out after EFAULT from get_user*/copy_from_user and never re-read the same address which would succeed because the PF path has established page tables already. This seems to be the case for the only existing use_mm user currently (virtio driver) but it is rather fragile in general. This is even more fragile in general for more complex paths such as generic_perform_write which can re-read the same address more times (e.g. iov_iter_copy_from_user_atomic to fail and then iov_iter_fault_in_readable on retry). Therefore we have to implement MMF_UNSTABLE protection in a robust way and never make a potentially corrupted content visible. That requires to hook deeper into the PF path and check for the flag _every time_ before a pte for anonymous memory is established (that means all !VM_SHARED mappings). The corruption can be triggered artificially (http://lkml.kernel.org/r/201708040646.v746kkhC024636@www262.sakura.ne.jp) but there doesn't seem to be any real life bug report. The race window should be quite tight to trigger most of the time. Link: http://lkml.kernel.org/r/20170807113839.16695-3-mhocko@kernel.org Fixes: aac453635549 ("mm, oom: introduce oom reaper") Signed-off-by: Michal Hocko <mhocko@suse.com> Reported-by: Wenwei Tao <wenwei.tww@alibaba-inc.com> Tested-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: "Kirill A. Shutemov" <kirill@shutemov.name> Cc: Andrea Argangeli <andrea@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-08-18	mm: discard memblock data later	Pavel Tatashin
	There is existing use after free bug when deferred struct pages are enabled: The memblock_add() allocates memory for the memory array if more than 128 entries are needed. See comment in e820__memblock_setup(): * The bootstrap memblock region count maximum is 128 entries * (INIT_MEMBLOCK_REGIONS), but EFI might pass us more E820 entries * than that - so allow memblock resizing. This memblock memory is freed here: free_low_memory_core_early() We access the freed memblock.memory later in boot when deferred pages are initialized in this path: deferred_init_memmap() for_each_mem_pfn_range() __next_mem_pfn_range() type = &memblock.memory; One possible explanation for why this use-after-free hasn't been hit before is that the limit of INIT_MEMBLOCK_REGIONS has never been exceeded at least on systems where deferred struct pages were enabled. Tested by reducing INIT_MEMBLOCK_REGIONS down to 4 from the current 128, and verifying in qemu that this code is getting excuted and that the freed pages are sane. Link: http://lkml.kernel.org/r/1502485554-318703-2-git-send-email-pasha.tatashin@oracle.com Fixes: 7e18adb4f80b ("mm: meminit: initialise remaining struct pages in parallel with kswapd") Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com> Reviewed-by: Steven Sistare <steven.sistare@oracle.com> Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com> Reviewed-by: Bob Picco <bob.picco@oracle.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-08-18	wait: add wait_event_killable_timeout()	Luis R. Rodriguez
	These are the few pending fixes I have queued up for v4.13-final. One is a a generic regression fix for recursive loops on kmod and the other one is a trivial print out correction. During the v4.13 development we assumed that recursive kmod loops were no longer possible. Clearly that is not true. The regression fix makes use of a new killable wait. We use a killable wait to be paranoid in how signals might be sent to modprobe and only accept a proper SIGKILL. The signal will only be available to userspace to issue iff a thread has already entered a wait state, and that happens only if we've already throttled after 50 kmod threads have been hit. Note that although it may seem excessive to trigger a failure afer 5 seconds if all kmod thread remain busy, prior to the series of changes that went into v4.13 we would actually always fatally fail any request which came in if the limit was already reached. The new waiting implemented in v4.13 actually gives us more breathing room -- the wait for 5 seconds is a wait for any kmod thread to finish. We give up and fail iff no kmod thread has finished and they're all running straight for 5 consecutive seconds. If 50 kmod threads are running consecutively for 5 seconds something else must be really bad. Recursive loops with kmod are bad but they're also hard to implement properly as a selftest without currently fooling current userspace tools like kmod [1]. For instance kmod will complain when you run depmod if it finds a recursive loop with symbol dependency between modules as such this type of recursive loop cannot go upstream as the modules_install target will fail after running depmod. These tests already exist on userspace kmod upstream though (refer to the testsuite/module-playground/mod-loop-*.c files). The same is not true if request_module() is used though, or worst if aliases are used. Likewise the issue with 64-bit kernels booting 32-bit userspace without a binfmt handler built-in is also currently not detected and proactively avoided by userspace kmod tools, or kconfig for all architectures. Although we could complain in the kernel when some of these individual recursive issues creep up, proactively avoiding these situations in userspace at build time is what we should keep striving for. Lastly, since recursive loops could happen with kmod it may mean recursive loops may also be possible with other kernel usermode helpers, this should be investigated and long term if we can come up with a more sensible generic solution even better! [0] https://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux.git/log/?h=20170809-kmod-for-v4.13-final [1] https://git.kernel.org/pub/scm/utils/kernel/kmod/kmod.git This patch (of 3): This wait is similar to wait_event_interruptible_timeout() but only accepts SIGKILL interrupt signal. Other signals are ignored. Link: http://lkml.kernel.org/r/20170809234635.13443-2-mcgrof@kernel.org Signed-off-by: Luis R. Rodriguez <mcgrof@kernel.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Kees Cook <keescook@chromium.org> Cc: Dmitry Torokhov <dmitry.torokhov@gmail.com> Cc: Jessica Yu <jeyu@redhat.com> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: Michal Marek <mmarek@suse.com> Cc: Petr Mladek <pmladek@suse.com> Cc: Miroslav Benes <mbenes@suse.cz> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Matt Redfearn <matt.redfearn@imgtec.com> Cc: Dan Carpenter <dan.carpenter@oracle.com> Cc: Colin Ian King <colin.king@canonical.com> Cc: Daniel Mentz <danielmentz@google.com> Cc: David Binderman <dcb314@hotmail.com> Cc: Matt Redfearn <matt.redfearn@imgetc.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-08-18	mm: memcontrol: fix NULL pointer crash in test_clear_page_writeback()	Johannes Weiner
	Jaegeuk and Brad report a NULL pointer crash when writeback ending tries to update the memcg stats: BUG: unable to handle kernel NULL pointer dereference at 00000000000003b0 IP: test_clear_page_writeback+0x12e/0x2c0 [...] RIP: 0010:test_clear_page_writeback+0x12e/0x2c0 Call Trace: <IRQ> end_page_writeback+0x47/0x70 f2fs_write_end_io+0x76/0x180 [f2fs] bio_endio+0x9f/0x120 blk_update_request+0xa8/0x2f0 scsi_end_request+0x39/0x1d0 scsi_io_completion+0x211/0x690 scsi_finish_command+0xd9/0x120 scsi_softirq_done+0x127/0x150 __blk_mq_complete_request_remote+0x13/0x20 flush_smp_call_function_queue+0x56/0x110 generic_smp_call_function_single_interrupt+0x13/0x30 smp_call_function_single_interrupt+0x27/0x40 call_function_single_interrupt+0x89/0x90 RIP: 0010:native_safe_halt+0x6/0x10 (gdb) l (test_clear_page_writeback+0x12e) 0xffffffff811bae3e is in test_clear_page_writeback (./include/linux/memcontrol.h:619). 614 mod_node_page_state(page_pgdat(page), idx, val); 615 if (mem_cgroup_disabled() \|\| !page->mem_cgroup) 616 return; 617 mod_memcg_state(page->mem_cgroup, idx, val); 618 pn = page->mem_cgroup->nodeinfo[page_to_nid(page)]; 619 this_cpu_add(pn->lruvec_stat->count[idx], val); 620 } 621 622 unsigned long mem_cgroup_soft_limit_reclaim(pg_data_t pgdat, int order, 623 gfp_t gfp_mask, The issue is that writeback doesn't hold a page reference and the page might get freed after PG_writeback is cleared (and the mapping is unlocked) in test_clear_page_writeback(). The stat functions looking up the page's node or zone are safe, as those attributes are static across allocation and free cycles. But page->mem_cgroup is not, and it will get cleared if we race with truncation or migration. It appears this race window has been around for a while, but less likely to trigger when the memcg stats were updated first thing after PG_writeback is cleared. Recent changes reshuffled this code to update the global node stats before the memcg ones, though, stretching the race window out to an extent where people can reproduce the problem. Update test_clear_page_writeback() to look up and pin page->mem_cgroup before clearing PG_writeback, then not use that pointer afterward. It is a partial revert of 62cccb8c8e7a ("mm: simplify lock_page_memcg()") but leaves the pageref-holding callsites that aren't affected alone. Link: http://lkml.kernel.org/r/20170809183825.GA26387@cmpxchg.org Fixes: 62cccb8c8e7a ("mm: simplify lock_page_memcg()") Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reported-by: Jaegeuk Kim <jaegeuk@kernel.org> Tested-by: Jaegeuk Kim <jaegeuk@kernel.org> Reported-by: Bradley Bolen <bradleybolen@gmail.com> Tested-by: Brad Bolen <bradleybolen@gmail.com> Cc: Vladimir Davydov <vdavydov@virtuozzo.com> Cc: Michal Hocko <mhocko@suse.cz> Cc: <stable@vger.kernel.org> [4.6+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-08-18	ipv4: convert dst_metrics.refcnt from atomic_t to refcount_t	Eric Dumazet
	refcount_t type and corresponding API should be used instead of atomic_t when the variable is used as a reference counter. This allows to avoid accidental refcounter overflows that might lead to use-after-free situations. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-18	datagram: When peeking datagrams with offset < 0 don't skip empty skbs	Matthew Dawson
	Due to commit e6afc8ace6dd5cef5e812f26c72579da8806f5ac ("udp: remove headers from UDP packets before queueing"), when udp packets are being peeked the requested extra offset is always 0 as there is no need to skip the udp header. However, when the offset is 0 and the next skb is of length 0, it is only returned once. The behaviour can be seen with the following python script: from socket import *; f=socket(AF_INET6, SOCK_DGRAM \| SOCK_NONBLOCK, 0); g=socket(AF_INET6, SOCK_DGRAM \| SOCK_NONBLOCK, 0); f.bind(('::', 0)); addr=('::1', f.getsockname()[1]); g.sendto(b'', addr) g.sendto(b'b', addr) print(f.recvfrom(10, MSG_PEEK)); print(f.recvfrom(10, MSG_PEEK)); Where the expected output should be the empty string twice. Instead, make sk_peek_offset return negative values, and pass those values to __skb_try_recv_datagram/__skb_try_recv_from_queue. If the passed offset to __skb_try_recv_from_queue is negative, the checked skb is never skipped. __skb_try_recv_from_queue will then ensure the offset is reset back to 0 if a peek is requested without an offset, unless no packets are found. Also simplify the if condition in __skb_try_recv_from_queue. If _off is greater then 0, and off is greater then or equal to skb->len, then (_off \|\| skb->len) must always be true assuming skb->len >= 0 is always true. Also remove a redundant check around a call to sk_peek_offset in af_unix.c, as it double checked if MSG_PEEK was set in the flags. V2: - Moved the negative fixup into __skb_try_recv_from_queue, and remove now redundant checks - Fix peeking in udp{,v6}_recvmsg to report the right value when the offset is 0 V3: - Marked new branch in __skb_try_recv_from_queue as unlikely. Signed-off-by: Matthew Dawson <matthew@mjdsystems.ca> Acked-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-18	Add OPA extended LID support	Hiatt, Don
	This patch series primarily increases sizes of variables that hold lid values from 16 to 32 bits. Additionally, it adds a check in the IB mad stack to verify a properly formatted MAD when OPA extended LIDs are used. Signed-off-by: Don Hiatt <don.hiatt@intel.com> Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
2017-08-18	Merge branch 'misc' into k.o/for-next	Doug Ledford
	Conflicts: drivers/infiniband/core/iwcm.c - The rdma_netlink patches in HEAD and the iwarp cm workqueue fix (don't use WQ_MEM_RECLAIM, we aren't safe for that context) touched the same code. Signed-off-by: Doug Ledford <dledford@redhat.com>
2017-08-18	infiniband: avoid overflow warning	Arnd Bergmann
	A sockaddr_in structure on the stack getting passed into rdma_ip2gid triggers this warning, since we memcpy into a larger sockaddr_in6 structure: In function 'memcpy', inlined from 'rdma_ip2gid' at include/rdma/ib_addr.h:175:3, inlined from 'addr_event.isra.4.constprop' at drivers/infiniband/core/roce_gid_mgmt.c:693:2, inlined from 'inetaddr_event' at drivers/infiniband/core/roce_gid_mgmt.c:716:9: include/linux/string.h:305:4: error: call to '__read_overflow2' declared with attribute error: detected read beyond size of object passed as 2nd parameter The warning seems appropriate here, but the code is also clearly correct, so we really just want to shut up this instance of the output. The best way I found so far is to avoid the memcpy() call and instead replace it with a struct assignment. Fixes: 6974f0c4555e ("include/linux/string.h: add the option of fortified string.h functions") Cc: Daniel Micay <danielmicay@gmail.com> Cc: Kees Cook <keescook@chromium.org> Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Doug Ledford <dledford@redhat.com>
2017-08-18	PCI/IB: add support for pci driver attribute groups	Greg Kroah-Hartman
	Some drivers (specifically the nes IB driver), want to create a lot of sysfs driver attributes. Instead of open-coding the creation and removal of these files (and getting it wrong btw), it's a better idea to let the driver core handle all of this logic for us. So add a new field to the pci driver structure, **groups, that allows pci drivers to specify an attribute group list it wishes to have created when it is registered with the driver core. Big bonus is now the driver doesn't race with userspace when the sysfs files are created vs. when the kobject is announced, so any script/tool that actually wanted to use these files will not have to poll waiting for them to show up. Cc: Faisal Latif <faisal.latif@intel.com> Cc: Doug Ledford <dledford@redhat.com> Cc: Sean Hefty <sean.hefty@intel.com> Cc: Hal Rosenstock <hal.rosenstock@gmail.com> Cc: Bjorn Helgaas <bhelgaas@google.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Acked-by: Bjorn Helgaas <bhelgaas@google.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
2017-08-18	cgroup: Add mount flag to enable cpuset to use v2 behavior in v1 cgroup	Waiman Long
	A new mount option "cpuset_v2_mode" is added to the v1 cgroupfs filesystem to enable cpuset controller to use v2 behavior in a v1 cgroup. This mount option applies only to cpuset controller and have no effect on other controllers. Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2017-08-18	Merge airlied/drm-next into drm-misc-next	Sean Paul
	Archit requested this backmerge to facilitate merging some patches depending on changes between -rc2 & -rc5 Signed-off-by: Sean Paul <seanpaul@chromium.org>
2017-08-18	blk-mq: Make blk_mq_reinit_tagset() calls easier to read	Bart Van Assche
	Since blk_mq_ops.reinit_request is only called from inside blk_mq_reinit_tagset(), make this function pointer an argument of blk_mq_reinit_tagset() instead of a member of struct blk_mq_ops. This patch does not change any functionality but makes blk_mq_reinit_tagset() calls easier to read and to analyze. Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com> Reviewed-by: Hannes Reinecke <hare@suse.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Sagi Grimberg <sagi@grimberg.me> Cc: James Smart <james.smart@broadcom.com> Cc: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-08-18	kernel/watchdog: Prevent false positives with turbo modes	Thomas Gleixner
	The hardlockup detector on x86 uses a performance counter based on unhalted CPU cycles and a periodic hrtimer. The hrtimer period is about 2/5 of the performance counter period, so the hrtimer should fire 2-3 times before the performance counter NMI fires. The NMI code checks whether the hrtimer fired since the last invocation. If not, it assumess a hard lockup. The calculation of those periods is based on the nominal CPU frequency. Turbo modes increase the CPU clock frequency and therefore shorten the period of the perf/NMI watchdog. With extreme Turbo-modes (3x nominal frequency) the perf/NMI period is shorter than the hrtimer period which leads to false positives. A simple fix would be to shorten the hrtimer period, but that comes with the side effect of more frequent hrtimer and softlockup thread wakeups, which is not desired. Implement a low pass filter, which checks the perf/NMI period against kernel time. If the perf/NMI fires before 4/5 of the watchdog period has elapsed then the event is ignored and postponed to the next perf/NMI. That solves the problem and avoids the overhead of shorter hrtimer periods and more frequent softlockup thread wakeups. Fixes: 58687acba592 ("lockup_detector: Combine nmi_watchdog and softlockup detector") Reported-and-tested-by: Kan Liang <Kan.liang@intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: dzickus@redhat.com Cc: prarit@redhat.com Cc: ak@linux.intel.com Cc: babu.moger@oracle.com Cc: peterz@infradead.org Cc: eranian@google.com Cc: acme@redhat.com Cc: stable@vger.kernel.org Cc: atomlin@redhat.com Cc: akpm@linux-foundation.org Cc: torvalds@linux-foundation.org Link: http://lkml.kernel.org/r/alpine.DEB.2.20.1708150931310.1886@nanos
2017-08-18	Merge branch 'irq/for-gpio' into irq/core	Thomas Gleixner
	Merge the flow handlers and irq domain extensions which are in a separate branch so they can be consumed by the gpio folks.
2017-08-18	irqdomain: Add irq_domain_{push,pop}_irq() functions	David Daney
	For an already existing irqdomain hierarchy, as might be obtained via a call to pci_enable_msix_range(), a PCI driver wishing to add an additional irqdomain to the hierarchy needs to be able to insert the irqdomain to that already initialized hierarchy. Calling irq_domain_create_hierarchy() allows the new irqdomain to be created, but no existing code allows for initializing the associated irq_data. Add a couple of helper functions (irq_domain_push_irq() and irq_domain_pop_irq()) to initialize the irq_data for the new irqdomain added to an existing hierarchy. Signed-off-by: David Daney <david.daney@cavium.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Marc Zyngier <marc.zyngier@arm.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Alexandre Courbot <gnurou@gmail.com> Cc: Linus Walleij <linus.walleij@linaro.org> Cc: linux-gpio@vger.kernel.org Link: http://lkml.kernel.org/r/1503017616-3252-6-git-send-email-david.daney@cavium.com
2017-08-18	genirq: Add handle_fasteoi_{level,edge}_irq flow handlers	David Daney
	Follow-on patch for gpio-thunderx uses a irqdomain hierarchy which requires slightly different flow handlers, add them to chip.c which contains most of the other flow handlers. Make these conditionally compiled based on CONFIG_IRQ_FASTEOI_HIERARCHY_HANDLERS. Signed-off-by: David Daney <david.daney@cavium.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Alexandre Courbot <gnurou@gmail.com> Cc: Marc Zyngier <marc.zyngier@arm.com> Cc: Linus Walleij <linus.walleij@linaro.org> Cc: linux-gpio@vger.kernel.org Link: http://lkml.kernel.org/r/1503017616-3252-3-git-send-email-david.daney@cavium.com
2017-08-18	genirq: Restrict effective affinity to interrupts actually using it	Marc Zyngier
	Just because CONFIG_GENERIC_IRQ_EFFECTIVE_AFF_MASK is selected doesn't mean that all the interrupts are using the effective affinity mask. For a number of them, this mask is likely to be empty. In order to deal with this, let's restrict the use of the effective affinity mask to these interrupts that have a non empty effective affinity. Signed-off-by: Marc Zyngier <marc.zyngier@arm.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Andrew Lunn <andrew@lunn.ch> Cc: James Hogan <james.hogan@imgtec.com> Cc: Jason Cooper <jason@lakedaemon.net> Cc: Paul Burton <paul.burton@imgtec.com> Cc: Chris Zankel <chris@zankel.net> Cc: Kevin Cernekee <cernekee@gmail.com> Cc: Wei Xu <xuwei5@hisilicon.com> Cc: Max Filippov <jcmvbkbc@gmail.com> Cc: Florian Fainelli <f.fainelli@gmail.com> Cc: Gregory Clement <gregory.clement@free-electrons.com> Cc: Matt Redfearn <matt.redfearn@imgtec.com> Cc: Sebastian Hesselbarth <sebastian.hesselbarth@gmail.com> Link: http://lkml.kernel.org/r/20170818083925.10108-2-marc.zyngier@arm.com
2017-08-18	Merge branch 'x86/asm' into locking/core	Ingo Molnar
	We need the ASM_UNREACHABLE() macro for a dependent patch. Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-08-18	ACPI / PM: Check low power idle constraints for debug only	Srinivas Pandruvada
	For SoC to achieve its lowest power platform idle state a set of hardware preconditions must be met. These preconditions or constraints can be obtained by issuing a device specific method (_DSM) with function "1". Refer to the document provided in the link below. Here during initialization (from attach() callback of LPS0 device), invoke function 1 to get the device constraints. Each enabled constraint is stored in a table. The devices in this table are used to check whether they were in required minimum state, while entering suspend. This check is done from platform freeze wake() callback, only when /sys/power/pm_debug_messages attribute is non zero. If any constraint is not met and device is ACPI power managed then it prints the device information to kernel logs. Also if debug is enabled in acpi/sleep.c, the constraint table and state of each device on wake is dumped in kernel logs. Since pm_debug_messages_on setting is used as condition to check constraints outside kernel/power/main.c, pm_debug_messages_on is changed to a global variable. Link: http://www.uefi.org/sites/default/files/resources/Intel_ACPI_Low_Power_S0_Idle.pdf Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2017-08-17	Revert "pstore: Honor dmesg_restrict sysctl on dmesg dumps"	Kees Cook
	This reverts commit 68c4a4f8abc60c9440ede9cd123d48b78325f7a3, with various conflict clean-ups. The capability check required too much privilege compared to simple DAC controls. A system builder was forced to have crash handler processes run with CAP_SYSLOG which would give it the ability to read (and wipe) the _current_ dmesg, which is much more access than being given access only to the historical log stored in pstorefs. With the prior commit to make the root directory 0750, the files are protected by default but a system builder can now opt to give access to a specific group (via chgrp on the pstorefs root directory) without being forced to also give away CAP_SYSLOG. Suggested-by: Nick Kralevich <nnk@google.com> Signed-off-by: Kees Cook <keescook@chromium.org> Reviewed-by: Petr Mladek <pmladek@suse.cz> Reviewed-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
2017-08-17	quota: Reduce contention on dq_data_lock	Jan Kara
	dq_data_lock is currently used to protect all modifications of quota accounting information, consistency of quota accounting on the inode, and dquot pointers from inode. As a result contention on the lock can be pretty heavy. Reduce the contention on the lock by protecting quota accounting information by a new dquot->dq_dqb_lock and consistency of quota accounting with inode usage by inode->i_lock. This change reduces time to create 500000 files on ext4 on ramdisk by 50 different processes in separate directories by 6% when user quota is turned on. When those 50 processes belong to 50 different users, the improvement is about 9%. Signed-off-by: Jan Kara <jack@suse.cz>
2017-08-17	fs: Provide __inode_get_bytes()	Jan Kara
	Provide helper __inode_get_bytes() which assumes i_lock is already acquired. Quota code will need this to be able to use i_lock to protect consistency of quota accounting information and inode usage. Signed-off-by: Jan Kara <jack@suse.cz>
2017-08-17	quota: Inline functions into their callsites	Jan Kara
	inode_add_rsv_space() and inode_sub_rsv_space() had only one callsite. Inline them there directly. inode_claim_rsv_space() and inode_reclaim_rsv_space() had two callsites so inline them there as well. This will simplify further locking changes. Signed-off-by: Jan Kara <jack@suse.cz>
2017-08-17	quota: Allow disabling tracking of dirty dquots in a list	Jan Kara
	Filesystems that are journalling quotas generally don't need tracking of dirty dquots in a list since forcing a transaction commit flushes all quotas anyway. Allow filesystem to say it doesn't want dquots to be tracked as it reduces contention on the dq_list_lock. Signed-off-by: Jan Kara <jack@suse.cz>
2017-08-17	quota: Remove dq_wait_unused from dquot	Jan Kara
	Currently every dquot carries a wait_queue_head_t used only when we are turning quotas off to wait for last users to drop dquot references. Since such rare case is not performance sensitive in any means, just use a global waitqueue for this and save space in struct dquot. Also convert the logic to use wait_event() instead of open-coding it. Signed-off-by: Jan Kara <jack@suse.cz>
2017-08-17	drm/ttm: make ttm_mem_type_manager_func debug more useful	Christian König
	Provide the drm printer directly instead of just the callback. Signed-off-by: Christian König <christian.koenig@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2017-08-18	Merge tag 'omapdrm-4.14' of ↵	Dave Airlie
	git://git.kernel.org/pub/scm/linux/kernel/git/tomba/linux into drm-next omapdrm changes for v4.14 * HDMI hot plug IRQ support (instead of polling) * Big driver cleanup from Laurent (no functional changes) * OMAP5 DSI support (only the pinmuxing was missing) * tag 'omapdrm-4.14' of git://git.kernel.org/pub/scm/linux/kernel/git/tomba/linux: (60 commits) drm/omap: Potential NULL deref in omap_crtc_duplicate_state() drm/omap: remove no-op cleanup code drm/omap: rename omapdrm device back drm: omapdrm: Remove omapdrm platform data ARM: OMAP2+: Don't register omapdss device for omapdrm ARM: OMAP2+: Remove unused omapdrm platform device drm: omapdrm: Remove the omapdss driver drm: omapdrm: Register omapdrm platform device in omapdss driver drm: omapdrm: hdmi: Don't allocate PHY features dynamically drm: omapdrm: hdmi: Configure the PHY from the HDMI core version drm: omapdrm: hdmi: Configure the PLL from the HDMI core version drm: omapdrm: hdmi: Pass HDMI core version as integer to HDMI audio drm: omapdrm: hdmi: Replace OMAP SoC model check with HDMI xmit version drm: omapdrm: hdmi: Rename functions and structures to use hdmi_ prefix drm/omap: add OMAP5 DSIPHY lane-enable support drm/omap: use regmap_update_bit() when muxing DSI pads drm: omapdrm: Remove dss_features.h drm: omapdrm: Move supported outputs feature to dss driver drm: omapdrm: Move DSS_FCK feature to dss driver drm: omapdrm: Move PCD, LINEWIDTH and DOWNSCALE features to dispc driver ...
2017-08-17	ACPICA: Make it possible to enable runtime GPEs earlier	Rafael J. Wysocki
	Runtime GPEs have corresponding _Lxx/_Exx methods and are enabled automatically during the initialization of the ACPI subsystem through acpi_update_all_gpes() with the assumption that acpi_setup_gpe_for_wake() will be called in advance for all of the GPEs pointed to by _PRW objects in the namespace that may be affected by acpi_update_all_gpes(). That is, acpi_ev_initialize_gpe_block() can only be called for a GPE block after acpi_setup_gpe_for_wake() has been called for all of the _PRW (wakeup) GPEs in it. The platform firmware on some systems, however, expects GPEs to be enabled before the enumeration of devices which is when acpi_setup_gpe_for_wake() is called and that goes against the above assumption. For this reason, introduce a new flag to be set by acpi_ev_initialize_gpe_block() when automatically enabling a GPE to indicate to acpi_setup_gpe_for_wake() that it needs to drop the reference to the GPE coming from acpi_ev_initialize_gpe_block() and modify acpi_setup_gpe_for_wake() accordingly. These changes allow acpi_setup_gpe_for_wake() and acpi_ev_initialize_gpe_block() to be invoked in any order. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Tested-by: Mika Westerberg <mika.westerberg@linux.intel.com>