summaryrefslogtreecommitdiff
path: root/include
AgeCommit message (Collapse)Author
2019-11-17IB/umem: remove the dmasync argument to ib_umem_getChristoph Hellwig
The argument is always ignored, so remove it. Link: https://lore.kernel.org/r/20191113073214.9514-3-hch@lst.de Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jason Gunthorpe <jgg@mellanox.com> Acked-by: Michal Kalderon <michal.kalderon@marvell.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-11-16Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netDavid S. Miller
Lots of overlapping changes and parallel additions, stuff like that. Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-16percpu-refcount: Use normal instead of RCU-sched"Sebastian Andrzej Siewior
This is a revert of commit a4244454df129 ("percpu-refcount: use RCU-sched insted of normal RCU") which claims the only reason for using RCU-sched is "rcu_read_[un]lock() … are slightly more expensive than preempt_disable/enable()" and "As the RCU critical sections are extremely short, using sched-RCU shouldn't have any latency implications." The problem with using RCU-sched here is that it disables preemption and the release callback (called from percpu_ref_put_many()) must not acquire any sleeping locks like spinlock_t. This breaks PREEMPT_RT because some of the users acquire spinlock_t locks in their callbacks. Using rcu_read_lock() on PREEMPTION=n kernels is not any different compared to rcu_read_lock_sched(). On PREEMPTION=y kernels there are already performance issues due to additional preemption points. Looking at the code, the rcu_read_lock() is just an increment and unlock is almost just a decrement unless there is something special to do. Both are functions while disabling preemption is inlined. Doing a small benchmark, the minimal amount of time required was mostly the same. The average time required was higher due to the higher MAX value (which could be preemption). With DEBUG_PREEMPT=y it is rcu_read_lock_sched() that takes a little longer due to the additional debug code. Convert back to normal RCU. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Dennis Zhou <dennis@kernel.org>
2019-11-17crypto: ablkcipher - remove deprecated and unused ablkcipher supportArd Biesheuvel
Now that all users of the deprecated ablkcipher interface have been moved to the skcipher interface, ablkcipher is no longer used and can be removed. Reviewed-by: Eric Biggers <ebiggers@kernel.org> Signed-off-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2019-11-17crypto: lib/chacha20poly1305 - reimplement crypt_from_sg() routineArd Biesheuvel
Reimplement the library routines to perform chacha20poly1305 en/decryption on scatterlists, without [ab]using the [deprecated] blkcipher interface, which is rather heavyweight and does things we don't really need. Instead, we use the sg_miter API in a novel and clever way, to iterate over the scatterlist in-place (i.e., source == destination, which is the only way this library is expected to be used). That way, we don't have to iterate over two scatterlists in parallel. Another optimization is that, instead of relying on the blkcipher walker to present the input in suitable chunks, we recognize that ChaCha is a streamcipher, and so we can simply deal with partial blocks by keeping a block of cipherstream on the stack and use crypto_xor() to mix it with the in/output. Finally, we omit the scatterwalk_and_copy() call if the last element of the scatterlist covers the MAC as well (which is the common case), avoiding the need to walk the scatterlist and kmap() the page twice. Signed-off-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2019-11-17crypto: chacha20poly1305 - import construction and selftest from ZincArd Biesheuvel
This incorporates the chacha20poly1305 from the Zinc library, retaining the library interface, but replacing the implementation with calls into the code that already existed in the kernel's crypto API. Note that this library API does not implement RFC7539 fully, given that it is limited to 64-bit nonces. (The 96-bit nonce version that was part of the selftest only has been removed, along with the 96-bit nonce test vectors that only tested the selftest but not the actual library itself) Signed-off-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2019-11-17crypto: curve25519 - generic C library implementationsJason A. Donenfeld
This contains two formally verified C implementations of the Curve25519 scalar multiplication function, one for 32-bit systems, and one for 64-bit systems whose compiler supports efficient 128-bit integer types. Not only are these implementations formally verified, but they are also the fastest available C implementations. They have been modified to be friendly to kernel space and to be generally less horrendous looking, but still an effort has been made to retain their formally verified characteristic, and so the C might look slightly unidiomatic. The 64-bit version comes from HACL*: https://github.com/project-everest/hacl-star The 32-bit version comes from Fiat: https://github.com/mit-plv/fiat-crypto Information: https://cr.yp.to/ecdh.html Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com> [ardb: - move from lib/zinc to lib/crypto - replace .c #includes with Kconfig based object selection - drop simd handling and simplify support for per-arch versions ] Signed-off-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2019-11-17crypto: blake2s - implement generic shash driverArd Biesheuvel
Wire up our newly added Blake2s implementation via the shash API. Signed-off-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2019-11-17crypto: blake2s - generic C library implementation and selftestJason A. Donenfeld
The C implementation was originally based on Samuel Neves' public domain reference implementation but has since been heavily modified for the kernel. We're able to do compile-time optimizations by moving some scaffolding around the final function into the header file. Information: https://blake2.net/ Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com> Signed-off-by: Samuel Neves <sneves@dei.uc.pt> Co-developed-by: Samuel Neves <sneves@dei.uc.pt> [ardb: - move from lib/zinc to lib/crypto - remove simd handling - rewrote selftest for better coverage - use fixed digest length for blake2s_hmac() and rename to blake2s256_hmac() ] Signed-off-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2019-11-17crypto: x86/poly1305 - depend on generic library not generic shashArd Biesheuvel
Remove the dependency on the generic Poly1305 driver. Instead, depend on the generic library so that we only reuse code without pulling in the generic skcipher implementation as well. While at it, remove the logic that prefers the non-SIMD path for short inputs - this is no longer necessary after recent FPU handling changes on x86. Since this removes the last remaining user of the routines exported by the generic shash driver, unexport them and make them static. Signed-off-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2019-11-17crypto: poly1305 - expose init/update/final library interfaceArd Biesheuvel
Expose the existing generic Poly1305 code via a init/update/final library interface so that callers are not required to go through the crypto API's shash abstraction to access it. At the same time, make some preparations so that the library implementation can be superseded by an accelerated arch-specific version in the future. Signed-off-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2019-11-17crypto: x86/poly1305 - unify Poly1305 state struct with generic codeArd Biesheuvel
In preparation of exposing a Poly1305 library interface directly from the accelerated x86 driver, align the state descriptor of the x86 code with the one used by the generic driver. This is needed to make the library interface unified between all implementations. Signed-off-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2019-11-17crypto: poly1305 - move core routines into a separate libraryArd Biesheuvel
Move the core Poly1305 routines shared between the generic Poly1305 shash driver and the Adiantum and NHPoly1305 drivers into a separate library so that using just this pieces does not pull in the crypto API pieces of the generic Poly1305 routine. In a subsequent patch, we will augment this generic library with init/update/final routines so that Poyl1305 algorithm can be used directly without the need for using the crypto API's shash abstraction. Signed-off-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2019-11-17crypto: chacha - unexport chacha_generic routinesArd Biesheuvel
Now that all users of generic ChaCha code have moved to the core library, there is no longer a need for the generic ChaCha skcpiher driver to export parts of it implementation for reuse by other drivers. So drop the exports, and make the symbols static. Signed-off-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2019-11-17crypto: x86/chacha - expose SIMD ChaCha routine as library functionArd Biesheuvel
Wire the existing x86 SIMD ChaCha code into the new ChaCha library interface, so that users of the library interface will get the accelerated version when available. Given that calls into the library API will always go through the routines in this module if it is enabled, switch to static keys to select the optimal implementation available (which may be none at all, in which case we defer to the generic implementation for all invocations). Signed-off-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2019-11-17crypto: chacha - move existing library code into lib/cryptoArd Biesheuvel
Currently, our generic ChaCha implementation consists of a permute function in lib/chacha.c that operates on the 64-byte ChaCha state directly [and which is always included into the core kernel since it is used by the /dev/random driver], and the crypto API plumbing to expose it as a skcipher. In order to support in-kernel users that need the ChaCha streamcipher but have no need [or tolerance] for going through the abstractions of the crypto API, let's expose the streamcipher bits via a library API as well, in a way that permits the implementation to be superseded by an architecture specific one if provided. So move the streamcipher code into a separate module in lib/crypto, and expose the init() and crypt() routines to users of the library. Signed-off-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2019-11-16Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netLinus Torvalds
Pull networking fixes from David Miller: 1) Fix memory leak in xfrm_state code, from Steffen Klassert. 2) Fix races between devlink reload operations and device setup/cleanup, from Jiri Pirko. 3) Null deref in NFC code, from Stephan Gerhold. 4) Refcount fixes in SMC, from Ursula Braun. 5) Memory leak in slcan open error paths, from Jouni Hogander. 6) Fix ETS bandwidth validation in hns3, from Yonglong Liu. 7) Info leak on short USB request answers in ax88172a driver, from Oliver Neukum. 8) Release mem region properly in ep93xx_eth, from Chuhong Yuan. 9) PTP config timestamp flags validation, from Richard Cochran. 10) Dangling pointers after SKB data realloc in seg6, from Andrea Mayer. 11) Missing free_netdev() in gemini driver, from Chuhong Yuan. * git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (56 commits) ipmr: Fix skb headroom in ipmr_get_route(). net: hns3: cleanup of stray struct hns3_link_mode_mapping net/smc: fix fastopen for non-blocking connect() rds: ib: update WR sizes when bringing up connection net: gemini: add missed free_netdev net: dsa: tag_8021q: Fix dsa_8021q_restore_pvid for an absent pvid seg6: fix skb transport_header after decap_and_validate() seg6: fix srh pointer in get_srh() net: stmmac: Use the correct style for SPDX License Identifier octeontx2-af: Use the correct style for SPDX License Identifier ptp: Extend the test program to check the external time stamp flags. mlx5: Reject requests to enable time stamping on both edges. igb: Reject requests that fail to enable time stamping on both edges. dp83640: Reject requests to enable time stamping on both edges. mv88e6xxx: Reject requests to enable time stamping on both edges. ptp: Introduce strict checking of external time stamp options. renesas: reject unsupported external timestamp flags mlx5: reject unsupported external timestamp flags igb: reject unsupported external timestamp flags dp83640: reject unsupported external timestamp flags ...
2019-11-16page_pool: do not release pool until inflight == 0.Jonathan Lemon
The page pool keeps track of the number of pages in flight, and it isn't safe to remove the pool until all pages are returned. Disallow removing the pool until all pages are back, so the pool is always available for page producers. Make the page pool responsible for its own delayed destruction instead of relying on XDP, so the page pool can be used without the xdp memory model. When all pages are returned, free the pool and notify xdp if the pool is registered with the xdp memory system. Have the callback perform a table walk since some drivers (cpsw) may share the pool among multiple xdp_rxq_info. Note that the increment of pages_state_release_cnt may result in inflight == 0, resulting in the pool being released. Fixes: d956a048cd3f ("xdp: force mem allocator removal and periodic warning") Signed-off-by: Jonathan Lemon <jonathan.lemon@gmail.com> Acked-by: Jesper Dangaard Brouer <brouer@redhat.com> Acked-by: Ilias Apalodimas <ilias.apalodimas@linaro.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-16usb: typec: tcpm: Remove tcpc_config configuration mechanismHans de Goede
All configuration can and should be done through fwnodes instead of through the tcpc_config struct and there are no existing users left of struct tcpc_config, so lets remove it. Signed-off-by: Hans de Goede <hdegoede@redhat.com> Reviewed-by: Heikki Krogerus <heikki.krogerus@linux.intel.com> Link: https://lore.kernel.org/r/20191114111840.40876-1-hdegoede@redhat.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-11-16pinctrl/msm: Setup GPIO chip in hierarchyLina Iyer
Some GPIOs are marked as wakeup capable and are routed to another interrupt controller that is an always-domain and can detect interrupts even when most of the SoC is powered off. The wakeup interrupt controller wakes up the GIC and replays the interrupt at the GIC. Setup the TLMM irqchip in hierarchy with the wakeup interrupt controller and ensure the wakeup GPIOs are handled correctly. Co-developed-by: Maulik Shah <mkshah@codeaurora.org> Signed-off-by: Lina Iyer <ilina@codeaurora.org> Signed-off-by: Marc Zyngier <maz@kernel.org> Reviewed-by: Stephen Boyd <swboyd@chromium.org> Link: https://lore.kernel.org/r/1573855915-9841-9-git-send-email-ilina@codeaurora.org ---- Changes in v2: - Address review comments - Fix Co-developed-by tag Changes in v1: - Address minor review comments - Remove redundant call to set irq handler - Move irq_domain_qcom_handle_wakeup() to this patch Changes in RFC v2: - Rebase on top of GPIO hierarchy support in linux-next - Set the chained irq handler for summary line
2019-11-16irqchip/qcom-pdc: Add irqdomain for wakeup capable GPIOsLina Iyer
Introduce a new domain for wakeup capable GPIOs. The domain can be requested using the bus token DOMAIN_BUS_WAKEUP. In the following patches, we will specify PDC as the wakeup-parent for the TLMM GPIO irqchip. Requesting a wakeup GPIO will setup the GPIO and the corresponding PDC interrupt as its parent. Co-developed-by: Stephen Boyd <swboyd@chromium.org> Signed-off-by: Stephen Boyd <swboyd@chromium.org> Signed-off-by: Lina Iyer <ilina@codeaurora.org> Signed-off-by: Marc Zyngier <maz@kernel.org> Reviewed-by: Stephen Boyd <swboyd@chromium.org> Link: https://lore.kernel.org/r/1573855915-9841-5-git-send-email-ilina@codeaurora.org
2019-11-16genirq: Introduce irq_chip_get/set_parent_state callsMaulik Shah
On certain QTI chipsets some GPIOs are direct-connect interrupts to the GIC to be used as regular interrupt lines. When the GPIOs are not used for interrupt generation the interrupt line is disabled. But disabling the interrupt at GIC does not prevent the interrupt to be reported as pending at GIC_ISPEND. Later, when drivers call enable_irq() on the interrupt, an unwanted interrupt occurs. Introduce get and set methods for irqchip's parent to clear it's pending irq state. This then can be invoked by the GPIO interrupt controller on the parents in it hierarchy to clear the interrupt before enabling the interrupt. Signed-off-by: Maulik Shah <mkshah@codeaurora.org> Signed-off-by: Lina Iyer <ilina@codeaurora.org> Signed-off-by: Marc Zyngier <maz@kernel.org> Reviewed-by: Stephen Boyd <swboyd@chromium.org> Link: https://lore.kernel.org/r/1573855915-9841-7-git-send-email-ilina@codeaurora.org [updated commit text and minor code fixes]
2019-11-16irqdomain: Add bus token DOMAIN_BUS_WAKEUPLina Iyer
A single controller can handle normal interrupts and wake-up interrupts independently, with a different numbering space. It is thus crucial to allow the driver for such a controller discriminate between the two. A simple way to do so is to tag the wake-up irqdomain with a "bus token" that indicates the wake-up domain. This slightly abuses the notion of bus, but also radically simplifies the design of such a driver. Between two evils, we choose the least damaging. Suggested-by: Stephen Boyd <swboyd@chromium.org> Signed-off-by: Lina Iyer <ilina@codeaurora.org> Signed-off-by: Marc Zyngier <maz@kernel.org> Reviewed-by: Stephen Boyd <swboyd@chromium.org> Link: https://lore.kernel.org/r/1573855915-9841-2-git-send-email-ilina@codeaurora.org
2019-11-15mm/memory_hotplug: fix try_offline_node()David Hildenbrand
try_offline_node() is pretty much broken right now: - The node span is updated when onlining memory, not when adding it. We ignore memory that was mever onlined. Bad. - We touch possible garbage memmaps. The pfn_to_nid(pfn) can easily trigger a kernel panic. Bad for memory that is offline but also bad for subsection hotadd with ZONE_DEVICE, whereby the memmap of the first PFN of a section might contain garbage. - Sections belonging to mixed nodes are not properly considered. As memory blocks might belong to multiple nodes, we would have to walk all pageblocks (or at least subsections) within present sections. However, we don't have a way to identify whether a memmap that is not online was initialized (relevant for ZONE_DEVICE). This makes things more complicated. Luckily, we can piggy pack on the node span and the nid stored in memory blocks. Currently, the node span is grown when calling move_pfn_range_to_zone() - e.g., when onlining memory, and shrunk when removing memory, before calling try_offline_node(). Sysfs links are created via link_mem_sections(), e.g., during boot or when adding memory. If the node still spans memory or if any memory block belongs to the nid, we don't set the node offline. As memory blocks that span multiple nodes cannot get offlined, the nid stored in memory blocks is reliable enough (for such online memory blocks, the node still spans the memory). Introduce for_each_memory_block() to efficiently walk all memory blocks. Note: We will soon stop shrinking the ZONE_DEVICE zone and the node span when removing ZONE_DEVICE memory to fix similar issues (access of garbage memmaps) - until we have a reliable way to identify whether these memmaps were properly initialized. This implies later, that once a node had ZONE_DEVICE memory, we won't be able to set a node offline - which should be acceptable. Since commit f1dd2cd13c4b ("mm, memory_hotplug: do not associate hotadded memory to zones until online") memory that is added is not assoziated with a zone/node (memmap not initialized). The introducing commit 60a5a19e7419 ("memory-hotplug: remove sysfs file of node") already missed that we could have multiple nodes for a section and that the zone/node span is updated when onlining pages, not when adding them. I tested this by hotplugging two DIMMs to a memory-less and cpu-less NUMA node. The node is properly onlined when adding the DIMMs. When removing the DIMMs, the node is properly offlined. Masayoshi Mizuma reported: : Without this patch, memory hotplug fails as panic: : : BUG: kernel NULL pointer dereference, address: 0000000000000000 : ... : Call Trace: : remove_memory_block_devices+0x81/0xc0 : try_remove_memory+0xb4/0x130 : __remove_memory+0xa/0x20 : acpi_memory_device_remove+0x84/0x100 : acpi_bus_trim+0x57/0x90 : acpi_bus_trim+0x2e/0x90 : acpi_device_hotplug+0x2b2/0x4d0 : acpi_hotplug_work_fn+0x1a/0x30 : process_one_work+0x171/0x380 : worker_thread+0x49/0x3f0 : kthread+0xf8/0x130 : ret_from_fork+0x35/0x40 [david@redhat.com: v3] Link: http://lkml.kernel.org/r/20191102120221.7553-1-david@redhat.com Link: http://lkml.kernel.org/r/20191028105458.28320-1-david@redhat.com Fixes: 60a5a19e7419 ("memory-hotplug: remove sysfs file of node") Fixes: f1dd2cd13c4b ("mm, memory_hotplug: do not associate hotadded memory to zones until online") # visiable after d0dc12e86b319 Signed-off-by: David Hildenbrand <david@redhat.com> Tested-by: Masayoshi Mizuma <m.mizuma@jp.fujitsu.com> Cc: Tang Chen <tangchen@cn.fujitsu.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: "Rafael J. Wysocki" <rafael@kernel.org> Cc: Keith Busch <keith.busch@intel.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: "Peter Zijlstra (Intel)" <peterz@infradead.org> Cc: Jani Nikula <jani.nikula@intel.com> Cc: Nayna Jain <nayna@linux.ibm.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Pavel Tatashin <pasha.tatashin@soleen.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-11-15fork: extend clone3() to support setting a PIDAdrian Reber
The main motivation to add set_tid to clone3() is CRIU. To restore a process with the same PID/TID CRIU currently uses /proc/sys/kernel/ns_last_pid. It writes the desired (PID - 1) to ns_last_pid and then (quickly) does a clone(). This works most of the time, but it is racy. It is also slow as it requires multiple syscalls. Extending clone3() to support *set_tid makes it possible restore a process using CRIU without accessing /proc/sys/kernel/ns_last_pid and race free (as long as the desired PID/TID is available). This clone3() extension places the same restrictions (CAP_SYS_ADMIN) on clone3() with *set_tid as they are currently in place for ns_last_pid. The original version of this change was using a single value for set_tid. At the 2019 LPC, after presenting set_tid, it was, however, decided to change set_tid to an array to enable setting the PID of a process in multiple PID namespaces at the same time. If a process is created in a PID namespace it is possible to influence the PID inside and outside of the PID namespace. Details also in the corresponding selftest. To create a process with the following PIDs: PID NS level Requested PID 0 (host) 31496 1 42 2 1 For that example the two newly introduced parameters to struct clone_args (set_tid and set_tid_size) would need to be: set_tid[0] = 1; set_tid[1] = 42; set_tid[2] = 31496; set_tid_size = 3; If only the PIDs of the two innermost nested PID namespaces should be defined it would look like this: set_tid[0] = 1; set_tid[1] = 42; set_tid_size = 2; The PID of the newly created process would then be the next available free PID in the PID namespace level 0 (host) and 42 in the PID namespace at level 1 and the PID of the process in the innermost PID namespace would be 1. The set_tid array is used to specify the PID of a process starting from the innermost nested PID namespaces up to set_tid_size PID namespaces. set_tid_size cannot be larger then the current PID namespace level. Signed-off-by: Adrian Reber <areber@redhat.com> Reviewed-by: Christian Brauner <christian.brauner@ubuntu.com> Reviewed-by: Oleg Nesterov <oleg@redhat.com> Reviewed-by: Dmitry Safonov <0x7f454c46@gmail.com> Acked-by: Andrei Vagin <avagin@gmail.com> Link: https://lore.kernel.org/r/20191115123621.142252-1-areber@redhat.com Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
2019-11-15bpf: Support attaching tracing BPF program to other BPF programsAlexei Starovoitov
Allow FENTRY/FEXIT BPF programs to attach to other BPF programs of any type including their subprograms. This feature allows snooping on input and output packets in XDP, TC programs including their return values. In order to do that the verifier needs to track types not only of vmlinux, but types of other BPF programs as well. The verifier also needs to translate uapi/linux/bpf.h types used by networking programs into kernel internal BTF types used by FENTRY/FEXIT BPF programs. In some cases LLVM optimizations can remove arguments from BPF subprograms without adjusting BTF info that LLVM backend knows. When BTF info disagrees with actual types that the verifiers sees the BPF trampoline has to fallback to conservative and treat all arguments as u64. The FENTRY/FEXIT program can still attach to such subprograms, but it won't be able to recognize pointer types like 'struct sk_buff *' and it won't be able to pass them to bpf_skb_output() for dumping packets to user space. The FENTRY/FEXIT program would need to use bpf_probe_read_kernel() instead. The BPF_PROG_LOAD command is extended with attach_prog_fd field. When it's set to zero the attach_btf_id is one vmlinux BTF type ids. When attach_prog_fd points to previously loaded BPF program the attach_btf_id is BTF type id of main function or one of its subprograms. Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Song Liu <songliubraving@fb.com> Link: https://lore.kernel.org/bpf/20191114185720.1641606-18-ast@kernel.org
2019-11-15bpf: Compare BTF types of functions arguments with actual typesAlexei Starovoitov
Make the verifier check that BTF types of function arguments match actual types passed into top-level BPF program and into BPF-to-BPF calls. If types match such BPF programs and sub-programs will have full support of BPF trampoline. If types mismatch the trampoline has to be conservative. It has to save/restore five program arguments and assume 64-bit scalars. Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Song Liu <songliubraving@fb.com> Acked-by: Andrii Nakryiko <andriin@fb.com> Link: https://lore.kernel.org/bpf/20191114185720.1641606-17-ast@kernel.org
2019-11-15bpf: Annotate context typesAlexei Starovoitov
Annotate BPF program context types with program-side type and kernel-side type. This type information is used by the verifier. btf_get_prog_ctx_type() is used in the later patches to verify that BTF type of ctx in BPF program matches to kernel expected ctx type. For example, the XDP program type is: BPF_PROG_TYPE(BPF_PROG_TYPE_XDP, xdp, struct xdp_md, struct xdp_buff) That means that XDP program should be written as: int xdp_prog(struct xdp_md *ctx) { ... } Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Song Liu <songliubraving@fb.com> Link: https://lore.kernel.org/bpf/20191114185720.1641606-16-ast@kernel.org
2019-11-15netfilter: nf_flow_table_offload: add IPv6 supportPablo Neira Ayuso
Add nf_flow_rule_route_ipv6() and use it from the IPv6 and the inet flowtable type definitions. Rename the nf_flow_rule_route() function to nf_flow_rule_route_ipv4(). Adjust maximum number of actions, which now becomes 16 to leave sufficient room for the IPv6 address mangling for NAT. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-11-15bpf: Fix race in btf_resolve_helper_id()Alexei Starovoitov
btf_resolve_helper_id() caching logic is a bit racy, since under root the verifier can verify several programs in parallel. Fix it with READ/WRITE_ONCE. Fix the type as well, since error is also recorded. Fixes: a7658e1a4164 ("bpf: Check types of arguments passed into helpers") Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Song Liu <songliubraving@fb.com> Acked-by: Andrii Nakryiko <andriin@fb.com> Link: https://lore.kernel.org/bpf/20191114185720.1641606-15-ast@kernel.org
2019-11-15bpf: Introduce BPF trampolineAlexei Starovoitov
Introduce BPF trampoline concept to allow kernel code to call into BPF programs with practically zero overhead. The trampoline generation logic is architecture dependent. It's converting native calling convention into BPF calling convention. BPF ISA is 64-bit (even on 32-bit architectures). The registers R1 to R5 are used to pass arguments into BPF functions. The main BPF program accepts only single argument "ctx" in R1. Whereas CPU native calling convention is different. x86-64 is passing first 6 arguments in registers and the rest on the stack. x86-32 is passing first 3 arguments in registers. sparc64 is passing first 6 in registers. And so on. The trampolines between BPF and kernel already exist. BPF_CALL_x macros in include/linux/filter.h statically compile trampolines from BPF into kernel helpers. They convert up to five u64 arguments into kernel C pointers and integers. On 64-bit architectures this BPF_to_kernel trampolines are nops. On 32-bit architecture they're meaningful. The opposite job kernel_to_BPF trampolines is done by CAST_TO_U64 macros and __bpf_trace_##call() shim functions in include/trace/bpf_probe.h. They convert kernel function arguments into array of u64s that BPF program consumes via R1=ctx pointer. This patch set is doing the same job as __bpf_trace_##call() static trampolines, but dynamically for any kernel function. There are ~22k global kernel functions that are attachable via nop at function entry. The function arguments and types are described in BTF. The job of btf_distill_func_proto() function is to extract useful information from BTF into "function model" that architecture dependent trampoline generators will use to generate assembly code to cast kernel function arguments into array of u64s. For example the kernel function eth_type_trans has two pointers. They will be casted to u64 and stored into stack of generated trampoline. The pointer to that stack space will be passed into BPF program in R1. On x86-64 such generated trampoline will consume 16 bytes of stack and two stores of %rdi and %rsi into stack. The verifier will make sure that only two u64 are accessed read-only by BPF program. The verifier will also recognize the precise type of the pointers being accessed and will not allow typecasting of the pointer to a different type within BPF program. The tracing use case in the datacenter demonstrated that certain key kernel functions have (like tcp_retransmit_skb) have 2 or more kprobes that are always active. Other functions have both kprobe and kretprobe. So it is essential to keep both kernel code and BPF programs executing at maximum speed. Hence generated BPF trampoline is re-generated every time new program is attached or detached to maintain maximum performance. To avoid the high cost of retpoline the attached BPF programs are called directly. __bpf_prog_enter/exit() are used to support per-program execution stats. In the future this logic will be optimized further by adding support for bpf_stats_enabled_key inside generated assembly code. Introduction of preemptible and sleepable BPF programs will completely remove the need to call to __bpf_prog_enter/exit(). Detach of a BPF program from the trampoline should not fail. To avoid memory allocation in detach path the half of the page is used as a reserve and flipped after each attach/detach. 2k bytes is enough to call 40+ BPF programs directly which is enough for BPF tracing use cases. This limit can be increased in the future. BPF_TRACE_FENTRY programs have access to raw kernel function arguments while BPF_TRACE_FEXIT programs have access to kernel return value as well. Often kprobe BPF program remembers function arguments in a map while kretprobe fetches arguments from a map and analyzes them together with return value. BPF_TRACE_FEXIT accelerates this typical use case. Recursion prevention for kprobe BPF programs is done via per-cpu bpf_prog_active counter. In practice that turned out to be a mistake. It caused programs to randomly skip execution. The tracing tools missed results they were looking for. Hence BPF trampoline doesn't provide builtin recursion prevention. It's a job of BPF program itself and will be addressed in the follow up patches. BPF trampoline is intended to be used beyond tracing and fentry/fexit use cases in the future. For example to remove retpoline cost from XDP programs. Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Andrii Nakryiko <andriin@fb.com> Acked-by: Song Liu <songliubraving@fb.com> Link: https://lore.kernel.org/bpf/20191114185720.1641606-5-ast@kernel.org
2019-11-15bpf: Add bpf_arch_text_poke() helperAlexei Starovoitov
Add bpf_arch_text_poke() helper that is used by BPF trampoline logic to patch nops/calls in kernel text into calls into BPF trampoline and to patch calls/nops inside BPF programs too. Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Song Liu <songliubraving@fb.com> Acked-by: Andrii Nakryiko <andriin@fb.com> Link: https://lore.kernel.org/bpf/20191114185720.1641606-4-ast@kernel.org
2019-11-15bpf: Support doubleword alignment in bpf_jit_binary_allocIlya Leoshkevich
Currently passing alignment greater than 4 to bpf_jit_binary_alloc does not work: in such cases it silently aligns only to 4 bytes. On s390, in order to load a constant from memory in a large (>512k) BPF program, one must use lgrl instruction, whose memory operand must be aligned on an 8-byte boundary. This patch makes it possible to request 8-byte alignment from bpf_jit_binary_alloc, and also makes it issue a warning when an unsupported alignment is requested. Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20191115123722.58462-1-iii@linux.ibm.com
2019-11-15i2c: remove helpers for ref-counting clientsWolfram Sang
There are no in-tree users of these helpers anymore, and there shouldn't. Most use cases went away once the driver model started to refcount for us. There have been users like the media subsystem, but they all switched to better refcounting methods meanwhile. Media did this in 2008. Last user (IPMI) left 2018. Remove this cruft. Signed-off-by: Wolfram Sang <wsa+renesas@sang-engineering.com> Reviewed-by: Niklas Söderlund <niklas.soderlund+renesas@ragnatech.se> Reviewed-by: Jean Delvare <jdelvare@suse.de> Tested-by: Luca Ceresoli <luca@lucaceresoli.net> Reviewed-by: Luca Ceresoli <luca@lucaceresoli.net> Reviewed-by: Geert Uytterhoeven <geert+renesas@glider.be>
2019-11-15ptp: Introduce strict checking of external time stamp options.Richard Cochran
User space may request time stamps on rising edges, falling edges, or both. However, the particular mode may or may not be supported in the hardware or in the driver. This patch adds a "strict" flag that tells drivers to ensure that the requested mode will be honored. Signed-off-by: Richard Cochran <richardcochran@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-15ptp: Validate requests to enable time stamping of external signals.Richard Cochran
Commit 415606588c61 ("PTP: introduce new versions of IOCTLs") introduced a new external time stamp ioctl that validates the flags. This patch extends the validation to ensure that at least one rising or falling edge flag is set when enabling external time stamps. Signed-off-by: Richard Cochran <richardcochran@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-15net: dsa: ocelot: add tagger for Ocelot/Felix switchesVladimir Oltean
While it is entirely possible that this tagger format is in fact more generic than just these 2 switch families, I don't have that knowledge. The Seville switch in NXP T1040 has a similar frame format, but there are enough differences (e.g. DEST field starts at bit 57 instead of 56) that calling this file tag_vitesse.c is a bit of a stretch at the moment. The frame format has been listed in a comment so that people who add support for further Vitesse switches can rework this tagger while keeping compatibility with Felix. The "ocelot" name was chosen instead of "felix" because even the Ocelot switch can act as a DSA device when it is used in NPI mode, and the Felix tagger format is almost identical. Currently it is only used for the Felix switch embedded in the NXP LS1028A chip. The ABI for this tagger should be considered "not stable" at the moment. The DSA tag is always placed before the Ethernet header and therefore, we are using the long prefix for RX tags to avoid putting the DSA master port in promiscuous mode. Once there will be an API in DSA for drivers to request DSA masters to be in promiscuous mode unconditionally, we will switch to the "no prefix" extraction frame header, which will save 16 padding bytes for each RX frame. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-15net: mscc: ocelot: publish ocelot_sys.h to include/soc/msccVladimir Oltean
The Felix DSA driver needs to write to SYS_RAM_INIT_RAM_INIT for its own chip initialization process. Also update the MAINTAINERS file such that the headers exported by the ocelot driver are under the same maintainers' umbrella as the driver itself. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-15net: mscc: ocelot: publish structure definitions to include/soc/mscc/ocelot.hVladimir Oltean
We will be registering another switch driver based on ocelot, which lives under drivers/net/dsa. Make sure the Felix DSA front-end has the necessary abstractions to implement a new Ocelot driver instantiation. This includes the function prototypes for implementing DSA callbacks. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-15net/smc: introduce bookkeeping of SMCD link groupsUrsula Braun
If the ism module is unloaded return control from exit routine only, if all link groups are freed. If an IB device is thrown away return control from device removal only, if all link groups belonging to this device are freed. A counters for the total number of SMCD link groups per ISM device is introduced. ism module unloading continues only if the total number of SMCD link groups for all ISM devices is zero. ISM device removal continues only it the total number of SMCD link groups per ISM device has decreased to zero. Signed-off-by: Ursula Braun <ubraun@linux.ibm.com> Signed-off-by: Karsten Graul <kgraul@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-15net/smc: immediate termination for SMCD link groupsUrsula Braun
SMCD link group termination is called when peer signals its shutdown of its corresponding link group. For regular shutdowns no connections exist anymore. For abnormal shutdowns connections must be killed and their DMBs must be unregistered immediately. That means the SMCR method to delay the link group freeing several seconds does not fit. This patch adds immediate termination of a link group and its SMCD connections and makes sure all SMCD link group related cleanup steps are finished. Signed-off-by: Ursula Braun <ubraun@linux.ibm.com> Signed-off-by: Karsten Graul <kgraul@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-15new helper: lookup_positive_unlocked()Al Viro
Most of the callers of lookup_one_len_unlocked() treat negatives are ERR_PTR(-ENOENT). Provide a helper that would do just that. Note that a pinned positive dentry remains positive - it's ->d_inode is stable, etc.; a pinned _negative_ dentry can become positive at any point as long as you are not holding its parent at least shared. So using lookup_one_len_unlocked() needs to be careful; lookup_positive_unlocked() is safer and that's what the callers end up open-coding anyway. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2019-11-15fs/namei.c: pull positivity check into follow_managed()Al Viro
There are 4 callers; two proceed to check if result is positive and fail with ENOENT if it isn't; one (in handle_lookup_down()) is guaranteed to yield positive and one (in lookup_fast()) is _preceded_ by positivity check. However, follow_managed() on a negative dentry is a (fairly cheap) no-op on anything other than autofs. And negative autofs dentries are never hashed, so lookup_fast() is not going to run into one of those. Moreover, successful follow_managed() on a _positive_ dentry never yields a negative one (and we significantly rely upon that in callers of lookup_fast()). In other words, we can easily transpose the positivity check and the call of follow_managed() in lookup_fast(). And that allows to fold the positivity check *into* follow_managed(), simplifying life for the code downstream of its calls. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2019-11-15pipe: Allow pipes to have kernel-reserved slotsDavid Howells
Split pipe->ring_size into two numbers: (1) pipe->ring_size - indicates the hard size of the pipe ring. (2) pipe->max_usage - indicates the maximum number of pipe ring slots that userspace orchestrated events can fill. This allows for a pipe that is both writable by the general kernel notification facility and by userspace, allowing plenty of ring space for notifications to be added whilst preventing userspace from being able to pin too much unswappable kernel space. Signed-off-by: David Howells <dhowells@redhat.com>
2019-11-15jbd2: make jbd2_handle_buffer_credits() handle reserved handlesJan Kara
The helper jbd2_handle_buffer_credits() doesn't correctly handle reserved handles which can lead to crashes. Fix it getting of journal pointer to work for reserved handles as well. Fixes: a9a8344ee171 ("ext4, jbd2: Provide accessor function for handle credits") Reported-by: Eric Biggers <ebiggers@kernel.org> Signed-off-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20191115102210.29445-1-jack@suse.cz Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2019-11-15y2038: itimer: change implementation to timespec64Arnd Bergmann
There is no 64-bit version of getitimer/setitimer since that is not actually needed. However, the implementation is built around the deprecated 'struct timeval' type. Change the code to use timespec64 internally to reduce the dependencies on timeval and associated helper functions. Minor adjustments in the code are needed to make the native and compat version work the same way, and to keep the range check working after the conversion. Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2019-11-15y2038: move itimer reset into itimer.cArnd Bergmann
Preparing for a change to the itimer internals, stop using the do_setitimer() symbol and instead use a new higher-level interface. The do_getitimer()/do_setitimer functions can now be made static, allowing the compiler to potentially produce better object code. Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2019-11-15y2038: itimer: compat handling to itimer.cArnd Bergmann
The structure is only used in one place, moving it there simplifies the interface and helps with later changes to this code. Rename it to match the other time32 structures in the process. Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2019-11-15y2038: time: avoid timespec usage in settimeofday()Arnd Bergmann
The compat_get_timeval() and timeval_valid() interfaces are deprecated and getting removed along with the definition of struct timeval itself. Change the two implementations of the settimeofday() system call to open-code these helpers and completely avoid references to timeval. The timeval_valid() call is not needed any more here, only a check to avoid overflowing tv_nsec during the multiplication, as there is another range check in do_sys_settimeofday64(). Tested-by: syzbot+dccce9b26ba09ca49966@syzkaller.appspotmail.com Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2019-11-15y2038: elfcore: Use __kernel_old_timeval for process timesArnd Bergmann
We store elapsed time for a crashed process in struct elf_prstatus using 'timeval' structures. Once glibc starts using 64-bit time_t, this becomes incompatible with the kernel's idea of timeval since the structure layout no longer matches on 32-bit architectures. This changes the definition of the elf_prstatus structure to use __kernel_old_timeval instead, which is hardcoded to the currently used binary layout. There is no risk of overflow in y2038 though, because the time values are all relative times, and can store up to 68 years of process elapsed time. There is a risk of applications breaking at build time when they use the new kernel headers and expect the type to be exactly 'timeval' rather than a structure that has the same fields as before. Those applications have to be modified to deal with 64-bit time_t anyway. Signed-off-by: Arnd Bergmann <arnd@arndb.de>