summaryrefslogtreecommitdiff
path: root/include
AgeCommit message (Collapse)Author
2020-08-25bpf: Allow local storage to be used from LSM programsKP Singh
Adds support for both bpf_{sk, inode}_storage_{get, delete} to be used in LSM programs. These helpers are not used for tracing programs (currently) as their usage is tied to the life-cycle of the object and should only be used where the owning object won't be freed (when the owning object is passed as an argument to the LSM hook). Thus, they are safer to use in LSM hooks than tracing. Usage of local storage in tracing programs will probably follow a per function based whitelist approach. Since the UAPI helper signature for bpf_sk_storage expect a bpf_sock, it, leads to a compilation warning for LSM programs, it's also updated to accept a void * pointer instead. Signed-off-by: KP Singh <kpsingh@google.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Martin KaFai Lau <kafai@fb.com> Link: https://lore.kernel.org/bpf/20200825182919.1118197-7-kpsingh@chromium.org
2020-08-25bpf: Implement bpf_local_storage for inodesKP Singh
Similar to bpf_local_storage for sockets, add local storage for inodes. The life-cycle of storage is managed with the life-cycle of the inode. i.e. the storage is destroyed along with the owning inode. The BPF LSM allocates an __rcu pointer to the bpf_local_storage in the security blob which are now stackable and can co-exist with other LSMs. Signed-off-by: KP Singh <kpsingh@google.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20200825182919.1118197-6-kpsingh@chromium.org
2020-08-25bpf: Split bpf_local_storage to bpf_sk_storageKP Singh
A purely mechanical change: bpf_sk_storage.c = bpf_sk_storage.c + bpf_local_storage.c bpf_sk_storage.h = bpf_sk_storage.h + bpf_local_storage.h Signed-off-by: KP Singh <kpsingh@google.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Martin KaFai Lau <kafai@fb.com> Link: https://lore.kernel.org/bpf/20200825182919.1118197-5-kpsingh@chromium.org
2020-08-25bpf: Generalize bpf_sk_storageKP Singh
Refactor the functionality in bpf_sk_storage.c so that concept of storage linked to kernel objects can be extended to other objects like inode, task_struct etc. Each new local storage will still be a separate map and provide its own set of helpers. This allows for future object specific extensions and still share a lot of the underlying implementation. This includes the changes suggested by Martin in: https://lore.kernel.org/bpf/20200725013047.4006241-1-kafai@fb.com/ adding new map operations to support bpf_local_storage maps: * storages for different kernel objects to optionally have different memory charging strategy (map_local_storage_charge, map_local_storage_uncharge) * Functionality to extract the storage pointer from a pointer to the owning object (map_owner_storage_ptr) Co-developed-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: KP Singh <kpsingh@google.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20200825182919.1118197-4-kpsingh@chromium.org
2020-08-25bpf: Generalize caching for sk_storage.KP Singh
Provide the a ability to define local storage caches on a per-object type basis. The caches and caching indices for different objects should not be inter-mixed as suggested in: https://lore.kernel.org/bpf/20200630193441.kdwnkestulg5erii@kafai-mbp.dhcp.thefacebook.com/ "Caching a sk-storage at idx=0 of a sk should not stop an inode-storage to be cached at the same idx of a inode." Signed-off-by: KP Singh <kpsingh@google.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Martin KaFai Lau <kafai@fb.com> Link: https://lore.kernel.org/bpf/20200825182919.1118197-3-kpsingh@chromium.org
2020-08-25bpf: Renames in preparation for bpf_local_storageKP Singh
A purely mechanical change to split the renaming from the actual generalization. Flags/consts: SK_STORAGE_CREATE_FLAG_MASK BPF_LOCAL_STORAGE_CREATE_FLAG_MASK BPF_SK_STORAGE_CACHE_SIZE BPF_LOCAL_STORAGE_CACHE_SIZE MAX_VALUE_SIZE BPF_LOCAL_STORAGE_MAX_VALUE_SIZE Structs: bucket bpf_local_storage_map_bucket bpf_sk_storage_map bpf_local_storage_map bpf_sk_storage_data bpf_local_storage_data bpf_sk_storage_elem bpf_local_storage_elem bpf_sk_storage bpf_local_storage The "sk" member in bpf_local_storage is also updated to "owner" in preparation for changing the type to void * in a subsequent patch. Functions: selem_linked_to_sk selem_linked_to_storage selem_alloc bpf_selem_alloc __selem_unlink_sk bpf_selem_unlink_storage_nolock __selem_link_sk bpf_selem_link_storage_nolock selem_unlink_sk __bpf_selem_unlink_storage sk_storage_update bpf_local_storage_update __sk_storage_lookup bpf_local_storage_lookup bpf_sk_storage_map_free bpf_local_storage_map_free bpf_sk_storage_map_alloc bpf_local_storage_map_alloc bpf_sk_storage_map_alloc_check bpf_local_storage_map_alloc_check bpf_sk_storage_map_check_btf bpf_local_storage_map_check_btf Signed-off-by: KP Singh <kpsingh@google.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Martin KaFai Lau <kafai@fb.com> Link: https://lore.kernel.org/bpf/20200825182919.1118197-2-kpsingh@chromium.org
2020-08-25Merge series "ASoC: SOF: trivial code/log/comment improvements" from ↵Mark Brown
Pierre-Louis Bossart <pierre-louis.bossart@linux.intel.com>: Remove useless variable initialization and allocation, adjust log levels to make support easier, and fix comments. No functional changes. Guennadi Liakhovetski (2): ASoC: SOF: topology: (cosmetic) remove redundant variable initialisations ASoC: SOF: (cosmetic) use the "bool" type where it makes sense Pierre-Louis Bossart (4): ASoC: SOF: IPC: reduce verbosity of IPC pointer updates ASoC: SOF: acpi: add dev_dbg() log for probe completion ASoC: SOF: Intel: add dev_dbg log when driver is not selected ASoC: Intel: use consistent HDAudio spelling in comments/docs Ranjani Sridharan (2): ASoC: SOF: topology: remove unnecessary memory alloc for sdev->private ASoC: SOF: topology: reduce the log level for unhandled widgets include/sound/soc-acpi.h | 2 +- sound/soc/intel/Kconfig | 2 +- sound/soc/intel/skylake/skl.c | 6 +++--- sound/soc/sof/Kconfig | 2 +- sound/soc/sof/intel/Kconfig | 2 +- sound/soc/sof/ipc.c | 16 +++++++++++----- sound/soc/sof/pcm.c | 8 ++++---- sound/soc/sof/sof-acpi-dev.c | 2 ++ sound/soc/sof/sof-pci-dev.c | 6 +++--- sound/soc/sof/sof-priv.h | 10 +++++----- sound/soc/sof/topology.c | 20 ++++---------------- 11 files changed, 36 insertions(+), 40 deletions(-) base-commit: aafdeba5cbc14cecee3797e669473b70a2b3e81e -- 2.25.1
2020-08-25ASoC: Intel: use consistent HDAudio spelling in comments/docsPierre-Louis Bossart
We use HDaudio and HDAudio, pick one to make searches easier. No functionality change Reported-by: Guennadi Liakhovetski <guennadi.liakhovetski@linux.intel.com> Reviewed-by: Bard Liao <yung-chuan.liao@linux.intel.com> Reviewed-by: Guennadi Liakhovetski <guennadi.liakhovetski@linux.intel.com> Signed-off-by: Pierre-Louis Bossart <pierre-louis.bossart@linux.intel.com> Link: https://lore.kernel.org/r/20200824200912.46852-9-pierre-louis.bossart@linux.intel.com Signed-off-by: Mark Brown <broonie@kernel.org>
2020-08-25power: supply: Support battery temperature device-tree propertiesDmitry Osipenko
The generic battery temperature properties are already supported by the power-supply core. Let's support parsing of the common battery temperature properties from a device-tree. Signed-off-by: Dmitry Osipenko <digetx@gmail.com> Signed-off-by: Sebastian Reichel <sebastian.reichel@collabora.com>
2020-08-25dt-bindings: power: supply: Add device-tree binding for Summit SMB3xxDavid Heidelberg
Summit SMB3xx series is a Programmable Switching Li+ Battery Charger. This patch adds device-tree binding for Summit SMB345, SMB347 and SMB358 chargers. Signed-off-by: David Heidelberg <david@ixit.cz> Signed-off-by: Dmitry Osipenko <digetx@gmail.com> Reviewed-by: Rob Herring <robh@kernel.org> Signed-off-by: Sebastian Reichel <sebastian.reichel@collabora.com>
2020-08-25Merge tag 'v5.9-rc2' into asoc-5.10Mark Brown
Linux 5.9-rc2
2020-08-25Merge tag 'v5.9-rc2' into drm-misc-fixesMaarten Lankhorst
Backmerge requested by Tomi for a fix to omap inconsistent locking state issue, and because we need at least v5.9-rc2 now. Signed-off-by: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
2020-08-25clk: sunxi-ng: add support for the Allwinner A100 CCUYangtao Li
Add support for a100 in the sunxi-ng CCU framework. Signed-off-by: Yangtao Li <frank@allwinnertech.com> Signed-off-by: Maxime Ripard <maxime@cerno.tech> Acked-by: Rob Herring <robh@kernel.org> Link: https://lore.kernel.org/r/1eb41bf6c966a0e54820200650d27a5d4f2ac160.1595572867.git.frank@allwinnertech.com
2020-08-24rcu: Report QS for outermost PREEMPT=n rcu_read_unlock() for strict GPsPaul E. McKenney
The CONFIG_PREEMPT=n instance of rcu_read_unlock is even more aggressively than that of CONFIG_PREEMPT=y in deferring reporting quiescent states to the RCU core. This is just what is wanted in normal use because it reduces overhead, but the resulting delay is not what is wanted for kernels built with CONFIG_RCU_STRICT_GRACE_PERIOD=y. This commit therefore adds an rcu_read_unlock_strict() function that checks for exceptional conditions, and reports the newly started quiescent state if it is safe to do so, also doing a spin-delay if requested via rcutree.rcu_unlock_delay. This commit also adds a call to rcu_read_unlock_strict() from the CONFIG_PREEMPT=n instance of __rcu_read_unlock(). [ paulmck: Fixed bug located by kernel test robot <lkp@intel.com> ] Reported-by Jann Horn <jannh@google.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-08-24rcu: Remove unused __rcu_is_watching() functionPaul E. McKenney
The x86/entry work removed all uses of __rcu_is_watching(), therefore this commit removes it entirely. Cc: Andy Lutomirski <luto@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: Borislav Petkov <bp@alien8.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: <x86@kernel.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-08-24rculist: Introduce list/hlist_for_each_entry_srcu() macrosMadhuparna Bhowmik
list/hlist_for_each_entry_rcu() provides an optional cond argument to specify the lock held in the updater side. However for SRCU read side, not providing the cond argument results into false positive as whether srcu_read_lock is held or not is not checked implicitly. Therefore, on read side the lockdep expression srcu_read_lock_held(srcu struct) can solve this issue. However, the function still fails to check the cases where srcu protected list is traversed with rcu_read_lock() instead of srcu_read_lock(). Therefore, to remove the false negative, this patch introduces two new list traversal primitives : list_for_each_entry_srcu() and hlist_for_each_entry_srcu(). Both of the functions have non-optional cond argument as it is required for both read and update side, and simply checks if the cond is true. For regular read side the lockdep expression srcu_read_lock_head() can be passed as the cond argument to list/hlist_for_each_entry_srcu(). Suggested-by: Paolo Bonzini <pbonzini@redhat.com> Tested-by: Suraj Upadhyay <usuraj35@gmail.com> Tested-by: Naresh Kamboju <naresh.kamboju@linaro.org> [ paulmck: Add "true" per kbuild test robot feedback. ] Signed-off-by: Madhuparna Bhowmik <madhuparnabhowmik10@gmail.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-08-24rcu: Fix kerneldoc comments in rcupdate.hTobias Klauser
This commit fixes the kerneldoc comments for rcu_read_unlock_bh(), rcu_read_unlock_sched() and rcu_head_after_call_rcu() so they e.g. get properly linked in the API documentation. Also add parenthesis after function names to match the notation used in other kerneldoc comments in the same file. Signed-off-by: Tobias Klauser <tklauser@distanz.ch> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-08-24rcu/trace: Print negative GP numbers correctlyJoel Fernandes (Google)
GP numbers start from -300 and gp_seq numbers start of -1200 (for a shift of 2). These negative numbers are printed as unsigned long which not only takes up more text space, but is rather confusing to the reader as they have to constantly expend energy to truncate the number. Just print the negative numbering directly. Cc: Uladzislau Rezki (Sony) <urezki@gmail.com> Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-08-24qed: align adjacent indentIgor Russkikh
Remove extra indent on some of adjacent declarations. Signed-off-by: Igor Russkikh <irusskikh@marvell.com> Signed-off-by: Alexander Lobakin <alobakin@marvell.com> Signed-off-by: Michal Kalderon <michal.kalderon@marvell.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-08-24qed: use devlink logic to report errorsIgor Russkikh
Use devlink_health_report to push error indications. We implement this in qede via callback function to make it possible to reuse the same for other drivers sitting on top of qed in future. Signed-off-by: Igor Russkikh <irusskikh@marvell.com> Signed-off-by: Alexander Lobakin <alobakin@marvell.com> Signed-off-by: Michal Kalderon <michal.kalderon@marvell.com> Acked-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-08-24qed: health reporter init deinit seqIgor Russkikh
Here we declare health reporter ops (empty for now) and register these in qed probe and remove callbacks. This way we get devlink attached to all kind of qed* PCI device entities: networking or storage offload entity. Signed-off-by: Igor Russkikh <irusskikh@marvell.com> Signed-off-by: Alexander Lobakin <alobakin@marvell.com> Signed-off-by: Michal Kalderon <michal.kalderon@marvell.com> Reviewed-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-08-24qed/qede: make devlink survive recoveryIgor Russkikh
Devlink instance lifecycle was linked to qed_dev object, that caused devlink to be recreated on each recovery. Changing it by making higher level driver (qede) responsible for its life. This way devlink now survives recoveries. qede now stores devlink structure pointer as a part of its device object, devlink private data contains a linkage structure, qed_devlink. Signed-off-by: Igor Russkikh <irusskikh@marvell.com> Signed-off-by: Alexander Lobakin <alobakin@marvell.com> Signed-off-by: Michal Kalderon <michal.kalderon@marvell.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-08-24io_uring: allow tcp ancillary data for __sys_recvmsg_sock()Luke Hsiao
For TCP tx zero-copy, the kernel notifies the process of completions by queuing completion notifications on the socket error queue. This patch allows reading these notifications via recvmsg to support TCP tx zero-copy. Ancillary data was originally disallowed due to privilege escalation via io_uring's offloading of sendmsg() onto a kernel thread with kernel credentials (https://crbug.com/project-zero/1975). So, we must ensure that the socket type is one where the ancillary data types that are delivered on recvmsg are plain data (no file descriptors or values that are translated based on the identity of the calling process). This was tested by using io_uring to call recvmsg on the MSG_ERRQUEUE with tx zero-copy enabled. Before this patch, we received -EINVALID from this specific code path. After this patch, we could read tcp tx zero-copy completion notifications from the MSG_ERRQUEUE. Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com> Signed-off-by: Arjun Roy <arjunroy@google.com> Acked-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Jann Horn <jannh@google.com> Reviewed-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Luke Hsiao <lukehsiao@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-08-24net: Get rid of consume_skb when tracing is offHerbert Xu
The function consume_skb is only meaningful when tracing is enabled. This patch makes it conditional on CONFIG_TRACEPOINTS. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-08-24bitops, kcsan: Partially revert instrumentation for non-atomic bitopsMarco Elver
Previous to the change to distinguish read-write accesses, when CONFIG_KCSAN_ASSUME_PLAIN_WRITES_ATOMIC=y is set, KCSAN would consider the non-atomic bitops as atomic. We want to partially revert to this behaviour, but with one important distinction: report racing modifications, since lost bits due to non-atomicity are certainly possible. Given the operations here only modify a single bit, assuming non-atomicity of the writer is sufficient may be reasonable for certain usage (and follows the permissible nature of the "assume plain writes atomic" rule). In other words: 1. We want non-atomic read-modify-write races to be reported; this is accomplished by kcsan_check_read(), where any concurrent write (atomic or not) will generate a report. 2. We do not want to report races with marked readers, but -do- want to report races with unmarked readers; this is accomplished by the instrument_write() ("assume atomic write" with Kconfig option set). With the above rules, when KCSAN_ASSUME_PLAIN_WRITES_ATOMIC is selected, it is hoped that KCSAN's reporting behaviour is better aligned with current expected permissible usage for non-atomic bitops. Note that, a side-effect of not telling KCSAN that the accesses are read-writes, is that this information is not displayed in the access summary in the report. It is, however, visible in inline-expanded stack traces. For now, it does not make sense to introduce yet another special case to KCSAN's runtime, only to cater to the case here. Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Paul E. McKenney <paulmck@kernel.org> Cc: Will Deacon <will@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Daniel Axtens <dja@axtens.net> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: <linux-arch@vger.kernel.org> Signed-off-by: Marco Elver <elver@google.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-08-24locking/atomics: Use read-write instrumentation for atomic RMWsMarco Elver
Use instrument_atomic_read_write() for atomic RMW ops. Cc: Will Deacon <will@kernel.org> Cc: Boqun Feng <boqun.feng@gmail.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: <linux-arch@vger.kernel.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Marco Elver <elver@google.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-08-24asm-generic/bitops: Use instrument_read_write() where appropriateMarco Elver
Use the new instrument_read_write() where appropriate. Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Marco Elver <elver@google.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-08-24instrumented.h: Introduce read-write instrumentation hooksMarco Elver
Introduce read-write instrumentation hooks, to more precisely denote an operation's behaviour. KCSAN is able to distinguish compound instrumentation, and with the new instrumentation we then benefit from improved reporting. More importantly, read-write compound operations should not implicitly be treated as atomic, if they aren't actually atomic. Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Marco Elver <elver@google.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-08-24kcsan: Support compounded read-write instrumentationMarco Elver
Add support for compounded read-write instrumentation if supported by the compiler. Adds the necessary instrumentation functions, and a new type which is used to generate a more descriptive report. Furthermore, such compounded memory access instrumentation is excluded from the "assume aligned writes up to word size are atomic" rule, because we cannot assume that the compiler emits code that is atomic for compound ops. LLVM/Clang added support for the feature in: https://github.com/llvm/llvm-project/commit/785d41a261d136b64ab6c15c5d35f2adc5ad53e3 The new instrumentation is emitted for sets of memory accesses in the same basic block to the same address with at least one read appearing before a write. These typically result from compound operations such as ++, --, +=, -=, |=, &=, etc. but also equivalent forms such as "var = var + 1". Where the compiler determines that it is equivalent to emit a call to a single __tsan_read_write instead of separate __tsan_read and __tsan_write, we can then benefit from improved performance and better reporting for such access patterns. The new reports now show that the ops are both reads and writes, for example: read-write to 0xffffffff90548a38 of 8 bytes by task 143 on cpu 3: test_kernel_rmw_array+0x45/0xa0 access_thread+0x71/0xb0 kthread+0x21e/0x240 ret_from_fork+0x22/0x30 read-write to 0xffffffff90548a38 of 8 bytes by task 144 on cpu 2: test_kernel_rmw_array+0x45/0xa0 access_thread+0x71/0xb0 kthread+0x21e/0x240 ret_from_fork+0x22/0x30 Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Marco Elver <elver@google.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-08-24tcp: bpf: Optionally store mac header in TCP_SAVE_SYNMartin KaFai Lau
This patch is adapted from Eric's patch in an earlier discussion [1]. The TCP_SAVE_SYN currently only stores the network header and tcp header. This patch allows it to optionally store the mac header also if the setsockopt's optval is 2. It requires one more bit for the "save_syn" bit field in tcp_sock. This patch achieves this by moving the syn_smc bit next to the is_mptcp. The syn_smc is currently used with the TCP experimental option. Since syn_smc is only used when CONFIG_SMC is enabled, this patch also puts the "IS_ENABLED(CONFIG_SMC)" around it like the is_mptcp did with "IS_ENABLED(CONFIG_MPTCP)". The mac_hdrlen is also stored in the "struct saved_syn" to allow a quick offset from the bpf prog if it chooses to start getting from the network header or the tcp header. [1]: https://lore.kernel.org/netdev/CANn89iLJNWh6bkH7DNhy_kmcAexuUCccqERqe7z2QsvPhGrYPQ@mail.gmail.com/ Suggested-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/bpf/20200820190123.2886935-1-kafai@fb.com
2020-08-24bpf: tcp: Allow bpf prog to write and parse TCP header optionMartin KaFai Lau
[ Note: The TCP changes here is mainly to implement the bpf pieces into the bpf_skops_*() functions introduced in the earlier patches. ] The earlier effort in BPF-TCP-CC allows the TCP Congestion Control algorithm to be written in BPF. It opens up opportunities to allow a faster turnaround time in testing/releasing new congestion control ideas to production environment. The same flexibility can be extended to writing TCP header option. It is not uncommon that people want to test new TCP header option to improve the TCP performance. Another use case is for data-center that has a more controlled environment and has more flexibility in putting header options for internal only use. For example, we want to test the idea in putting maximum delay ACK in TCP header option which is similar to a draft RFC proposal [1]. This patch introduces the necessary BPF API and use them in the TCP stack to allow BPF_PROG_TYPE_SOCK_OPS program to parse and write TCP header options. It currently supports most of the TCP packet except RST. Supported TCP header option: ─────────────────────────── This patch allows the bpf-prog to write any option kind. Different bpf-progs can write its own option by calling the new helper bpf_store_hdr_opt(). The helper will ensure there is no duplicated option in the header. By allowing bpf-prog to write any option kind, this gives a lot of flexibility to the bpf-prog. Different bpf-prog can write its own option kind. It could also allow the bpf-prog to support a recently standardized option on an older kernel. Sockops Callback Flags: ────────────────────── The bpf program will only be called to parse/write tcp header option if the following newly added callback flags are enabled in tp->bpf_sock_ops_cb_flags: BPF_SOCK_OPS_PARSE_UNKNOWN_HDR_OPT_CB_FLAG BPF_SOCK_OPS_PARSE_ALL_HDR_OPT_CB_FLAG BPF_SOCK_OPS_WRITE_HDR_OPT_CB_FLAG A few words on the PARSE CB flags. When the above PARSE CB flags are turned on, the bpf-prog will be called on packets received at a sk that has at least reached the ESTABLISHED state. The parsing of the SYN-SYNACK-ACK will be discussed in the "3 Way HandShake" section. The default is off for all of the above new CB flags, i.e. the bpf prog will not be called to parse or write bpf hdr option. There are details comment on these new cb flags in the UAPI bpf.h. sock_ops->skb_data and bpf_load_hdr_opt() ───────────────────────────────────────── sock_ops->skb_data and sock_ops->skb_data_end covers the whole TCP header and its options. They are read only. The new bpf_load_hdr_opt() helps to read a particular option "kind" from the skb_data. Please refer to the comment in UAPI bpf.h. It has details on what skb_data contains under different sock_ops->op. 3 Way HandShake ─────────────── The bpf-prog can learn if it is sending SYN or SYNACK by reading the sock_ops->skb_tcp_flags. * Passive side When writing SYNACK (i.e. sock_ops->op == BPF_SOCK_OPS_WRITE_HDR_OPT_CB), the received SYN skb will be available to the bpf prog. The bpf prog can use the SYN skb (which may carry the header option sent from the remote bpf prog) to decide what bpf header option should be written to the outgoing SYNACK skb. The SYN packet can be obtained by getsockopt(TCP_BPF_SYN*). More on this later. Also, the bpf prog can learn if it is in syncookie mode (by checking sock_ops->args[0] == BPF_WRITE_HDR_TCP_SYNACK_COOKIE). The bpf prog can store the received SYN pkt by using the existing bpf_setsockopt(TCP_SAVE_SYN). The example in a later patch does it. [ Note that the fullsock here is a listen sk, bpf_sk_storage is not very useful here since the listen sk will be shared by many concurrent connection requests. Extending bpf_sk_storage support to request_sock will add weight to the minisock and it is not necessary better than storing the whole ~100 bytes SYN pkt. ] When the connection is established, the bpf prog will be called in the existing PASSIVE_ESTABLISHED_CB callback. At that time, the bpf prog can get the header option from the saved syn and then apply the needed operation to the newly established socket. The later patch will use the max delay ack specified in the SYN header and set the RTO of this newly established connection as an example. The received ACK (that concludes the 3WHS) will also be available to the bpf prog during PASSIVE_ESTABLISHED_CB through the sock_ops->skb_data. It could be useful in syncookie scenario. More on this later. There is an existing getsockopt "TCP_SAVED_SYN" to return the whole saved syn pkt which includes the IP[46] header and the TCP header. A few "TCP_BPF_SYN*" getsockopt has been added to allow specifying where to start getting from, e.g. starting from TCP header, or from IP[46] header. The new getsockopt(TCP_BPF_SYN*) will also know where it can get the SYN's packet from: - (a) the just received syn (available when the bpf prog is writing SYNACK) and it is the only way to get SYN during syncookie mode. or - (b) the saved syn (available in PASSIVE_ESTABLISHED_CB and also other existing CB). The bpf prog does not need to know where the SYN pkt is coming from. The getsockopt(TCP_BPF_SYN*) will hide this details. Similarly, a flags "BPF_LOAD_HDR_OPT_TCP_SYN" is also added to bpf_load_hdr_opt() to read a particular header option from the SYN packet. * Fastopen Fastopen should work the same as the regular non fastopen case. This is a test in a later patch. * Syncookie For syncookie, the later example patch asks the active side's bpf prog to resend the header options in ACK. The server can use bpf_load_hdr_opt() to look at the options in this received ACK during PASSIVE_ESTABLISHED_CB. * Active side The bpf prog will get a chance to write the bpf header option in the SYN packet during WRITE_HDR_OPT_CB. The received SYNACK pkt will also be available to the bpf prog during the existing ACTIVE_ESTABLISHED_CB callback through the sock_ops->skb_data and bpf_load_hdr_opt(). * Turn off header CB flags after 3WHS If the bpf prog does not need to write/parse header options beyond the 3WHS, the bpf prog can clear the bpf_sock_ops_cb_flags to avoid being called for header options. Or the bpf-prog can select to leave the UNKNOWN_HDR_OPT_CB_FLAG on so that the kernel will only call it when there is option that the kernel cannot handle. [1]: draft-wang-tcpm-low-latency-opt-00 https://tools.ietf.org/html/draft-wang-tcpm-low-latency-opt-00 Signed-off-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20200820190104.2885895-1-kafai@fb.com
2020-08-24bpf: sock_ops: Change some members of sock_ops_kern from u32 to u8Martin KaFai Lau
A later patch needs to add a few pointers and a few u8 to sock_ops_kern. Hence, this patch saves some spaces by moving some of the existing members from u32 to u8 so that the later patch can still fit everything in a cacheline. Signed-off-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20200820190058.2885640-1-kafai@fb.com
2020-08-24bpf: tcp: Add bpf_skops_hdr_opt_len() and bpf_skops_write_hdr_opt()Martin KaFai Lau
The bpf prog needs to parse the SYN header to learn what options have been sent by the peer's bpf-prog before writing its options into SYNACK. This patch adds a "syn_skb" arg to tcp_make_synack() and send_synack(). This syn_skb will eventually be made available (as read-only) to the bpf prog. This will be the only SYN packet available to the bpf prog during syncookie. For other regular cases, the bpf prog can also use the saved_syn. When writing options, the bpf prog will first be called to tell the kernel its required number of bytes. It is done by the new bpf_skops_hdr_opt_len(). The bpf prog will only be called when the new BPF_SOCK_OPS_WRITE_HDR_OPT_CB_FLAG is set in tp->bpf_sock_ops_cb_flags. When the bpf prog returns, the kernel will know how many bytes are needed and then update the "*remaining" arg accordingly. 4 byte alignment will be included in the "*remaining" before this function returns. The 4 byte aligned number of bytes will also be stored into the opts->bpf_opt_len. "bpf_opt_len" is a newly added member to the struct tcp_out_options. Then the new bpf_skops_write_hdr_opt() will call the bpf prog to write the header options. The bpf prog is only called if it has reserved spaces before (opts->bpf_opt_len > 0). The bpf prog is the last one getting a chance to reserve header space and writing the header option. These two functions are half implemented to highlight the changes in TCP stack. The actual codes preparing the bpf running context and invoking the bpf prog will be added in the later patch with other necessary bpf pieces. Signed-off-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/bpf/20200820190052.2885316-1-kafai@fb.com
2020-08-24bpf: tcp: Add bpf_skops_parse_hdr()Martin KaFai Lau
The patch adds a function bpf_skops_parse_hdr(). It will call the bpf prog to parse the TCP header received at a tcp_sock that has at least reached the ESTABLISHED state. For the packets received during the 3WHS (SYN, SYNACK and ACK), the received skb will be available to the bpf prog during the callback in bpf_skops_established() introduced in the previous patch and in the bpf_skops_write_hdr_opt() that will be added in the next patch. Calling bpf prog to parse header is controlled by two new flags in tp->bpf_sock_ops_cb_flags: BPF_SOCK_OPS_PARSE_UNKNOWN_HDR_OPT_CB_FLAG and BPF_SOCK_OPS_PARSE_ALL_HDR_OPT_CB_FLAG. When BPF_SOCK_OPS_PARSE_UNKNOWN_HDR_OPT_CB_FLAG is set, the bpf prog will only be called when there is unknown option in the TCP header. When BPF_SOCK_OPS_PARSE_ALL_HDR_OPT_CB_FLAG is set, the bpf prog will be called on all received TCP header. This function is half implemented to highlight the changes in TCP stack. The actual codes preparing the bpf running context and invoking the bpf prog will be added in the later patch with other necessary bpf pieces. Signed-off-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/bpf/20200820190046.2885054-1-kafai@fb.com
2020-08-24bpf: tcp: Add bpf_skops_established()Martin KaFai Lau
In tcp_init_transfer(), it currently calls the bpf prog to give it a chance to handle the just "ESTABLISHED" event (e.g. do setsockopt on the newly established sk). Right now, it is done by calling the general purpose tcp_call_bpf(). In the later patch, it also needs to pass the just-received skb which concludes the 3 way handshake. E.g. the SYNACK received at the active side. The bpf prog can then learn some specific header options written by the peer's bpf-prog and potentially do setsockopt on the newly established sk. Thus, instead of reusing the general purpose tcp_call_bpf(), a new function bpf_skops_established() is added to allow passing the "skb" to the bpf prog. The actual skb passing from bpf_skops_established() to the bpf prog will happen together in a later patch which has the necessary bpf pieces. A "skb" arg is also added to tcp_init_transfer() such that it can then be passed to bpf_skops_established(). Calling the new bpf_skops_established() instead of tcp_call_bpf() should be a noop in this patch. Signed-off-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: John Fastabend <john.fastabend@gmail.com> Link: https://lore.kernel.org/bpf/20200820190039.2884750-1-kafai@fb.com
2020-08-24tcp: Add saw_unknown to struct tcp_options_receivedMartin KaFai Lau
In a later patch, the bpf prog only wants to be called to handle a header option if that particular header option cannot be handled by the kernel. This unknown option could be written by the peer's bpf-prog. It could also be a new standard option that the running kernel does not support it while a bpf-prog can handle it. This patch adds a "saw_unknown" bit to "struct tcp_options_received" and it uses an existing one byte hole to do that. "saw_unknown" will be set in tcp_parse_options() if it sees an option that the kernel cannot handle. Signed-off-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Reviewed-by: Eric Dumazet <edumazet@google.com> Acked-by: John Fastabend <john.fastabend@gmail.com> Link: https://lore.kernel.org/bpf/20200820190033.2884430-1-kafai@fb.com
2020-08-24tcp: bpf: Add TCP_BPF_RTO_MIN for bpf_setsockoptMartin KaFai Lau
This patch adds bpf_setsockopt(TCP_BPF_RTO_MIN) to allow bpf prog to set the min rto of a connection. It could be used together with the earlier patch which has added bpf_setsockopt(TCP_BPF_DELACK_MAX). A later selftest patch will communicate the max delay ack in a bpf tcp header option and then the receiving side can use bpf_setsockopt(TCP_BPF_RTO_MIN) to set a shorter rto. Signed-off-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Reviewed-by: Eric Dumazet <edumazet@google.com> Acked-by: John Fastabend <john.fastabend@gmail.com> Link: https://lore.kernel.org/bpf/20200820190027.2884170-1-kafai@fb.com
2020-08-24tcp: bpf: Add TCP_BPF_DELACK_MAX setsockoptMartin KaFai Lau
This change is mostly from an internal patch and adapts it from sysctl config to the bpf_setsockopt setup. The bpf_prog can set the max delay ack by using bpf_setsockopt(TCP_BPF_DELACK_MAX). This max delay ack can be communicated to its peer through bpf header option. The receiving peer can then use this max delay ack and set a potentially lower rto by using bpf_setsockopt(TCP_BPF_RTO_MIN) which will be introduced in the next patch. Another later selftest patch will also use it like the above to show how to write and parse bpf tcp header option. Signed-off-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Reviewed-by: Eric Dumazet <edumazet@google.com> Acked-by: John Fastabend <john.fastabend@gmail.com> Link: https://lore.kernel.org/bpf/20200820190021.2884000-1-kafai@fb.com
2020-08-24tcp: Use a struct to represent a saved_synMartin KaFai Lau
The TCP_SAVE_SYN has both the network header and tcp header. The total length of the saved syn packet is currently stored in the first 4 bytes (u32) of an array and the actual packet data is stored after that. A later patch will add a bpf helper that allows to get the tcp header alone from the saved syn without the network header. It will be more convenient to have a direct offset to a specific header instead of re-parsing it. This requires to separately store the network hdrlen. The total header length (i.e. network + tcp) is still needed for the current usage in getsockopt. Although this total length can be obtained by looking into the tcphdr and then get the (th->doff << 2), this patch chooses to directly store the tcp hdrlen in the second four bytes of this newly created "struct saved_syn". By using a new struct, it can give a readable name to each individual header length. Signed-off-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Reviewed-by: Eric Dumazet <edumazet@google.com> Acked-by: John Fastabend <john.fastabend@gmail.com> Link: https://lore.kernel.org/bpf/20200820190014.2883694-1-kafai@fb.com
2020-08-24RDMA/core: Move the rdma_show_ib_cm_event() macroChuck Lever
Refactor: Make it globally available in the utilities header. Link: https://lore.kernel.org/r/159767239131.2968.9520990257041764685.stgit@klimt.1015granger.net Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2020-08-25ocxl: Don't return trigger page when allocating an interruptFrederic Barrat
Existing users of ocxl_link_irq_alloc() have been converted to obtain the trigger page of an interrupt through xive directly, we therefore have no need to return the trigger page when allocating an interrupt. It also allows ocxl to use the xive native interface to allocate interrupts, instead of its custom service. Signed-off-by: Frederic Barrat <fbarrat@linux.ibm.com> Reviewed-by: Cédric Le Goater <clg@kaod.org> Reviewed-by: Greg Kurz <groug@kaod.org> Acked-by: Andrew Donnellan <ajd@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20200403153838.29224-4-fbarrat@linux.ibm.com
2020-08-24ipv6: ndisc: adjust ndisc_ifinfo_sysctl_change prototypeTobias Klauser
Commit 32927393dc1c ("sysctl: pass kernel pointers to ->proc_handler") changed ndisc_ifinfo_sysctl_change to take a kernel pointer. Adjust its prototype in net/ndisc.h as well to fix the following sparse warning: net/ipv6/ndisc.c:1838:5: error: symbol 'ndisc_ifinfo_sysctl_change' redeclared with different type (incompatible argument 3 (different address spaces)): net/ipv6/ndisc.c:1838:5: int extern [addressable] [signed] [toplevel] ndisc_ifinfo_sysctl_change( ... ) net/ipv6/ndisc.c: note: in included file (through include/net/ipv6.h): ./include/net/ndisc.h:496:5: note: previously declared as: ./include/net/ndisc.h:496:5: int extern [addressable] [signed] [toplevel] ndisc_ifinfo_sysctl_change( ... ) net/ipv6/ndisc.c: note: in included file (through include/net/ip6_route.h): Fixes: 32927393dc1c ("sysctl: pass kernel pointers to ->proc_handler") Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Tobias Klauser <tklauser@distanz.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-08-24Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nfDavid S. Miller
Pablo Neira Ayuso says: ==================== Netfilter fixes for net The following patchset contains Netfilter fixes for net: 1) Don't flag SCTP heartbeat as invalid for re-used connections, from Florian Westphal. 2) Bogus overlap report due to rbtree tree rotations, from Stefano Brivio. 3) Detect partial overlap with start end point match, also from Stefano. 4) Skip netlink dump of NFTA_SET_USERDATA is unset. 5) Incorrect nft_list_attributes enumeration definition. 6) Missing zeroing before memcpy to destination register, also from Florian. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2020-08-24libceph: add __maybe_unused to DEFINE_CEPH_FEATUREIlya Dryomov
Avoid -Wunused-const-variable warnings for "make W=1". Reported-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
2020-08-24drm/ttm: drop bus.size from bus placement.Dave Airlie
This is always calculated the same, and only used in a couple of places. Signed-off-by: Dave Airlie <airlied@redhat.com> Reviewed-by: Christian König <christian.koenig@amd.com> Link: https://patchwork.freedesktop.org/patch/msgid/20200811074658.58309-2-airlied@gmail.com
2020-08-24soc: ti: pm33xx: Simplify RTC usage to prepare to drop platform dataTony Lindgren
We must re-enable the RTC module clock enabled in RTC+DDR suspend, and pm33xx has been using platform data callbacks for that. Looks like for retention suspend the RTC module clock must not be re-enabled. To remove the legacy platform data callbacks, and eventually be able to drop the RTC legacy platform data, let's manage the RTC module clock and register range directly in pm33xx. Acked-by: Santosh Shilimkar <ssantosh@kernel.org> Signed-off-by: Tony Lindgren <tony@atomide.com>
2020-08-24Revert "powerpc/64s: Remove PROT_SAO support"Shawn Anastasio
This reverts commit 5c9fa16e8abd342ce04dc830c1ebb2a03abf6c05. Since PROT_SAO can still be useful for certain classes of software, reintroduce it. Concerns about guest migration for LPARs using SAO will be addressed next. Signed-off-by: Shawn Anastasio <shawn@anastas.io> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20200821185558.35561-2-shawn@anastas.io
2020-08-23treewide: Use fallthrough pseudo-keywordGustavo A. R. Silva
Replace the existing /* fall through */ comments and its variants with the new pseudo-keyword macro fallthrough[1]. Also, remove unnecessary fall-through markings when it is the case. [1] https://www.kernel.org/doc/html/v5.7/process/deprecated.html?highlight=fallthrough#implicit-switch-case-fall-through Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
2020-08-23Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netDavid S. Miller
2020-08-23Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netLinus Torvalds
Pull networking fixes from David Miller: "Nothing earth shattering here, lots of small fixes (f.e. missing RCU protection, bad ref counting, missing memset(), etc.) all over the place: 1) Use get_file_rcu() in task_file iterator, from Yonghong Song. 2) There are two ways to set remote source MAC addresses in macvlan driver, but only one of which validates things properly. Fix this. From Alvin Šipraga. 3) Missing of_node_put() in gianfar probing, from Sumera Priyadarsini. 4) Preserve device wanted feature bits across multiple netlink ethtool requests, from Maxim Mikityanskiy. 5) Fix rcu_sched stall in task and task_file bpf iterators, from Yonghong Song. 6) Avoid reset after device destroy in ena driver, from Shay Agroskin. 7) Missing memset() in netlink policy export reallocation path, from Johannes Berg. 8) Fix info leak in __smc_diag_dump(), from Peilin Ye. 9) Decapsulate ECN properly for ipv6 in ipv4 tunnels, from Mark Tomlinson. 10) Fix number of data stream negotiation in SCTP, from David Laight. 11) Fix double free in connection tracker action module, from Alaa Hleihel. 12) Don't allow empty NHA_GROUP attributes, from Nikolay Aleksandrov" * git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (46 commits) net: nexthop: don't allow empty NHA_GROUP bpf: Fix two typos in uapi/linux/bpf.h net: dsa: b53: check for timeout tipc: call rcu_read_lock() in tipc_aead_encrypt_done() net/sched: act_ct: Fix skb double-free in tcf_ct_handle_fragments() error flow net: sctp: Fix negotiation of the number of data streams. dt-bindings: net: renesas, ether: Improve schema validation gre6: Fix reception with IP6_TNL_F_RCV_DSCP_COPY hv_netvsc: Fix the queue_mapping in netvsc_vf_xmit() hv_netvsc: Remove "unlikely" from netvsc_select_queue bpf: selftests: global_funcs: Check err_str before strstr bpf: xdp: Fix XDP mode when no mode flags specified selftests/bpf: Remove test_align leftovers tools/resolve_btfids: Fix sections with wrong alignment net/smc: Prevent kernel-infoleak in __smc_diag_dump() sfc: fix build warnings on 32-bit net: phy: mscc: Fix a couple of spelling mistakes "spcified" -> "specified" libbpf: Fix map index used in error message net: gemini: Fix missing free_netdev() in error path of gemini_ethernet_port_probe() net: atlantic: Use readx_poll_timeout() for large timeout ...