summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2020-03-29mptcp: rework mptcp_sendmsg_frag to accept optional dfragPaolo Abeni
This will simplify mptcp-level retransmission implementation in the next patch. If dfrag is provided by the caller, skip kernel space memory allocation and use data and metadata provided by the dfrag itself. Because a peer could ack data at TCP level but refrain from sending mptcp-level ACKs, we could grow the mptcp socket backlog indefinitely. We should thus block mptcp_sendmsg until the peer has acked some of the sent data. In order to be able to do so, increment the mptcp socket wmem_queued counter on memory allocation and decrement it when releasing the memory on mptcp-level ack reception. Because TCP performns sndbuf auto-tuning up to tcp_wmem_max[2], make this the mptcp sk_sndbuf limit. In the future we could add experiment with autotuning as TCP does in tcp_sndbuf_expand(). v2 -> v3: - remove 'inline' in foo.c files (David S. Miller) Co-developed-by: Florian Westphal <fw@strlen.de> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-29mptcp: allow partial cleaning of rtx head dfragFlorian Westphal
After adding wmem accounting for the mptcp socket we could get into a situation where the mptcp socket can't transmit more data, and mptcp_clean_una doesn't reduce wmem even if snd_una has advanced because it currently will only remove entire dfrags. Allow advancing the dfrag head sequence and reduce wmem, even though this isn't correct (as we can't release the page). Because we will soon block on mptcp sk in case wmem is too large, call sk_stream_write_space() in case we reduced the backlog so userspace task blocked in sendmsg or poll will be woken up. This isn't an issue if the send buffer is large, but it is when SO_SNDBUF is used to reduce it to a lower value. Note we can still get a deadlock for low SO_SNDBUF values in case both sides of the connection write to the socket: both could be blocked due to wmem being too small -- and current mptcp stack will only increment mptcp ack_seq on recv. This doesn't happen with the selftest as it uses poll() and will always call recv if there is data to read. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-29mptcp: implement memory accounting for mptcp rtx queuePaolo Abeni
Charge the data on the rtx queue to the master MPTCP socket, too. Such memory in uncharged when the data is acked/dequeued. Also account mptcp sockets inuse via a protocol specific pcpu counter. Co-developed-by: Florian Westphal <fw@strlen.de> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-29mptcp: introduce MPTCP retransmission timerPaolo Abeni
The timer will be used to schedule retransmission. It's frequency is based on the current subflow RTO estimation and is reset on every una_seq update The timer is clearer for good by __mptcp_clear_xmit() Also clean MPTCP rtx queue before each transmission. Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-29mptcp: queue data for mptcp level retransmissionPaolo Abeni
Keep the send page fragment on an MPTCP level retransmission queue. The queue entries are allocated inside the page frag allocator, acquiring an additional reference to the page for each list entry. Also switch to a custom page frag refill function, to ensure that the current page fragment can always host an MPTCP rtx queue entry. The MPTCP rtx queue is flushed at disconnect() and close() time Note that now we need to call __mptcp_init_sock() regardless of mptcp enable status, as the destructor will try to walk the rtx_queue. v2 -> v3: - remove 'inline' in foo.c files (David S. Miller) Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-29mptcp: update per unacked sequence on pkt receptionPaolo Abeni
So that we keep per unacked sequence number consistent; since we update per msk data, use an atomic64 cmpxchg() to protect against concurrent updates from multiple subflows. Initialize the snd_una at connect()/accept() time. Co-developed-by: Florian Westphal <fw@strlen.de> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-29mptcp: Implement path manager interface commandsPeter Krystad
Fill in more path manager functionality by adding a worker function and modifying the related stub functions to schedule the worker. Co-developed-by: Florian Westphal <fw@strlen.de> Signed-off-by: Florian Westphal <fw@strlen.de> Co-developed-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Peter Krystad <peter.krystad@linux.intel.com> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-29mptcp: Add handling of outgoing MP_JOIN requestsPeter Krystad
Subflow creation may be initiated by the path manager when the primary connection is fully established and a remote address has been received via ADD_ADDR. Create an in-kernel sock and use kernel_connect() to initiate connection. Passive sockets can't acquire the mptcp socket lock at subflow creation time, so an additional list protected by a new spinlock is used to track the MPJ subflows. Such list is spliced into conn_list tail every time the msk socket lock is acquired, so that it will not interfere with data flow on the original connection. Data flow and connection failover not addressed by this commit. Co-developed-by: Florian Westphal <fw@strlen.de> Signed-off-by: Florian Westphal <fw@strlen.de> Co-developed-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Co-developed-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: Peter Krystad <peter.krystad@linux.intel.com> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-29mptcp: Add handling of incoming MP_JOIN requestsPeter Krystad
Process the MP_JOIN option in a SYN packet with the same flow as MP_CAPABLE but when the third ACK is received add the subflow to the MPTCP socket subflow list instead of adding it to the TCP socket accept queue. The subflow is added at the end of the subflow list so it will not interfere with the existing subflows operation and no data is expected to be transmitted on it. Co-developed-by: Florian Westphal <fw@strlen.de> Signed-off-by: Florian Westphal <fw@strlen.de> Co-developed-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Peter Krystad <peter.krystad@linux.intel.com> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-29mptcp: Add path manager interfacePeter Krystad
Add enough of a path manager interface to allow sending of ADD_ADDR when an incoming MPTCP connection is created. Capable of sending only a single IPv4 ADD_ADDR option. The 'pm_data' element of the connection sock will need to be expanded to handle multiple interfaces and IPv6. Partial processing of the incoming ADD_ADDR is included so the path manager notification of that event happens at the proper time, which involves validating the incoming address information. This is a skeleton interface definition for events generated by MPTCP. Co-developed-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net> Co-developed-by: Florian Westphal <fw@strlen.de> Signed-off-by: Florian Westphal <fw@strlen.de> Co-developed-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Co-developed-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: Peter Krystad <peter.krystad@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-29mptcp: Add ADD_ADDR handlingPeter Krystad
Add handling for sending and receiving the ADD_ADDR, ADD_ADDR6, and RM_ADDR suboptions. Co-developed-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net> Co-developed-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Peter Krystad <peter.krystad@linux.intel.com> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-29mlx4: fix "initializer element not constant" compiler errorJacob Keller
A recent commit e8937681797c ("devlink: prepare to support region operations") used the region_cr_space_str and region_fw_health_str variables as initializers for the devlink_region_ops structures. This can result in compiler errors: drivers/net/ethernet/mellanox//mlx4/crdump.c:45:10: error: initializer element is not constant .name = region_cr_space_str, ^ drivers/net/ethernet/mellanox//mlx4/crdump.c:45:10: note: (near initialization for ‘region_cr_space_ops.name’) drivers/net/ethernet/mellanox//mlx4/crdump.c:50:10: error: initializer element is not constant .name = region_fw_health_str, The variables were made to be "const char * const", indicating that both the pointer and data were constant. This was enough to resolve this on recent GCC (gcc (GCC) 9.2.1 20190827 (Red Hat 9.2.1-1) for this author). Unfortunately this is not enough for older compilers to realize that the variable can be treated as a constant expression. Fix this by introducing macros for the string and use those instead of the variable name in the region ops structures. Reported-by: tanhuazhong <tanhuazhong@huawei.com> Fixes: e8937681797c ("devlink: prepare to support region operations") Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Reviewed-by: Leon Romanovsky <leonro@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-29devlink: don't wrap commands in rST shell blocksJacob Keller
The devlink-region.rst and ice-region.rst documentation files wrapped some lines within shell code blocks due to being longer than 80 lines. It was pointed out during review that wrapping these lines shouldn't be done. Fix these two rST files and remove the line wrapping on these shell command examples. Reported-by: Jiri Pirko <jiri@resnulli.us> Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-29net: dsa: mt7530: use resolved link config in mac_link_up()René van Dorst
Convert the mt7530 switch driver to use the finalised link parameters in mac_link_up() rather than the parameters in mac_config(). Signed-off-by: René van Dorst <opensource@vdorst.com> Tested-by: Sean Wang <sean.wang@mediatek.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-29net: dsa: sja1105: show more ethtool statistics counters for P/Q/R/SVladimir Oltean
It looks like the P/Q/R/S series supports some more counters, generically named "Ethernet statistics counter", which we were not printing. Add them. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-29sctp: fix possibly using a bad saddr with a given dstMarcelo Ricardo Leitner
Under certain circumstances, depending on the order of addresses on the interfaces, it could be that sctp_v[46]_get_dst() would return a dst with a mismatched struct flowi. For example, if when walking through the bind addresses and the first one is not a match, it saves the dst as a fallback (added in 410f03831c07), but not the flowi. Then if the next one is also not a match, the previous dst will be returned but with the flowi information for the 2nd address, which is wrong. The fix is to use a locally stored flowi that can be used for such attempts, and copy it to the parameter only in case it is a possible match, together with the corresponding dst entry. The patch updates IPv6 code mostly just to be in sync. Even though the issue is also present there, it fallback is not expected to work with IPv6. Fixes: 410f03831c07 ("sctp: add routing output fallback") Reported-by: Jin Meng <meng.a.jin@nokia-sbell.com> Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Tested-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-29s390/qeth: support net namespaces for L3 devicesJulian Wiedmann
Enable the L3 driver's IPv4 address notifier to watch for events on qeth devices that have been moved into a net namespace. We need to program those IPs into the HW just as usual, otherwise inbound traffic won't flow. Fixes: 6133fb1aa137 ("[NETNS]: Disable inetaddr notifiers in namespaces other than initial.") Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-29sctp: fix refcount bug in sctp_wfreeQiujun Huang
We should iterate over the datamsgs to move all chunks(skbs) to newsk. The following case cause the bug: for the trouble SKB, it was in outq->transmitted list sctp_outq_sack sctp_check_transmitted SKB was moved to outq->sacked list then throw away the sack queue SKB was deleted from outq->sacked (but it was held by datamsg at sctp_datamsg_to_asoc So, sctp_wfree was not called here) then migrate happened sctp_for_each_tx_datachunk( sctp_clear_owner_w); sctp_assoc_migrate(); sctp_for_each_tx_datachunk( sctp_set_owner_w); SKB was not in the outq, and was not changed to newsk finally __sctp_outq_teardown sctp_chunk_put (for another skb) sctp_datamsg_put __kfree_skb(msg->frag_list) sctp_wfree (for SKB) SKB->sk was still oldsk (skb->sk != asoc->base.sk). Reported-and-tested-by: syzbot+cea71eec5d6de256d54d@syzkaller.appspotmail.com Signed-off-by: Qiujun Huang <hqjagain@gmail.com> Acked-by: Marcelo Ricardo Leitner <mleitner@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-29net: Fix typo of SKB_SGO_CB_OFFSETCambda Zhu
The SKB_SGO_CB_OFFSET should be SKB_GSO_CB_OFFSET which means the offset of the GSO in skb cb. This patch fixes the typo. Fixes: 9207f9d45b0a ("net: preserve IP control block during GSO segmentation") Signed-off-by: Cambda Zhu <cambda@linux.alibaba.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-29ipv4: fix a RCU-list lock in fib_triestat_seq_showQian Cai
fib_triestat_seq_show() calls hlist_for_each_entry_rcu(tb, head, tb_hlist) without rcu_read_lock() will trigger a warning, net/ipv4/fib_trie.c:2579 RCU-list traversed in non-reader section!! other info that might help us debug this: rcu_scheduler_active = 2, debug_locks = 1 1 lock held by proc01/115277: #0: c0000014507acf00 (&p->lock){+.+.}-{3:3}, at: seq_read+0x58/0x670 Call Trace: dump_stack+0xf4/0x164 (unreliable) lockdep_rcu_suspicious+0x140/0x164 fib_triestat_seq_show+0x750/0x880 seq_read+0x1a0/0x670 proc_reg_read+0x10c/0x1b0 __vfs_read+0x3c/0x70 vfs_read+0xac/0x170 ksys_read+0x7c/0x140 system_call+0x5c/0x68 Fix it by adding a pair of rcu_read_lock/unlock() and use cond_resched_rcu() to avoid the situation where walking of a large number of items may prevent scheduling for a long time. Signed-off-by: Qian Cai <cai@lca.pw> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-29qed: Fix race condition between scheduling and destroying the slowpath workqueueYuval Basson
Calling queue_delayed_work concurrently with destroy_workqueue might race to an unexpected outcome - scheduled task after wq is destroyed or other resources (like ptt_pool) are freed (yields NULL pointer dereference). cancel_delayed_work prevents the race by cancelling the timer triggered for scheduling a new task. Fixes: 59ccf86fe ("qed: Add driver infrastucture for handling mfw requests") Signed-off-by: Denis Bolotin <dbolotin@marvell.com> Signed-off-by: Michal Kalderon <mkalderon@marvell.com> Signed-off-by: Yuval Basson <ybason@marvell.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-29net: page pool: allow to pass zero flags to page_pool_init()Denis Kirjanov
page pool API can be useful for non-DMA cases like xen-netfront driver so let's allow to pass zero flags to page pool flags. v2: check DMA direction only if PP_FLAG_DMA_MAP is set Signed-off-by: Denis Kirjanov <kda@linux-powerpc.org> Acked-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-29selftests: move timestamping selftests to net folderJian Yang
For historical reasons, there are several timestamping selftest targets in selftests/networking/timestamping. Move them to the standard directory for networking tests: selftests/net. Signed-off-by: Jian Yang <jianyang@google.com> Acked-by: Willem de Bruijn <willemb@google.com> Acked-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-29ARM: dts: apalis-imx6qdl: use rgmii-id instead of rgmiiPhilippe Schenker
Until now a PHY-fixup in mach-imx set our rgmii timing correctly. For the PHY KSZ9131 there is no PHY-fixup in mach-imx. To support this PHY too, use rgmii-id. For the now used KSZ9031 nothing will change, as rgmii-id is only implemented and supported by the KSZ9131. Signed-off-by: Philippe Schenker <philippe.schenker@toradex.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-29net: phy: micrel.c: add rgmii interface delay possibility to ksz9131Philippe Schenker
The KSZ9131 provides DLL controlled delays on RXC and TXC lines. This patch makes use of those delays. The information which delays should be enabled or disabled comes from the interface names, documented in ethernet-controller.yaml: rgmii: Disable RXC and TXC delays rgmii-id: Enable RXC and TXC delays rgmii-txid: Enable only TXC delay, disable RXC delay rgmii-rxid: Enable onlx RXC delay, disable TXC delay Signed-off-by: Philippe Schenker <philippe.schenker@toradex.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-29net: macsec: add support for specifying offload upon link creationMark Starovoytov
This patch adds new netlink attribute to allow a user to (optionally) specify the desired offload mode immediately upon MACSec link creation. Separate iproute patch will be required to support this from user space. Signed-off-by: Mark Starovoytov <mstarovoitov@marvell.com> Signed-off-by: Igor Russkikh <irusskikh@marvell.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-29mac80211: fix authentication with iwlwifi/mvmJohannes Berg
The original patch didn't copy the ieee80211_is_data() condition because on most drivers the management frames don't go through this path. However, they do on iwlwifi/mvm, so we do need to keep the condition here. Cc: stable@vger.kernel.org Fixes: ce2e1ca70307 ("mac80211: Check port authorization in the ieee80211_tx_dequeue() case") Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-29Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller
Minor comment conflict in mac80211. Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-30netfilter: flowtable: add counter support in HW offloadwenxu
Store the conntrack counters to the conntrack entry in the HW flowtable offload. Signed-off-by: wenxu <wenxu@ucloud.cn> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-03-30netfilter: conntrack: add nf_ct_acct_add()wenxu
Add nf_ct_acct_add function to update the conntrack counter with packets and bytes. Signed-off-by: wenxu <wenxu@ucloud.cn> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-03-30netfilter: nf_tables: skip set types that do not support for expressionsPablo Neira Ayuso
The bitmap set does not support for expressions, skip it from the estimation step. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-03-30netfilter: nft_dynset: validate set expression definitionPablo Neira Ayuso
If the global set expression definition mismatches the dynset expression, then bail out. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-03-30netfilter: nft_set_bitmap: initialize set element extension in lookupsPablo Neira Ayuso
Otherwise, nft_lookup might dereference an uninitialized pointer to the element extension. Fixes: 665153ff5752 ("netfilter: nf_tables: add bitmap set type") Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-03-30netfilter: ctnetlink: be more strict when NF_CONNTRACK_MARK is not setRomain Bellan
When CONFIG_NF_CONNTRACK_MARK is not set, any CTA_MARK or CTA_MARK_MASK in netlink message are not supported. We should return an error when one of them is set, not both Fixes: 9306425b70bf ("netfilter: ctnetlink: must check mark attributes vs NULL") Signed-off-by: Romain Bellan <romain.bellan@wifirst.fr> Signed-off-by: Florent Fourcot <florent.fourcot@wifirst.fr> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-03-30Merge branch 'bpf-lsm'Daniel Borkmann
KP Singh says: ==================== ** Motivation Google does analysis of rich runtime security data to detect and thwart threats in real-time. Currently, this is done in custom kernel modules but we would like to replace this with something that's upstream and useful to others. The current kernel infrastructure for providing telemetry (Audit, Perf etc.) is disjoint from access enforcement (i.e. LSMs). Augmenting the information provided by audit requires kernel changes to audit, its policy language and user-space components. Furthermore, building a MAC policy based on the newly added telemetry data requires changes to various LSMs and their respective policy languages. This patchset allows BPF programs to be attached to LSM hooks This facilitates a unified and dynamic (not requiring re-compilation of the kernel) audit and MAC policy. ** Why an LSM? Linux Security Modules target security behaviours rather than the kernel's API. For example, it's easy to miss out a newly added system call for executing processes (eg. execve, execveat etc.) but the LSM framework ensures that all process executions trigger the relevant hooks irrespective of how the process was executed. Allowing users to implement LSM hooks at runtime also benefits the LSM eco-system by enabling a quick feedback loop from the security community about the kind of behaviours that the LSM Framework should be targeting. ** How does it work? The patchset introduces a new eBPF (https://docs.cilium.io/en/v1.6/bpf/) program type BPF_PROG_TYPE_LSM which can only be attached to LSM hooks. Loading and attachment of BPF programs requires CAP_SYS_ADMIN. The new LSM registers nop functions (bpf_lsm_<hook_name>) as LSM hook callbacks. Their purpose is to provide a definite point where BPF programs can be attached as BPF_TRAMP_MODIFY_RETURN trampoline programs for hooks that return an int, and BPF_TRAMP_FEXIT trampoline programs for void LSM hooks. Audit logs can be written using a format chosen by the eBPF program to the perf events buffer or to global eBPF variables or maps and can be further processed in user-space. ** BTF Based Design The current design uses BTF: * https://facebookmicrosites.github.io/bpf/blog/2018/11/14/btf-enhancement.html * https://lwn.net/Articles/803258 which allows verifiable read-only structure accesses by field names rather than fixed offsets. This allows accessing the hook parameters using a dynamically created context which provides a certain degree of ABI stability: // Only declare the structure and fields intended to be used // in the program struct vm_area_struct { unsigned long vm_start; } __attribute__((preserve_access_index)); // Declare the eBPF program mprotect_audit which attaches to // to the file_mprotect LSM hook and accepts three arguments. SEC("lsm/file_mprotect") int BPF_PROG(mprotect_audit, struct vm_area_struct *vma, unsigned long reqprot, unsigned long prot, int ret) { unsigned long vm_start = vma->vm_start; return 0; } By relocating field offsets, BTF makes a large portion of kernel data structures readily accessible across kernel versions without requiring a large corpus of BPF helper functions and requiring recompilation with every kernel version. The BTF type information is also used by the BPF verifier to validate memory accesses within the BPF program and also prevents arbitrary writes to the kernel memory. The limitations of BTF compatibility are described in BPF Co-Re (http://vger.kernel.org/bpfconf2019_talks/bpf-core.pdf, i.e. field renames, #defines and changes to the signature of LSM hooks). This design imposes that the MAC policy (eBPF programs) be updated when the inspected kernel structures change outside of BTF compatibility guarantees. In practice, this is only required when a structure field used by a current policy is removed (or renamed) or when the used LSM hooks change. We expect the maintenance cost of these changes to be acceptable as compared to the design presented in the RFC. (https://lore.kernel.org/bpf/20190910115527.5235-1-kpsingh@chromium.org/). ** Usage Examples A simple example and some documentation is included in the patchset. In order to better illustrate the capabilities of the framework some more advanced prototype (not-ready for review) code has also been published separately: * Logging execution events (including environment variables and arguments) https://github.com/sinkap/linux-krsi/blob/patch/v1/examples/samples/bpf/lsm_audit_env.c * Detecting deletion of running executables: https://github.com/sinkap/linux-krsi/blob/patch/v1/examples/samples/bpf/lsm_detect_exec_unlink.c * Detection of writes to /proc/<pid>/mem: https://github.com/sinkap/linux-krsi/blob/patch/v1/examples/samples/bpf/lsm_audit_env.c We have updated Google's internal telemetry infrastructure and have started deploying this LSM on our Linux Workstations. This gives us more confidence in the real-world applications of such a system. ** Changelog: - v8 -> v9: https://lore.kernel.org/bpf/20200327192854.31150-1-kpsingh@chromium.org/ * Fixed a selftest crash when CONFIG_LSM doesn't have "bpf". * Added James' Ack. * Rebase. - v7 -> v8: https://lore.kernel.org/bpf/20200326142823.26277-1-kpsingh@chromium.org/ * Removed CAP_MAC_ADMIN check from bpf_lsm_verify_prog. LSMs can add it in their own bpf_prog hook. This can be revisited as a separate patch. * Added Andrii and James' Ack/Review tags. * Fixed an indentation issue and missing newlines in selftest error a cases. * Updated a comment as suggested by Alexei. * Updated the documentation to use the newer libbpf API and some other fixes. * Rebase - v6 -> v7: https://lore.kernel.org/bpf/20200325152629.6904-1-kpsingh@chromium.org/ * Removed __weak from the LSM attachment nops per Kees' suggestion. Will send a separate patch (if needed) to update the noinline definition in include/linux/compiler_attributes.h. * waitpid to wait specifically for the forked child in selftests. * Comment format fixes in security/... as suggested by Casey. * Added Acks from Kees and Andrii and Casey's Reviewed-by: tags to the respective patches. * Rebase - v5 -> v6: https://lore.kernel.org/bpf/20200323164415.12943-1-kpsingh@chromium.org/ * Updated LSM_HOOK macro to define a default value and cleaned up the BPF LSM hook declarations. * Added Yonghong's Acks and Kees' Reviewed-by tags. * Simplification of the selftest code. * Rebase and fixes suggested by Andrii and Yonghong and some other minor fixes noticed in internal review. - v4 -> v5: https://lore.kernel.org/bpf/20200220175250.10795-1-kpsingh@chromium.org/ * Removed static keys and special casing of BPF calls from the LSM framework. * Initialized the BPF callbacks (nops) as proper LSM hooks. * Updated to using the newly introduced BPF_TRAMP_MODIFY_RETURN trampolines in https://lkml.org/lkml/2020/3/4/877 * Addressed Andrii's feedback and rebased. - v3 -> v4: * Moved away from allocating a separate security_hook_heads and adding a new special case for arch_prepare_bpf_trampoline to using BPF fexit trampolines called from the right place in the LSM hook and toggled by static keys based on the discussion in: https://lore.kernel.org/bpf/CAG48ez25mW+_oCxgCtbiGMX07g_ph79UOJa07h=o_6B6+Q-u5g@mail.gmail.com/ * Since the code does not deal with security_hook_heads anymore, it goes from "being a BPF LSM" to "BPF program attachment to LSM hooks". * Added a new test case which ensures that the BPF programs' return value is reflected by the LSM hook. - v2 -> v3 does not change the overall design and has some minor fixes: * LSM_ORDER_LAST is introduced to represent the behaviour of the BPF LSM * Fixed the inadvertent clobbering of the LSM Hook error codes * Added GPL license requirement to the commit log * The lsm_hook_idx is now the more conventional 0-based index * Some changes were split into a separate patch ("Load btf_vmlinux only once per object") https://lore.kernel.org/bpf/20200117212825.11755-1-kpsingh@chromium.org/ * Addressed Andrii's feedback on the BTF implementation * Documentation update for using generated vmlinux.h to simplify programs * Rebase - Changes since v1: https://lore.kernel.org/bpf/20191220154208.15895-1-kpsingh@chromium.org * Eliminate the requirement to maintain LSM hooks separately in security/bpf/hooks.h Use BPF trampolines to dynamically allocate security hooks * Drop the use of securityfs as bpftool provides the required introspection capabilities. Update the tests to use the bpf_skeleton and global variables * Use O_CLOEXEC anonymous fds to represent BPF attachment in line with the other BPF programs with the possibility to use bpf program pinning in the future to provide "permanent attachment". * Drop the logic based on prog names for handling re-attachment. * Drop bpf_lsm_event_output from this series and send it as a separate patch. ==================== Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2020-03-30bpf: lsm: Add DocumentationKP Singh
Document how eBPF programs (BPF_PROG_TYPE_LSM) can be loaded and attached (BPF_LSM_MAC) to the LSM hooks. Signed-off-by: KP Singh <kpsingh@google.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Brendan Jackman <jackmanb@google.com> Reviewed-by: Florent Revest <revest@google.com> Reviewed-by: Thomas Garnier <thgarnie@google.com> Reviewed-by: James Morris <jamorris@linux.microsoft.com> Link: https://lore.kernel.org/bpf/20200329004356.27286-9-kpsingh@chromium.org
2020-03-30bpf: lsm: Add selftests for BPF_PROG_TYPE_LSMKP Singh
* Load/attach a BPF program that hooks to file_mprotect (int) and bprm_committed_creds (void). * Perform an action that triggers the hook. * Verify if the audit event was received using the shared global variables for the process executed. * Verify if the mprotect returns a -EPERM. Signed-off-by: KP Singh <kpsingh@google.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Brendan Jackman <jackmanb@google.com> Reviewed-by: Florent Revest <revest@google.com> Reviewed-by: Thomas Garnier <thgarnie@google.com> Reviewed-by: James Morris <jamorris@linux.microsoft.com> Acked-by: Andrii Nakryiko <andriin@fb.com> Link: https://lore.kernel.org/bpf/20200329004356.27286-8-kpsingh@chromium.org
2020-03-30tools/libbpf: Add support for BPF_PROG_TYPE_LSMKP Singh
Since BPF_PROG_TYPE_LSM uses the same attaching mechanism as BPF_PROG_TYPE_TRACING, the common logic is refactored into a static function bpf_program__attach_btf_id. A new API call bpf_program__attach_lsm is still added to avoid userspace conflicts if this ever changes in the future. Signed-off-by: KP Singh <kpsingh@google.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Brendan Jackman <jackmanb@google.com> Reviewed-by: Florent Revest <revest@google.com> Reviewed-by: James Morris <jamorris@linux.microsoft.com> Acked-by: Yonghong Song <yhs@fb.com> Acked-by: Andrii Nakryiko <andriin@fb.com> Link: https://lore.kernel.org/bpf/20200329004356.27286-7-kpsingh@chromium.org
2020-03-30bpf: lsm: Initialize the BPF LSM hooksKP Singh
* The hooks are initialized using the definitions in include/linux/lsm_hook_defs.h. * The LSM can be enabled / disabled with CONFIG_BPF_LSM. Signed-off-by: KP Singh <kpsingh@google.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Brendan Jackman <jackmanb@google.com> Reviewed-by: Florent Revest <revest@google.com> Acked-by: Kees Cook <keescook@chromium.org> Acked-by: James Morris <jamorris@linux.microsoft.com> Link: https://lore.kernel.org/bpf/20200329004356.27286-6-kpsingh@chromium.org
2020-03-30bpf: lsm: Implement attach, detach and executionKP Singh
JITed BPF programs are dynamically attached to the LSM hooks using BPF trampolines. The trampoline prologue generates code to handle conversion of the signature of the hook to the appropriate BPF context. The allocated trampoline programs are attached to the nop functions initialized as LSM hooks. BPF_PROG_TYPE_LSM programs must have a GPL compatible license and and need CAP_SYS_ADMIN (required for loading eBPF programs). Upon attachment: * A BPF fexit trampoline is used for LSM hooks with a void return type. * A BPF fmod_ret trampoline is used for LSM hooks which return an int. The attached programs can override the return value of the bpf LSM hook to indicate a MAC Policy decision. Signed-off-by: KP Singh <kpsingh@google.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Brendan Jackman <jackmanb@google.com> Reviewed-by: Florent Revest <revest@google.com> Acked-by: Andrii Nakryiko <andriin@fb.com> Acked-by: James Morris <jamorris@linux.microsoft.com> Link: https://lore.kernel.org/bpf/20200329004356.27286-5-kpsingh@chromium.org
2020-03-30bpf: lsm: Provide attachment points for BPF LSM programsKP Singh
When CONFIG_BPF_LSM is enabled, nop functions, bpf_lsm_<hook_name>, are generated for each LSM hook. These functions are initialized as LSM hooks in a subsequent patch. Signed-off-by: KP Singh <kpsingh@google.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Brendan Jackman <jackmanb@google.com> Reviewed-by: Florent Revest <revest@google.com> Reviewed-by: Kees Cook <keescook@chromium.org> Acked-by: Yonghong Song <yhs@fb.com> Acked-by: James Morris <jamorris@linux.microsoft.com> Link: https://lore.kernel.org/bpf/20200329004356.27286-4-kpsingh@chromium.org
2020-03-30security: Refactor declaration of LSM hooksKP Singh
The information about the different types of LSM hooks is scattered in two locations i.e. union security_list_options and struct security_hook_heads. Rather than duplicating this information even further for BPF_PROG_TYPE_LSM, define all the hooks with the LSM_HOOK macro in lsm_hook_defs.h which is then used to generate all the data structures required by the LSM framework. The LSM hooks are defined as: LSM_HOOK(<return_type>, <default_value>, <hook_name>, args...) with <default_value> acccessible in security.c as: LSM_RET_DEFAULT(<hook_name>) Signed-off-by: KP Singh <kpsingh@google.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Brendan Jackman <jackmanb@google.com> Reviewed-by: Florent Revest <revest@google.com> Reviewed-by: Kees Cook <keescook@chromium.org> Reviewed-by: Casey Schaufler <casey@schaufler-ca.com> Acked-by: James Morris <jamorris@linux.microsoft.com> Link: https://lore.kernel.org/bpf/20200329004356.27286-3-kpsingh@chromium.org
2020-03-30bpf: Introduce BPF_PROG_TYPE_LSMKP Singh
Introduce types and configs for bpf programs that can be attached to LSM hooks. The programs can be enabled by the config option CONFIG_BPF_LSM. Signed-off-by: KP Singh <kpsingh@google.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Brendan Jackman <jackmanb@google.com> Reviewed-by: Florent Revest <revest@google.com> Reviewed-by: Thomas Garnier <thgarnie@google.com> Acked-by: Yonghong Song <yhs@fb.com> Acked-by: Andrii Nakryiko <andriin@fb.com> Acked-by: James Morris <jamorris@linux.microsoft.com> Link: https://lore.kernel.org/bpf/20200329004356.27286-2-kpsingh@chromium.org
2020-03-30selftests: Add test for overriding global data value before loadToke Høiland-Jørgensen
This adds a test to exercise the new bpf_map__set_initial_value() function. The test simply overrides the global data section with all zeroes, and checks that the new value makes it into the kernel map on load. Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Andrii Nakryiko <andriin@fb.com> Link: https://lore.kernel.org/bpf/20200329132253.232541-2-toke@redhat.com
2020-03-30libbpf: Add setter for initial value for internal mapsToke Høiland-Jørgensen
For internal maps (most notably the maps backing global variables), libbpf uses an internal mmaped area to store the data after opening the object. This data is subsequently copied into the kernel map when the object is loaded. This adds a function to set a new value for that data, which can be used to before it is loaded into the kernel. This is especially relevant for RODATA maps, since those are frozen on load. Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Andrii Nakryiko <andriin@fb.com> Link: https://lore.kernel.org/bpf/20200329132253.232541-1-toke@redhat.com
2020-03-30bpf, net: Fix build issue when net ns not configuredDaniel Borkmann
Fix a redefinition of 'net_gen_cookie' error that was overlooked when net ns is not configured. Fixes: f318903c0bf4 ("bpf: Add netns cookie and enable it for bpf cgroup hooks") Reported-by: kbuild test robot <lkp@intel.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2020-03-29Linux 5.6v5.6Linus Torvalds
2020-03-29Merge branch 'akpm' (patches from Andrew)Linus Torvalds
Merge vm fixes from Andrew Morton: "5 fixes" * emailed patches from Andrew Morton <akpm@linux-foundation.org>: mm/sparse: fix kernel crash with pfn_section_valid check mm: fork: fix kernel_stack memcg stats for various stack implementations hugetlb_cgroup: fix illegal access to memory drivers/base/memory.c: indicate all memory blocks as removable mm/swapfile.c: move inode_lock out of claim_swapfile
2020-03-29Merge tag 'timers-urgent-2020-03-29' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull timer fix from Thomas Gleixner: "A single fix for the Hyper-V clocksource driver to make sched clock actually return nanoseconds and not the virtual clock value which increments at 10e7 HZ (100ns)" * tag 'timers-urgent-2020-03-29' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: clocksource/drivers/hyper-v: Make sched clock return nanoseconds correctly
2020-03-29Merge tag 'irq-urgent-2020-03-29' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull irq fix from Thomas Gleixner: "A single bugfix to prevent reference leaks in irq affinity notifiers" * tag 'irq-urgent-2020-03-29' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: genirq: Fix reference leaks on irq affinity notifiers