summaryrefslogtreecommitdiff
path: root/include/uapi/linux
AgeCommit message (Collapse)Author
2018-06-04Merge branch 'work.aio-1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull aio updates from Al Viro: "Majority of AIO stuff this cycle. aio-fsync and aio-poll, mostly. The only thing I'm holding back for a day or so is Adam's aio ioprio - his last-minute fixup is trivial (missing stub in !CONFIG_BLOCK case), but let it sit in -next for decency sake..." * 'work.aio-1' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (46 commits) aio: sanitize the limit checking in io_submit(2) aio: fold do_io_submit() into callers aio: shift copyin of iocb into io_submit_one() aio_read_events_ring(): make a bit more readable aio: all callers of aio_{read,write,fsync,poll} treat 0 and -EIOCBQUEUED the same way aio: take list removal to (some) callers of aio_complete() aio: add missing break for the IOCB_CMD_FDSYNC case random: convert to ->poll_mask timerfd: convert to ->poll_mask eventfd: switch to ->poll_mask pipe: convert to ->poll_mask crypto: af_alg: convert to ->poll_mask net/rxrpc: convert to ->poll_mask net/iucv: convert to ->poll_mask net/phonet: convert to ->poll_mask net/nfc: convert to ->poll_mask net/caif: convert to ->poll_mask net/bluetooth: convert to ->poll_mask net/sctp: convert to ->poll_mask net/tipc: convert to ->poll_mask ...
2018-06-04xsk: new descriptor addressing schemeBjörn Töpel
Currently, AF_XDP only supports a fixed frame-size memory scheme where each frame is referenced via an index (idx). A user passes the frame index to the kernel, and the kernel acts upon the data. Some NICs, however, do not have a fixed frame-size model, instead they have a model where a memory window is passed to the hardware and multiple frames are filled into that window (referred to as the "type-writer" model). By changing the descriptor format from the current frame index addressing scheme, AF_XDP can in the future be extended to support these kinds of NICs. In the index-based model, an idx refers to a frame of size frame_size. Addressing a frame in the UMEM is done by offseting the UMEM starting address by a global offset, idx * frame_size + offset. Communicating via the fill- and completion-rings are done by means of idx. In this commit, the idx is removed in favor of an address (addr), which is a relative address ranging over the UMEM. To convert an idx-based address to the new addr is simply: addr = idx * frame_size + offset. We also stop referring to the UMEM "frame" as a frame. Instead it is simply called a chunk. To transfer ownership of a chunk to the kernel, the addr of the chunk is passed in the fill-ring. Note, that the kernel will mask addr to make it chunk aligned, so there is no need for userspace to do that. E.g., for a chunk size of 2k, passing an addr of 2048, 2050 or 3000 to the fill-ring will refer to the same chunk. On the completion-ring, the addr will match that of the Tx descriptor, passed to the kernel. Changing the descriptor format to use chunks/addr will allow for future changes to move to a type-writer based model, where multiple frames can reside in one chunk. In this model passing one single chunk into the fill-ring, would potentially result in multiple Rx descriptors. This commit changes the uapi of AF_XDP sockets, and updates the documentation. Signed-off-by: Björn Töpel <bjorn.topel@intel.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-06-03bpf: flowlabel in bpf_fib_lookup should be flowinfoDavid Ahern
As Michal noted the flow struct takes both the flow label and priority. Update the bpf_fib_lookup API to note that it is flowinfo and not just the flow label. Cc: Michal Kubecek <mkubecek@suse.cz> Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-06-03bpf: implement bpf_get_current_cgroup_id() helperYonghong Song
bpf has been used extensively for tracing. For example, bcc contains an almost full set of bpf-based tools to trace kernel and user functions/events. Most tracing tools are currently either filtered based on pid or system-wide. Containers have been used quite extensively in industry and cgroup is often used together to provide resource isolation and protection. Several processes may run inside the same container. It is often desirable to get container-level tracing results as well, e.g. syscall count, function count, I/O activity, etc. This patch implements a new helper, bpf_get_current_cgroup_id(), which will return cgroup id based on the cgroup within which the current task is running. The later patch will provide an example to show that userspace can get the same cgroup id so it could configure a filter or policy in the bpf program based on task cgroup id. The helper is currently implemented for tracing. It can be added to other program types as well when needed. Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Yonghong Song <yhs@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-06-03rpmsg: char: Switch to SPDX license identifierSuman Anna
Use the appropriate SPDX license identifier in the rpmsg char driver source file and drop the previous boilerplate license text. The uapi header file already had the SPDX license identifier added as part of a mass update but the license text removal was deferred for later, and this patch drops the same. Signed-off-by: Suman Anna <s-anna@ti.com> Signed-off-by: Bjorn Andersson <bjorn.andersson@linaro.org>
2018-06-03bpf: make sure to clear unused fields in tunnel/xfrm state fetchDaniel Borkmann
Since the remaining bits are not filled in struct bpf_tunnel_key resp. struct bpf_xfrm_state and originate from uninitialized stack space, we should make sure to clear them before handing control back to the program. Also add a padding element to struct bpf_xfrm_state for future use similar as we have in struct bpf_tunnel_key and clear it as well. struct bpf_xfrm_state { __u32 reqid; /* 0 4 */ __u32 spi; /* 4 4 */ __u16 family; /* 8 2 */ /* XXX 2 bytes hole, try to pack */ union { __u32 remote_ipv4; /* 4 */ __u32 remote_ipv6[4]; /* 16 */ }; /* 12 16 */ /* size: 28, cachelines: 1, members: 4 */ /* sum members: 26, holes: 1, sum holes: 2 */ /* last cacheline: 28 bytes */ }; Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Song Liu <songliubraving@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-06-03bpf: add bpf_skb_cgroup_id helperDaniel Borkmann
Add a new bpf_skb_cgroup_id() helper that allows to retrieve the cgroup id from the skb's socket. This is useful in particular to enable bpf_get_cgroup_classid()-like behavior for cgroup v1 in cgroup v2 by allowing ID based matching on egress. This can in particular be used in combination with applying policy e.g. from map lookups, and also complements the older bpf_skb_under_cgroup() interface. In user space the cgroup id for a given path can be retrieved through the f_handle as demonstrated in [0] recently. [0] https://lkml.org/lkml/2018/5/22/1190 Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-06-03Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller
Filling in the padding slot in the bpf structure as a bug fix in 'ne' overlapped with actually using that padding area for something in 'net-next'. Signed-off-by: David S. Miller <davem@davemloft.net>
2018-06-02PCI/DPC: Disable ERR_NONFATAL handling by DPCOza Pawandeep
PCIe ERR_NONFATAL errors mean a particular transaction is unreliable but the Link is otherwise fully functional (PCIe r4.0, sec 6.2.2). The AER driver handles these by logging the error details and calling driver-supplied pci_error_handlers callbacks. It does not reset downstream devices, does not remove them from the PCI subsystem, does not re-enumerate them, and does not call their driver .remove() or .probe() methods. But DPC driver previously enabled DPC on ERR_NONFATAL, so if the hardware supports DPC, these errors caused a Link reset (performed automatically by the hardware), followed by the DPC driver removing affected devices (which calls their .remove() methods), bringing the Link back up, and re-enumerating (which calls driver .probe() methods). Disable ERR_NONFATAL DPC triggering so these errors will only be handled by AER. This means drivers won't have to deal with different usage of their pci_error_handlers callbacks and .probe() and .remove() methods based on whether the platform has DPC support. Signed-off-by: Oza Pawandeep <poza@codeaurora.org> [bhelgaas: changelog] Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
2018-06-03netfilter: nf_tables: add connlimit supportPablo Neira Ayuso
This features which allows you to limit the maximum number of connections per arbitrary key. The connlimit expression is stateful, therefore it can be used from meters to dynamically populate a set, this provides a mapping to the iptables' connlimit match. This patch also comes that allows you define static connlimit policies. This extension depends on the nf_conncount infrastructure. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-06-02Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-nextDavid S. Miller
Pablo Neira Ayuso says: ==================== Netfilter/IPVS updates for net-next The following patchset contains Netfilter/IPVS updates for your net-next tree, the most relevant things in this batch are: 1) Compile masquerade infrastructure into NAT module, from Florian Westphal. Same thing with the redirection support. 2) Abort transaction if early initialization of the commit phase fails. Also from Florian. 3) Get rid of synchronize_rcu() by using rule array in nf_tables, from Florian. 4) Abort nf_tables batch if fatal signal is pending, from Florian. 5) Use .call_rcu nfnetlink from nf_tables to make dumps fully lockless. From Florian Westphal. 6) Support to match transparent sockets from nf_tables, from Máté Eckl. 7) Audit support for nf_tables, from Phil Sutter. 8) Validate chain dependencies from commit phase, fall back to fine grain validation only in case of errors. 9) Attach dst to skbuff from netfilter flowtable packet path, from Jason A. Donenfeld. 10) Use artificial maximum attribute cap to remove VLA from nfnetlink. Patch from Kees Cook. 11) Add extension to allow to forward packets through neighbour layer. 12) Add IPv6 conntrack helper support to IPVS, from Julian Anastasov. 13) Add IPv6 FTP conntrack support to IPVS, from Julian Anastasov. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2018-06-01bpf: fix uapi hole for 32 bit compat applicationsDaniel Borkmann
In 64 bit, we have a 4 byte hole between ifindex and netns_dev in the case of struct bpf_map_info but also struct bpf_prog_info. In net-next commit b85fab0e67b ("bpf: Add gpl_compatible flag to struct bpf_prog_info") added a bitfield into it to expose some flags related to programs. Thus, add an unnamed __u32 bitfield for both so that alignment keeps the same in both 32 and 64 bit cases, and can be naturally extended from there as in b85fab0e67b. Before: # file test.o test.o: ELF 32-bit LSB relocatable, Intel 80386, version 1 (SYSV), not stripped # pahole test.o struct bpf_map_info { __u32 type; /* 0 4 */ __u32 id; /* 4 4 */ __u32 key_size; /* 8 4 */ __u32 value_size; /* 12 4 */ __u32 max_entries; /* 16 4 */ __u32 map_flags; /* 20 4 */ char name[16]; /* 24 16 */ __u32 ifindex; /* 40 4 */ __u64 netns_dev; /* 44 8 */ __u64 netns_ino; /* 52 8 */ /* size: 64, cachelines: 1, members: 10 */ /* padding: 4 */ }; After (same as on 64 bit): # file test.o test.o: ELF 32-bit LSB relocatable, Intel 80386, version 1 (SYSV), not stripped # pahole test.o struct bpf_map_info { __u32 type; /* 0 4 */ __u32 id; /* 4 4 */ __u32 key_size; /* 8 4 */ __u32 value_size; /* 12 4 */ __u32 max_entries; /* 16 4 */ __u32 map_flags; /* 20 4 */ char name[16]; /* 24 16 */ __u32 ifindex; /* 40 4 */ /* XXX 4 bytes hole, try to pack */ __u64 netns_dev; /* 48 8 */ __u64 netns_ino; /* 56 8 */ /* --- cacheline 1 boundary (64 bytes) --- */ /* size: 64, cachelines: 1, members: 10 */ /* sum members: 60, holes: 1, sum holes: 4 */ }; Reported-by: Dmitry V. Levin <ldv@altlinux.org> Reported-by: Eugene Syromiatnikov <esyr@redhat.com> Fixes: 52775b33bb507 ("bpf: offload: report device information about offloaded maps") Fixes: 675fc275a3a2d ("bpf: offload: report device information for offloaded programs") Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-06-01netfilter: nft_fwd_netdev: allow to forward packets via neighbour layerPablo Neira Ayuso
This allows us to forward packets from the netdev family via neighbour layer, so you don't need an explicit link-layer destination when using this expression from rules. The ttl/hop_limit field is decremented. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-06-01netfilter: nf_tables: Add audit support to log statementPhil Sutter
This extends log statement to support the behaviour achieved with AUDIT target in iptables. Audit logging is enabled via a pseudo log level 8. In this case any other settings like log prefix are ignored since audit log format is fixed. Signed-off-by: Phil Sutter <phil@nwl.cc> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-06-01netfilter: nf_tables: add support for native socket matchingMáté Eckl
Now it can only match the transparent flag of an ip/ipv6 socket. Signed-off-by: Máté Eckl <ecklm94@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-05-31rtnetlink: Add more well known protocol valuesDonald Sharp
FRRouting installs routes into the kernel associated with the originating protocol. Add these values to the well known values in rtnetlink.h. Signed-off-by: Donald Sharp <sharpd@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-31fs: Add aio iopriority supportAdam Manzanares
This is the per-I/O equivalent of the ioprio_set system call. When IOCB_FLAG_IOPRIO is set on the iocb aio_flags field, then we set the newly added kiocb ki_ioprio field to the value in the iocb aio_reqprio field. This patch depends on block: add ioprio_check_cap function. Signed-off-by: Adam Manzanares <adam.manzanares@wdc.com> Reviewed-by: Jeff Moyer <jmoyer@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-05-31btrfs: Add unprivileged version of ino_lookup ioctlTomohiro Misono
Add unprivileged version of ino_lookup ioctl BTRFS_IOC_INO_LOOKUP_USER to allow normal users to call "btrfs subvolume list/show" etc. in combination with BTRFS_IOC_GET_SUBVOL_INFO/BTRFS_IOC_GET_SUBVOL_ROOTREF. This can be used like BTRFS_IOC_INO_LOOKUP but the argument is different. This is because it always searches the fs/file tree correspoinding to the fd with which this ioctl is called and also returns the name of bottom subvolume. The main differences from original ino_lookup ioctl are: 1. Read + Exec permission will be checked using inode_permission() during path construction. -EACCES will be returned in case of failure. 2. Path construction will be stopped at the inode number which corresponds to the fd with which this ioctl is called. If constructed path does not exist under fd's inode, -EACCES will be returned. 3. The name of bottom subvolume is also searched and filled. Note that the maximum length of path is shorter 256 (BTRFS_VOL_NAME_MAX+1) bytes than ino_lookup ioctl because of space of subvolume's name. Reviewed-by: Gu Jinxiang <gujx@cn.fujitsu.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Tested-by: Gu Jinxiang <gujx@cn.fujitsu.com> Signed-off-by: Tomohiro Misono <misono.tomohiro@jp.fujitsu.com> [ style fixes ] Signed-off-by: David Sterba <dsterba@suse.com>
2018-05-31btrfs: Add unprivileged ioctl which returns subvolume's ROOT_REFTomohiro Misono
Add unprivileged ioctl BTRFS_IOC_GET_SUBVOL_ROOTREF which returns ROOT_REF information of the subvolume containing this inode except the subvolume name (this is because to prevent potential name leak). The subvolume name will be gained by user version of ino_lookup ioctl (BTRFS_IOC_INO_LOOKUP_USER) which also performs permission check. The min id of root ref's subvolume to be searched is specified by @min_id in struct btrfs_ioctl_get_subvol_rootref_args. After the search ends, @min_id is set to the last searched root ref's subvolid + 1. Also, if there are more root refs than BTRFS_MAX_ROOTREF_BUFFER_NUM, -EOVERFLOW is returned. Therefore the caller can just call this ioctl again without changing the argument to continue search. Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Gu Jinxiang <gujx@cn.fujitsu.com> Tested-by: Gu Jinxiang <gujx@cn.fujitsu.com> Signed-off-by: Tomohiro Misono <misono.tomohiro@jp.fujitsu.com> [ style fixes and struct item renames ] Signed-off-by: David Sterba <dsterba@suse.com>
2018-05-31btrfs: Add unprivileged ioctl which returns subvolume informationTomohiro Misono
Add new unprivileged ioctl BTRFS_IOC_GET_SUBVOL_INFO which returns the information of subvolume containing this inode. (i.e. returns the information in ROOT_ITEM and ROOT_BACKREF.) Reviewed-by: Gu Jinxiang <gujx@cn.fujitsu.com> Tested-by: Gu Jinxiang <gujx@cn.fujitsu.com> Signed-off-by: Tomohiro Misono <misono.tomohiro@jp.fujitsu.com> [ minor style fixes, update struct comments ] Signed-off-by: David Sterba <dsterba@suse.com>
2018-05-31crypto: ccp - Add GET_ID SEV commandJanakarajan Natarajan
The GET_ID command, added as of SEV API v0.16, allows the SEV firmware to be queried about a unique CPU ID. This unique ID can then be used to obtain the public certificate containing the Chip Endorsement Key (CEK) public key signed by the AMD SEV Signing Key (ASK). For more information please refer to "Section 5.12 GET_ID" of https://support.amd.com/TechDocs/55766_SEV-KM%20API_Specification.pdf Signed-off-by: Janakarajan Natarajan <Janakarajan.Natarajan@amd.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2018-05-30media: rc: introduce BPF_PROG_LIRC_MODE2Sean Young
Add support for BPF_PROG_LIRC_MODE2. This type of BPF program can call rc_keydown() to reported decoded IR scancodes, or rc_repeat() to report that the last key should be repeated. The bpf program can be attached to using the bpf(BPF_PROG_ATTACH) syscall; the target_fd must be the /dev/lircN device. Acked-by: Yonghong Song <yhs@fb.com> Signed-off-by: Sean Young <sean@mess.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-05-29bpf: Drop mpls from bpf_fib_lookupDavid Ahern
MPLS support will not be submitted this dev cycle, but in working on it I do see a few changes are needed to the API. For now, drop mpls from the API. Since the fields in question are unions, the mpls fields can be added back later without affecting the uapi. Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-05-29bpf: clean up eBPF helpers documentationQuentin Monnet
These are minor edits for the eBPF helpers documentation in include/uapi/linux/bpf.h. The main fix consists in removing "BPF_FIB_LOOKUP_", because it ends with a non-escaped underscore that gets interpreted by rst2man and produces the following message in the resulting manual page: DOCUTILS SYSTEM MESSAGES System Message: ERROR/3 (/tmp/bpf-helpers.rst:, line 1514) Unknown target name: "bpf_fib_lookup". Other edits consist in: - Improving formatting for flag values for "bpf_fib_lookup()" helper. - Emphasising a parameter name in description of the return value for "bpf_get_stack()" helper. - Removing unnecessary blank lines between "Description" and "Return" sections for the few helpers that would use it, for consistency. Signed-off-by: Quentin Monnet <quentin.monnet@netronome.com> Acked-by: Song Liu <songliubraving@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-05-29net: Add IFA_RT_PRIORITY address attributeDavid Ahern
Currently, if two interfaces have addresses in the same connected route, then the order of the prefix route entries is determined by the order in which the addresses are assigned or the links brought up. Add IFA_RT_PRIORITY to allow user to specify the metric of the prefix route associated with an address giving interface managers better control of the order of prefix routes. Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-28virtio_net: Introduce VIRTIO_NET_F_STANDBY feature bitSridhar Samudrala
This feature bit can be used by hypervisor to indicate virtio_net device to act as a standby for another device with the same MAC address. VIRTIO_NET_F_STANDBY is defined as bit 62 as it is a device feature bit. Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-28bpf: Hooks for sys_sendmsgAndrey Ignatov
In addition to already existing BPF hooks for sys_bind and sys_connect, the patch provides new hooks for sys_sendmsg. It leverages existing BPF program type `BPF_PROG_TYPE_CGROUP_SOCK_ADDR` that provides access to socket itlself (properties like family, type, protocol) and user-passed `struct sockaddr *` so that BPF program can override destination IP and port for system calls such as sendto(2) or sendmsg(2) and/or assign source IP to the socket. The hooks are implemented as two new attach types: `BPF_CGROUP_UDP4_SENDMSG` and `BPF_CGROUP_UDP6_SENDMSG` for UDPv4 and UDPv6 correspondingly. UDPv4 and UDPv6 separate attach types for same reason as sys_bind and sys_connect hooks, i.e. to prevent reading from / writing to e.g. user_ip6 fields when user passes sockaddr_in since it'd be out-of-bound. The difference with already existing hooks is sys_sendmsg are implemented only for unconnected UDP. For TCP it doesn't make sense to change user-provided `struct sockaddr *` at sendto(2)/sendmsg(2) time since socket either was already connected and has source/destination set or wasn't connected and call to sendto(2)/sendmsg(2) would lead to ENOTCONN anyway. Connected UDP is already handled by sys_connect hooks that can override source/destination at connect time and use fast-path later, i.e. these hooks don't affect UDP fast-path. Rewriting source IP is implemented differently than that in sys_connect hooks. When sys_sendmsg is used with unconnected UDP it doesn't work to just bind socket to desired local IP address since source IP can be set on per-packet basis by using ancillary data (cmsg(3)). So no matter if socket is bound or not, source IP has to be rewritten on every call to sys_sendmsg. To do so two new fields are added to UAPI `struct bpf_sock_addr`; * `msg_src_ip4` to set source IPv4 for UDPv4; * `msg_src_ip6` to set source IPv6 for UDPv6. Signed-off-by: Andrey Ignatov <rdna@fb.com> Acked-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-05-26Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller
Lots of easy overlapping changes in the confict resolutions here. Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-26KVM: x86: hyperv: declare KVM_CAP_HYPERV_TLBFLUSH capabilityVitaly Kuznetsov
We need a new capability to indicate support for the newly added HvFlushVirtualAddress{List,Space}{,Ex} hypercalls. Upon seeing this capability, userspace is supposed to announce PV TLB flush features by setting the appropriate CPUID bits (if needed). Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2018-05-26aio: implement IOCB_CMD_POLLChristoph Hellwig
Simple one-shot poll through the io_submit() interface. To poll for a file descriptor the application should submit an iocb of type IOCB_CMD_POLL. It will poll the fd for the events specified in the the first 32 bits of the aio_buf field of the iocb. Unlike poll or epoll without EPOLLONESHOT this interface always works in one shot mode, that is once the iocb is completed, it will have to be resubmitted. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-05-26uapi: turn __poll_t sparse checks on by defaultChristoph Hellwig
Signed-off-by: Christoph Hellwig <hch@lst.de>
2018-05-25Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netLinus Torvalds
Pull networking fixes from David Miller: "Let's begin the holiday weekend with some networking fixes: 1) Whoops need to restrict cfg80211 wiphy names even more to 64 bytes. From Eric Biggers. 2) Fix flags being ignored when using kernel_connect() with SCTP, from Xin Long. 3) Use after free in DCCP, from Alexey Kodanev. 4) Need to check rhltable_init() return value in ipmr code, from Eric Dumazet. 5) XDP handling fixes in virtio_net from Jason Wang. 6) Missing RTA_TABLE in rtm_ipv4_policy[], from Roopa Prabhu. 7) Need to use IRQ disabling spinlocks in mlx4_qp_lookup(), from Jack Morgenstein. 8) Prevent out-of-bounds speculation using indexes in BPF, from Daniel Borkmann. 9) Fix regression added by AF_PACKET link layer cure, from Willem de Bruijn. 10) Correct ENIC dma mask, from Govindarajulu Varadarajan. 11) Missing config options for PMTU tests, from Stefano Brivio" * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (48 commits) ibmvnic: Fix partial success login retries selftests/net: Add missing config options for PMTU tests mlx4_core: allocate ICM memory in page size chunks enic: set DMA mask to 47 bit ppp: remove the PPPIOCDETACH ioctl ipv4: remove warning in ip_recv_error net : sched: cls_api: deal with egdev path only if needed vhost: synchronize IOTLB message with dev cleanup packet: fix reserve calculation net/mlx5: IPSec, Fix a race between concurrent sandbox QP commands net/mlx5e: When RXFCS is set, add FCS data into checksum calculation bpf: properly enforce index mask to prevent out-of-bounds speculation net/mlx4: Fix irq-unsafe spinlock usage net: phy: broadcom: Fix bcm_write_exp() net: phy: broadcom: Fix auxiliary control register reads net: ipv4: add missing RTA_TABLE to rtm_ipv4_policy net/mlx4: fix spelling mistake: "Inrerface" -> "Interface" and rephrase message ibmvnic: Only do H_EOI for mobility events tuntap: correctly set SOCKWQ_ASYNC_NOSPACE virtio-net: fix leaking page for gso packet during mergeable XDP ...
2018-05-25openvswitch: Add conntrack limit netlink definitionYi-Hung Wei
Define netlink messages and attributes to support user kernel communication that uses the conntrack limit feature. Signed-off-by: Yi-Hung Wei <yihung.wei@gmail.com> Acked-by: Pravin B Shelar <pshelar@ovn.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-25Merge tag 'mlx5e-updates-2018-05-19' of ↵David S. Miller
git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux Saeed Mahameed says: ==================== mlx5e-updates-2018-05-19 This series contains updates for mlx5e netdevice driver with one subject, DSCP to priority mapping, in the first patch Huy adds the needed API in dcbnl, the second patch adds the needed mlx5 core capability bits for the feature, and all other patches are mlx5e (netdev) only changes to add support for the feature. From: Huy Nguyen Dscp to priority mapping for Ethernet packet: These patches enable differentiated services code point (dscp) to priority mapping for Ethernet packet. Once this feature is enabled, the packet is routed to the corresponding priority based on its dscp. User can combine this feature with priority flow control (pfc) feature to have priority flow control based on the dscp. Firmware interface: Mellanox firmware provides two control knobs for this feature: QPTS register allow changing the trust state between dscp and pcp mode. The default is pcp mode. Once in dscp mode, firmware will route the packet based on its dscp value if the dscp field exists. QPDPM register allow mapping a specific dscp (0 to 63) to a specific priority (0 to 7). By default, all the dscps are mapped to priority zero. Software interface: This feature is controlled via application priority TLV. IEEE specification P802.1Qcd/D2.1 defines priority selector id 5 for application priority TLV. This APP TLV selector defines DSCP to priority map. This APP TLV can be sent by the switch or can be set locally using software such as lldptool. In mlx5 drivers, we add the support for net dcb's getapp and setapp call back. Mlx5 driver only handles the selector id 5 application entry (dscp application priority application entry). If user sends multiple dscp to priority APP TLV entries on the same dscp, the last sent one will take effect. All the previous sent will be deleted. This attribute combined with pfc attribute allows advanced user to fine tune the qos setting for specific priority queue. For example, user can give dedicated buffer for one or more priorities or user can give large buffer to certain priorities. The dcb buffer configuration will be controlled by lldptool. >> lldptool -T -i eth2 -V BUFFER prio 0,2,5,7,1,2,3,6 maps priorities 0,1,2,3,4,5,6,7 to receive buffer 0,2,5,7,1,2,3,6 >> lldptool -T -i eth2 -V BUFFER size 87296,87296,0,87296,0,0,0,0 sets receive buffer size for buffer 0,1,2,3,4,5,6,7 respectively After discussion on mailing list with Jakub, Jiri, Ido and John, we agreed to choose dcbnl over devlink interface since this feature is intended to set port attributes which are governed by the netdev instance of that port, where devlink API is more suitable for global ASIC configurations. The firmware trust state (in QPTS register) is changed based on the number of dscp to priority application entries. When the first dscp to priority application entry is added by the user, the trust state is changed to dscp. When the last dscp to priority application entry is deleted by the user, the trust state is changed to pcp. When the port is in DSCP trust state, the transmit queue is selected based on the dscp of the skb. When the port is in DSCP trust state and vport inline mode is not NONE, firmware requires mlx5 driver to copy the IP header to the wqe ethernet segment inline header if the skb has it. This is done by changing the transmit queue sq's min inline mode to L3. Note that the min inline mode of sqs that belong to other features such as xdpsq, icosq are not modified. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-25net: bridge: add support for port isolationNikolay Aleksandrov
This patch adds support for a new port flag - BR_ISOLATED. If it is set then isolated ports cannot communicate between each other, but they can still communicate with non-isolated ports. The same can be achieved via ACLs but they can't scale with large number of ports and also the complexity of the rules grows. This feature can be used to achieve isolated vlan functionality (similar to pvlan) as well, though currently it will be port-wide (for all vlans on the port). The new test in should_deliver uses data that is already cache hot and the new boolean is used to avoid an additional source port test in should_deliver. Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com> Reviewed-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-24ppp: remove the PPPIOCDETACH ioctlEric Biggers
The PPPIOCDETACH ioctl effectively tries to "close" the given ppp file before f_count has reached 0, which is fundamentally a bad idea. It does check 'f_count < 2', which excludes concurrent operations on the file since they would only be possible with a shared fd table, in which case each fdget() would take a file reference. However, it fails to account for the fact that even with 'f_count == 1' the file can still be linked into epoll instances. As reported by syzbot, this can trivially be used to cause a use-after-free. Yet, the only known user of PPPIOCDETACH is pppd versions older than ppp-2.4.2, which was released almost 15 years ago (November 2003). Also, PPPIOCDETACH apparently stopped working reliably at around the same time, when the f_count check was added to the kernel, e.g. see https://lkml.org/lkml/2002/12/31/83. Also, the current 'f_count < 2' check makes PPPIOCDETACH only work in single-threaded applications; it always fails if called from a multithreaded application. All pppd versions released in the last 15 years just close() the file descriptor instead. Therefore, instead of hacking around this bug by exporting epoll internals to modules, and probably missing other related bugs, just remove the PPPIOCDETACH ioctl and see if anyone actually notices. Leave a stub in place that prints a one-time warning and returns EINVAL. Reported-by: syzbot+16363c99d4134717c05b@syzkaller.appspotmail.com Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Signed-off-by: Eric Biggers <ebiggers@google.com> Acked-by: Paul Mackerras <paulus@ozlabs.org> Reviewed-by: Guillaume Nault <g.nault@alphalink.fr> Tested-by: Guillaume Nault <g.nault@alphalink.fr> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-24Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-nextDavid S. Miller
Alexei Starovoitov says: ==================== pull-request: bpf-next 2018-05-24 The following pull-request contains BPF updates for your *net-next* tree. The main changes are: 1) Björn Töpel cleans up AF_XDP (removes rebind, explicit cache alignment from uapi, etc). 2) David Ahern adds mtu checks to bpf_ipv{4,6}_fib_lookup() helpers. 3) Jesper Dangaard Brouer adds bulking support to ndo_xdp_xmit. 4) Jiong Wang adds support for indirect and arithmetic shifts to NFP 5) Martin KaFai Lau cleans up BTF uapi and makes the btf_header extensible. 6) Mathieu Xhonneux adds an End.BPF action to seg6local with BPF helpers allowing to edit/grow/shrink a SRH and apply on a packet generic SRv6 actions. 7) Sandipan Das adds support for bpf2bpf function calls in ppc64 JIT. 8) Yonghong Song adds BPF_TASK_FD_QUERY command for introspection of tracing events. 9) other misc fixes from Gustavo A. R. Silva, Sirio Balmelli, John Fastabend, and Magnus Karlsson ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-24bpf: introduce bpf subcommand BPF_TASK_FD_QUERYYonghong Song
Currently, suppose a userspace application has loaded a bpf program and attached it to a tracepoint/kprobe/uprobe, and a bpf introspection tool, e.g., bpftool, wants to show which bpf program is attached to which tracepoint/kprobe/uprobe. Such attachment information will be really useful to understand the overall bpf deployment in the system. There is a name field (16 bytes) for each program, which could be used to encode the attachment point. There are some drawbacks for this approaches. First, bpftool user (e.g., an admin) may not really understand the association between the name and the attachment point. Second, if one program is attached to multiple places, encoding a proper name which can imply all these attachments becomes difficult. This patch introduces a new bpf subcommand BPF_TASK_FD_QUERY. Given a pid and fd, if the <pid, fd> is associated with a tracepoint/kprobe/uprobe perf event, BPF_TASK_FD_QUERY will return . prog_id . tracepoint name, or . k[ret]probe funcname + offset or kernel addr, or . u[ret]probe filename + offset to the userspace. The user can use "bpftool prog" to find more information about bpf program itself with prog_id. Acked-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: Yonghong Song <yhs@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-05-24net/dcb: Add dcbnl buffer attributeHuy Nguyen
In this patch, we add dcbnl buffer attribute to allow user change the NIC's buffer configuration such as priority to buffer mapping and buffer size of individual buffer. This attribute combined with pfc attribute allows advanced user to fine tune the qos setting for specific priority queue. For example, user can give dedicated buffer for one or more priorities or user can give large buffer to certain priorities. The dcb buffer configuration will be controlled by lldptool. lldptool -T -i eth2 -V BUFFER prio 0,2,5,7,1,2,3,6 maps priorities 0,1,2,3,4,5,6,7 to receive buffer 0,2,5,7,1,2,3,6 lldptool -T -i eth2 -V BUFFER size 87296,87296,0,87296,0,0,0,0 sets receive buffer size for buffer 0,1,2,3,4,5,6,7 respectively After discussion on mailing list with Jakub, Jiri, Ido and John, we agreed to choose dcbnl over devlink interface since this feature is intended to set port attributes which are governed by the netdev instance of that port, where devlink API is more suitable for global ASIC configurations. We present an use case scenario where dcbnl buffer attribute configured by advance user helps reduce the latency of messages of different sizes. Scenarios description: On ConnectX-5, we run latency sensitive traffic with small/medium message sizes ranging from 64B to 256KB and bandwidth sensitive traffic with large messages sizes 512KB and 1MB. We group small, medium, and large message sizes to their own pfc enables priorities as follow. Priorities 1 & 2 (64B, 256B and 1KB) Priorities 3 & 4 (4KB, 8KB, 16KB, 64KB, 128KB and 256KB) Priorities 5 & 6 (512KB and 1MB) By default, ConnectX-5 maps all pfc enabled priorities to a single lossless fixed buffer size of 50% of total available buffer space. The other 50% is assigned to lossy buffer. Using dcbnl buffer attribute, we create three equal size lossless buffers. Each buffer has 25% of total available buffer space. Thus, the lossy buffer size reduces to 25%. Priority to lossless buffer mappings are set as follow. Priorities 1 & 2 on lossless buffer #1 Priorities 3 & 4 on lossless buffer #2 Priorities 5 & 6 on lossless buffer #3 We observe improvements in latency for small and medium message sizes as follows. Please note that the large message sizes bandwidth performance is reduced but the total bandwidth remains the same. 256B message size (42 % latency reduction) 4K message size (21% latency reduction) 64K message size (16% latency reduction) CC: Ido Schimmel <idosch@idosch.org> CC: Jakub Kicinski <jakub.kicinski@netronome.com> CC: Jiri Pirko <jiri@resnulli.us> CC: Or Gerlitz <gerlitz.or@gmail.com> CC: Parav Pandit <parav@mellanox.com> CC: Aron Silverton <aron.silverton@oracle.com> Signed-off-by: Huy Nguyen <huyn@mellanox.com> Reviewed-by: Parav Pandit <parav@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2018-05-24tty: fix typo in ASYNCB_FOURPORT commentGiulio Benetti
On #define ASYNCB_FOURPORT there's an ortography error on comment: "Set OU1, OUT2 per AST Fourport settings" Change it into: "Set OUT1, OUT2 per AST Fourport settings" Signed-off-by: Giulio Benetti <giulio.benetti@micronovasrl.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-05-24ipv6: sr: Add seg6local action End.BPFMathieu Xhonneux
This patch adds the End.BPF action to the LWT seg6local infrastructure. This action works like any other seg6local End action, meaning that an IPv6 header with SRH is needed, whose DA has to be equal to the SID of the action. It will also advance the SRH to the next segment, the BPF program does not have to take care of this. Since the BPF program may not be a source of instability in the kernel, it is important to ensure that the integrity of the packet is maintained before yielding it back to the IPv6 layer. The hook hence keeps track if the SRH has been altered through the helpers, and re-validates its content if needed with seg6_validate_srh. The state kept for validation is stored in a per-CPU buffer. The BPF program is not allowed to directly write into the packet, and only some fields of the SRH can be altered through the helper bpf_lwt_seg6_store_bytes. Performances profiling has shown that the SRH re-validation does not induce a significant overhead. If the altered SRH is deemed as invalid, the packet is dropped. This validation is also done before executing any action through bpf_lwt_seg6_action, and will not be performed again if the SRH is not modified after calling the action. The BPF program may return 3 types of return codes: - BPF_OK: the End.BPF action will look up the next destination through seg6_lookup_nexthop. - BPF_REDIRECT: if an action has been executed through the bpf_lwt_seg6_action helper, the BPF program should return this value, as the skb's destination is already set and the default lookup should not be performed. - BPF_DROP : the packet will be dropped. Signed-off-by: Mathieu Xhonneux <m.xhonneux@gmail.com> Acked-by: David Lebrun <dlebrun@google.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-05-24bpf: Add IPv6 Segment Routing helpersMathieu Xhonneux
The BPF seg6local hook should be powerful enough to enable users to implement most of the use-cases one could think of. After some thinking, we figured out that the following actions should be possible on a SRv6 packet, requiring 3 specific helpers : - bpf_lwt_seg6_store_bytes: Modify non-sensitive fields of the SRH - bpf_lwt_seg6_adjust_srh: Allow to grow or shrink a SRH (to add/delete TLVs) - bpf_lwt_seg6_action: Apply some SRv6 network programming actions (specifically End.X, End.T, End.B6 and End.B6.Encap) The specifications of these helpers are provided in the patch (see include/uapi/linux/bpf.h). The non-sensitive fields of the SRH are the following : flags, tag and TLVs. The other fields can not be modified, to maintain the SRH integrity. Flags, tag and TLVs can easily be modified as their validity can be checked afterwards via seg6_validate_srh. It is not allowed to modify the segments directly. If one wants to add segments on the path, he should stack a new SRH using the End.B6 action via bpf_lwt_seg6_action. Growing, shrinking or editing TLVs via the helpers will flag the SRH as invalid, and it will have to be re-validated before re-entering the IPv6 layer. This flag is stored in a per-CPU buffer, along with the current header length in bytes. Storing the SRH len in bytes in the control block is mandatory when using bpf_lwt_seg6_adjust_srh. The Header Ext. Length field contains the SRH len rounded to 8 bytes (a padding TLV can be inserted to ensure the 8-bytes boundary). When adding/deleting TLVs within the BPF program, the SRH may temporary be in an invalid state where its length cannot be rounded to 8 bytes without remainder, hence the need to store the length in bytes separately. The caller of the BPF program can then ensure that the SRH's final length is valid using this value. Again, a final SRH modified by a BPF program which doesn’t respect the 8-bytes boundary will be discarded as it will be considered as invalid. Finally, a fourth helper is provided, bpf_lwt_push_encap, which is available from the LWT BPF IN hook, but not from the seg6local BPF one. This helper allows to encapsulate a Segment Routing Header (either with a new outer IPv6 header, or by inlining it directly in the existing IPv6 header) into a non-SRv6 packet. This helper is required if we want to offer the possibility to dynamically encapsulate a SRH for non-SRv6 packet, as the BPF seg6local hook only works on traffic already containing a SRH. This is the BPF equivalent of the seg6 LWT infrastructure, which achieves the same purpose but with a static SRH per route. These helpers require CONFIG_IPV6=y (and not =m). Signed-off-by: Mathieu Xhonneux <m.xhonneux@gmail.com> Acked-by: David Lebrun <dlebrun@google.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-05-24bpf: get JITed image lengths of functions via syscallSandipan Das
This adds new two new fields to struct bpf_prog_info. For multi-function programs, these fields can be used to pass a list of the JITed image lengths of each function for a given program to userspace using the bpf system call with the BPF_OBJ_GET_INFO_BY_FD command. This can be used by userspace applications like bpftool to split up the contiguous JITed dump, also obtained via the system call, into more relatable chunks corresponding to each function. Signed-off-by: Sandipan Das <sandipan@linux.vnet.ibm.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-05-24bpf: get kernel symbol addresses via syscallSandipan Das
This adds new two new fields to struct bpf_prog_info. For multi-function programs, these fields can be used to pass a list of kernel symbol addresses for all functions in a given program to userspace using the bpf system call with the BPF_OBJ_GET_INFO_BY_FD command. When bpf_jit_kallsyms is enabled, we can get the address of the corresponding kernel symbol for a callee function and resolve the symbol's name. The address is determined by adding the value of the call instruction's imm field to __bpf_call_base. This offset gets assigned to the imm field by the verifier. For some architectures, such as powerpc64, the imm field is not large enough to hold this offset. We resolve this by: [1] Assigning the subprog id to the imm field of a call instruction in the verifier instead of the offset of the callee's symbol's address from __bpf_call_base. [2] Determining the address of a callee's corresponding symbol by using the imm field as an index for the list of kernel symbol addresses now available from the program info. Suggested-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Sandipan Das <sandipan@linux.vnet.ibm.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-05-23Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-nextDavid S. Miller
Pablo Neira Ayuso says: ==================== Netfilter updates for net-next The following patchset contains Netfilter updates for your net-next tree, they are: 1) Remove obsolete nf_log tracing from nf_tables, from Florian Westphal. 2) Add support for map lookups to numgen, random and hash expressions, from Laura Garcia. 3) Allow to register nat hooks for iptables and nftables at the same time. Patchset from Florian Westpha. 4) Timeout support for rbtree sets. 5) ip6_rpfilter works needs interface for link-local addresses, from Vincent Bernat. 6) Add nf_ct_hook and nf_nat_hook structures and use them. 7) Do not drop packets on packets raceing to insert conntrack entries into hashes, this is particularly a problem in nfqueue setups. 8) Address fallout from xt_osf separation to nf_osf, patches from Florian Westphal and Fernando Mancera. 9) Remove reference to struct nft_af_info, which doesn't exist anymore. From Taehee Yoo. This batch comes with is a conflict between 25fd386e0bc0 ("netfilter: core: add missing __rcu annotation") in your tree and 2c205dd3981f ("netfilter: add struct nf_nat_hook and use it") coming in this batch. This conflict can be solved by leaving the __rcu tag on __netfilter_net_init() - added by 25fd386e0bc0 - and remove all code related to nf_nat_decode_session_hook - which is gone after 2c205dd3981f, as described by: diff --cc net/netfilter/core.c index e0ae4aae96f5,206fb2c4c319..168af54db975 --- a/net/netfilter/core.c +++ b/net/netfilter/core.c @@@ -611,7 -580,13 +611,8 @@@ const struct nf_conntrack_zone nf_ct_zo EXPORT_SYMBOL_GPL(nf_ct_zone_dflt); #endif /* CONFIG_NF_CONNTRACK */ - static void __net_init __netfilter_net_init(struct nf_hook_entries **e, int max) -#ifdef CONFIG_NF_NAT_NEEDED -void (*nf_nat_decode_session_hook)(struct sk_buff *, struct flowi *); -EXPORT_SYMBOL(nf_nat_decode_session_hook); -#endif - + static void __net_init + __netfilter_net_init(struct nf_hook_entries __rcu **e, int max) { int h; I can also merge your net-next tree into nf-next, solve the conflict and resend the pull request if you prefer so. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-23Merge tag 'mac80211-next-for-davem-2018-05-23' of ↵David S. Miller
git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211-next Johannes Berg says: For this round, we have various things all over the place, notably * a fix for a race in aggregation, which I want to let bake for a bit longer before sending to stable * some new statistics (ACK RSSI, TXQ) * TXQ configuration * preparations for HE, particularly radiotap * replace confusing "country IE" by "country element" since it's not referring to Ireland Note that I merged net-next to get a fix from mac80211 that got there via net, to apply one patch that would otherwise conflict. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-23ipv4: support sport, dport and ip_proto in RTM_GETROUTERoopa Prabhu
This is a followup to fib rules sport, dport and ipproto match support. Only supports tcp, udp and icmp for ipproto. Used by fib rule self tests. Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-23net: add skeleton of bpfilter kernel moduleAlexei Starovoitov
bpfilter.ko consists of bpfilter_kern.c (normal kernel module code) and user mode helper code that is embedded into bpfilter.ko The steps to build bpfilter.ko are the following: - main.c is compiled by HOSTCC into the bpfilter_umh elf executable file - with quite a bit of objcopy and Makefile magic the bpfilter_umh elf file is converted into bpfilter_umh.o object file with _binary_net_bpfilter_bpfilter_umh_start and _end symbols Example: $ nm ./bld_x64/net/bpfilter/bpfilter_umh.o 0000000000004cf8 T _binary_net_bpfilter_bpfilter_umh_end 0000000000004cf8 A _binary_net_bpfilter_bpfilter_umh_size 0000000000000000 T _binary_net_bpfilter_bpfilter_umh_start - bpfilter_umh.o and bpfilter_kern.o are linked together into bpfilter.ko bpfilter_kern.c is a normal kernel module code that calls the fork_usermode_blob() helper to execute part of its own data as a user mode process. Notice that _binary_net_bpfilter_bpfilter_umh_start - end is placed into .init.rodata section, so it's freed as soon as __init function of bpfilter.ko is finished. As part of __init the bpfilter.ko does first request/reply action via two unix pipe provided by fork_usermode_blob() helper to make sure that umh is healthy. If not it will kill it via pid. Later bpfilter_process_sockopt() will be called from bpfilter hooks in get/setsockopt() to pass iptable commands into umh via bpfilter.ko If admin does 'rmmod bpfilter' the __exit code bpfilter.ko will kill umh as well. Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-23bpf: btf: Rename btf_key_id and btf_value_id in bpf_map_infoMartin KaFai Lau
In "struct bpf_map_info", the name "btf_id", "btf_key_id" and "btf_value_id" could cause confusion because the "id" of "btf_id" means the BPF obj id given to the BTF object while "btf_key_id" and "btf_value_id" means the BTF type id within that BTF object. To make it clear, btf_key_id and btf_value_id are renamed to btf_key_type_id and btf_value_type_id. Suggested-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Martin KaFai Lau <kafai@fb.com> Acked-by: Yonghong Song <yhs@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-05-23bpf: btf: Remove unused bits from uapi/linux/btf.hMartin KaFai Lau
This patch does the followings: 1. Limit BTF_MAX_TYPES and BTF_MAX_NAME_OFFSET to 64k. We can raise it later. 2. Remove the BTF_TYPE_PARENT and BTF_STR_TBL_ELF_ID. They are currently encoded at the highest bit of a u32. It is because the current use case does not require supporting parent type (i.e type_id referring to a type in another BTF file). It also does not support referring to a string in ELF. The BTF_TYPE_PARENT and BTF_STR_TBL_ELF_ID checks are replaced by BTF_TYPE_ID_CHECK and BTF_STR_OFFSET_CHECK which are defined in btf.c instead of uapi/linux/btf.h. 3. Limit the BTF_INFO_KIND from 5 bits to 4 bits which is enough. There is unused bits headroom if we ever needed it later. 4. The root bit in BTF_INFO is also removed because it is not used in the current use case. 5. Remove BTF_INT_VARARGS since func type is not supported now. The BTF_INT_ENCODING is limited to 4 bits instead of 8 bits. The above can be added back later because the verifier ensures the unused bits are zeros. Signed-off-by: Martin KaFai Lau <kafai@fb.com> Acked-by: Yonghong Song <yhs@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>