summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2020-03-29efi/libstub/arm64: Avoid image_base value from efi_loaded_imageArd Biesheuvel
Commit: 9f9223778ef3 ("efi/libstub/arm: Make efi_entry() an ordinary PE/COFF entrypoint") did some code refactoring to get rid of the EFI entry point assembler code, and in the process, it got rid of the assignment of image_addr to the value of _text. Instead, it switched to using the image_base field of the efi_loaded_image struct provided by UEFI, which should contain the same value. However, Michael reports that this is not the case: older GRUB builds corrupt this value in some way, and since we can easily switch back to referring to _text to discover this value, let's simply do that. While at it, fix another issue in commit 9f9223778ef3, which may result in the unassigned image_addr to be misidentified as the preferred load offset of the kernel, which is unlikely but will cause a boot crash if it does occur. Finally, let's add a warning if the _text vs. image_base discrepancy is detected, so we can tell more easily how widespread this issue actually is. Reported-by: Michael Kelley <mikelley@microsoft.com> Tested-by: Michael Kelley <mikelley@microsoft.com> Signed-off-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-arm-kernel@lists.infradead.org Cc: linux-efi@vger.kernel.org
2020-03-29i3c: convert to use i2c_new_client_device()Wolfram Sang
Move away from the deprecated API. Signed-off-by: Wolfram Sang <wsa+renesas@sang-engineering.com> Signed-off-by: Boris Brezillon <boris.brezillon@collabora.com> Link: https://lore.kernel.org/linux-i3c/20200326211002.13241-2-wsa+renesas@sang-engineering.com
2020-03-28Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netLinus Torvalds
Pull networking fixes from David Miller: 1) Fix memory leak in vti6, from Torsten Hilbrich. 2) Fix double free in xfrm_policy_timer, from YueHaibing. 3) NL80211_ATTR_CHANNEL_WIDTH attribute is put with wrong type, from Johannes Berg. 4) Wrong allocation failure check in qlcnic driver, from Xu Wang. 5) Get ks8851-ml IO operations right, for real this time, from Marek Vasut. * git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (22 commits) r8169: fix PHY driver check on platforms w/o module softdeps net: ks8851-ml: Fix IO operations, again mlxsw: spectrum_mr: Fix list iteration in error path qlcnic: Fix bad kzalloc null test mac80211: set IEEE80211_TX_CTRL_PORT_CTRL_PROTO for nl80211 TX mac80211: mark station unauthorized before key removal mac80211: Check port authorization in the ieee80211_tx_dequeue() case cfg80211: Do not warn on same channel at the end of CSA mac80211: drop data frames without key on encrypted links ieee80211: fix HE SPR size calculation nl80211: fix NL80211_ATTR_CHANNEL_WIDTH attribute type xfrm: policy: Fix doulbe free in xfrm_policy_timer bpf: Explicitly memset some bpf info structures declared on the stack bpf: Explicitly memset the bpf_attr structure bpf: Sanitize the bpf_struct_ops tcp-cc name vti6: Fix memory leak of skb if input policy check fails esp: remove the skb from the chain when it's enqueued in cryptd_wq ipv6: xfrm6_tunnel.c: Use built-in RCU list checking xfrm: add the missing verify_sec_ctx_len check in xfrm_add_acquire xfrm: fix uctx len check in verify_sec_ctx_len ...
2020-03-28Merge branch 'ifla_xdp_expected_fd'Alexei Starovoitov
Toke Høiland-Jørgensen says: ==================== This series adds support for atomically replacing the XDP program loaded on an interface. This is achieved by means of a new netlink attribute that can specify the expected previous program to replace on the interface. If set, the kernel will compare this "expected fd" attribute with the program currently loaded on the interface, and reject the operation if it does not match. With this primitive, userspace applications can avoid stepping on each other's toes when simultaneously updating the loaded XDP program. Changelog: v4: - Switch back to passing FD instead of ID (Andrii) - Rename flag to XDP_FLAGS_REPLACE (for consistency with other similar uses) v3: - Pass existing ID instead of FD (Jakub) - Use opts struct for new libbpf function (Andrii) v2: - Fix checkpatch nits and add .strict_start_type to netlink policy (Jakub) ==================== Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2020-03-28selftests/bpf: Add tests for attaching XDP programsToke Høiland-Jørgensen
This adds tests for the various replacement operations using IFLA_XDP_EXPECTED_FD. Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/158515700967.92963.15098921624731968356.stgit@toke.dk
2020-03-28libbpf: Add function to set link XDP fd while specifying old programToke Høiland-Jørgensen
This adds a new function to set the XDP fd while specifying the FD of the program to replace, using the newly added IFLA_XDP_EXPECTED_FD netlink parameter. The new function uses the opts struct mechanism to be extendable in the future. Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/158515700857.92963.7052131201257841700.stgit@toke.dk
2020-03-28tools: Add EXPECTED_FD-related definitions in if_link.hToke Høiland-Jørgensen
This adds the IFLA_XDP_EXPECTED_FD netlink attribute definition and the XDP_FLAGS_REPLACE flag to if_link.h in tools/include. Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/158515700747.92963.8615391897417388586.stgit@toke.dk
2020-03-28xdp: Support specifying expected existing program when attaching XDPToke Høiland-Jørgensen
While it is currently possible for userspace to specify that an existing XDP program should not be replaced when attaching to an interface, there is no mechanism to safely replace a specific XDP program with another. This patch adds a new netlink attribute, IFLA_XDP_EXPECTED_FD, which can be set along with IFLA_XDP_FD. If set, the kernel will check that the program currently loaded on the interface matches the expected one, and fail the operation if it does not. This corresponds to a 'cmpxchg' memory operation. Setting the new attribute with a negative value means that no program is expected to be attached, which corresponds to setting the UPDATE_IF_NOEXIST flag. A new companion flag, XDP_FLAGS_REPLACE, is also added to explicitly request checking of the EXPECTED_FD attribute. This is needed for userspace to discover whether the kernel supports the new attribute. Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Reviewed-by: Jakub Kicinski <kuba@kernel.org> Link: https://lore.kernel.org/bpf/158515700640.92963.3551295145441017022.stgit@toke.dk
2020-03-28platform/x86: surface3_power: Fix Kconfig section orderingAndy Shevchenko
Kconfig section is misplaced. Put it in the same order as it is done in Makefile for this driver. Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
2020-03-28platform/x86: surface3_power: Add missed headersAndy Shevchenko
We obviously are users of bits.h and types.h. Add them to the list. Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
2020-03-28platform/x86: surface3_power: Reformat GUID assignmentAndy Shevchenko
For better readability reformat GUID assignment. While here, add the comment how this GUID looks in a string representation. Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
2020-03-28platform/x86: surface3_power: Drop useless macro ACPI_PTR()Andy Shevchenko
Driver depends to ACPI, this marco always is evaluated to the parameter, thus useless. Drop it for good. Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
2020-03-28platform/x86: surface3_power: Prefix POLL_INTERVAL with SURFACE_3Andy Shevchenko
For better namespace maintenance prefix POLL_INTERVAL macro with SURFACE_3. Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
2020-03-28platform/x86: surface3_power: Simplify mshw0011_adp_psr() to one linerAndy Shevchenko
Refactor mshw0011_adp_psr() to be one liner. Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
2020-03-28platform/x86: surface3_power: Use dev_err() instead of pr_err()Andy Shevchenko
We have device and we may use it to print messages. Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
2020-03-28platform/x86: surface3_power: Drop unused structure definitionAndy Shevchenko
As reported by kbuild bot the struct mshw0011_lookup in never used. Drop its definition for good. Reported-by: kbuild test robot <lkp@intel.com> Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
2020-03-28Merge branch 'i2c/for-current' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux Pull i2c fixes from Wolfram Sang: "Three more driver bugfixes, and two doc improvements fixing build warnings while we are here" * 'i2c/for-current' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux: i2c: pca-platform: Use platform_irq_get_optional i2c: st: fix missing struct parameter description i2c: nvidia-gpu: Handle timeout correctly in gpu_i2c_check_status() i2c: fix a doc warning i2c: hix5hd2: add missed clk_disable_unprepare in remove
2020-03-28bpf: Fix build warning regarding missing prototypesJean-Philippe Menil
Fix build warnings when building net/bpf/test_run.o with W=1 due to missing prototype for bpf_fentry_test{1..6}. Instead of declaring prototypes, turn off warnings with __diag_{push,ignore,pop} as pointed out by Alexei. Signed-off-by: Jean-Philippe Menil <jpmenil@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20200327204713.28050-1-jpmenil@gmail.com
2020-03-28MIPS: ralink: mt7621: Fix soc_device introductionThomas Bogendoerfer
Depending on selected SMP config options soc_device didn't get initialised at all. With UP config vmlinux didn't link because of missing soc bus. Fixes: 71b9b5e0130d ("MIPS: ralink: mt7621: introduce 'soc_device' initialization") Signed-off-by: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Tested-by: René van Dorst <opensource@vdorst.com>
2020-03-28Merge tag 'scsi-fixes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi Pull SCSI fixes from James Bottomley: "Two small fixes: one in drivers (qla2xxx), and one in the core (sd) to try to cope with USB enclosures that silently change reported parameters" * tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi: scsi: sd: Fix optimal I/O size for devices that change reported values scsi: qla2xxx: Fix I/Os being passed down when FC device is being deleted
2020-03-28libbpf, xsk: Init all ring members in xsk_umem__create and xsk_socket__createFletcher Dunn
Fix a sharp edge in xsk_umem__create and xsk_socket__create. Almost all of the members of the ring buffer structs are initialized, but the "cached_xxx" variables are not all initialized. The caller is required to zero them. This is needlessly dangerous. The results if you don't do it can be very bad. For example, they can cause xsk_prod_nb_free and xsk_cons_nb_avail to return values greater than the size of the queue. xsk_ring_cons__peek can return an index that does not refer to an item that has been queued. I have confirmed that without this change, my program misbehaves unless I memset the ring buffers to zero before calling the function. Afterwards, my program works without (or with) the memset. Signed-off-by: Fletcher Dunn <fletcherd@valvesoftware.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Magnus Karlsson <magnus.karlsson@intel.com> Link: https://lore.kernel.org/bpf/85f12913cde94b19bfcb598344701c38@valvesoftware.com
2020-03-28fs/buffer: Make BH_Uptodate_Lock bit_spin_lock a regular spinlock_tThomas Gleixner
Bit spinlocks are problematic if PREEMPT_RT is enabled, because they disable preemption, which is undesired for latency reasons and breaks when regular spinlocks are taken within the bit_spinlock locked region because regular spinlocks are converted to 'sleeping spinlocks' on RT. PREEMPT_RT replaced the bit spinlocks with regular spinlocks to avoid this problem. The replacement was done conditionaly at compile time, but Christoph requested to do an unconditional conversion. Jan suggested to move the spinlock into a existing padding hole which avoids a size increase of struct buffer_head on production kernels. As a benefit the lock gains lockdep coverage. [ bigeasy: Remove the wrapper and use always spinlock_t and move it into the padding hole ] Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Jan Kara <jack@suse.cz> Cc: Christoph Hellwig <hch@infradead.org> Link: https://lkml.kernel.org/r/20191118132824.rclhrbujqh4b4g4d@linutronix.de
2020-03-28thermal/x86_pkg_temp: Make pkg_temp_lock a raw_spinlock_tClark Williams
The pkg_temp_lock spinlock is acquired in the thermal vector handler which is truly atomic context even on PREEMPT_RT kernels. The critical sections are tiny, so change it to a raw spinlock. Signed-off-by: Clark Williams <williams@redhat.com> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20191008110021.2j44ayunal7fkb7i@linutronix.de
2020-03-28Documentation/locking/locktypes: Minor copy editor fixesRandy Dunlap
Minor editorial fixes: - remove 'enabled' from PREEMPT_RT enabled kernels for consistency - add some periods for consistency - add "'" for possessive CPU's - spell out interrupts [ tglx: Picked up Paul's suggestions ] Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Paul E. McKenney <paulmck@kernel.org> Link: https://lkml.kernel.org/r/ac615f36-0b44-408d-aeab-d76e4241add4@infradead.org
2020-03-28Documentation/locking/locktypes: Further clarifications and wordsmithingThomas Gleixner
The documentation of rw_semaphores is wrong as it claims that the non-owner reader release is not supported by RT. That's just history biased memory distortion. Split the 'Owner semantics' section up and add separate sections for semaphore and rw_semaphore to reflect reality. Aside of that the following updates are done: - Add pseudo code to document the spinlock state preserving mechanism on PREEMPT_RT - Wordsmith the bitspinlock and lock nesting sections Co-developed-by: Paul McKenney <paulmck@kernel.org> Signed-off-by: Paul McKenney <paulmck@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Link: https://lkml.kernel.org/r/87wo78y5yy.fsf@nanos.tec.linutronix.de
2020-03-28x86/boot/compressed: Fix debug_puthex() parameter typeJoerg Roedel
In the CONFIG_X86_VERBOSE_BOOTUP=Y case, the debug_puthex() macro just turns into __puthex(), which takes 'unsigned long' as parameter. But in the CONFIG_X86_VERBOSE_BOOTUP=N case, it is a function which takes 'unsigned char *', causing compile warnings when the function is used. Fix the parameter type to get rid of the warnings. Signed-off-by: Joerg Roedel <jroedel@suse.de> Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lkml.kernel.org/r/20200319091407.1481-11-joro@8bytes.org
2020-03-28Merge branch 'uaccess.futex' of ↵Thomas Gleixner
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs into locking/core Pull uaccess futex cleanups for Al Viro: Consolidate access_ok() usage and the futex uaccess function zoo.
2020-03-28Merge branch 'next.uaccess-2' of ↵Thomas Gleixner
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs into x86/cleanups Pull uaccess cleanups from Al Viro: Consolidate the user access areas and get rid of uaccess_try(), user_ex() and other warts.
2020-03-28m68knommu: Remove mm.h include from uaccess_no.hThomas Gleixner
In file included from include/linux/huge_mm.h:8, from include/linux/mm.h:567, from arch/m68k/include/asm/uaccess_no.h:8, from arch/m68k/include/asm/uaccess.h:3, from include/linux/uaccess.h:11, from include/linux/sched/task.h:11, from include/linux/sched/signal.h:9, from include/linux/rcuwait.h:6, from include/linux/percpu-rwsem.h:7, from kernel/locking/percpu-rwsem.c:6: include/linux/fs.h:1422:29: error: array type has incomplete element type 'struct percpu_rw_semaphore' 1422 | struct percpu_rw_semaphore rw_sem[SB_FREEZE_LEVELS]; Removing the include of linux/mm.h from the uaccess header solves the problem and various build tests of nommu configurations still work. Fixes: 80fbaf1c3f29 ("rcuwait: Add @state argument to rcuwait_wait_event()") Reported-by: kbuild test robot <lkp@intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Geert Uytterhoeven <geert@linux-m68k.org> Link: https://lkml.kernel.org/r/87fte1qzh0.fsf@nanos.tec.linutronix.de
2020-03-28cpu/hotplug: Ignore pm_wakeup_pending() for disable_nonboot_cpus()Thomas Gleixner
A recent change to freeze_secondary_cpus() which added an early abort if a wakeup is pending missed the fact that the function is also invoked for shutdown, reboot and kexec via disable_nonboot_cpus(). In case of disable_nonboot_cpus() the wakeup event needs to be ignored as the purpose is to terminate the currently running kernel. Add a 'suspend' argument which is only set when the freeze is in context of a suspend operation. If not set then an eventually pending wakeup event is ignored. Fixes: a66d955e910a ("cpu/hotplug: Abort disabling secondary CPUs if wakeup is pending") Reported-by: Boqun Feng <boqun.feng@gmail.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Pavankumar Kondeti <pkondeti@codeaurora.org> Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/r/874kuaxdiz.fsf@nanos.tec.linutronix.de
2020-03-28Revert "clocksource/drivers/timer-probe: Avoid creating dead devices"Thomas Gleixner
This reverts commit 4f41fe386a94639cd9a1831298d4f85db5662f1e. The change breaks systems on which the DT node of a device is used by multiple drivers. The proposed workaround to clear OF_POPULATED is just a band aid and this needs to be cleaned up at the root of the problem. Revert this for now. Reported-by: Ionela Voinescu <ionela.voinescu@arm.com> Reported-by: Jon Hunter <jonathanh@nvidia.com> Requested-by: Rob Herring <robh+dt@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Saravana Kannan <saravanak@google.com> Cc: Daniel Lezcano <daniel.lezcano@linaro.org> Link: https://lore.kernel.org/r/20200324175955.GA16972@arm.com
2020-03-28MAINTAINERS: erofs: update my email addressGao Xiang
This email address will not be available in a few days. Update my own email address to xiang@kernel.org, which should be available all the time. Link: https://lore.kernel.org/r/20200328040036.117974-1-gaoxiang25@huawei.com Acked-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Gao Xiang <gaoxiang25@huawei.com>
2020-03-28i2c: pca-platform: Use platform_irq_get_optionalChris Packham
The interrupt is not required so use platform_irq_get_optional() to avoid error messages like i2c-pca-platform 22080000.i2c: IRQ index 0 not found Signed-off-by: Chris Packham <chris.packham@alliedtelesis.co.nz> Signed-off-by: Wolfram Sang <wsa@the-dreams.de>
2020-03-28i2c: st: fix missing struct parameter descriptionAlain Volmat
Fix a missing struct parameter description to allow warning free W=1 compilation. Signed-off-by: Alain Volmat <avolmat@me.com> Reviewed-by: Patrice Chotard <patrice.chotard@st.com> Signed-off-by: Wolfram Sang <wsa@the-dreams.de>
2020-03-27x86: get rid of user_atomic_cmpxchg_inatomic()Al Viro
Only one user left; the thing had been made polymorphic back in 2013 for the sake of MPX. No point keeping it now that MPX is gone. Convert futex_atomic_cmpxchg_inatomic() to user_access_{begin,end}() while we are at it. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-03-27generic arch_futex_atomic_op_inuser() doesn't need access_ok()Al Viro
uses get_user() and put_user() for memory accesses Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-03-27x86: don't reload after cmpxchg in unsafe_atomic_op2() loopAl Viro
lock cmpxchg leaves the current value in eax; no need to reload it. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-03-27x86: convert arch_futex_atomic_op_inuser() to ↵Al Viro
user_access_begin/user_access_end() Lift stac/clac pairs from __futex_atomic_op{1,2} into arch_futex_atomic_op_inuser(), fold them with access_ok() in there. The switch in arch_futex_atomic_op_inuser() is what has required the previous (objtool) commit... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-03-27objtool: whitelist __sanitizer_cov_trace_switch()Al Viro
it's not really different from e.g. __sanitizer_cov_trace_cmp4(); as it is, the switches that generate an array of labels get rejected by objtool, while slightly different set of cases that gets compiled into a series of comparisons is accepted. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-03-27[parisc, s390, sparc64] no need for access_ok() in futex handlingAl Viro
access_ok() is always true on those Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-03-27sh: no need of access_ok() in arch_futex_atomic_op_inuser()Al Viro
everything it uses is doing access_ok() already Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-03-27futex: arch_futex_atomic_op_inuser() calling conventions changeAl Viro
Move access_ok() in and pagefault_enable()/pagefault_disable() out. Mechanical conversion only - some instances don't really need a separate access_ok() at all (e.g. the ones only using get_user()/put_user(), or architectures where access_ok() is always true); we'll deal with that in followups. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-03-27Merge branch 'cgroup-helpers'Alexei Starovoitov
Daniel Borkmann says: ==================== This adds various straight-forward helper improvements and additions to BPF cgroup based connect(), sendmsg(), recvmsg() and bind-related hooks which would allow to implement more fine-grained policies and improve current load balancer limitations we're seeing. For details please see individual patches. I've tested them on Kubernetes & Cilium and also added selftests for the small verifier extension. Thanks! ==================== Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2020-03-27bpf: Add selftest cases for ctx_or_null argument typeDaniel Borkmann
Add various tests to make sure the verifier keeps catching them: # ./test_verifier [...] #230/p pass ctx or null check, 1: ctx OK #231/p pass ctx or null check, 2: null OK #232/p pass ctx or null check, 3: 1 OK #233/p pass ctx or null check, 4: ctx - const OK #234/p pass ctx or null check, 5: null (connect) OK #235/p pass ctx or null check, 6: null (bind) OK #236/p pass ctx or null check, 7: ctx (bind) OK #237/p pass ctx or null check, 8: null (bind) OK [...] Summary: 1595 PASSED, 0 SKIPPED, 0 FAILED Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/c74758d07b1b678036465ef7f068a49e9efd3548.1585323121.git.daniel@iogearbox.net
2020-03-27bpf: Enable retrival of pid/tgid/comm from bpf cgroup hooksDaniel Borkmann
We already have the bpf_get_current_uid_gid() helper enabled, and given we now have perf event RB output available for connect(), sendmsg(), recvmsg() and bind-related hooks, add a trivial change to enable bpf_get_current_pid_tgid() and bpf_get_current_comm() as well. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Andrii Nakryiko <andriin@fb.com> Link: https://lore.kernel.org/bpf/18744744ed93c06343be8b41edcfd858706f39d7.1585323121.git.daniel@iogearbox.net
2020-03-27bpf: Enable bpf cgroup hooks to retrieve cgroup v2 and ancestor idDaniel Borkmann
Enable the bpf_get_current_cgroup_id() helper for connect(), sendmsg(), recvmsg() and bind-related hooks in order to retrieve the cgroup v2 context which can then be used as part of the key for BPF map lookups, for example. Given these hooks operate in process context 'current' is always valid and pointing to the app that is performing mentioned syscalls if it's subject to a v2 cgroup. Also with same motivation of commit 7723628101aa ("bpf: Introduce bpf_skb_ancestor_cgroup_id helper") enable retrieval of ancestor from current so the cgroup id can be used for policy lookups which can then forbid connect() / bind(), for example. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/d2a7ef42530ad299e3cbb245e6c12374b72145ef.1585323121.git.daniel@iogearbox.net
2020-03-27bpf: Allow to retrieve cgroup v1 classid from v2 hooksDaniel Borkmann
Today, Kubernetes is still operating on cgroups v1, however, it is possible to retrieve the task's classid based on 'current' out of connect(), sendmsg(), recvmsg() and bind-related hooks for orchestrators which attach to the root cgroup v2 hook in a mixed env like in case of Cilium, for example, in order to then correlate certain pod traffic and use it as part of the key for BPF map lookups. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/555e1c69db7376c0947007b4951c260e1074efc3.1585323121.git.daniel@iogearbox.net
2020-03-27bpf: Add netns cookie and enable it for bpf cgroup hooksDaniel Borkmann
In Cilium we're mainly using BPF cgroup hooks today in order to implement kube-proxy free Kubernetes service translation for ClusterIP, NodePort (*), ExternalIP, and LoadBalancer as well as HostPort mapping [0] for all traffic between Cilium managed nodes. While this works in its current shape and avoids packet-level NAT for inter Cilium managed node traffic, there is one major limitation we're facing today, that is, lack of netns awareness. In Kubernetes, the concept of Pods (which hold one or multiple containers) has been built around network namespaces, so while we can use the global scope of attaching to root BPF cgroup hooks also to our advantage (e.g. for exposing NodePort ports on loopback addresses), we also have the need to differentiate between initial network namespaces and non-initial one. For example, ExternalIP services mandate that non-local service IPs are not to be translated from the host (initial) network namespace as one example. Right now, we have an ugly work-around in place where non-local service IPs for ExternalIP services are not xlated from connect() and friends BPF hooks but instead via less efficient packet-level NAT on the veth tc ingress hook for Pod traffic. On top of determining whether we're in initial or non-initial network namespace we also have a need for a socket-cookie like mechanism for network namespaces scope. Socket cookies have the nice property that they can be combined as part of the key structure e.g. for BPF LRU maps without having to worry that the cookie could be recycled. We are planning to use this for our sessionAffinity implementation for services. Therefore, add a new bpf_get_netns_cookie() helper which would resolve both use cases at once: bpf_get_netns_cookie(NULL) would provide the cookie for the initial network namespace while passing the context instead of NULL would provide the cookie from the application's network namespace. We're using a hole, so no size increase; the assignment happens only once. Therefore this allows for a comparison on initial namespace as well as regular cookie usage as we have today with socket cookies. We could later on enable this helper for other program types as well as we would see need. (*) Both externalTrafficPolicy={Local|Cluster} types [0] https://github.com/cilium/cilium/blob/master/bpf/bpf_sock.c Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/c47d2346982693a9cf9da0e12690453aded4c788.1585323121.git.daniel@iogearbox.net
2020-03-27bpf: Enable perf event rb output for bpf cgroup progsDaniel Borkmann
Currently, connect(), sendmsg(), recvmsg() and bind-related hooks are all lacking perf event rb output in order to push notifications or monitoring events up to user space. Back in commit a5a3a828cd00 ("bpf: add perf event notificaton support for sock_ops"), I've worked with Sowmini to enable them for sock_ops where the context part is not used (as opposed to skbs for example where the packet data can be appended). Make the bpf_sockopt_event_output() helper generic and enable it for mentioned hooks. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/69c39daf87e076b31e52473c902e9bfd37559124.1585323121.git.daniel@iogearbox.net
2020-03-27bpf: Enable retrieval of socket cookie for bind/post-bind hookDaniel Borkmann
We currently make heavy use of the socket cookie in BPF's connect(), sendmsg() and recvmsg() hooks for load-balancing decisions. However, it is currently not enabled/implemented in BPF {post-}bind hooks where it can later be used in combination for correlation in the tc egress path, for example. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/e9d71f310715332f12d238cc650c1edc5be55119.1585323121.git.daniel@iogearbox.net