Age | Commit message (Collapse) | Author |
|
Rewrite the generic REFCOUNT_FULL implementation so that the saturation
point is moved to INT_MIN / 2. This allows us to defer the sanity checks
until after the atomic operation, which removes many uses of cmpxchg()
in favour of atomic_fetch_{add,sub}().
Some crude perf results obtained from lkdtm show substantially less
overhead, despite the checking:
$ perf stat -r 3 -B -- echo {ATOMIC,REFCOUNT}_TIMING >/sys/kernel/debug/provoke-crash/DIRECT
# arm64
ATOMIC_TIMING: 46.50451 +- 0.00134 seconds time elapsed ( +- 0.00% )
REFCOUNT_TIMING (REFCOUNT_FULL, mainline): 77.57522 +- 0.00982 seconds time elapsed ( +- 0.01% )
REFCOUNT_TIMING (REFCOUNT_FULL, this series): 48.7181 +- 0.0256 seconds time elapsed ( +- 0.05% )
# x86
ATOMIC_TIMING: 31.6225 +- 0.0776 seconds time elapsed ( +- 0.25% )
REFCOUNT_TIMING (!REFCOUNT_FULL, mainline/x86 asm): 31.6689 +- 0.0901 seconds time elapsed ( +- 0.28% )
REFCOUNT_TIMING (REFCOUNT_FULL, mainline): 53.203 +- 0.138 seconds time elapsed ( +- 0.26% )
REFCOUNT_TIMING (REFCOUNT_FULL, this series): 31.7408 +- 0.0486 seconds time elapsed ( +- 0.15% )
Signed-off-by: Will Deacon <will@kernel.org>
Reviewed-by: Ard Biesheuvel <ardb@kernel.org>
Reviewed-by: Kees Cook <keescook@chromium.org>
Tested-by: Hanjun Guo <guohanjun@huawei.com>
Tested-by: Jan Glauber <jglauber@marvell.com>
Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Cc: Elena Reshetova <elena.reshetova@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20191121115902.2551-6-will@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
<linux/refcount.h> header
In an effort to improve performance of the REFCOUNT_FULL implementation,
move the bulk of its functions into linux/refcount.h. This allows them
to be inlined in the same way as if they had been provided via
CONFIG_ARCH_HAS_REFCOUNT.
Signed-off-by: Will Deacon <will@kernel.org>
Reviewed-by: Ard Biesheuvel <ardb@kernel.org>
Reviewed-by: Kees Cook <keescook@chromium.org>
Tested-by: Hanjun Guo <guohanjun@huawei.com>
Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Cc: Elena Reshetova <elena.reshetova@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20191121115902.2551-5-will@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
The full-fat refcount implementation is exposed via a set of functions
suffixed with "_checked()", the idea being that code can choose to use
the more expensive, yet more secure implementation on a case-by-case
basis.
In reality, this hasn't happened, so with a grand total of zero users,
let's remove the checked variants for now by simply dropping the suffix
and predicating the out-of-line functions on CONFIG_REFCOUNT_FULL=y.
Signed-off-by: Will Deacon <will@kernel.org>
Reviewed-by: Ard Biesheuvel <ardb@kernel.org>
Reviewed-by: Kees Cook <keescook@chromium.org>
Tested-by: Hanjun Guo <guohanjun@huawei.com>
Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Cc: Elena Reshetova <elena.reshetova@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20191121115902.2551-4-will@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
In preparation for changing the saturation point of REFCOUNT_FULL to
INT_MIN/2, change the type of integer operands passed into the API
from 'unsigned int' to 'int' so that we can avoid casting during
comparisons when we don't want to fall foul of C integral conversion
rules for signed and unsigned types.
Since the kernel is compiled with '-fno-strict-overflow', we don't need
to worry about the UB introduced by signed overflow here. Furthermore,
we're already making heavy use of the atomic_t API, which operates
exclusively on signed types.
Signed-off-by: Will Deacon <will@kernel.org>
Reviewed-by: Ard Biesheuvel <ardb@kernel.org>
Reviewed-by: Kees Cook <keescook@chromium.org>
Tested-by: Hanjun Guo <guohanjun@huawei.com>
Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Cc: Elena Reshetova <elena.reshetova@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20191121115902.2551-3-will@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
The REFCOUNT_FULL implementation uses a different saturation point than
the x86 implementation, which means that the shared refcount code in
lib/refcount.c (e.g. refcount_dec_not_one()) needs to be aware of the
difference.
Rather than duplicate the definitions from the lkdtm driver, instead
move them into <linux/refcount.h> and update all references accordingly.
Signed-off-by: Will Deacon <will@kernel.org>
Reviewed-by: Ard Biesheuvel <ardb@kernel.org>
Reviewed-by: Kees Cook <keescook@chromium.org>
Tested-by: Hanjun Guo <guohanjun@huawei.com>
Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Cc: Elena Reshetova <elena.reshetova@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20191121115902.2551-2-will@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
Given that we have BPF_MOD_NOP_TO_{CALL,JUMP}, BPF_MOD_{CALL,JUMP}_TO_NOP
and BPF_MOD_{CALL,JUMP}_TO_{CALL,JUMP} poke types and that we also pass in
old_addr as well as new_addr, it's a bit redundant and unnecessarily
complicates __bpf_arch_text_poke() itself since we can derive the same from
the *_addr that were passed in. Hence simplify and use BPF_MOD_{CALL,JUMP}
as types which also allows to clean up call-sites.
In addition to that, __bpf_arch_text_poke() currently verifies that text
matches expected old_insn before we invoke text_poke_bp(). Also add a check
on new_insn and skip rewrite if it already matches. Reason why this is rather
useful is that it avoids making any special casing in prog_array_map_poke_run()
when old and new prog were NULL and has the benefit that also for this case
we perform a check on text whether it really matches our expectations.
Suggested-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/fcb00a2b0b288d6c73de4ef58116a821c8fe8f2f.1574555798.git.daniel@iogearbox.net
|
|
Add a definition of bpf_jit_blinding_enabled() when CONFIG_BPF_JIT is not set
in order to fix a recent build regression:
[...]
CC kernel/bpf/verifier.o
CC kernel/bpf/inode.o
kernel/bpf/verifier.c: In function ‘fixup_bpf_calls’:
kernel/bpf/verifier.c:9132:25: error: implicit declaration of function ‘bpf_jit_blinding_enabled’; did you mean ‘bpf_jit_kallsyms_enabled’? [-Werror=implicit-function-declaration]
9132 | bool expect_blinding = bpf_jit_blinding_enabled(prog);
| ^~~~~~~~~~~~~~~~~~~~~~~~
| bpf_jit_kallsyms_enabled
CC kernel/bpf/helpers.o
CC kernel/bpf/hashtab.o
[...]
Fixes: d2e4c1e6c294 ("bpf: Constant map key tracking for prog array pokes")
Reported-by: Jakub Sitnicki <jakub@cloudflare.com>
Reported-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/40baf8f3507cac4851a310578edfb98ce73b5605.1574541375.git.daniel@iogearbox.net
|
|
Add tracking of constant keys into tail call maps. The signature of
bpf_tail_call_proto is that arg1 is ctx, arg2 map pointer and arg3
is a index key. The direct call approach for tail calls can be enabled
if the verifier asserted that for all branches leading to the tail call
helper invocation, the map pointer and index key were both constant
and the same.
Tracking of map pointers we already do from prior work via c93552c443eb
("bpf: properly enforce index mask to prevent out-of-bounds speculation")
and 09772d92cd5a ("bpf: avoid retpoline for lookup/update/ delete calls
on maps").
Given the tail call map index key is not on stack but directly in the
register, we can add similar tracking approach and later in fixup_bpf_calls()
add a poke descriptor to the progs poke_tab with the relevant information
for the JITing phase.
We internally reuse insn->imm for the rewritten BPF_JMP | BPF_TAIL_CALL
instruction in order to point into the prog's poke_tab, and keep insn->imm
as 0 as indicator that current indirect tail call emission must be used.
Note that publishing to the tracker must happen at the end of fixup_bpf_calls()
since adding elements to the poke_tab reallocates its memory, so we need
to wait until its in final state.
Future work can generalize and add similar approach to optimize plain
array map lookups. Difference there is that we need to look into the key
value that sits on stack. For clarity in bpf_insn_aux_data, map_state
has been renamed into map_ptr_state, so we get map_{ptr,key}_state as
trackers.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Link: https://lore.kernel.org/bpf/e8db37f6b2ae60402fa40216c96738ee9b316c32.1574452833.git.daniel@iogearbox.net
|
|
This work adds program tracking to prog array maps. This is needed such
that upon prog array updates/deletions we can fix up all programs which
make use of this tail call map. We add ops->map_poke_{un,}track()
helpers to maps to maintain the list of programs and ops->map_poke_run()
for triggering the actual update.
bpf_array_aux is extended to contain the list head and poke_mutex in
order to serialize program patching during updates/deletions.
bpf_free_used_maps() will untrack the program shortly before dropping
the reference to the map. For clearing out the prog array once all urefs
are dropped we need to use schedule_work() to have a sleepable context.
The prog_array_map_poke_run() is triggered during updates/deletions and
walks the maintained prog list. It checks in their poke_tabs whether the
map and key is matching and runs the actual bpf_arch_text_poke() for
patching in the nop or new jmp location. Depending on the type of update,
we use one of BPF_MOD_{NOP_TO_JUMP,JUMP_TO_NOP,JUMP_TO_JUMP}.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Link: https://lore.kernel.org/bpf/1fb364bb3c565b3e415d5ea348f036ff379e779d.1574452833.git.daniel@iogearbox.net
|
|
Add initial poke table data structures and management to the BPF
prog that can later be used by JITs. Also add an instance of poke
specific data for tail call maps; plan for later work is to extend
this also for BPF static keys.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Link: https://lore.kernel.org/bpf/1db285ec2ea4207ee0455b3f8e191a4fc58b9ade.1574452833.git.daniel@iogearbox.net
|
|
We're going to extend this with further information which is only
relevant for prog array at this point. Given this info is not used
in critical path, move it into its own structure such that the main
array map structure can be kept on diet.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Link: https://lore.kernel.org/bpf/b9ddccdb0f6f7026489ee955f16c96381e1e7238.1574452833.git.daniel@iogearbox.net
|
|
We later on are going to need a sleepable context as opposed to plain
RCU callback in order to untrack programs we need to poke at runtime
and tracking as well as image update is performed under mutex.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Link: https://lore.kernel.org/bpf/09823b1d5262876e9b83a8e75df04cf0467357a4.1574452833.git.daniel@iogearbox.net
|
|
Add BPF_MOD_{NOP_TO_JUMP,JUMP_TO_JUMP,JUMP_TO_NOP} patching for x86
JIT in order to be able to patch direct jumps or nop them out. We need
this facility in order to patch tail call jumps and in later work also
BPF static keys.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Link: https://lore.kernel.org/bpf/aa4784196a8e5e985af4b30a4fe5336bce6e9643.1574452833.git.daniel@iogearbox.net
|
|
This patch is to fix a data-race reported by syzbot:
BUG: KCSAN: data-race in sctp_assoc_migrate / sctp_hash_obj
write to 0xffff8880b67c0020 of 8 bytes by task 18908 on cpu 1:
sctp_assoc_migrate+0x1a6/0x290 net/sctp/associola.c:1091
sctp_sock_migrate+0x8aa/0x9b0 net/sctp/socket.c:9465
sctp_accept+0x3c8/0x470 net/sctp/socket.c:4916
inet_accept+0x7f/0x360 net/ipv4/af_inet.c:734
__sys_accept4+0x224/0x430 net/socket.c:1754
__do_sys_accept net/socket.c:1795 [inline]
__se_sys_accept net/socket.c:1792 [inline]
__x64_sys_accept+0x4e/0x60 net/socket.c:1792
do_syscall_64+0xcc/0x370 arch/x86/entry/common.c:290
entry_SYSCALL_64_after_hwframe+0x44/0xa9
read to 0xffff8880b67c0020 of 8 bytes by task 12003 on cpu 0:
sctp_hash_obj+0x4f/0x2d0 net/sctp/input.c:894
rht_key_get_hash include/linux/rhashtable.h:133 [inline]
rht_key_hashfn include/linux/rhashtable.h:159 [inline]
rht_head_hashfn include/linux/rhashtable.h:174 [inline]
head_hashfn lib/rhashtable.c:41 [inline]
rhashtable_rehash_one lib/rhashtable.c:245 [inline]
rhashtable_rehash_chain lib/rhashtable.c:276 [inline]
rhashtable_rehash_table lib/rhashtable.c:316 [inline]
rht_deferred_worker+0x468/0xab0 lib/rhashtable.c:420
process_one_work+0x3d4/0x890 kernel/workqueue.c:2269
worker_thread+0xa0/0x800 kernel/workqueue.c:2415
kthread+0x1d4/0x200 drivers/block/aoe/aoecmd.c:1253
ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:352
It was caused by rhashtable access asoc->base.sk when sctp_assoc_migrate
is changing its value. However, what rhashtable wants is netns from asoc
base.sk, and for an asoc, its netns won't change once set. So we can
simply fix it by caching netns since created.
Fixes: d6c0256a60e6 ("sctp: add the rhashtable apis for sctp global transport hashtable")
Reported-by: syzbot+e3b35fe7918ff0ee474e@syzkaller.appspotmail.com
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux
Saeed Mahameed says:
====================
mlx5-updates-2019-11-22
1) Misc Cleanups
2) Software steering support for Geneve
====================
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
|
|
Rename the mac_link_state() method to mac_pcs_get_state() to make it
clear that it should be returning the MACs PCS current state, which
is used for inband negotiation rather than just reading back what the
MAC has been configured for. Update the documentation to explicitly
mention that this is for inband.
We drop the return value as well; most of phylink doesn't check the
return value and it is not clear what it should do on error - instead
arrange for state->link to be false.
Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk>
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
|
|
These two functions have never been used since they were added.
Link: https://lore.kernel.org/r/20191113134528.21187-1-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Reviewed-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
|
|
The only two users of this are now converted to use mmu_interval_notifier,
delete all the code and update hmm.rst.
Link: https://lore.kernel.org/r/20191112202231.3856-14-jgg@ziepe.ca
Reviewed-by: Jérôme Glisse <jglisse@redhat.com>
Tested-by: Ralph Campbell <rcampbell@nvidia.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
|
|
Replace the internal interval tree based mmu notifier with the new common
mmu_interval_notifier_insert() API. This removes a lot of code and fixes a
deadlock that can be triggered in ODP:
zap_page_range()
mmu_notifier_invalidate_range_start()
[..]
ib_umem_notifier_invalidate_range_start()
down_read(&per_mm->umem_rwsem)
unmap_single_vma()
[..]
__split_huge_page_pmd()
mmu_notifier_invalidate_range_start()
[..]
ib_umem_notifier_invalidate_range_start()
down_read(&per_mm->umem_rwsem) // DEADLOCK
mmu_notifier_invalidate_range_end()
up_read(&per_mm->umem_rwsem)
mmu_notifier_invalidate_range_end()
up_read(&per_mm->umem_rwsem)
The umem_rwsem is held across the range_start/end as the ODP algorithm for
invalidate_range_end cannot tolerate changes to the interval
tree. However, due to the nested invalidation regions the second
down_read() can deadlock if there are competing writers. The new core code
provides an alternative scheme to solve this problem.
Fixes: ca748c39ea3f ("RDMA/umem: Get rid of per_mm->notifier_count")
Link: https://lore.kernel.org/r/20191112202231.3856-6-jgg@ziepe.ca
Tested-by: Artemy Kovalyov <artemyko@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
|
|
Only the function calls are stubbed out with static inlines that always
fail. This is the standard way to write a header for an optional component
and makes it easier for drivers that only optionally need HMM_MIRROR.
Link: https://lore.kernel.org/r/20191112202231.3856-5-jgg@ziepe.ca
Reviewed-by: Jérôme Glisse <jglisse@redhat.com>
Tested-by: Ralph Campbell <rcampbell@nvidia.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
|
|
hmm_mirror's handling of ranges does not use a sequence count which
results in this bug:
CPU0 CPU1
hmm_range_wait_until_valid(range)
valid == true
hmm_range_fault(range)
hmm_invalidate_range_start()
range->valid = false
hmm_invalidate_range_end()
range->valid = true
hmm_range_valid(range)
valid == true
Where the hmm_range_valid() should not have succeeded.
Adding the required sequence count would make it nearly identical to the
new mmu_interval_notifier. Instead replace the hmm_mirror stuff with
mmu_interval_notifier.
Co-existence of the two APIs is the first step.
Link: https://lore.kernel.org/r/20191112202231.3856-4-jgg@ziepe.ca
Reviewed-by: Jérôme Glisse <jglisse@redhat.com>
Tested-by: Philip Yang <Philip.Yang@amd.com>
Tested-by: Ralph Campbell <rcampbell@nvidia.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
|
|
Of the 13 users of mmu_notifiers, 8 of them use only
invalidate_range_start/end() and immediately intersect the
mmu_notifier_range with some kind of internal list of VAs. 4 use an
interval tree (i915_gem, radeon_mn, umem_odp, hfi1). 4 use a linked list
of some kind (scif_dma, vhost, gntdev, hmm)
And the remaining 5 either don't use invalidate_range_start() or do some
special thing with it.
It turns out that building a correct scheme with an interval tree is
pretty complicated, particularly if the use case is synchronizing against
another thread doing get_user_pages(). Many of these implementations have
various subtle and difficult to fix races.
This approach puts the interval tree as common code at the top of the mmu
notifier call tree and implements a shareable locking scheme.
It includes:
- An interval tree tracking VA ranges, with per-range callbacks
- A read/write locking scheme for the interval tree that avoids
sleeping in the notifier path (for OOM killer)
- A sequence counter based collision-retry locking scheme to tell
device page fault that a VA range is being concurrently invalidated.
This is based on various ideas:
- hmm accumulates invalidated VA ranges and releases them when all
invalidates are done, via active_invalidate_ranges count.
This approach avoids having to intersect the interval tree twice (as
umem_odp does) at the potential cost of a longer device page fault.
- kvm/umem_odp use a sequence counter to drive the collision retry,
via invalidate_seq
- a deferred work todo list on unlock scheme like RTNL, via deferred_list.
This makes adding/removing interval tree members more deterministic
- seqlock, except this version makes the seqlock idea multi-holder on the
write side by protecting it with active_invalidate_ranges and a spinlock
To minimize MM overhead when only the interval tree is being used, the
entire SRCU and hlist overheads are dropped using some simple
branches. Similarly the interval tree overhead is dropped when in hlist
mode.
The overhead from the mandatory spinlock is broadly the same as most of
existing users which already had a lock (or two) of some sort on the
invalidation path.
Link: https://lore.kernel.org/r/20191112202231.3856-3-jgg@ziepe.ca
Acked-by: Christian König <christian.koenig@amd.com>
Tested-by: Philip Yang <Philip.Yang@amd.com>
Tested-by: Ralph Campbell <rcampbell@nvidia.com>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
|
|
There are no users of phy_ethtool_sset() in the kernel anymore, and
as of commit 3c1bcc8614db ("net: ethernet: Convert phydev advertize
and supported from u32 to link mode"), the implementation is slightly
buggy - it doesn't correctly check the masked advertising mask as it
used to.
Remove it, and update the phy documentation to refer to its replacement
function.
Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
|
|
This commit reverts commit 91e6015b082b ("bpf: Emit audit messages
upon successful prog load and unload") and its follow up commit
7599a896f2e4 ("audit: Move audit_log_task declaration under
CONFIG_AUDITSYSCALL") as requested by Paul Moore. The change needs
close review on linux-audit, tests etc.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
|
|
Cc: Eric Dumazet <edumazet@google.com>
Signed-off-by: Maciej Żenczykowski <maze@google.com>
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
|
|
Adding 2 new functions -
1) struct trace_array *trace_array_get_by_name(const char *name);
Return pointer to a trace array with given name. If it does not exist,
create and return pointer to the new trace array.
2) int trace_array_set_clr_event(struct trace_array *tr,
const char *system ,const char *event, bool enable);
Enable/Disable events to this trace array.
Additionally,
- To handle reference counters, export trace_array_put()
- Due to introduction of the above 2 new functions, we no longer need to
export - ftrace_set_clr_event & trace_array_create APIs.
Link: http://lkml.kernel.org/r/1574276919-11119-2-git-send-email-divya.indi@oracle.com
Signed-off-by: Divya Indi <divya.indi@oracle.com>
Reviewed-by: Aruna Ramakrishna <aruna.ramakrishna@oracle.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
|
|
It's cleaner to use the BIT() macro instead of raw shift operation.
Link: http://lkml.kernel.org/r/20191121133815.15040-1-info@metux.net
Signed-off-by: Enrico Weigelt, metux IT consult <info@metux.net>
[ Added BIT() for bits 16 and 17 ]
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
|
|
Minor conflict in drivers/s390/net/qeth_l2_main.c, kept the lock
from commit c8183f548902 ("s390/qeth: fix potential deadlock on
workqueue flush"), removed the code which was removed by commit
9897d583b015 ("s390/qeth: consolidate some duplicated HW cmd code").
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
|
|
Add YAML devicetree binding for Bitmain BM1880 SoC.
Signed-off-by: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org>
Reviewed-by: Rob Herring <robh@kernel.org>
Link: https://lkml.kernel.org/r/20191115162901.17456-4-manivannan.sadhasivam@linaro.org
Signed-off-by: Stephen Boyd <sboyd@kernel.org>
|
|
Pull networking fixes from David Miller:
1) Validate tunnel options length in act_tunnel_key, from Xin Long.
2) Fix DMA sync bug in gve driver, from Adi Suresh.
3) TSO kills performance on some r8169 chips due to HW issues, disable
by default in that case, from Corinna Vinschen.
4) Fix clock disable mismatch in fec driver, from Chubong Yuan.
5) Fix interrupt status bits define in hns3 driver, from Huazhong Tan.
6) Fix workqueue deadlocks in qeth driver, from Julian Wiedmann.
7) Don't napi_disable() twice in r8152 driver, from Hayes Wang.
8) Fix SKB extension memory leak, from Florian Westphal.
* git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (54 commits)
r8152: avoid to call napi_disable twice
MAINTAINERS: Add myself as maintainer of virtio-vsock
udp: drop skb extensions before marking skb stateless
net: rtnetlink: prevent underflows in do_setvfinfo()
can: m_can_platform: remove unnecessary m_can_class_resume() call
can: m_can_platform: set net_device structure as driver data
hv_netvsc: Fix send_table offset in case of a host bug
hv_netvsc: Fix offset usage in netvsc_send_table()
net-ipv6: IPV6_TRANSPARENT - check NET_RAW prior to NET_ADMIN
sfc: Only cancel the PPS workqueue if it exists
nfc: port100: handle command failure cleanly
net-sysfs: fix netdev_queue_add_kobject() breakage
r8152: Re-order napi_disable in rtl8152_close
net: qca_spi: Move reset_count to struct qcaspi
net: qca_spi: fix receive buffer size check
net/ibmvnic: Ignore H_FUNCTION return from H_EOI to tolerate XIVE mode
Revert "net/ibmvnic: Fix EOI when running in XIVE mode"
net/mlxfw: Verify FSM error code translation doesn't exceed array size
net/mlx5: Update the list of the PCI supported devices
net/mlx5: Fix auto group size calculation
...
|
|
Add definition for flex parser tunneling header for Geneve.
Signed-off-by: Yevgeny Kliteynik <kliteyn@mellanox.com>
Reviewed-by: Alex Vesker <valex@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
|
|
|
|
|
|
|
|
RPC tasks on the backchannel never invoke xprt_complete_rqst(), so
there is no way to report their tk_status at completion. Also, any
RPC task that exits via rpc_exit_task() before it is replied to will
also disappear without a trace.
Introduce a trace point that is symmetrical with rpc_task_begin that
captures the termination status of each RPC task.
Sample trace output for callback requests initiated on the server:
kworker/u8:12-448 [003] 127.025240: rpc_task_end: task:50@3 flags=ASYNC|DYNAMIC|SOFT|SOFTCONN|SENT runstate=RUNNING|ACTIVE status=0 action=rpc_exit_task
kworker/u8:12-448 [002] 127.567310: rpc_task_end: task:51@3 flags=ASYNC|DYNAMIC|SOFT|SOFTCONN|SENT runstate=RUNNING|ACTIVE status=0 action=rpc_exit_task
kworker/u8:12-448 [001] 130.506817: rpc_task_end: task:52@3 flags=ASYNC|DYNAMIC|SOFT|SOFTCONN|SENT runstate=RUNNING|ACTIVE status=0 action=rpc_exit_task
Odd, though, that I never see trace_rpc_task_complete, either in the
forward or backchannel. Should it be removed?
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
|
|
In commit a82055af5959 ("netfilter: nft_payload: add VLAN offload
support"), VLAN fields in struct flow_dissector_key_vlan were unionized
with the intention of introducing another field that covered the whole TCI
header. However without a wrapping struct the subfields end up sharing the
same bits. As a result, "tc filter add ... flower vlan_id 14" specifies not
only vlan_id, but also vlan_priority.
Fix by wrapping the individual VLAN fields in a struct.
Fixes: a82055af5959 ("netfilter: nft_payload: add VLAN offload support")
Signed-off-by: Petr Machata <petrm@mellanox.com>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Once udp stack has set the UDP_SKB_IS_STATELESS flag, later skb free
assumes all skb head state has been dropped already.
This will leak the extension memory in case the skb has extensions other
than the ipsec secpath, e.g. bridge nf data.
To fix this, set the UDP_SKB_IS_STATELESS flag only if we don't have
extensions or if the extension space can be free'd.
Fixes: 895b5c9f206eb7d25dc1360a ("netfilter: drop bridge nf reset from nf_reset")
Cc: Paolo Abeni <pabeni@redhat.com>
Reported-by: Byron Stanoszek <gandalf@winds.org>
Signed-off-by: Florian Westphal <fw@strlen.de>
Acked-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Provide ability to get node and port GUIDs of VFs to be symmetrical
to already existing set option.
Signed-off-by: Danit Goldberg <danitg@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
|
|
Introduce a new ndo: ndo_get_vf_guid, to get from the net
device the port and node GUID.
New applications can choose to use this interface to show
GUIDs with iproute2 with commands such as:
- ip link show ib4
ib4: <BROADCAST,MULTICAST> mtu 4092 qdisc noop state DOWN mode DEFAULT group default qlen 256
link/infiniband 00:00:0a:2d:fe:80:00:00:00:00:00:00:ec:0d:9a:03:00:44:36:8d brd 00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff
vf 0 link/infiniband 00:00:0a:2d:fe:80:00:00:00:00:00:00:ec:0d:9a:03:00:44:36:8d brd 00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff,
spoof checking off, NODE_GUID 22:44:33:00:33:11:00:33, PORT_GUID 10:21:33:12:00:11:22:10, link-state disable, trust off, query_rss off
Signed-off-by: Danit Goldberg <danitg@mellanox.com>
Acked-by: David Ahern <dsahern@gmail.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
|
|
Add the support of the new PCM sync_stop ops in ASoC component.
It's optional and can be NULL unless you need the sync operation.
Acked-by: Mark Brown <broonie@kernel.org>
Link: https://lore.kernel.org/r/20191121190709.29121-3-tiwai@suse.de
Signed-off-by: Takashi Iwai <tiwai@suse.de>
|
|
The previous commit added the ability to throttle stations when they queue
too much airtime in the hardware. This commit enables the functionality by
calculating the expected airtime usage of each packet that is dequeued from
the TXQs in mac80211, and accounting that as pending airtime.
The estimated airtime for each skb is stored in the tx_info, so we can
subtract the same amount from the running total when the skb is freed or
recycled. The throttling mechanism relies on this accounting to be
accurate (i.e., that we are not freeing skbs without subtracting any
airtime they were accounted for), so we put the subtraction into
ieee80211_report_used_skb(). As an optimisation, we also subtract the
airtime on regular TX completion, zeroing out the value stored in the
packet afterwards, to avoid having to do an expensive lookup of the station
from the packet data on every packet.
This patch does *not* include any mechanism to wake a throttled TXQ again,
on the assumption that this will happen anyway as a side effect of whatever
freed the skb (most commonly a TX completion).
Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
Link: https://lore.kernel.org/r/20191119060610.76681-5-kyan@google.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
|
|
In order for the Fq_CoDel algorithm integrated in mac80211 layer to operate
effectively to control excessive queueing latency, the CoDel algorithm
requires an accurate measure of how long packets stays in the queue, AKA
sojourn time. The sojourn time measured at the mac80211 layer doesn't
include queueing latency in the lower layer (firmware/hardware) and CoDel
expects lower layer to have a short queue. However, most 802.11ac chipsets
offload tasks such TX aggregation to firmware or hardware, thus have a deep
lower layer queue.
Without a mechanism to control the lower layer queue size, packets only
stay in mac80211 layer transiently before being sent to firmware queue.
As a result, the sojourn time measured by CoDel in the mac80211 layer is
almost always lower than the CoDel latency target, hence CoDel does little
to control the latency, even when the lower layer queue causes excessive
latency.
The Byte Queue Limits (BQL) mechanism is commonly used to address the
similar issue with wired network interface. However, this method cannot be
applied directly to the wireless network interface. "Bytes" is not a
suitable measure of queue depth in the wireless network, as the data rate
can vary dramatically from station to station in the same network, from a
few Mbps to over Gbps.
This patch implements an Airtime-based Queue Limit (AQL) to make CoDel work
effectively with wireless drivers that utilized firmware/hardware
offloading. AQL allows each txq to release just enough packets to the lower
layer to form 1-2 large aggregations to keep hardware fully utilized and
retains the rest of the frames in mac80211 layer to be controlled by the
CoDel algorithm.
Signed-off-by: Kan Yan <kyan@google.com>
[ Toke: Keep API to set pending airtime internal, fix nits in commit msg ]
Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
Link: https://lore.kernel.org/r/20191119060610.76681-4-kyan@google.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
|
|
Felix recently added code to calculate airtime of packets to the mt76
driver. Import this into mac80211 so we can use it for airtime queue limit
calculations.
The airtime.c file is copied verbatim from the mt76 driver, and adjusted to
be usable in mac80211. This involves:
- Switching to mac80211 data structures.
- Adding support for 160 MHz channels and HE mode.
- Moving the symbol and duration calculations around a bit to avoid
rounding with the higher rates and longer symbol times used for HE rates.
The per-rate TX rate calculation is also split out to its own function so
it can be used directly for the AQL calculations later.
Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
Link: https://lore.kernel.org/r/20191119060610.76681-3-kyan@google.com
[fix HE_GROUP_IDX() to use 3 * bw, since there are 3 _gi values]
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
|
|
The API will be used by the hv_balloon and hv_vmbus drivers.
Balloon up/down and hot-add of memory must not be active if the user
wants the Linux VM to support hibernation, because they are incompatible
with hibernation according to Hyper-V team, e.g. upon suspend the
balloon VSP doesn't save any info about the ballooned-out pages (if any);
so, after Linux resumes, Linux balloon VSC expects that the VSP will
return the pages if Linux is under memory pressure, but the VSP will
never do that, since the VSP thinks it never stole the pages from the VM.
So, if the user wants Linux VM to support hibernation, Linux must forbid
balloon up/down and hot-add, and the only functionality of the balloon VSC
driver is reporting the VM's memory pressure to the host.
Ideally, when Linux detects that the user wants it to support hibernation,
the balloon VSC should tell the VSP that it does not support ballooning
and hot-add. However, the current version of the VSP requires the VSC
should support these capabilities, otherwise the capability negotiation
fails and the VSC can not load at all, so with the later changes to the
VSC driver, Linux VM still reports to the VSP that the VSC supports these
capabilities, but the VSC ignores the VSP's requests of balloon up/down
and hot add, and reports an error to the VSP, when applicable. BTW, in
the future the balloon VSP driver will allow the VSC to not support the
capabilities of balloon up/down and hot add.
The ACPI S4 state is not a must for hibernation to work, because Linux is
able to hibernate as long as the system can shut down. However in practice
we decide to artificially use the presence of the virtual ACPI S4 state as
an indicator of the user's intent of using hibernation, because Linux VM
must find a way to know if the user wants to use the hibernation feature
or not.
By default, Hyper-V does not enable the virtual ACPI S4 state; on recent
Hyper-V hosts (e.g. RS5, 19H1), the administrator is able to enable the
state for a VM by WMI commands.
Once all the vmbus and VSC patches for the hibernation feature are
accepted, an extra patch will be submitted to forbid hibernation if the
virtual ACPI S4 state is absent, i.e. hv_is_hibernation_supported() is
false.
Signed-off-by: Dexuan Cui <decui@microsoft.com>
Reviewed-by: Michael Kelley <mikelley@microsoft.com>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
Introduce user specified latency in the packet reception path
By exposing the test parameters as part of the debugfs channel
attributes. We will control the testing state via these attributes.
Signed-off-by: Branden Bonaby <brandonbonaby94@gmail.com>
Reviewed-by: Michael Kelley <mikelley@microsoft.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
Hyper-V has added VMBus protocol versions 5.1 and 5.2 in recent release
versions. Allow Linux guests to negotiate these new protocol versions
on versions of Hyper-V that support them. While on this, also allow
guests to negotiate the VMBus protocol version 4.1 (which was missing).
Signed-off-by: Andrea Parri <parri.andrea@gmail.com>
Reviewed-by: Wei Liu <wei.liu@kernel.org>
Reviewed-by: Michael Kelley <mikelley@microsoft.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
The technique used to get the next VMBus version seems increasisly
clumsy as the number of VMBus versions increases. Performance is
not a concern since this is only done once during system boot; it's
just that we'll end up with more lines of code than is really needed.
As an alternative, introduce a table with the version numbers listed
in order (from the most recent to the oldest). vmbus_connect() loops
through the versions listed in the table until it gets an accepted
connection or gets to the end of the table (invalid version).
Suggested-by: Michael Kelley <mikelley@microsoft.com>
Signed-off-by: Andrea Parri <parri.andrea@gmail.com>
Reviewed-by: Wei Liu <wei.liu@kernel.org>
Reviewed-by: Michael Kelley <mikelley@microsoft.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
This is alike the previous change, with some additional ipv4 specific
quirk. Even when using the route hint we still have to do perform
additional per packet checks about source address validity: a new
helper is added to wrap them.
Hints are explicitly disabled if the destination is a local broadcast,
that keeps the code simple and local broadcast are a slower path anyway.
UDP flood performances vs recvmmsg() receiver:
vanilla patched delta
Kpps Kpps %
1683 1871 +11
In the worst case scenario - each packet has a different
destination address - the performance delta is within noise
range.
v3 -> v4:
- re-enable hints for forward
v2 -> v3:
- really fix build (sic) and hint usage check
- use fib4_has_custom_rules() helpers (David A.)
- add ip_extract_route_hint() helper (Edward C.)
- use prev skb as hint instead of copying data (Willem)
v1 -> v2:
- fix build issue with !CONFIG_IP_MULTIPLE_TABLES
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|