Age | Commit message (Collapse) | Author |
|
Change type of action reference counter to refcount_t.
Change type of action bind counter to atomic_t.
This type is used to allow decrementing bind counter without testing
for 0 result.
Reviewed-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Implement functions to atomically update and free action cookie
using rcu mechanism.
Reviewed-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Add 'clone' action to kernel datapath by using existing functions.
When actions within clone don't modify the current flow, the flow
key is not cloned before executing clone actions.
This is a follow up patch for this incomplete work:
https://patchwork.ozlabs.org/patch/722096/
v1 -> v2:
Refactor as advised by reviewer.
Signed-off-by: Yifeng Sun <pkusunyifeng@gmail.com>
Signed-off-by: Andy Zhou <azhou@ovn.org>
Acked-by: Pravin B Shelar <pshelar@ovn.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Otherwise we end up with attempting to send packets from down devices
or to send oversized packets, which may cause unexpected driver/device
behaviour. Generic XDP has already done this check, so reuse the logic
in native XDP.
Fixes: 814abfabef3c ("xdp: add bpf_redirect helper function")
Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
In commit
'bpf: bpf_compute_data uses incorrect cb structure' (8108a7751512)
we added the routine bpf_compute_data_end_sk_skb() to compute the
correct data_end values, but this has since been lost. In kernel
v4.14 this was correct and the above patch was applied in it
entirety. Then when v4.14 was merged into v4.15-rc1 net-next tree
we lost the piece that renamed bpf_compute_data_pointers to the
new function bpf_compute_data_end_sk_skb. This was done here,
e1ea2f9856b7 ("Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net")
When it conflicted with the following rename patch,
6aaae2b6c433 ("bpf: rename bpf_compute_data_end into bpf_compute_data_pointers")
Finally, after a refactor I thought even the function
bpf_compute_data_end_sk_skb() was no longer needed and it was
erroneously removed.
However, we never reverted the sk_skb_convert_ctx_access() usage of
tcp_skb_cb which had been committed and survived the merge conflict.
Here we fix this by adding back the helper and *_data_end_sk_skb()
usage. Using the bpf_skc_data_end mapping is not correct because it
expects a qdisc_skb_cb object but at the sock layer this is not the
case. Even though it happens to work here because we don't overwrite
any data in-use at the socket layer and the cb structure is cleared
later this has potential to create some subtle issues. But, even
more concretely the filter.c access check uses tcp_skb_cb.
And by some act of chance though,
struct bpf_skb_data_end {
struct qdisc_skb_cb qdisc_cb; /* 0 28 */
/* XXX 4 bytes hole, try to pack */
void * data_meta; /* 32 8 */
void * data_end; /* 40 8 */
/* size: 48, cachelines: 1, members: 3 */
/* sum members: 44, holes: 1, sum holes: 4 */
/* last cacheline: 48 bytes */
};
and then tcp_skb_cb,
struct tcp_skb_cb {
[...]
struct {
__u32 flags; /* 24 4 */
struct sock * sk_redir; /* 32 8 */
void * data_end; /* 40 8 */
} bpf; /* 24 */
};
So when we use offset_of() to track down the byte offset we get 40 in
either case and everything continues to work. Fix this mess and use
correct structures its unclear how long this might actually work for
until someone moves the structs around.
Reported-by: Martin KaFai Lau <kafai@fb.com>
Fixes: e1ea2f9856b7 ("Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net")
Fixes: 6aaae2b6c433 ("bpf: rename bpf_compute_data_end into bpf_compute_data_pointers")
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
We are hitting a regression with the following commit:
commit a93e7b331568227500186a465fee3c2cb5dffd1f
Author: Hamish Martin <hamish.martin@alliedtelesis.co.nz>
Date: Mon May 14 13:32:23 2018 +1200
uio: Prevent device destruction while fds are open
The problem is the addition of spin_lock_irqsave in uio_write. This
leads to hitting uio_write -> copy_from_user -> _copy_from_user ->
might_fault and the logs filling up with sleeping warnings.
I also noticed some uio drivers allocate memory, sleep, grab mutexes
from callouts like open() and release and uio is now doing
spin_lock_irqsave while calling them.
Reported-by: Mike Christie <mchristi@redhat.com>
CC: Hamish Martin <hamish.martin@alliedtelesis.co.nz>
Reviewed-by: Hamish Martin <hamish.martin@alliedtelesis.co.nz>
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
the control action in the common member of struct tcf_tunnel_key must be a
valid value, as it can contain the chain index when 'goto chain' is used.
Ensure that the control action can be read as x->tcfa_action, when x is a
pointer to struct tc_action and x->ops->type is TCA_ACT_TUNNEL_KEY, to
prevent the following command:
# tc filter add dev $h2 ingress protocol ip pref 1 handle 101 flower \
> $tcflags dst_mac $h2mac action tunnel_key unset goto chain 1
from causing a NULL dereference when a matching packet is received:
BUG: unable to handle kernel NULL pointer dereference at 0000000000000000
PGD 80000001097ac067 P4D 80000001097ac067 PUD 103b0a067 PMD 0
Oops: 0000 [#1] SMP PTI
CPU: 0 PID: 3491 Comm: mausezahn Tainted: G E 4.18.0-rc2.auguri+ #421
Hardware name: Hewlett-Packard HP Z220 CMT Workstation/1790, BIOS K51 v01.58 02/07/2013
RIP: 0010:tcf_action_exec+0xb8/0x100
Code: 00 00 00 20 74 1d 83 f8 03 75 09 49 83 c4 08 4d 39 ec 75 bc 48 83 c4 10 5b 5d 41 5c 41 5d 41 5e 41 5f c3 49 8b 97 a8 00 00 00 <48> 8b 12 48 89 55 00 48 83 c4 10 5b 5d 41 5c 41 5d 41 5e 41 5f c3
RSP: 0018:ffff95145ea03c40 EFLAGS: 00010246
RAX: 0000000020000001 RBX: ffff9514499e5800 RCX: 0000000000000001
RDX: 0000000000000000 RSI: 0000000000000002 RDI: 0000000000000000
RBP: ffff95145ea03e60 R08: 0000000000000000 R09: ffff95145ea03c9c
R10: ffff95145ea03c78 R11: 0000000000000008 R12: ffff951456a69800
R13: ffff951456a69808 R14: 0000000000000001 R15: ffff95144965ee40
FS: 00007fd67ee11740(0000) GS:ffff95145ea00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000000 CR3: 00000001038a2006 CR4: 00000000001606f0
Call Trace:
<IRQ>
fl_classify+0x1ad/0x1c0 [cls_flower]
? __update_load_avg_se.isra.47+0x1ca/0x1d0
? __update_load_avg_se.isra.47+0x1ca/0x1d0
? update_load_avg+0x665/0x690
? update_load_avg+0x665/0x690
? kmem_cache_alloc+0x38/0x1c0
tcf_classify+0x89/0x140
__netif_receive_skb_core+0x5ea/0xb70
? enqueue_entity+0xd0/0x270
? process_backlog+0x97/0x150
process_backlog+0x97/0x150
net_rx_action+0x14b/0x3e0
__do_softirq+0xde/0x2b4
do_softirq_own_stack+0x2a/0x40
</IRQ>
do_softirq.part.18+0x49/0x50
__local_bh_enable_ip+0x49/0x50
__dev_queue_xmit+0x4ab/0x8a0
? wait_woken+0x80/0x80
? packet_sendmsg+0x38f/0x810
? __dev_queue_xmit+0x8a0/0x8a0
packet_sendmsg+0x38f/0x810
sock_sendmsg+0x36/0x40
__sys_sendto+0x10e/0x140
? do_vfs_ioctl+0xa4/0x630
? syscall_trace_enter+0x1df/0x2e0
? __audit_syscall_exit+0x22a/0x290
__x64_sys_sendto+0x24/0x30
do_syscall_64+0x5b/0x180
entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x7fd67e18dc93
Code: 48 8b 0d 18 83 20 00 f7 d8 64 89 01 48 83 c8 ff c3 66 0f 1f 44 00 00 83 3d 59 c7 20 00 00 75 13 49 89 ca b8 2c 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 34 c3 48 83 ec 08 e8 2b f7 ff ff 48 89 04 24
RSP: 002b:00007ffe0189b748 EFLAGS: 00000246 ORIG_RAX: 000000000000002c
RAX: ffffffffffffffda RBX: 00000000020ca010 RCX: 00007fd67e18dc93
RDX: 0000000000000062 RSI: 00000000020ca322 RDI: 0000000000000003
RBP: 00007ffe0189b780 R08: 00007ffe0189b760 R09: 0000000000000014
R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000062
R13: 00000000020ca322 R14: 00007ffe0189b760 R15: 0000000000000003
Modules linked in: act_tunnel_key act_gact cls_flower sch_ingress vrf veth act_csum(E) xt_CHECKSUM iptable_mangle ipt_MASQUERADE iptable_nat nf_nat_ipv4 nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack ipt_REJECT nf_reject_ipv4 tun bridge stp llc ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter intel_rapl snd_hda_codec_hdmi x86_pkg_temp_thermal intel_powerclamp snd_hda_codec_realtek coretemp snd_hda_codec_generic kvm_intel kvm irqbypass snd_hda_intel crct10dif_pclmul crc32_pclmul hp_wmi ghash_clmulni_intel pcbc snd_hda_codec aesni_intel sparse_keymap rfkill snd_hda_core snd_hwdep snd_seq crypto_simd iTCO_wdt gpio_ich iTCO_vendor_support wmi_bmof cryptd mei_wdt glue_helper snd_seq_device snd_pcm pcspkr snd_timer snd i2c_i801 lpc_ich sg soundcore wmi mei_me
mei ie31200_edac nfsd auth_rpcgss nfs_acl lockd grace sunrpc ip_tables xfs libcrc32c sd_mod sr_mod cdrom i915 video i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ahci crc32c_intel libahci serio_raw sfc libata mtd drm ixgbe mdio i2c_core e1000e dca
CR2: 0000000000000000
---[ end trace 1ab8b5b5d4639dfc ]---
RIP: 0010:tcf_action_exec+0xb8/0x100
Code: 00 00 00 20 74 1d 83 f8 03 75 09 49 83 c4 08 4d 39 ec 75 bc 48 83 c4 10 5b 5d 41 5c 41 5d 41 5e 41 5f c3 49 8b 97 a8 00 00 00 <48> 8b 12 48 89 55 00 48 83 c4 10 5b 5d 41 5c 41 5d 41 5e 41 5f c3
RSP: 0018:ffff95145ea03c40 EFLAGS: 00010246
RAX: 0000000020000001 RBX: ffff9514499e5800 RCX: 0000000000000001
RDX: 0000000000000000 RSI: 0000000000000002 RDI: 0000000000000000
RBP: ffff95145ea03e60 R08: 0000000000000000 R09: ffff95145ea03c9c
R10: ffff95145ea03c78 R11: 0000000000000008 R12: ffff951456a69800
R13: ffff951456a69808 R14: 0000000000000001 R15: ffff95144965ee40
FS: 00007fd67ee11740(0000) GS:ffff95145ea00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000000 CR3: 00000001038a2006 CR4: 00000000001606f0
Kernel panic - not syncing: Fatal exception in interrupt
Kernel Offset: 0x11400000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
---[ end Kernel panic - not syncing: Fatal exception in interrupt ]---
Fixes: d0f6dd8a914f ("net/sched: Introduce act_tunnel_key")
Signed-off-by: Davide Caratti <dcaratti@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
the control action in the common member of struct tcf_csum must be a valid
value, as it can contain the chain index when 'goto chain' is used. Ensure
that the control action can be read as x->tcfa_action, when x is a pointer
to struct tc_action and x->ops->type is TCA_ACT_CSUM, to prevent the
following command:
# tc filter add dev $h2 ingress protocol ip pref 1 handle 101 flower \
> $tcflags dst_mac $h2mac action csum ip or tcp or udp or sctp goto chain 1
from triggering a NULL pointer dereference when a matching packet is
received.
BUG: unable to handle kernel NULL pointer dereference at 0000000000000000
PGD 800000010416b067 P4D 800000010416b067 PUD 1041be067 PMD 0
Oops: 0000 [#1] SMP PTI
CPU: 0 PID: 3072 Comm: mausezahn Tainted: G E 4.18.0-rc2.auguri+ #421
Hardware name: Hewlett-Packard HP Z220 CMT Workstation/1790, BIOS K51 v01.58 02/07/2013
RIP: 0010:tcf_action_exec+0xb8/0x100
Code: 00 00 00 20 74 1d 83 f8 03 75 09 49 83 c4 08 4d 39 ec 75 bc 48 83 c4 10 5b 5d 41 5c 41 5d 41 5e 41 5f c3 49 8b 97 a8 00 00 00 <48> 8b 12 48 89 55 00 48 83 c4 10 5b 5d 41 5c 41 5d 41 5e 41 5f c3
RSP: 0018:ffffa020dea03c40 EFLAGS: 00010246
RAX: 0000000020000001 RBX: ffffa020d7ccef00 RCX: 0000000000000054
RDX: 0000000000000000 RSI: ffffa020ca5ae000 RDI: ffffa020d7ccef00
RBP: ffffa020dea03e60 R08: 0000000000000000 R09: ffffa020dea03c9c
R10: ffffa020dea03c78 R11: 0000000000000008 R12: ffffa020d3fe4f00
R13: ffffa020d3fe4f08 R14: 0000000000000001 R15: ffffa020d53ca300
FS: 00007f5a46942740(0000) GS:ffffa020dea00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000000 CR3: 0000000104218002 CR4: 00000000001606f0
Call Trace:
<IRQ>
fl_classify+0x1ad/0x1c0 [cls_flower]
? arp_rcv+0x121/0x1b0
? __x2apic_send_IPI_dest+0x40/0x40
? smp_reschedule_interrupt+0x1c/0xd0
? reschedule_interrupt+0xf/0x20
? reschedule_interrupt+0xa/0x20
? device_is_rmrr_locked+0xe/0x50
? iommu_should_identity_map+0x49/0xd0
? __intel_map_single+0x30/0x140
? e1000e_update_rdt_wa.isra.52+0x22/0xb0 [e1000e]
? e1000_alloc_rx_buffers+0x233/0x250 [e1000e]
? kmem_cache_alloc+0x38/0x1c0
tcf_classify+0x89/0x140
__netif_receive_skb_core+0x5ea/0xb70
? enqueue_task_fair+0xb6/0x7d0
? process_backlog+0x97/0x150
process_backlog+0x97/0x150
net_rx_action+0x14b/0x3e0
__do_softirq+0xde/0x2b4
do_softirq_own_stack+0x2a/0x40
</IRQ>
do_softirq.part.18+0x49/0x50
__local_bh_enable_ip+0x49/0x50
__dev_queue_xmit+0x4ab/0x8a0
? wait_woken+0x80/0x80
? packet_sendmsg+0x38f/0x810
? __dev_queue_xmit+0x8a0/0x8a0
packet_sendmsg+0x38f/0x810
sock_sendmsg+0x36/0x40
__sys_sendto+0x10e/0x140
? do_vfs_ioctl+0xa4/0x630
? syscall_trace_enter+0x1df/0x2e0
? __audit_syscall_exit+0x22a/0x290
__x64_sys_sendto+0x24/0x30
do_syscall_64+0x5b/0x180
entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x7f5a45cbec93
Code: 48 8b 0d 18 83 20 00 f7 d8 64 89 01 48 83 c8 ff c3 66 0f 1f 44 00 00 83 3d 59 c7 20 00 00 75 13 49 89 ca b8 2c 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 34 c3 48 83 ec 08 e8 2b f7 ff ff 48 89 04 24
RSP: 002b:00007ffd0ee6d748 EFLAGS: 00000246 ORIG_RAX: 000000000000002c
RAX: ffffffffffffffda RBX: 0000000001161010 RCX: 00007f5a45cbec93
RDX: 0000000000000062 RSI: 0000000001161322 RDI: 0000000000000003
RBP: 00007ffd0ee6d780 R08: 00007ffd0ee6d760 R09: 0000000000000014
R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000062
R13: 0000000001161322 R14: 00007ffd0ee6d760 R15: 0000000000000003
Modules linked in: act_csum act_gact cls_flower sch_ingress vrf veth act_tunnel_key(E) xt_CHECKSUM iptable_mangle ipt_MASQUERADE iptable_nat nf_nat_ipv4 nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack ipt_REJECT nf_reject_ipv4 tun bridge stp llc ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter intel_rapl x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel snd_hda_codec_hdmi snd_hda_codec_realtek kvm snd_hda_codec_generic hp_wmi iTCO_wdt sparse_keymap rfkill mei_wdt iTCO_vendor_support wmi_bmof gpio_ich irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel pcbc aesni_intel snd_hda_intel crypto_simd cryptd snd_hda_codec glue_helper snd_hda_core snd_hwdep snd_seq snd_seq_device snd_pcm pcspkr i2c_i801 snd_timer snd sg lpc_ich soundcore wmi mei_me
mei ie31200_edac nfsd auth_rpcgss nfs_acl lockd grace sunrpc ip_tables xfs libcrc32c sr_mod cdrom sd_mod ahci libahci crc32c_intel i915 ixgbe serio_raw libata video dca i2c_algo_bit sfc drm_kms_helper syscopyarea mtd sysfillrect mdio sysimgblt fb_sys_fops drm e1000e i2c_core
CR2: 0000000000000000
---[ end trace 3c9e9d1a77df4026 ]---
RIP: 0010:tcf_action_exec+0xb8/0x100
Code: 00 00 00 20 74 1d 83 f8 03 75 09 49 83 c4 08 4d 39 ec 75 bc 48 83 c4 10 5b 5d 41 5c 41 5d 41 5e 41 5f c3 49 8b 97 a8 00 00 00 <48> 8b 12 48 89 55 00 48 83 c4 10 5b 5d 41 5c 41 5d 41 5e 41 5f c3
RSP: 0018:ffffa020dea03c40 EFLAGS: 00010246
RAX: 0000000020000001 RBX: ffffa020d7ccef00 RCX: 0000000000000054
RDX: 0000000000000000 RSI: ffffa020ca5ae000 RDI: ffffa020d7ccef00
RBP: ffffa020dea03e60 R08: 0000000000000000 R09: ffffa020dea03c9c
R10: ffffa020dea03c78 R11: 0000000000000008 R12: ffffa020d3fe4f00
R13: ffffa020d3fe4f08 R14: 0000000000000001 R15: ffffa020d53ca300
FS: 00007f5a46942740(0000) GS:ffffa020dea00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000000 CR3: 0000000104218002 CR4: 00000000001606f0
Kernel panic - not syncing: Fatal exception in interrupt
Kernel Offset: 0x26400000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
---[ end Kernel panic - not syncing: Fatal exception in interrupt ]---
Fixes: 9c5f69bbd75a ("net/sched: act_csum: don't use spinlock in the fast path")
Signed-off-by: Davide Caratti <dcaratti@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
As support dissecting of QinQ inner and outer vlan headers, user can
add rules to match on QinQ vlan headers.
Signed-off-by: Jianbo Liu <jianbol@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Dissect the QinQ packets to get both outer and inner vlan information,
then store to the extended flow keys.
Signed-off-by: Jianbo Liu <jianbol@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Change vlan dissector key to save vlan tpid to support both 802.1Q
and 802.1AD ethertype.
Signed-off-by: Jianbo Liu <jianbol@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
A newly added dummy helper function tries to return a failure from a "void"
function:
In file included from include/net/dsa.h:24,
from arch/arm/plat-orion/common.c:21:
include/net/devlink.h: In function 'devlink_param_value_changed':
include/net/devlink.h:771:9: error: 'return' with a value, in function returning void [-Werror]
return -EOPNOTSUPP;
This fixes it by removing the bogus statement.
Fixes: ea601e170988 ("devlink: Add devlink notifications support for params")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
These two functions return the regular -EINVAL failure in the normal
code path, but return a nonstandard '-1' error otherwise, which gets
interpreted as -EPERM.
Let's change it to -EINVAL for the dummy functions as well.
Fixes: 4d4fd36126d6 ("net: bridge: Publish bridge accessor functions")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
skb_shinfo(skb)->tx_flags is derived from sk->sk_tsflags, possibly
after modification by __sock_cmsg_send, by calling sock_tx_timestamp.
The IPv4 and IPv6 paths do this conversion differently. In IPv4, the
individual protocols that support tx timestamps call this function
and store the result in ipc.tx_flags. In IPv6, sock_tx_timestamp is
called in __ip6_append_data.
There is no need to store both tx_flags and ts_flags in the cookie
as one is derived from the other. Convert when setting up the cork
and remove the redundant field. This is similar to IPv6, only have
the conversion happen only once per datagram, in ip(6)_setup_cork.
Also change __ip6_append_data to match __ip_append_data. Only update
tskey if timestamping is enabled with OPT_ID. The SOCK_.. test is
redundant: only valid protocols can have non-zero cork->tx_flags.
After this change the IPv4 and IPv6 logic is the same.
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
ipcm_cookie includes sockcm_cookie. Do the same for ipcm6_cookie.
This reduces the number of arguments that need to be passed around,
applies ipcm6_init to all cookie fields at once and reduces code
differentiation between ipv4 and ipv6.
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Initialize the cookie in one location to reduce code duplication and
avoid bugs from inconsistent initialization, such as that fixed in
commit 9887cba19978 ("ip: limit use of gso_size to udp").
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Initialize the cookie in one location to reduce code duplication and
avoid bugs from inconsistent initialization, such as that fixed in
commit 9887cba19978 ("ip: limit use of gso_size to udp").
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Initialize the cookie in one location to reduce code duplication and
avoid bugs from inconsistent initialization, such as that fixed in
commit 9887cba19978 ("ip: limit use of gso_size to udp").
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
The new added "reciprocal_value_adv" implements the advanced version of the
algorithm described in Figure 4.2 of the paper except when
"divisor > (1U << 31)" whose ceil(log2(d)) result will be 32 which then
requires u128 divide on host. The exception case could be easily handled
before calling "reciprocal_value_adv".
The advanced version requires more complex calculation to get the
reciprocal multiplier and other control variables, but then could reduce
the required emulation operations.
It makes no sense to use this advanced version for host divide emulation,
those extra complexities for calculating multiplier etc could completely
waive our saving on emulation operations.
However, it makes sense to use it for JIT divide code generation (for
example eBPF JIT backends) for which we are willing to trade performance of
JITed code with that of host. As shown by the following pseudo code, the
required emulation operations could go down from 6 (the basic version) to 3
or 4.
To use the result of "reciprocal_value_adv", suppose we want to calculate
n/d, the C-style pseudo code will be the following, it could be easily
changed to real code generation for other JIT targets.
struct reciprocal_value_adv rvalue;
u8 pre_shift, exp;
// handle exception case.
if (d >= (1U << 31)) {
result = n >= d;
return;
}
rvalue = reciprocal_value_adv(d, 32)
exp = rvalue.exp;
if (rvalue.is_wide_m && !(d & 1)) {
// floor(log2(d & (2^32 -d)))
pre_shift = fls(d & -d) - 1;
rvalue = reciprocal_value_adv(d >> pre_shift, 32 - pre_shift);
} else {
pre_shift = 0;
}
// code generation starts.
if (imm == 1U << exp) {
result = n >> exp;
} else if (rvalue.is_wide_m) {
// pre_shift must be zero when reached here.
t = (n * rvalue.m) >> 32;
result = n - t;
result >>= 1;
result += t;
result >>= rvalue.sh - 1;
} else {
if (pre_shift)
result = n >> pre_shift;
result = ((u64)result * rvalue.m) >> 32;
result >>= rvalue.sh;
}
Signed-off-by: Jiong Wang <jiong.wang@netronome.com>
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/clk/linux
Pull clk fixes from Stephen Boyd:
"The usual collection of driver fixlets:
- build cleanup/fix for the sunxi makefile that tried to save size
but failed and prevented dead code elimination from working
- two Davinci clk driver fixes for a typo causing build failures in
different configurations and an error check that checks the wrong
variable.
- undo the DT ABI breaking imx6ul binding header shuffle that got
merged this cycle"
* tag 'clk-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/clk/linux:
dt-bindings: clock: imx6ul: Do not change the clock definition order
clk: davinci: fix a typo (which leads to build failures)
clk: davinci: cfgchip: testing the wrong variable
clk: sunxi-ng: replace lib-y with obj-y
|
|
This patch disallows rbtree with single elements, which is causing
problems with the recent timeout support. Before this patch, you
could opt out individual set representations per module, which is
just adding extra complexity.
Fixes: 8d8540c4f5e0("netfilter: nft_set_rbtree: add timeout support")
Reported-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
|
The current implementation of cfg80211_rx_control_port assumed that the
caller could provide a contiguous region of memory for the control port
frame to be sent up to userspace. Unfortunately, many drivers produce
non-linear skbs, especially for data frames. This resulted in userspace
getting notified of control port frames with correct metadata (from
address, port, etc) yet garbage / nonsense contents, resulting in bad
handshakes, disconnections, etc.
mac80211 linearizes skbs containing management frames. But it didn't
seem worthwhile to do this for control port frames. Thus the signature
of cfg80211_rx_control_port was changed to take the skb directly.
nl80211 then takes care of obtaining control port frame data directly
from the (linear | non-linear) skb.
The caller is still responsible for freeing the skb,
cfg80211_rx_control_port does not take ownership of it.
Fixes: 6a671a50f819 ("nl80211: Add CMD_CONTROL_PORT_FRAME API")
Signed-off-by: Denis Kenzior <denkenz@gmail.com>
[fix some kernel-doc formatting, add fixes tag]
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
|
|
This patch fixes a silent out-of-bound read possibility that was present
because of the misuse of this function.
Mostly it was called with a struct udphdr *hp which had only the udphdr
part linearized by the skb_header_pointer, however
nf_tproxy_get_sock_v{4,6} uses it as a tcphdr pointer, so some reads for
tcp specific attributes may be invalid.
Fixes: a583636a83ea ("inet: refactor inet[6]_lookup functions to take skb")
Signed-off-by: Máté Eckl <ecklm94@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing fixes and cleanups from Steven Rostedt:
"While cleaning out my INBOX, I found a few patches that were lost in
the noise. These are minor bug fixes and clean ups. Those include:
- avoid a string overflow
- code that didn't match the comment (but should)
- a small code optimization (use of a conditional)
- quiet printf warnings
- nuke unused code
- fix function graph interrupt annotation"
* tag 'trace-v4.18-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
tracing: Fix missing return symbol in function_graph output
ftrace: Nuke clear_ftrace_function
tracing: Use __printf markup to silence compiler
tracing: Optimize trace_buffer_iter() logic
tracing: Make create_filter() code match the comments
tracing: Avoid string overflow
|
|
Essentially the same as the ipv4 equivalents.
Signed-off-by: Edward Cree <ecree@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
At present the ipv6_renew_options_kern() function ends up calling into
access_ok() which is problematic if done from inside an interrupt as
access_ok() calls WARN_ON_IN_IRQ() on some (all?) architectures
(x86-64 is affected). Example warning/backtrace is shown below:
WARNING: CPU: 1 PID: 3144 at lib/usercopy.c:11 _copy_from_user+0x85/0x90
...
Call Trace:
<IRQ>
ipv6_renew_option+0xb2/0xf0
ipv6_renew_options+0x26a/0x340
ipv6_renew_options_kern+0x2c/0x40
calipso_req_setattr+0x72/0xe0
netlbl_req_setattr+0x126/0x1b0
selinux_netlbl_inet_conn_request+0x80/0x100
selinux_inet_conn_request+0x6d/0xb0
security_inet_conn_request+0x32/0x50
tcp_conn_request+0x35f/0xe00
? __lock_acquire+0x250/0x16c0
? selinux_socket_sock_rcv_skb+0x1ae/0x210
? tcp_rcv_state_process+0x289/0x106b
tcp_rcv_state_process+0x289/0x106b
? tcp_v6_do_rcv+0x1a7/0x3c0
tcp_v6_do_rcv+0x1a7/0x3c0
tcp_v6_rcv+0xc82/0xcf0
ip6_input_finish+0x10d/0x690
ip6_input+0x45/0x1e0
? ip6_rcv_finish+0x1d0/0x1d0
ipv6_rcv+0x32b/0x880
? ip6_make_skb+0x1e0/0x1e0
__netif_receive_skb_core+0x6f2/0xdf0
? process_backlog+0x85/0x250
? process_backlog+0x85/0x250
? process_backlog+0xec/0x250
process_backlog+0xec/0x250
net_rx_action+0x153/0x480
__do_softirq+0xd9/0x4f7
do_softirq_own_stack+0x2a/0x40
</IRQ>
...
While not present in the backtrace, ipv6_renew_option() ends up calling
access_ok() via the following chain:
access_ok()
_copy_from_user()
copy_from_user()
ipv6_renew_option()
The fix presented in this patch is to perform the userspace copy
earlier in the call chain such that it is only called when the option
data is actually coming from userspace; that place is
do_ipv6_setsockopt(). Not only does this solve the problem seen in
the backtrace above, it also allows us to simplify the code quite a
bit by removing ipv6_renew_options_kern() completely. We also take
this opportunity to cleanup ipv6_renew_options()/ipv6_renew_option()
a small amount as well.
This patch is heavily based on a rough patch by Al Viro. I've taken
his original patch, converted a kmemdup() call in do_ipv6_setsockopt()
to a memdup_user() call, made better use of the e_inval jump target in
the same function, and cleaned up the use ipv6_renew_option() by
ipv6_renew_options().
CC: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Paul Moore <paul@paul-moore.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
enable_sriov - Enables Single-Root Input/Output Virtualization(SR-IOV)
characteristic of the device.
Reviewed-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: Vasundhara Volam <vasundhara-v.volam@broadcom.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Add 2 first generic parameters to devlink configuration parameters set:
internal_err_reset - When set enables reset device on internal errors.
max_macs - max number of MACs per ETH port.
Signed-off-by: Moshe Shemesh <moshe@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Add devlink_param_notify() function to support devlink param notifications.
Add notification call to devlink param set, register and unregister
functions.
Add devlink_param_value_changed() function to enable the driver notify
devlink on value change. Driver should use this function after value was
changed on any configuration mode part to driverinit.
Signed-off-by: Moshe Shemesh <moshe@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
"driverinit" configuration mode value is held by devlink to enable
the driver query the value after reload. Two additional functions
added to help the driver get/set the value from/to devlink:
devlink_param_driverinit_value_set() and
devlink_param_driverinit_value_get().
Signed-off-by: Moshe Shemesh <moshe@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Add param set command to set value for a parameter.
Value can be set to any of the supported configuration modes.
Signed-off-by: Moshe Shemesh <moshe@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Add param get command which gets data per parameter.
Option to dump the parameters data per device.
Signed-off-by: Moshe Shemesh <moshe@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Define configuration parameters data structure.
Add functions to register and unregister the driver supported
configuration parameters table.
For each parameter registered, the driver should fill all the parameter's
fields. In case the only supported configuration mode is "driverinit"
the parameter's get()/set() functions are not required and should be set
to NULL, for any other configuration mode, these functions are required
and should be set by the driver.
Signed-off-by: Moshe Shemesh <moshe@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
After commit 07d78363dcff ("net: Convert NAPI gro list into a small hash
table.")' there is 8 hash buckets, which allows more flows to be held for
merging. but MAX_GRO_SKBS, the total held skb for merging, is 8 skb still,
limit the hash table performance.
keep MAX_GRO_SKBS as 8 skb, but limit each hash list length to 8 skb, not
the total 8 skb
Signed-off-by: Li RongQing <lirongqing@baidu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
MLX5 IB HCA offers the memory key, dump_fill_mkey to boost
performance by forcing local HCA operations to skip the PCI bus
access,
This patch adds needed hardware definitions.
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
|
|
mlx5_core_dump_fill_mkey() is going to be used in next
patch in IB and doesn't need to be visible to whole
mlx5_core. Move that command to mlx5_ib.
Signed-off-by: Yonatan Cohen <yonatanc@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
|
|
Use the socket error queue for reporting dropped packets if the
socket has enabled that feature through the SO_TXTIME API.
Packets are dropped either on enqueue() if they aren't accepted by the
qdisc or on dequeue() if the system misses their deadline. Those are
reported as different errors so applications can react accordingly.
Userspace can retrieve the errors through the socket error queue and the
corresponding cmsg interfaces. A struct sock_extended_err* is used for
returning the error data, and the packet's timestamp can be retrieved by
adding both ee_data and ee_info fields as e.g.:
((__u64) serr->ee_data << 32) + serr->ee_info
This feature is disabled by default and must be explicitly enabled by
applications. Enabling it can bring some overhead for the Tx cycles
of the application.
Signed-off-by: Jesus Sanchez-Palencia <jesus.sanchez-palencia@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Add infra so etf qdisc supports HW offload of time-based transmission.
For hw offload, the time sorted list is still used, so packets are
dequeued always in order of txtime.
Example:
$ tc qdisc replace dev enp2s0 parent root handle 100 mqprio num_tc 3 \
map 2 2 1 0 2 2 2 2 2 2 2 2 2 2 2 2 queues 1@0 1@1 2@2 hw 0
$ tc qdisc add dev enp2s0 parent 100:1 etf offload delta 100000 \
clockid CLOCK_REALTIME
In this example, the Qdisc will use HW offload for the control of the
transmission time through the network adapter. The hrtimer used for
packets scheduling inside the qdisc will use the clockid CLOCK_REALTIME
as reference and packets leave the Qdisc "delta" (100000) nanoseconds
before their transmission time. Because this will be using HW offload and
since dynamic clocks are not supported by the hrtimer, the system clock
and the PHC clock must be synchronized for this mode to behave as
expected.
Signed-off-by: Jesus Sanchez-Palencia <jesus.sanchez-palencia@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
The ETF (Earliest TxTime First) qdisc uses the information added
earlier in this series (the socket option SO_TXTIME and the new
role of sk_buff->tstamp) to schedule packets transmission based
on absolute time.
For some workloads, just bandwidth enforcement is not enough, and
precise control of the transmission of packets is necessary.
Example:
$ tc qdisc replace dev enp2s0 parent root handle 100 mqprio num_tc 3 \
map 2 2 1 0 2 2 2 2 2 2 2 2 2 2 2 2 queues 1@0 1@1 2@2 hw 0
$ tc qdisc add dev enp2s0 parent 100:1 etf delta 100000 \
clockid CLOCK_TAI
In this example, the Qdisc will provide SW best-effort for the control
of the transmission time to the network adapter, the time stamp in the
socket will be in reference to the clockid CLOCK_TAI and packets
will leave the qdisc "delta" (100000) nanoseconds before its transmission
time.
The ETF qdisc will buffer packets sorted by their txtime. It will drop
packets on enqueue() if their skbuff clockid does not match the clock
reference of the Qdisc. Moreover, on dequeue(), a packet will be dropped
if it expires while being enqueued.
The qdisc also supports the SO_TXTIME deadline mode. For this mode, it
will dequeue a packet as soon as possible and change the skb timestamp
to 'now' during etf_dequeue().
Note that both the qdisc's and the SO_TXTIME ABIs allow for a clockid
to be configured, but it's been decided that usage of CLOCK_TAI should
be enforced until we decide to allow for other clockids to be used.
The rationale here is that PTP times are usually in the TAI scale, thus
no other clocks should be necessary. For now, the qdisc will return
EINVAL if any clocks other than CLOCK_TAI are used.
Signed-off-by: Jesus Sanchez-Palencia <jesus.sanchez-palencia@intel.com>
Signed-off-by: Vinicius Costa Gomes <vinicius.gomes@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
This adds 'qdisc_watchdog_init_clockid()' that allows a clockid to be
passed, this allows other time references to be used when scheduling
the Qdisc to run.
Signed-off-by: Vinicius Costa Gomes <vinicius.gomes@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Add a transmit_time field to struct inet_cork, then copy the
timestamp from the CMSG cookie at ip_setup_cork() so we can
safely copy it into the skb later during __ip_make_skb().
For the raw fast path, just perform the copy at raw_send_hdrinc().
Signed-off-by: Richard Cochran <rcochran@linutronix.de>
Signed-off-by: Jesus Sanchez-Palencia <jesus.sanchez-palencia@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
This patch introduces SO_TXTIME. User space enables this option in
order to pass a desired future transmit time in a CMSG when calling
sendmsg(2). The argument to this socket option is a 8-bytes long struct
provided by the uapi header net_tstamp.h defined as:
struct sock_txtime {
clockid_t clockid;
u32 flags;
};
Note that new fields were added to struct sock by filling a 2-bytes
hole found in the struct. For that reason, neither the struct size or
number of cachelines were altered.
Signed-off-by: Richard Cochran <rcochran@linutronix.de>
Signed-off-by: Jesus Sanchez-Palencia <jesus.sanchez-palencia@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
The new action inheritdsfield copies the field DS of
IPv4 and IPv6 packets into skb->priority. This enables
later classification of packets based on the DS field.
v5:
*Update the drop counter for TC_ACT_SHOT
v4:
*Not allow setting flags other than the expected ones.
*Allow dumping the pure flags.
v3:
*Use optional flags, so that it won't break old versions of tc.
*Allow users to set both SKBEDIT_F_PRIORITY and SKBEDIT_F_INHERITDSFIELD flags.
v2:
*Fix the style issue
*Move the code from skbmod to skbedit
Original idea by Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Qiaobin Fu <qiaobinf@bu.edu>
Reviewed-by: Michel Machado <michel@digirati.com.br>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Reviewed-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Acked-by: Davide Caratti <dcaratti@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
NetworkManager likes to manage linklocal prefix routes and does so with
the NLM_F_APPEND flag, breaking attempts to simplify the IPv6 route
code and by extension enable multipath routes with device only nexthops.
Revert f34436a43092 and these followup patches:
6eba08c3626b ("ipv6: Only emit append events for appended routes").
ce45bded6435 ("mlxsw: spectrum_router: Align with new route replace logic")
53b562df8c20 ("mlxsw: spectrum_router: Allow appending to dev-only routes")
Update the fib_tests cases to reflect the old behavior.
Fixes: f34436a43092 ("net/ipv6: Simplify route replace and appending into multipath route")
Signed-off-by: David Ahern <dsahern@gmail.com>
|
|
Also involved adding a way to run a netfilter hook over a list of packets.
Rather than attempting to make netfilter know about lists (which would be
a major project in itself) we just let it call the regular okfn (in this
case ip_rcv_finish()) for any packets it steals, and have it give us back
a list of packets it's synchronously accepted (which normally NF_HOOK
would automatically call okfn() on, but we want to be able to potentially
pass the list to a listified version of okfn().)
The netfilter hooks themselves are indirect calls that still happen per-
packet (see nf_hook_entry_hookfn()), but again, changing that can be left
for future work.
There is potential for out-of-order receives if the netfilter hook ends up
synchronously stealing packets, as they will be processed before any
accepts earlier in the list. However, it was already possible for an
asynchronous accept to cause out-of-order receives, so presumably this is
considered OK.
Signed-off-by: Edward Cree <ecree@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
First example of a layer splitting the list (rather than merely taking
individual packets off it).
Involves new list.h function, list_cut_before(), like list_cut_position()
but cuts on the other side of the given entry.
Signed-off-by: Edward Cree <ecree@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Signed-off-by: Edward Cree <ecree@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Just calls netif_receive_skb() in a loop.
Signed-off-by: Edward Cree <ecree@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
spp_ipv6_flowlabel and spp_dscp are added in sctp_paddrparams in
this patch so that users could set sctp_sock/asoc/transport dscp
and flowlabel with spp_flags SPP_IPV6_FLOWLABEL or SPP_DSCP by
SCTP_PEER_ADDR_PARAMS , as described section 8.1.12 in RFC6458.
As said in last patch, it uses '| 0x100000' or '|0x1' to mark
flowlabel or dscp is set, so that their values could be set
to 0.
Note that to guarantee that an old app built with old kernel
headers could work on the newer kernel, the param's check in
sctp_g/setsockopt_peer_addr_params() is also improved, which
follows the way that sctp_g/setsockopt_delayed_ack() or some
other sockopts' process that accept two types of params does.
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Like some other per transport params, flowlabel and dscp are added
in transport, asoc and sctp_sock. By default, transport sets its
value from asoc's, and asoc does it from sctp_sock. flowlabel
only works for ipv6 transport.
Other than that they need to be passed down in sctp_xmit, flow4/6
also needs to set them before looking up route in get_dst.
Note that it uses '& 0x100000' to check if flowlabel is set and
'& 0x1' (tos 1st bit is unused) to check if dscp is set by users,
so that they could be set to 0 by sockopt in next patch.
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|