summaryrefslogtreecommitdiff
path: root/net
AgeCommit message (Collapse)Author
2020-05-06bpf, sockmap: bpf_tcp_ingress needs to subtract bytes from sg.sizeJohn Fastabend
In bpf_tcp_ingress we used apply_bytes to subtract bytes from sg.size which is used to track total bytes in a message. But this is not correct because apply_bytes is itself modified in the main loop doing the mem_charge. Then at the end of this we have sg.size incorrectly set and out of sync with actual sk values. Then we can get a splat if we try to cork the data later and again try to redirect the msg to ingress. To fix instead of trying to track msg.size do the easy thing and include it as part of the sk_msg_xfer logic so that when the msg is moved the sg.size is always correct. To reproduce the below users will need ingress + cork and hit an error path that will then try to 'free' the skmsg. [ 173.699981] BUG: KASAN: null-ptr-deref in sk_msg_free_elem+0xdd/0x120 [ 173.699987] Read of size 8 at addr 0000000000000008 by task test_sockmap/5317 [ 173.700000] CPU: 2 PID: 5317 Comm: test_sockmap Tainted: G I 5.7.0-rc1+ #43 [ 173.700005] Hardware name: Dell Inc. Precision 5820 Tower/002KVM, BIOS 1.9.2 01/24/2019 [ 173.700009] Call Trace: [ 173.700021] dump_stack+0x8e/0xcb [ 173.700029] ? sk_msg_free_elem+0xdd/0x120 [ 173.700034] ? sk_msg_free_elem+0xdd/0x120 [ 173.700042] __kasan_report+0x102/0x15f [ 173.700052] ? sk_msg_free_elem+0xdd/0x120 [ 173.700060] kasan_report+0x32/0x50 [ 173.700070] sk_msg_free_elem+0xdd/0x120 [ 173.700080] __sk_msg_free+0x87/0x150 [ 173.700094] tcp_bpf_send_verdict+0x179/0x4f0 [ 173.700109] tcp_bpf_sendpage+0x3ce/0x5d0 Fixes: 604326b41a6fb ("bpf, sockmap: convert to generic sk_msg interface") Signed-off-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com> Acked-by: Martin KaFai Lau <kafai@fb.com> Link: https://lore.kernel.org/bpf/158861290407.14306.5327773422227552482.stgit@john-Precision-5820-Tower
2020-05-06bpf, sockmap: msg_pop_data can incorrecty set an sge lengthJohn Fastabend
When sk_msg_pop() is called where the pop operation is working on the end of a sge element and there is no additional trailing data and there _is_ data in front of pop, like the following case, |____________a_____________|__pop__| We have out of order operations where we incorrectly set the pop variable so that instead of zero'ing pop we incorrectly leave it untouched, effectively. This can cause later logic to shift the buffers around believing it should pop extra space. The result is we have 'popped' more data then we expected potentially breaking program logic. It took us a while to hit this case because typically we pop headers which seem to rarely be at the end of a scatterlist elements but we can't rely on this. Fixes: 7246d8ed4dcce ("bpf: helper to pop data from messages") Signed-off-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com> Acked-by: Martin KaFai Lau <kafai@fb.com> Link: https://lore.kernel.org/bpf/158861288359.14306.7654891716919968144.stgit@john-Precision-5820-Tower
2020-05-05neigh: send protocol value in neighbor create notificationRoman Mashak
When a new neighbor entry has been added, event is generated but it does not include protocol, because its value is assigned after the event notification routine has run, so move protocol assignment code earlier. Fixes: df9b0e30d44c ("neighbor: Add protocol attribute") Cc: David Ahern <dsahern@gmail.com> Signed-off-by: Roman Mashak <mrv@mojatatu.com> Reviewed-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-05erspan: Add type I version 0 support.William Tu
The Type I ERSPAN frame format is based on the barebones IP + GRE(4-byte) encapsulation on top of the raw mirrored frame. Both type I and II use 0x88BE as protocol type. Unlike type II and III, no sequence number or key is required. To creat a type I erspan tunnel device: $ ip link add dev erspan11 type erspan \ local 172.16.1.100 remote 172.16.1.200 \ erspan_ver 0 Signed-off-by: William Tu <u9012063@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-05net/smc: remove unused inline function smc_curs_readYueHaibing
commit bac6de7b6370 ("net/smc: eliminate cursor read and write calls") left behind this. Signed-off-by: YueHaibing <yuehaibing@huawei.com> Signed-off-by: Karsten Graul <kgraul@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-05net/smc: log important pnetid and state change eventsKarsten Graul
Print to system log when SMC links are available or go down, link group state changes or pnetids are applied to and removed from devices. The log entries are triggered by either user configuration actions or adapter activation/deactivation events and are not expected to happen often. The entries help SMC users to keep track of the SMC link group status and to detect when actions are needed (like to add replacements for failed adapters). Signed-off-by: Karsten Graul <kgraul@linux.ibm.com> Reviewed-by: Ursula Braun <ubraun@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-05sch_choke: Remove classid from choke_skb_cb.David S. Miller
Suggested by Cong Wang. Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-05net: sched: choke: Remove unused inline function choke_set_classidYueHaibing
There's no callers in-tree anymore since commit 5952fde10c35 ("net: sched: choke: remove dead filter classify code") Signed-off-by: YueHaibing <yuehaibing@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-04xsk: Remove unnecessary member in xdp_umemMagnus Karlsson
Remove the unnecessary member of address in struct xdp_umem as it is only used during the umem registration. No need to carry this around as it is not used during run-time nor when unregistering the umem. Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Jonathan Lemon <jonathan.lemon@gmail.com> Link: https://lore.kernel.org/bpf/1588599232-24897-3-git-send-email-magnus.karlsson@intel.com
2020-05-04xsk: Change two variable names for increased clarityMagnus Karlsson
Change two variables names so that it is clearer what they represent. The first one is xsk_list that in fact only contains the list of AF_XDP sockets with a Tx component. Change this to xsk_tx_list for improved clarity. The second variable is size in the ring structure. One might think that this is the size of the ring, but it is in fact the size of the umem, copied into the ring structure to improve performance. Rename this variable umem_size to avoid any confusion. Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Jonathan Lemon <jonathan.lemon@gmail.com> Link: https://lore.kernel.org/bpf/1588599232-24897-2-git-send-email-magnus.karlsson@intel.com
2020-05-04net: partially revert dynamic lockdep key changesCong Wang
This patch reverts the folowing commits: commit 064ff66e2bef84f1153087612032b5b9eab005bd "bonding: add missing netdev_update_lockdep_key()" commit 53d374979ef147ab51f5d632dfe20b14aebeccd0 "net: avoid updating qdisc_xmit_lock_key in netdev_update_lockdep_key()" commit 1f26c0d3d24125992ab0026b0dab16c08df947c7 "net: fix kernel-doc warning in <linux/netdevice.h>" commit ab92d68fc22f9afab480153bd82a20f6e2533769 "net: core: add generic lockdep keys" but keeps the addr_list_lock_key because we still lock addr_list_lock nestedly on stack devices, unlikely xmit_lock this is safe because we don't take addr_list_lock on any fast path. Reported-and-tested-by: syzbot+aaa6fa4949cc5d9b7b25@syzkaller.appspotmail.com Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Taehee Yoo <ap420073@gmail.com> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Acked-by: Taehee Yoo <ap420073@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-04atm: fix a memory leak of vcc->user_backCong Wang
In lec_arp_clear_vccs() only entry->vcc is freed, but vcc could be installed on entry->recv_vcc too in lec_vcc_added(). This fixes the following memory leak: unreferenced object 0xffff8880d9266b90 (size 16): comm "atm2", pid 425, jiffies 4294907980 (age 23.488s) hex dump (first 16 bytes): 00 00 00 00 00 00 00 00 00 00 00 00 6b 6b 6b a5 ............kkk. backtrace: [<(____ptrval____)>] kmem_cache_alloc_trace+0x10e/0x151 [<(____ptrval____)>] lane_ioctl+0x4b3/0x569 [<(____ptrval____)>] do_vcc_ioctl+0x1ea/0x236 [<(____ptrval____)>] svc_ioctl+0x17d/0x198 [<(____ptrval____)>] sock_do_ioctl+0x47/0x12f [<(____ptrval____)>] sock_ioctl+0x2f9/0x322 [<(____ptrval____)>] vfs_ioctl+0x1e/0x2b [<(____ptrval____)>] ksys_ioctl+0x61/0x80 [<(____ptrval____)>] __x64_sys_ioctl+0x16/0x19 [<(____ptrval____)>] do_syscall_64+0x57/0x65 [<(____ptrval____)>] entry_SYSCALL_64_after_hwframe+0x49/0xb3 Cc: Gengming Liu <l.dmxcsnsbh@gmail.com> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-04atm: fix a UAF in lec_arp_clear_vccs()Cong Wang
Gengming reported a UAF in lec_arp_clear_vccs(), where we add a vcc socket to an entry in a per-device list but free the socket without removing it from the list when vcc->dev is NULL. We need to call lec_vcc_close() to search and remove those entries contain the vcc being destroyed. This can be done by calling vcc->push(vcc, NULL) unconditionally in vcc_destroy_socket(). Another issue discovered by Gengming's reproducer is the vcc->dev may point to the static device lecatm_dev, for which we don't need to register/unregister device, so we can just check for vcc->dev->ops->owner. Reported-by: Gengming Liu <l.dmxcsnsbh@gmail.com> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-04devlink: let kernel allocate region snapshot idJakub Kicinski
Currently users have to choose a free snapshot id before calling DEVLINK_CMD_REGION_NEW. This is potentially racy and inconvenient. Make the DEVLINK_ATTR_REGION_SNAPSHOT_ID optional and try to allocate id automatically. Send a message back to the caller with the snapshot info. Example use: $ devlink region new netdevsim/netdevsim1/dummy netdevsim/netdevsim1/dummy: snapshot 1 $ id=$(devlink -j region new netdevsim/netdevsim1/dummy | \ jq '.[][][][]') $ devlink region dump netdevsim/netdevsim1/dummy snapshot $id [...] $ devlink region del netdevsim/netdevsim1/dummy snapshot $id v4: - inline the notification code v3: - send the notification only once snapshot creation completed. v2: - don't wrap the line containing extack; - add a few sentences to the docs. Signed-off-by: Jakub Kicinski <kuba@kernel.org> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-04devlink: factor out building a snapshot notificationJakub Kicinski
We'll need to send snapshot info back on the socket which requested a snapshot to be created. Factor out constructing a snapshot description from the broadcast notification code. v3: new patch Signed-off-by: Jakub Kicinski <kuba@kernel.org> Reviewed-by: Jiri Pirko <jiri@mellanox.com> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-04net_sched: sch_fq: add horizon attributeEric Dumazet
QUIC servers would like to use SO_TXTIME, without having CAP_NET_ADMIN, to efficiently pace UDP packets. As far as sch_fq is concerned, we need to add safety checks, so that a buggy application does not fill the qdisc with packets having delivery time far in the future. This patch adds a configurable horizon (default: 10 seconds), and a configurable policy when a packet is beyond the horizon at enqueue() time: - either drop the packet (default policy) - or cap its delivery time to the horizon. $ tc -s -d qd sh dev eth0 qdisc fq 8022: root refcnt 257 limit 10000p flow_limit 100p buckets 1024 orphan_mask 1023 quantum 10Kb initial_quantum 51160b low_rate_threshold 550Kbit refill_delay 40.0ms timer_slack 10.000us horizon 10.000s Sent 1234215879 bytes 837099 pkt (dropped 21, overlimits 0 requeues 6) backlog 0b 0p requeues 6 flows 1191 (inactive 1177 throttled 0) gc 0 highprio 0 throttled 692 latency 11.480us pkts_too_long 0 alloc_errors 0 horizon_drops 21 horizon_caps 0 v2: fixed an overflow on 32bit kernels in fq_init(), reported by kbuild test robot <lkp@intel.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-04net_sched: fix tcm_parent in tc filter dumpCong Wang
When we tell kernel to dump filters from root (ffff:ffff), those filters on ingress (ffff:0000) are matched, but their true parents must be dumped as they are. However, kernel dumps just whatever we tell it, that is either ffff:ffff or ffff:0000: $ nl-cls-list --dev=dummy0 --parent=root cls basic dev dummy0 id none parent root prio 49152 protocol ip match-all cls basic dev dummy0 id :1 parent root prio 49152 protocol ip match-all $ nl-cls-list --dev=dummy0 --parent=ffff: cls basic dev dummy0 id none parent ffff: prio 49152 protocol ip match-all cls basic dev dummy0 id :1 parent ffff: prio 49152 protocol ip match-all This is confusing and misleading, more importantly this is a regression since 4.15, so the old behavior must be restored. And, when tc filters are installed on a tc class, the parent should be the classid, rather than the qdisc handle. Commit edf6711c9840 ("net: sched: remove classid and q fields from tcf_proto") removed the classid we save for filters, we can just restore this classid in tcf_block. Steps to reproduce this: ip li set dev dummy0 up tc qd add dev dummy0 ingress tc filter add dev dummy0 parent ffff: protocol arp basic action pass tc filter show dev dummy0 root Before this patch: filter protocol arp pref 49152 basic filter protocol arp pref 49152 basic handle 0x1 action order 1: gact action pass random type none pass val 0 index 1 ref 1 bind 1 After this patch: filter parent ffff: protocol arp pref 49152 basic filter parent ffff: protocol arp pref 49152 basic handle 0x1 action order 1: gact action pass random type none pass val 0 index 1 ref 1 bind 1 Fixes: a10fa20101ae ("net: sched: propagate q and parent from caller down to tcf_fill_node") Fixes: edf6711c9840 ("net: sched: remove classid and q fields from tcf_proto") Cc: Jamal Hadi Salim <jhs@mojatatu.com> Cc: Jiri Pirko <jiri@resnulli.us> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-04net: sched: fallback to qdisc noqueue if default qdisc setup failJesper Dangaard Brouer
Currently if the default qdisc setup/init fails, the device ends up with qdisc "noop", which causes all TX packets to get dropped. With the introduction of sysctl net/core/default_qdisc it is possible to change the default qdisc to be more advanced, which opens for the possibility that Qdisc_ops->init() can fail. This patch detect these kind of failures, and choose to fallback to qdisc "noqueue", which is so simple that its init call will not fail. This allows the interface to continue functioning. V2: As this also captures memory failures, which are transient, the device is not kept in IFF_NO_QUEUE state. This allows the net_device to retry to default qdisc assignment. Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-04net/smc: save SMC-R peer link_uidKarsten Graul
During SMC-R link establishment the peers exchange the link_uid that is used for debugging purposes. Save the peer link_uid in smc_link so it can be retrieved by the smc_diag netlink interface. Signed-off-by: Karsten Graul <kgraul@linux.ibm.com> Reviewed-by: Ursula Braun <ubraun@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-04net/smc: create improved SMC-R link_uidKarsten Graul
The link_uid of an SMC-R link is exchanged between SMC peers and its value can be used for debugging purposes. Create a unique link_uid during link initialization and use it in communication with SMC-R peers. Signed-off-by: Karsten Graul <kgraul@linux.ibm.com> Reviewed-by: Ursula Braun <ubraun@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-04net/smc: improve termination processingKarsten Graul
Add helper smcr_lgr_link_deactivate_all() and eliminate duplicate code. In smc_lgr_free(), clear the smc-r links before smc_lgr_free_bufs() is called so buffers are already prepared for free. The usage of the soft parameter in __smc_lgr_terminate() is no longer needed, smc_lgr_free() can be called directly. smc_lgr_terminate_sched() and smc_smcd_terminate() set lgr->freeing to indicate that the link group will be freed soon to avoid unnecessary schedules of the free worker. Signed-off-by: Karsten Graul <kgraul@linux.ibm.com> Reviewed-by: Ursula Braun <ubraun@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-04net/smc: add termination reason and handle LLC protocol violationKarsten Graul
Allow to set the reason code for the link group termination, and set meaningful values before termination processing is triggered. This reason code is sent to the peer in the final delete link message. When the LLC request or response layer receives a message type that was not handled, drop a warning and terminate the link group. Signed-off-by: Karsten Graul <kgraul@linux.ibm.com> Reviewed-by: Ursula Braun <ubraun@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-04net/smc: asymmetric link taggingKarsten Graul
New connections must not be assigned to asymmetric links. Add asymmetric link tagging using new link variable link_is_asym. The new helpers smcr_lgr_set_type() and smcr_lgr_set_type_asym() are called to set the state of the link group, and tag all links accordingly. smcr_lgr_conn_assign_link() respects the link tagging and will not assign new connections to links tagged as asymmetric link. Signed-off-by: Karsten Graul <kgraul@linux.ibm.com> Reviewed-by: Ursula Braun <ubraun@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-04net/smc: assign link to a new connectionKarsten Graul
For new connections, assign a link from the link group, using some simple load balancing. Signed-off-by: Karsten Graul <kgraul@linux.ibm.com> Reviewed-by: Ursula Braun <ubraun@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-04net/smc: send DELETE_LINK, ALL message and wait for send to completeKarsten Graul
Add smc_llc_send_message_wait() which uses smc_wr_tx_send_wait() to send an LLC message and waits for the message send to complete. smc_llc_send_link_delete_all() calls the new function to send an DELETE_LINK,ALL LLC message. The RFC states that the sender of this type of message needs to wait for the completion event of the message transmission and can terminate the link afterwards. Signed-off-by: Karsten Graul <kgraul@linux.ibm.com> Reviewed-by: Ursula Braun <ubraun@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-04net/smc: wait for departure of an IB messageKarsten Graul
Introduce smc_wr_tx_send_wait() to send an IB message and wait for the tx completion event of the message. This makes sure that the message is no longer in-flight when the function returns. Signed-off-by: Karsten Graul <kgraul@linux.ibm.com> Reviewed-by: Ursula Braun <ubraun@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-04net/smc: handle incoming CDC validation messageKarsten Graul
Call smc_cdc_msg_validate() when a CDC message with the failover validation bit enabled was received. Validate that the sequence number sent with the message is one we already have received. If not, messages were lost and the connection is terminated using a new abort_work. Signed-off-by: Karsten Graul <kgraul@linux.ibm.com> Reviewed-by: Ursula Braun <ubraun@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-04net/smc: send failover validation messageKarsten Graul
When a connection is switched to a new link then a link validation message must be sent to the peer over the new link, containing the sequence number of the last CDC message that was sent over the old link. The peer will validate if this sequence number is the same or lower then the number he received, and abort the connection if messages were lost. Add smcr_cdc_msg_send_validation() to send the message validation message and call it when a connection was switched in smc_switch_cursor(). Signed-off-by: Karsten Graul <kgraul@linux.ibm.com> Reviewed-by: Ursula Braun <ubraun@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-04net/smc: switch connections to alternate linkKarsten Graul
Add smc_switch_conns() to switch all connections from a link that is going down. Find an other link to switch the connections to, and switch each connection to the new link. smc_switch_cursor() updates the cursors of a connection to the state of the last successfully sent CDC message. When there is no link to switch to, terminate the link group. Call smc_switch_conns() when a link is going down. And with the possibility that links of connections can switch adapt CDC and TX functions to detect and handle link switches. Signed-off-by: Karsten Graul <kgraul@linux.ibm.com> Reviewed-by: Ursula Braun <ubraun@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-04net/smc: save state of last sent CDC messageKarsten Graul
When a link goes down and all connections of this link need to be switched to an other link then the producer cursor and the sequence of the last successfully sent CDC message must be known. Add the two fields to the SMC connection and update it in the tx completion handler. And to allow matching of sequences in error cases reset the seqno to the old value in smc_cdc_msg_send() when the actual send failed. Signed-off-by: Karsten Graul <kgraul@linux.ibm.com> Reviewed-by: Ursula Braun <ubraun@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-04devlink: Fix reporter's recovery conditionAya Levin
Devlink health core conditions the reporter's recovery with the expiration of the grace period. This is not relevant for the first recovery. Explicitly demand that the grace period will only apply to recoveries other than the first. Fixes: c8e1da0bf923 ("devlink: Add health report functionality") Signed-off-by: Aya Levin <ayal@mellanox.com> Reviewed-by: Moshe Shemesh <moshe@mellanox.com> Reviewed-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-04tipc: fix partial topology connection closureTuong Lien
When an application connects to the TIPC topology server and subscribes to some services, a new connection is created along with some objects - 'tipc_subscription' to store related data correspondingly... However, there is one omission in the connection handling that when the connection or application is orderly shutdown (e.g. via SIGQUIT, etc.), the connection is not closed in kernel, the 'tipc_subscription' objects are not freed too. This results in: - The maximum number of subscriptions (65535) will be reached soon, new subscriptions will be rejected; - TIPC module cannot be removed (unless the objects are somehow forced to release first); The commit fixes the issue by closing the connection if the 'recvmsg()' returns '0' i.e. when the peer is shutdown gracefully. It also includes the other unexpected cases. Acked-by: Jon Maloy <jmaloy@redhat.com> Acked-by: Ying Xue <ying.xue@windriver.com> Signed-off-by: Tuong Lien <tuong.t.lien@dektech.com.au> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-04net: dsa: Do not make user port errors fatalFlorian Fainelli
Prior to 1d27732f411d ("net: dsa: setup and teardown ports"), we would not treat failures to set-up an user port as fatal, but after this commit we would, which is a regression for some systems where interfaces may be declared in the Device Tree, but the underlying hardware may not be present (pluggable daughter cards for instance). Fixes: 1d27732f411d ("net: dsa: setup and teardown ports") Signed-off-by: Florian Fainelli <f.fainelli@gmail.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-03net/smc: enqueue local LLC messagesKarsten Graul
As SMC server, when a second link was deleted, trigger the setup of an asymmetric link. Do this by enqueueing a local ADD_LINK message which is processed by the LLC layer as if it were received from peer. Do the same when a new IB port became active and a new link could be created. smc_llc_srv_add_link_local() enqueues a local ADD_LINK message. And smc_llc_srv_delete_link_local() is used the same way to enqueue a local DELETE_LINK message. This is used when an IB port is no longer active. Signed-off-by: Karsten Graul <kgraul@linux.ibm.com> Reviewed-by: Ursula Braun <ubraun@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-03net/smc: delete link processing as SMC serverKarsten Graul
Add smc_llc_process_srv_delete_link() to process a DELETE_LINK request as SMC server. When the request is to delete ALL links then terminate the whole link group. If not, find the link to delete by its link_id, send the DELETE_LINK request LLC message and wait for the response. No matter if a response was received, clear the deleted link and update the link group state. Signed-off-by: Karsten Graul <kgraul@linux.ibm.com> Reviewed-by: Ursula Braun <ubraun@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-03net/smc: delete link processing as SMC clientKarsten Graul
Add smc_llc_process_cli_delete_link() to process a DELETE_LINK request as SMC client. When the request is to delete ALL links then terminate the whole link group. If not, find the link to delete by its link_id, send the DELETE_LINK response LLC message and then clear the deleted link. Finally determine and update the link group state. Signed-off-by: Karsten Graul <kgraul@linux.ibm.com> Reviewed-by: Ursula Braun <ubraun@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-03net/smc: llc_del_link_work and use the LLC flow for delete linkKarsten Graul
Introduce a work that is scheduled when a new DELETE_LINK LLC request is received. The work will call either the SMC client or SMC server DELETE_LINK processing. And use the LLC flow framework to process incoming DELETE_LINK LLC messages, scheduling the llc_del_link_work for those events. With these changes smc_lgr_forget() is only called by one function and can be migrated into smc_lgr_cleanup_early(). Signed-off-by: Karsten Graul <kgraul@linux.ibm.com> Reviewed-by: Ursula Braun <ubraun@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-03net/smc: delete an asymmetric link as SMC serverKarsten Graul
When a link group moved from asymmetric to symmetric state then the dangling asymmetric link can be deleted. Add smc_llc_find_asym_link() to find the respective link and add smc_llc_delete_asym_link() to delete it. Signed-off-by: Karsten Graul <kgraul@linux.ibm.com> Reviewed-by: Ursula Braun <ubraun@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-03net/smc: final part of add link processing as SMC serverKarsten Graul
This patch finalizes the ADD_LINK processing of new links. Send the CONFIRM_LINK request to the peer, receive the response and set link state to ACTIVE. Signed-off-by: Karsten Graul <kgraul@linux.ibm.com> Reviewed-by: Ursula Braun <ubraun@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-03net/smc: rkey processing for a new link as SMC serverKarsten Graul
Part of SMC server new link establishment is the exchange of rkeys for used buffers. Loop over all used RMB buffers and send ADD_LINK_CONTINUE LLC messages to the peer. Signed-off-by: Karsten Graul <kgraul@linux.ibm.com> Reviewed-by: Ursula Braun <ubraun@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-03net/smc: first part of add link processing as SMC serverKarsten Graul
First set of functions to process an ADD_LINK LLC request as an SMC server. Find an alternate IB device, determine the new link group type and get the index for the new link. Then initialize the link and send the ADD_LINK LLC message to the peer. Save the contents of the response, ready the link, map all used buffers and register the buffers with the IB device. If any error occurs, stop the processing and clear the link. And call smc_llc_srv_add_link() in af_smc.c to start second link establishment after the initial link of a link group was created. Signed-off-by: Karsten Graul <kgraul@linux.ibm.com> Reviewed-by: Ursula Braun <ubraun@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-03net/smc: final part of add link processing as SMC clientKarsten Graul
This patch finalizes the ADD_LINK processing of new links. Receive the CONFIRM_LINK request from peer, complete the link initialization, register all used buffers with the IB device and finally send the CONFIRM_LINK response, which completes the ADD_LINK processing. And activate smc_llc_cli_add_link() in af_smc.c. Signed-off-by: Karsten Graul <kgraul@linux.ibm.com> Reviewed-by: Ursula Braun <ubraun@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-03net/smc: rkey processing for a new link as SMC clientKarsten Graul
Part of the SMC client new link establishment process is the exchange of rkeys for all used buffers. Add new LLC message type ADD_LINK_CONTINUE which is used to exchange rkeys of all current RMB buffers. Add functions to iterate over all used RMB buffers of the link group, and implement the ADD_LINK_CONTINUE processing. Signed-off-by: Karsten Graul <kgraul@linux.ibm.com> Reviewed-by: Ursula Braun <ubraun@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-03net/smc: first part of add link processing as SMC clientKarsten Graul
First set of functions to process an ADD_LINK LLC request as an SMC client. Find an alternate IB device, determine the new link group type and get the index for the new link. Then ready the link, map the buffers and send an ADD_LINK LLC response. If any error occurs, send a reject LLC message and terminate the processing. Add smc_llc_alloc_alt_link() to find a free link index for a new link, depending on the new link group type. Signed-off-by: Karsten Graul <kgraul@linux.ibm.com> Reviewed-by: Ursula Braun <ubraun@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-03net_sched: sch_skbprio: add message validation to skbprio_change()Eric Dumazet
Do not assume the attribute has the right size. Fixes: aea5f654e6b7 ("net/sched: add skbprio scheduler") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: syzbot <syzkaller@googlegroups.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-03net_sched: sch_fq: perform a prefetch() earlierEric Dumazet
The prefetch() done in fq_dequeue() can be done a bit earlier after the refactoring of the code done in the prior patch. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-03net_sched: sch_fq: do not call fq_peek() twice per packetEric Dumazet
This refactors the code to not call fq_peek() from fq_dequeue_head() since the caller can provide the skb. Also rename fq_dequeue_head() to fq_dequeue_skb() because 'head' is a bit vague, given the skb could come from t_root rb-tree. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-03net_sched: sch_fq: use bulk freeing in fq_gc()Eric Dumazet
fq_gc() already builds a small array of pointers, so using kmem_cache_free_bulk() needs very little change. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-03net_sched: sch_fq: change fq_flow size/layoutEric Dumazet
sizeof(struct fq_flow) is 112 bytes on 64bit arches. This means that half of them use two cache lines, but 50% use three cache lines. This patch adds cache line alignment, and makes sure that only the first cache line is touched by fq_enqueue(), which is more expensive that fq_dequeue() in general. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-03net_sched: sch_fq: avoid touching f->next from fq_gc()Eric Dumazet
A significant amount of cpu cycles is spent in fq_gc() When fq_gc() does its lookup in the rb-tree, it needs the following fields from struct fq_flow : f->sk (lookup key in the rb-tree) f->fq_node (anchor in the rb-tree) f->next (used to determine if the flow is detached) f->age (used to determine if the flow is candidate for gc) This unfortunately spans two cache lines (assuming 64 bytes cache lines) We can avoid using f->next, if we use the low order bit of f->{age|tail} This low order bit is 0, if f->tail points to an sk_buff. We set the low order bit to 1, if the union contains a jiffies value. Combined with the following patch, this makes sure we only need to bring into cpu caches one cache line per flow. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>