summaryrefslogtreecommitdiff
path: root/net
AgeCommit message (Collapse)Author
2018-07-06Bluetooth: Introduce helpers for LE set scan start and completeJaganath Kanakkassery
Introduce a helper hci_req_start_scan() which starts an LE scan and call it from passive_Scan() and active_scan(). There is not functionality change in this patch. This is basically done to enable extended scanning if the controller supports which will be done in the subsequent patch Signed-off-by: Jaganath Kanakkassery <jaganathx.kanakkassery@intel.com> Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2018-07-06netfilter: nf_tables: place all set backends in one single modulePablo Neira Ayuso
This patch disallows rbtree with single elements, which is causing problems with the recent timeout support. Before this patch, you could opt out individual set representations per module, which is just adding extra complexity. Fixes: 8d8540c4f5e0("netfilter: nft_set_rbtree: add timeout support") Reported-by: Taehee Yoo <ap420073@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-07-06nl80211/mac80211: allow non-linear skb in rx_control_portDenis Kenzior
The current implementation of cfg80211_rx_control_port assumed that the caller could provide a contiguous region of memory for the control port frame to be sent up to userspace. Unfortunately, many drivers produce non-linear skbs, especially for data frames. This resulted in userspace getting notified of control port frames with correct metadata (from address, port, etc) yet garbage / nonsense contents, resulting in bad handshakes, disconnections, etc. mac80211 linearizes skbs containing management frames. But it didn't seem worthwhile to do this for control port frames. Thus the signature of cfg80211_rx_control_port was changed to take the skb directly. nl80211 then takes care of obtaining control port frame data directly from the (linear | non-linear) skb. The caller is still responsible for freeing the skb, cfg80211_rx_control_port does not take ownership of it. Fixes: 6a671a50f819 ("nl80211: Add CMD_CONTROL_PORT_FRAME API") Signed-off-by: Denis Kenzior <denkenz@gmail.com> [fix some kernel-doc formatting, add fixes tag] Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2018-07-06netfilter: nf_tproxy: fix possible non-linear access to transport headerMáté Eckl
This patch fixes a silent out-of-bound read possibility that was present because of the misuse of this function. Mostly it was called with a struct udphdr *hp which had only the udphdr part linearized by the skb_header_pointer, however nf_tproxy_get_sock_v{4,6} uses it as a tcphdr pointer, so some reads for tcp specific attributes may be invalid. Fixes: a583636a83ea ("inet: refactor inet[6]_lookup functions to take skb") Signed-off-by: Máté Eckl <ecklm94@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-07-06Bluetooth: Add HCI command for clear Resolv listAnkit Navik
Check for Resolv list supported by controller. So check the supported commmand first before issuing this command i.e.,HCI_OP_LE_CLEAR_RESOLV_LIST Before patch: < HCI Command: LE Read White List... (0x08|0x000f) plen 0 #55 [hci0] 13.338168 > HCI Event: Command Complete (0x0e) plen 5 #56 [hci0] 13.338842 LE Read White List Size (0x08|0x000f) ncmd 1 Status: Success (0x00) Size: 25 < HCI Command: LE Clear White List (0x08|0x0010) plen 0 #57 [hci0] 13.339029 > HCI Event: Command Complete (0x0e) plen 4 #58 [hci0] 13.339939 LE Clear White List (0x08|0x0010) ncmd 1 Status: Success (0x00) < HCI Command: LE Read Resolving L.. (0x08|0x002a) plen 0 #59 [hci0] 13.340152 > HCI Event: Command Complete (0x0e) plen 5 #60 [hci0] 13.340952 LE Read Resolving List Size (0x08|0x002a) ncmd 1 Status: Success (0x00) Size: 25 < HCI Command: LE Read Maximum Dat.. (0x08|0x002f) plen 0 #61 [hci0] 13.341180 > HCI Event: Command Complete (0x0e) plen 12 #62 [hci0] 13.341898 LE Read Maximum Data Length (0x08|0x002f) ncmd 1 Status: Success (0x00) Max TX octets: 251 Max TX time: 17040 Max RX octets: 251 Max RX time: 17040 After patch: < HCI Command: LE Read White List... (0x08|0x000f) plen 0 #55 [hci0] 28.919131 > HCI Event: Command Complete (0x0e) plen 5 #56 [hci0] 28.920016 LE Read White List Size (0x08|0x000f) ncmd 1 Status: Success (0x00) Size: 25 < HCI Command: LE Clear White List (0x08|0x0010) plen 0 #57 [hci0] 28.920164 > HCI Event: Command Complete (0x0e) plen 4 #58 [hci0] 28.920873 LE Clear White List (0x08|0x0010) ncmd 1 Status: Success (0x00) < HCI Command: LE Read Resolving L.. (0x08|0x002a) plen 0 #59 [hci0] 28.921109 > HCI Event: Command Complete (0x0e) plen 5 #60 [hci0] 28.922016 LE Read Resolving List Size (0x08|0x002a) ncmd 1 Status: Success (0x00) Size: 25 < HCI Command: LE Clear Resolving... (0x08|0x0029) plen 0 #61 [hci0] 28.922166 > HCI Event: Command Complete (0x0e) plen 4 #62 [hci0] 28.922872 LE Clear Resolving List (0x08|0x0029) ncmd 1 Status: Success (0x00) < HCI Command: LE Read Maximum Dat.. (0x08|0x002f) plen 0 #63 [hci0] 28.923117 > HCI Event: Command Complete (0x0e) plen 12 #64 [hci0] 28.924030 LE Read Maximum Data Length (0x08|0x002f) ncmd 1 Status: Success (0x00) Max TX octets: 251 Max TX time: 17040 Max RX octets: 251 Max RX time: 17040 Signed-off-by: Ankit Navik <ankit.p.navik@intel.com> Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2018-07-06Bluetooth: Store Resolv list sizeAnkit Navik
When the controller supports the Read LE Resolv List size feature, the maximum list size are read and now stored. Before patch: < HCI Command: LE Read White List... (0x08|0x000f) plen 0 #55 [hci0] 17.979791 > HCI Event: Command Complete (0x0e) plen 5 #56 [hci0] 17.980629 LE Read White List Size (0x08|0x000f) ncmd 1 Status: Success (0x00) Size: 25 < HCI Command: LE Clear White List (0x08|0x0010) plen 0 #57 [hci0] 17.980786 > HCI Event: Command Complete (0x0e) plen 4 #58 [hci0] 17.981627 LE Clear White List (0x08|0x0010) ncmd 1 Status: Success (0x00) < HCI Command: LE Read Maximum Dat.. (0x08|0x002f) plen 0 #59 [hci0] 17.981786 > HCI Event: Command Complete (0x0e) plen 12 #60 [hci0] 17.982636 LE Read Maximum Data Length (0x08|0x002f) ncmd 1 Status: Success (0x00) Max TX octets: 251 Max TX time: 17040 Max RX octets: 251 Max RX time: 17040 After patch: < HCI Command: LE Read White List... (0x08|0x000f) plen 0 #55 [hci0] 13.338168 > HCI Event: Command Complete (0x0e) plen 5 #56 [hci0] 13.338842 LE Read White List Size (0x08|0x000f) ncmd 1 Status: Success (0x00) Size: 25 < HCI Command: LE Clear White List (0x08|0x0010) plen 0 #57 [hci0] 13.339029 > HCI Event: Command Complete (0x0e) plen 4 #58 [hci0] 13.339939 LE Clear White List (0x08|0x0010) ncmd 1 Status: Success (0x00) < HCI Command: LE Read Resolving L.. (0x08|0x002a) plen 0 #59 [hci0] 13.340152 > HCI Event: Command Complete (0x0e) plen 5 #60 [hci0] 13.340952 LE Read Resolving List Size (0x08|0x002a) ncmd 1 Status: Success (0x00) Size: 25 < HCI Command: LE Read Maximum Dat.. (0x08|0x002f) plen 0 #61 [hci0] 13.341180 > HCI Event: Command Complete (0x0e) plen 12 #62 [hci0] 13.341898 LE Read Maximum Data Length (0x08|0x002f) ncmd 1 Status: Success (0x00) Max TX octets: 251 Max TX time: 17040 Max RX octets: 251 Max RX time: 17040 Signed-off-by: Ankit Navik <ankit.p.navik@intel.com> Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2018-07-066lowpan: iphc: reset mac_header after decompress to fix panicMichael Scott
After decompression of 6lowpan socket data, an IPv6 header is inserted before the existing socket payload. After this, we reset the network_header value of the skb to account for the difference in payload size from prior to decompression + the addition of the IPv6 header. However, we fail to reset the mac_header value. Leaving the mac_header value untouched here, can cause a calculation error in net/packet/af_packet.c packet_rcv() function when an AF_PACKET socket is opened in SOCK_RAW mode for use on a 6lowpan interface. On line 2088, the data pointer is moved backward by the value returned from skb_mac_header(). If skb->data is adjusted so that it is before the skb->head pointer (which can happen when an old value of mac_header is left in place) the kernel generates a panic in net/core/skbuff.c line 1717. This panic can be generated by BLE 6lowpan interfaces (such as bt0) and 802.15.4 interfaces (such as lowpan0) as they both use the same 6lowpan sources for compression and decompression. Signed-off-by: Michael Scott <michael@opensourcefoundries.com> Acked-by: Alexander Aring <aring@mojatatu.com> Acked-by: Jukka Rissanen <jukka.rissanen@linux.intel.com> Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2018-07-06ipv4: Return EINVAL when ping_group_range sysctl doesn't map to user nsTyler Hicks
The low and high values of the net.ipv4.ping_group_range sysctl were being silently forced to the default disabled state when a write to the sysctl contained GIDs that didn't map to the associated user namespace. Confusingly, the sysctl's write operation would return success and then a subsequent read of the sysctl would indicate that the low and high values are the overflowgid. This patch changes the behavior by clearly returning an error when the sysctl write operation receives a GID range that doesn't map to the associated user namespace. In such a situation, the previous value of the sysctl is preserved and that range will be returned in a subsequent read of the sysctl. Signed-off-by: Tyler Hicks <tyhicks@canonical.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-06net: ipv6: listify ipv6_rcv() and ip6_rcv_finish()Edward Cree
Essentially the same as the ipv4 equivalents. Signed-off-by: Edward Cree <ecree@solarflare.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-06net: ipv4: fix list processing on L3 slave devicesEdward Cree
If we have an L3 master device, l3mdev_ip_rcv() will steal the skb, but we were returning NET_RX_SUCCESS from ip_rcv_finish_core() which meant that ip_list_rcv_finish() would keep it on the list. Instead let's move the l3mdev_ip_rcv() call into the caller, so that our response to a steal can be different in the single packet path (return NET_RX_SUCCESS) and the list path (forget this packet and continue). Fixes: 5fa12739a53d ("net: ipv4: listify ip_rcv_finish") Signed-off-by: Edward Cree <ecree@solarflare.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-05leds: triggers: let struct led_trigger::activate() return an error codeUwe Kleine-König
Given that activating a trigger can fail, let the callback return an indication. This prevents to have a trigger active according to the "trigger" sysfs attribute but not functional. All users are changed accordingly to return 0 for now. There is no intended change in behaviour. Signed-off-by: Uwe Kleine-König <u.kleine-koenig@pengutronix.de> Acked-by: Pavel Machek <pavel@ucw.cz> Signed-off-by: Jacek Anaszewski <jacek.anaszewski@gmail.com>
2018-07-05batman-adv: fix checkpatch warning about misspelled "cache"Sven Eckelmann
commit a2d4df9b673c ("spelling.txt: add more spellings to spelling.txt") introduced the spellcheck of "cache" for checkpatch.pl. Signed-off-by: Sven Eckelmann <sven@narfation.org> Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2018-07-05net: core: filter: mark expected switch fall-throughGustavo A. R. Silva
In preparation to enabling -Wimplicit-fallthrough, mark switch cases where we are expecting to fall through. Warning level 2 was used: -Wimplicit-fallthrough=2 Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-05net: decnet: dn_nsp_in: mark expected switch fall-throughGustavo A. R. Silva
In preparation to enabling -Wimplicit-fallthrough, mark switch cases where we are expecting to fall through. Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-05tipc: mark expected switch fall-throughsGustavo A. R. Silva
In preparation to enabling -Wimplicit-fallthrough, mark switch cases where we are expecting to fall through. Warning level 2 was used: -Wimplicit-fallthrough=2 Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com> Acked-by: Ying Xue <ying.xue@windriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-05net: qrtr: Reset the node and port ID of broadcast messagesArun Kumar Neelakantam
All the control messages broadcast to remote routers are using QRTR_NODE_BCAST instead of using local router NODE ID which cause the packets to be dropped on remote router due to invalid NODE ID. Signed-off-by: Arun Kumar Neelakantam <aneela@codeaurora.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-05net: qrtr: Broadcast messages only from control portArun Kumar Neelakantam
The broadcast node id should only be sent with the control port id. Signed-off-by: Arun Kumar Neelakantam <aneela@codeaurora.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-05ipv6: make ipv6_renew_options() interrupt/kernel safePaul Moore
At present the ipv6_renew_options_kern() function ends up calling into access_ok() which is problematic if done from inside an interrupt as access_ok() calls WARN_ON_IN_IRQ() on some (all?) architectures (x86-64 is affected). Example warning/backtrace is shown below: WARNING: CPU: 1 PID: 3144 at lib/usercopy.c:11 _copy_from_user+0x85/0x90 ... Call Trace: <IRQ> ipv6_renew_option+0xb2/0xf0 ipv6_renew_options+0x26a/0x340 ipv6_renew_options_kern+0x2c/0x40 calipso_req_setattr+0x72/0xe0 netlbl_req_setattr+0x126/0x1b0 selinux_netlbl_inet_conn_request+0x80/0x100 selinux_inet_conn_request+0x6d/0xb0 security_inet_conn_request+0x32/0x50 tcp_conn_request+0x35f/0xe00 ? __lock_acquire+0x250/0x16c0 ? selinux_socket_sock_rcv_skb+0x1ae/0x210 ? tcp_rcv_state_process+0x289/0x106b tcp_rcv_state_process+0x289/0x106b ? tcp_v6_do_rcv+0x1a7/0x3c0 tcp_v6_do_rcv+0x1a7/0x3c0 tcp_v6_rcv+0xc82/0xcf0 ip6_input_finish+0x10d/0x690 ip6_input+0x45/0x1e0 ? ip6_rcv_finish+0x1d0/0x1d0 ipv6_rcv+0x32b/0x880 ? ip6_make_skb+0x1e0/0x1e0 __netif_receive_skb_core+0x6f2/0xdf0 ? process_backlog+0x85/0x250 ? process_backlog+0x85/0x250 ? process_backlog+0xec/0x250 process_backlog+0xec/0x250 net_rx_action+0x153/0x480 __do_softirq+0xd9/0x4f7 do_softirq_own_stack+0x2a/0x40 </IRQ> ... While not present in the backtrace, ipv6_renew_option() ends up calling access_ok() via the following chain: access_ok() _copy_from_user() copy_from_user() ipv6_renew_option() The fix presented in this patch is to perform the userspace copy earlier in the call chain such that it is only called when the option data is actually coming from userspace; that place is do_ipv6_setsockopt(). Not only does this solve the problem seen in the backtrace above, it also allows us to simplify the code quite a bit by removing ipv6_renew_options_kern() completely. We also take this opportunity to cleanup ipv6_renew_options()/ipv6_renew_option() a small amount as well. This patch is heavily based on a rough patch by Al Viro. I've taken his original patch, converted a kmemdup() call in do_ipv6_setsockopt() to a memdup_user() call, made better use of the e_inval jump target in the same function, and cleaned up the use ipv6_renew_option() by ipv6_renew_options(). CC: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Paul Moore <paul@paul-moore.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-05devlink: Add enable_sriov boolean generic parameterVasundhara Volam
enable_sriov - Enables Single-Root Input/Output Virtualization(SR-IOV) characteristic of the device. Reviewed-by: Michael Chan <michael.chan@broadcom.com> Signed-off-by: Vasundhara Volam <vasundhara-v.volam@broadcom.com> Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-05devlink: Add generic parameters internal_err_reset and max_macsMoshe Shemesh
Add 2 first generic parameters to devlink configuration parameters set: internal_err_reset - When set enables reset device on internal errors. max_macs - max number of MACs per ETH port. Signed-off-by: Moshe Shemesh <moshe@mellanox.com> Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-05devlink: Add devlink notifications support for paramsMoshe Shemesh
Add devlink_param_notify() function to support devlink param notifications. Add notification call to devlink param set, register and unregister functions. Add devlink_param_value_changed() function to enable the driver notify devlink on value change. Driver should use this function after value was changed on any configuration mode part to driverinit. Signed-off-by: Moshe Shemesh <moshe@mellanox.com> Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-05devlink: Add support for get/set driverinit valueMoshe Shemesh
"driverinit" configuration mode value is held by devlink to enable the driver query the value after reload. Two additional functions added to help the driver get/set the value from/to devlink: devlink_param_driverinit_value_set() and devlink_param_driverinit_value_get(). Signed-off-by: Moshe Shemesh <moshe@mellanox.com> Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-05devlink: Add param set commandMoshe Shemesh
Add param set command to set value for a parameter. Value can be set to any of the supported configuration modes. Signed-off-by: Moshe Shemesh <moshe@mellanox.com> Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-05devlink: Add param get commandMoshe Shemesh
Add param get command which gets data per parameter. Option to dump the parameters data per device. Signed-off-by: Moshe Shemesh <moshe@mellanox.com> Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-05devlink: Add devlink_param register and unregisterMoshe Shemesh
Define configuration parameters data structure. Add functions to register and unregister the driver supported configuration parameters table. For each parameter registered, the driver should fill all the parameter's fields. In case the only supported configuration mode is "driverinit" the parameter's get()/set() functions are not required and should be set to NULL, for any other configuration mode, these functions are required and should be set by the driver. Signed-off-by: Moshe Shemesh <moshe@mellanox.com> Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-05net: limit each hash list length to MAX_GRO_SKBSLi RongQing
After commit 07d78363dcff ("net: Convert NAPI gro list into a small hash table.")' there is 8 hash buckets, which allows more flows to be held for merging. but MAX_GRO_SKBS, the total held skb for merging, is 8 skb still, limit the hash table performance. keep MAX_GRO_SKBS as 8 skb, but limit each hash list length to 8 skb, not the total 8 skb Signed-off-by: Li RongQing <lirongqing@baidu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-05netfilter: x_tables: set module owner for icmp(6) matchesFlorian Westphal
nft_compat relies on xt_request_find_match to increment refcount of the module that provides the match/target. The (builtin) icmp matches did't set the module owner so it was possible to rmmod ip(6)tables while icmp extensions were still in use. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-07-05ieee802154: 6lowpan: set IFLA_LINKLubomir Rintel
Otherwise NetworkManager (and iproute alike) is not able to identify the parent IEEE 802.15.4 interface of a 6LoWPAN link. Signed-off-by: Lubomir Rintel <lkundrak@v3.sk> Acked-by: Alexander Aring <aring@mojatatu.com> Signed-off-by: Stefan Schmidt <stefan@datenfreihafen.org>
2018-07-05net: ipv4: fix drop handling in ip_list_rcv() and ip_list_rcv_finish()Edward Cree
Since callees (ip_rcv_core() and ip_rcv_finish_core()) might free or steal the skb, we can't use the list_cut_before() method; we can't even do a list_del(&skb->list) in the drop case, because skb might have already been freed and reused. So instead, take each skb off the source list before processing, and add it to the sublist afterwards if it wasn't freed or stolen. Fixes: 5fa12739a53d net: ipv4: listify ip_rcv_finish Fixes: 17266ee93984 net: ipv4: listified version of ip_rcv Signed-off-by: Edward Cree <ecree@solarflare.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-04net/sched: Make etf report drops on error_queueJesus Sanchez-Palencia
Use the socket error queue for reporting dropped packets if the socket has enabled that feature through the SO_TXTIME API. Packets are dropped either on enqueue() if they aren't accepted by the qdisc or on dequeue() if the system misses their deadline. Those are reported as different errors so applications can react accordingly. Userspace can retrieve the errors through the socket error queue and the corresponding cmsg interfaces. A struct sock_extended_err* is used for returning the error data, and the packet's timestamp can be retrieved by adding both ee_data and ee_info fields as e.g.: ((__u64) serr->ee_data << 32) + serr->ee_info This feature is disabled by default and must be explicitly enabled by applications. Enabling it can bring some overhead for the Tx cycles of the application. Signed-off-by: Jesus Sanchez-Palencia <jesus.sanchez-palencia@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-04net/sched: Add HW offloading capability to ETFJesus Sanchez-Palencia
Add infra so etf qdisc supports HW offload of time-based transmission. For hw offload, the time sorted list is still used, so packets are dequeued always in order of txtime. Example: $ tc qdisc replace dev enp2s0 parent root handle 100 mqprio num_tc 3 \ map 2 2 1 0 2 2 2 2 2 2 2 2 2 2 2 2 queues 1@0 1@1 2@2 hw 0 $ tc qdisc add dev enp2s0 parent 100:1 etf offload delta 100000 \ clockid CLOCK_REALTIME In this example, the Qdisc will use HW offload for the control of the transmission time through the network adapter. The hrtimer used for packets scheduling inside the qdisc will use the clockid CLOCK_REALTIME as reference and packets leave the Qdisc "delta" (100000) nanoseconds before their transmission time. Because this will be using HW offload and since dynamic clocks are not supported by the hrtimer, the system clock and the PHC clock must be synchronized for this mode to behave as expected. Signed-off-by: Jesus Sanchez-Palencia <jesus.sanchez-palencia@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-04net/sched: Introduce the ETF QdiscVinicius Costa Gomes
The ETF (Earliest TxTime First) qdisc uses the information added earlier in this series (the socket option SO_TXTIME and the new role of sk_buff->tstamp) to schedule packets transmission based on absolute time. For some workloads, just bandwidth enforcement is not enough, and precise control of the transmission of packets is necessary. Example: $ tc qdisc replace dev enp2s0 parent root handle 100 mqprio num_tc 3 \ map 2 2 1 0 2 2 2 2 2 2 2 2 2 2 2 2 queues 1@0 1@1 2@2 hw 0 $ tc qdisc add dev enp2s0 parent 100:1 etf delta 100000 \ clockid CLOCK_TAI In this example, the Qdisc will provide SW best-effort for the control of the transmission time to the network adapter, the time stamp in the socket will be in reference to the clockid CLOCK_TAI and packets will leave the qdisc "delta" (100000) nanoseconds before its transmission time. The ETF qdisc will buffer packets sorted by their txtime. It will drop packets on enqueue() if their skbuff clockid does not match the clock reference of the Qdisc. Moreover, on dequeue(), a packet will be dropped if it expires while being enqueued. The qdisc also supports the SO_TXTIME deadline mode. For this mode, it will dequeue a packet as soon as possible and change the skb timestamp to 'now' during etf_dequeue(). Note that both the qdisc's and the SO_TXTIME ABIs allow for a clockid to be configured, but it's been decided that usage of CLOCK_TAI should be enforced until we decide to allow for other clockids to be used. The rationale here is that PTP times are usually in the TAI scale, thus no other clocks should be necessary. For now, the qdisc will return EINVAL if any clocks other than CLOCK_TAI are used. Signed-off-by: Jesus Sanchez-Palencia <jesus.sanchez-palencia@intel.com> Signed-off-by: Vinicius Costa Gomes <vinicius.gomes@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-04net/sched: Allow creating a Qdisc watchdog with other clocksVinicius Costa Gomes
This adds 'qdisc_watchdog_init_clockid()' that allows a clockid to be passed, this allows other time references to be used when scheduling the Qdisc to run. Signed-off-by: Vinicius Costa Gomes <vinicius.gomes@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-04net: packet: Hook into time based transmission.Richard Cochran
For raw layer-2 packets, copy the desired future transmit time from the CMSG cookie into the skb. Signed-off-by: Richard Cochran <rcochran@linutronix.de> Signed-off-by: Jesus Sanchez-Palencia <jesus.sanchez-palencia@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-04net: ipv6: Hook into time based transmissionJesus Sanchez-Palencia
Add a struct sockcm_cookie parameter to ip6_setup_cork() so we can easily re-use the transmit_time field from struct inet_cork for most paths, by copying the timestamp from the CMSG cookie. This is later copied into the skb during __ip6_make_skb(). For the raw fast path, also pass the sockcm_cookie as a parameter so we can just perform the copy at rawv6_send_hdrinc() directly. Signed-off-by: Jesus Sanchez-Palencia <jesus.sanchez-palencia@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-04net: ipv4: Hook into time based transmissionJesus Sanchez-Palencia
Add a transmit_time field to struct inet_cork, then copy the timestamp from the CMSG cookie at ip_setup_cork() so we can safely copy it into the skb later during __ip_make_skb(). For the raw fast path, just perform the copy at raw_send_hdrinc(). Signed-off-by: Richard Cochran <rcochran@linutronix.de> Signed-off-by: Jesus Sanchez-Palencia <jesus.sanchez-palencia@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-04net: Add a new socket option for a future transmit time.Richard Cochran
This patch introduces SO_TXTIME. User space enables this option in order to pass a desired future transmit time in a CMSG when calling sendmsg(2). The argument to this socket option is a 8-bytes long struct provided by the uapi header net_tstamp.h defined as: struct sock_txtime { clockid_t clockid; u32 flags; }; Note that new fields were added to struct sock by filling a 2-bytes hole found in the struct. For that reason, neither the struct size or number of cachelines were altered. Signed-off-by: Richard Cochran <rcochran@linutronix.de> Signed-off-by: Jesus Sanchez-Palencia <jesus.sanchez-palencia@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-04net: Clear skb->tstamp only on the forwarding pathJesus Sanchez-Palencia
This is done in preparation for the upcoming time based transmission patchset. Now that skb->tstamp will be used to hold packet's txtime, we must ensure that it is being cleared when traversing namespaces. Also, doing that from skb_scrub_packet() before the early return would break our feature when tunnels are used. Signed-off-by: Jesus Sanchez-Palencia <jesus.sanchez-palencia@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-04net: sched: act_pedit: fix possible memory leak in tcf_pedit_init()Wei Yongjun
'keys_ex' is malloced by tcf_pedit_keys_ex_parse() in tcf_pedit_init() but not all of the error handle path free it, this may cause memory leak. This patch fix it. Fixes: 71d0ed7079df ("net/act_pedit: Support using offset relative to the conventional network headers") Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com> Acked-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-04sctp: fix the issue that pathmtu may be set lower than MINSEGMENTXin Long
After commit b6c5734db070 ("sctp: fix the handling of ICMP Frag Needed for too small MTUs"), sctp_transport_update_pmtu would refetch pathmtu from the dst and set it to transport's pathmtu without any check. The new pathmtu may be lower than MINSEGMENT if the dst is obsolete and updated by .get_dst() in sctp_transport_update_pmtu. In this case, it could have a smaller MTU as well, and thus we should validate it against MINSEGMENT instead. Syzbot reported a warning in sctp_mtu_payload caused by this. This patch refetches the pathmtu by calling sctp_dst_mtu where it does the check against MINSEGMENT. v1->v2: - refetch the pathmtu by calling sctp_dst_mtu instead as Marcelo's suggestion. Fixes: b6c5734db070 ("sctp: fix the handling of ICMP Frag Needed for too small MTUs") Reported-by: syzbot+f0d9d7cba052f9344b03@syzkaller.appspotmail.com Suggested-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: Xin Long <lucien.xin@gmail.com> Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Acked-by: Neil Horman <nhorman@tuxdriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-04net:sched: add action inheritdsfield to skbeditQiaobin Fu
The new action inheritdsfield copies the field DS of IPv4 and IPv6 packets into skb->priority. This enables later classification of packets based on the DS field. v5: *Update the drop counter for TC_ACT_SHOT v4: *Not allow setting flags other than the expected ones. *Allow dumping the pure flags. v3: *Use optional flags, so that it won't break old versions of tc. *Allow users to set both SKBEDIT_F_PRIORITY and SKBEDIT_F_INHERITDSFIELD flags. v2: *Fix the style issue *Move the code from skbmod to skbedit Original idea by Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: Qiaobin Fu <qiaobinf@bu.edu> Reviewed-by: Michel Machado <michel@digirati.com.br> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Reviewed-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Acked-by: Davide Caratti <dcaratti@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-04net/ipv6: Revert attempt to simplify route replace and appendDavid Ahern
NetworkManager likes to manage linklocal prefix routes and does so with the NLM_F_APPEND flag, breaking attempts to simplify the IPv6 route code and by extension enable multipath routes with device only nexthops. Revert f34436a43092 and these followup patches: 6eba08c3626b ("ipv6: Only emit append events for appended routes"). ce45bded6435 ("mlxsw: spectrum_router: Align with new route replace logic") 53b562df8c20 ("mlxsw: spectrum_router: Allow appending to dev-only routes") Update the fib_tests cases to reflect the old behavior. Fixes: f34436a43092 ("net/ipv6: Simplify route replace and appending into multipath route") Signed-off-by: David Ahern <dsahern@gmail.com>
2018-07-04gen_stats: Fix netlink stats dumping in the presence of paddingToke Høiland-Jørgensen
The gen_stats facility will add a header for the toplevel nlattr of type TCA_STATS2 that contains all stats added by qdisc callbacks. A reference to this header is stored in the gnet_dump struct, and when all the per-qdisc callbacks have finished adding their stats, the length of the containing header will be adjusted to the right value. However, on architectures that need padding (i.e., that don't set CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS), the padding nlattr is added before the stats, which means that the stored pointer will point to the padding, and so when the header is fixed up, the result is just a very big padding nlattr. Because most qdiscs also supply the legacy TCA_STATS struct, this problem has been mostly invisible, but we exposed it with the netlink attribute-based statistics in CAKE. Fix the issue by fixing up the stored pointer if it points to a padding nlattr. Tested-by: Pete Heist <pete@heistp.net> Tested-by: Kevin Darbyshire-Bryant <kevin@darbyshire-bryant.me.uk> Signed-off-by: Toke Høiland-Jørgensen <toke@toke.dk> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-04net: don't bother calling list RX functions on empty listsEdward Cree
Generally the check should be very cheap, as the sk_buff_head is in cache. Signed-off-by: Edward Cree <ecree@solarflare.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-04net: ipv4: listify ip_rcv_finishEdward Cree
ip_rcv_finish_core(), if it does not drop, sets skb->dst by either early demux or route lookup. The last step, calling dst_input(skb), is left to the caller; in the listified case, we split to form sublists with a common dst, but then ip_sublist_rcv_finish() just calls dst_input(skb) in a loop. The next step in listification would thus be to add a list_input() method to struct dst_entry. Early demux is an indirect call based on iph->protocol; this is another opportunity for listification which is not taken here (it would require slicing up ip_rcv_finish_core() to allow splitting on protocol changes). Signed-off-by: Edward Cree <ecree@solarflare.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-04net: ipv4: listified version of ip_rcvEdward Cree
Also involved adding a way to run a netfilter hook over a list of packets. Rather than attempting to make netfilter know about lists (which would be a major project in itself) we just let it call the regular okfn (in this case ip_rcv_finish()) for any packets it steals, and have it give us back a list of packets it's synchronously accepted (which normally NF_HOOK would automatically call okfn() on, but we want to be able to potentially pass the list to a listified version of okfn().) The netfilter hooks themselves are indirect calls that still happen per- packet (see nf_hook_entry_hookfn()), but again, changing that can be left for future work. There is potential for out-of-order receives if the netfilter hook ends up synchronously stealing packets, as they will be processed before any accepts earlier in the list. However, it was already possible for an asynchronous accept to cause out-of-order receives, so presumably this is considered OK. Signed-off-by: Edward Cree <ecree@solarflare.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-04net: core: propagate SKB lists through packet_type lookupEdward Cree
__netif_receive_skb_core() does a depressingly large amount of per-packet work that can't easily be listified, because the another_round looping makes it nontrivial to slice up into smaller functions. Fortunately, most of that work disappears in the fast path: * Hardware devices generally don't have an rx_handler * Unless you're tcpdumping or something, there is usually only one ptype * VLAN processing comes before the protocol ptype lookup, so doesn't force a pt_prev deliver so normally, __netif_receive_skb_core() will run straight through and pass back the one ptype found in ptype_base[hash of skb->protocol]. Signed-off-by: Edward Cree <ecree@solarflare.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-04net: core: another layer of lists, around PF_MEMALLOC skb handlingEdward Cree
First example of a layer splitting the list (rather than merely taking individual packets off it). Involves new list.h function, list_cut_before(), like list_cut_position() but cuts on the other side of the given entry. Signed-off-by: Edward Cree <ecree@solarflare.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-04net: core: Another step of skb receive list processingEdward Cree
netif_receive_skb_list_internal() now processes a list and hands it on to the next function. Signed-off-by: Edward Cree <ecree@solarflare.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-04net: core: unwrap skb list receive slightly furtherEdward Cree
Signed-off-by: Edward Cree <ecree@solarflare.com> Signed-off-by: David S. Miller <davem@davemloft.net>