summaryrefslogtreecommitdiff
path: root/net/ipv6/addrconf.c
AgeCommit message (Collapse)Author
2022-06-30Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski
drivers/net/ethernet/microchip/sparx5/sparx5_switchdev.c 9c5de246c1db ("net: sparx5: mdb add/del handle non-sparx5 devices") fbb89d02e33a ("net: sparx5: Allow mdb entries to both CPU and ports") Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-06-29ipv6: fix lockdep splat in in6_dump_addrs()Eric Dumazet
As reported by syzbot, we should not use rcu_dereference() when rcu_read_lock() is not held. WARNING: suspicious RCU usage 5.19.0-rc2-syzkaller #0 Not tainted net/ipv6/addrconf.c:5175 suspicious rcu_dereference_check() usage! other info that might help us debug this: rcu_scheduler_active = 2, debug_locks = 1 1 lock held by syz-executor326/3617: #0: ffffffff8d5848e8 (rtnl_mutex){+.+.}-{3:3}, at: netlink_dump+0xae/0xc20 net/netlink/af_netlink.c:2223 stack backtrace: CPU: 0 PID: 3617 Comm: syz-executor326 Not tainted 5.19.0-rc2-syzkaller #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Call Trace: <TASK> __dump_stack lib/dump_stack.c:88 [inline] dump_stack_lvl+0xcd/0x134 lib/dump_stack.c:106 in6_dump_addrs+0x12d1/0x1790 net/ipv6/addrconf.c:5175 inet6_dump_addr+0x9c1/0xb50 net/ipv6/addrconf.c:5300 netlink_dump+0x541/0xc20 net/netlink/af_netlink.c:2275 __netlink_dump_start+0x647/0x900 net/netlink/af_netlink.c:2380 netlink_dump_start include/linux/netlink.h:245 [inline] rtnetlink_rcv_msg+0x73e/0xc90 net/core/rtnetlink.c:6046 netlink_rcv_skb+0x153/0x420 net/netlink/af_netlink.c:2501 netlink_unicast_kernel net/netlink/af_netlink.c:1319 [inline] netlink_unicast+0x543/0x7f0 net/netlink/af_netlink.c:1345 netlink_sendmsg+0x917/0xe10 net/netlink/af_netlink.c:1921 sock_sendmsg_nosec net/socket.c:714 [inline] sock_sendmsg+0xcf/0x120 net/socket.c:734 ____sys_sendmsg+0x6eb/0x810 net/socket.c:2492 ___sys_sendmsg+0xf3/0x170 net/socket.c:2546 __sys_sendmsg net/socket.c:2575 [inline] __do_sys_sendmsg net/socket.c:2584 [inline] __se_sys_sendmsg net/socket.c:2582 [inline] __x64_sys_sendmsg+0x132/0x220 net/socket.c:2582 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x46/0xb0 Fixes: 88e2ca308094 ("mld: convert ifmcaddr6 to RCU") Reported-by: syzbot <syzkaller@googlegroups.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Taehee Yoo <ap420073@gmail.com> Link: https://lore.kernel.org/r/20220628121248.858695-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-06-28ipv6/addrconf: fix timing bug in tempaddr regenSam Edwards
The addrconf_verify_rtnl() function uses a big if/elseif/elseif/... block to categorize each address by what type of attention it needs. An about-to-expire (RFC 4941) temporary address is one such category, but the previous elseif branch catches addresses that have already run out their prefered_lft. This means that if addrconf_verify_rtnl() fails to run in the necessary time window (i.e. REGEN_ADVANCE time units before the end of the prefered_lft), the temporary address will never be regenerated, and no temporary addresses will be available until each one's valid_lft runs out and manage_tempaddrs() begins anew. Fix this by moving the entire temporary address regeneration case out of that block. That block is supposed to implement the "destructive" part of an address's lifecycle, and regenerating a fresh temporary address is not, semantically speaking, actually tied to any particular lifecycle stage. The age test is also changed from `age >= prefered_lft - regen_advance` to `age + regen_advance >= prefered_lft` instead, to ensure no underflow occurs if the system administrator increases the regen_advance to a value greater than the already-set prefered_lft. Note that this does not fix the problem of addrconf_verify_rtnl() sometimes not running in time, resulting in the race condition described in RFC 4941 section 3.4 - it only ensures that the address is regenerated. Fixing THAT problem may require either using jiffies instead of seconds for all time arithmetic here, or always rounding up when regen_advance is converted to seconds. Signed-off-by: Sam Edwards <CFSworks@gmail.com> Link: https://lore.kernel.org/r/20220623181103.7033-1-CFSworks@gmail.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2022-06-27ipv6: take care of disable_policy when restoring routesNicolas Dichtel
When routes corresponding to addresses are restored by fixup_permanent_addr(), the dst_nopolicy parameter was not set. The typical use case is a user that configures an address on a down interface and then put this interface up. Let's take care of this flag in addrconf_f6i_alloc(), so that every callers benefit ont it. CC: stable@kernel.org CC: David Forster <dforster@brocade.com> Fixes: df789fe75206 ("ipv6: Provide ipv6 version of "disable_policy" sysctl") Reported-by: Siwar Zitouni <siwar.zitouni@6wind.com> Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://lore.kernel.org/r/20220623120015.32640-1-nicolas.dichtel@6wind.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-06-09net: rename reference+tracking helpersJakub Kicinski
Netdev reference helpers have a dev_ prefix for historic reasons. Renaming the old helpers would be too much churn but we can rename the tracking ones which are relatively recent and should be the default for new code. Rename: dev_hold_track() -> netdev_hold() dev_put_track() -> netdev_put() dev_replace_track() -> netdev_ref_replace() Link: https://lore.kernel.org/r/20220608043955.919359-1-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-05-31net/ipv6: Expand and rename accept_unsolicited_na to accept_untracked_naArun Ajith S
RFC 9131 changes default behaviour of handling RX of NA messages when the corresponding entry is absent in the neighbour cache. The current implementation is limited to accept just unsolicited NAs. However, the RFC is more generic where it also accepts solicited NAs. Both types should result in adding a STALE entry for this case. Expand accept_untracked_na behaviour to also accept solicited NAs to be compliant with the RFC and rename the sysctl knob to accept_untracked_na. Fixes: f9a2fb73318e ("net/ipv6: Introduce accept_unsolicited_na knob to implement router-side changes for RFC9131") Signed-off-by: Arun Ajith S <aajith@arista.com> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://lore.kernel.org/r/20220530101414.65439-1-aajith@arista.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2022-05-25Merge tag 'net-next-5.19' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next Pull networking updates from Jakub Kicinski: "Core ---- - Support TCPv6 segmentation offload with super-segments larger than 64k bytes using the IPv6 Jumbogram extension header (AKA BIG TCP). - Generalize skb freeing deferral to per-cpu lists, instead of per-socket lists. - Add a netdev statistic for packets dropped due to L2 address mismatch (rx_otherhost_dropped). - Continue work annotating skb drop reasons. - Accept alternative netdev names (ALT_IFNAME) in more netlink requests. - Add VLAN support for AF_PACKET SOCK_RAW GSO. - Allow receiving skb mark from the socket as a cmsg. - Enable memcg accounting for veth queues, sysctl tables and IPv6. BPF --- - Add libbpf support for User Statically-Defined Tracing (USDTs). - Speed up symbol resolution for kprobes multi-link attachments. - Support storing typed pointers to referenced and unreferenced objects in BPF maps. - Add support for BPF link iterator. - Introduce access to remote CPU map elements in BPF per-cpu map. - Allow middle-of-the-road settings for the kernel.unprivileged_bpf_disabled sysctl. - Implement basic types of dynamic pointers e.g. to allow for dynamically sized ringbuf reservations without extra memory copies. Protocols --------- - Retire port only listening_hash table, add a second bind table hashed by port and address. Avoid linear list walk when binding to very popular ports (e.g. 443). - Add bridge FDB bulk flush filtering support allowing user space to remove all FDB entries matching a condition. - Introduce accept_unsolicited_na sysctl for IPv6 to implement router-side changes for RFC9131. - Support for MPTCP path manager in user space. - Add MPTCP support for fallback to regular TCP for connections that have never connected additional subflows or transmitted out-of-sequence data (partial support for RFC8684 fallback). - Avoid races in MPTCP-level window tracking, stabilize and improve throughput. - Support lockless operation of GRE tunnels with seq numbers enabled. - WiFi support for host based BSS color collision detection. - Add support for SO_TXTIME/SCM_TXTIME on CAN sockets. - Support transmission w/o flow control in CAN ISOTP (ISO 15765-2). - Support zero-copy Tx with TLS 1.2 crypto offload (sendfile). - Allow matching on the number of VLAN tags via tc-flower. - Add tracepoint for tcp_set_ca_state(). Driver API ---------- - Improve error reporting from classifier and action offload. - Add support for listing line cards in switches (devlink). - Add helpers for reporting page pool statistics with ethtool -S. - Add support for reading clock cycles when using PTP virtual clocks, instead of having the driver convert to time before reporting. This makes it possible to report time from different vclocks. - Support configuring low-latency Tx descriptor push via ethtool. - Separate Clause 22 and Clause 45 MDIO accesses more explicitly. New hardware / drivers ---------------------- - Ethernet: - Marvell's Octeon NIC PCI Endpoint support (octeon_ep) - Sunplus SP7021 SoC (sp7021_emac) - Add support for Renesas RZ/V2M (in ravb) - Add support for MediaTek mt7986 switches (in mtk_eth_soc) - Ethernet PHYs: - ADIN1100 industrial PHYs (w/ 10BASE-T1L and SQI reporting) - TI DP83TD510 PHY - Microchip LAN8742/LAN88xx PHYs - WiFi: - Driver for pureLiFi X, XL, XC devices (plfxlc) - Driver for Silicon Labs devices (wfx) - Support for WCN6750 (in ath11k) - Support Realtek 8852ce devices (in rtw89) - Mobile: - MediaTek T700 modems (Intel 5G 5000 M.2 cards) - CAN: - ctucanfd: add support for CTU CAN FD open-source IP core from Czech Technical University in Prague Drivers ------- - Delete a number of old drivers still using virt_to_bus(). - Ethernet NICs: - intel: support TSO on tunnels MPLS - broadcom: support multi-buffer XDP - nfp: support VF rate limiting - sfc: use hardware tx timestamps for more than PTP - mlx5: multi-port eswitch support - hyper-v: add support for XDP_REDIRECT - atlantic: XDP support (including multi-buffer) - macb: improve real-time perf by deferring Tx processing to NAPI - High-speed Ethernet switches: - mlxsw: implement basic line card information querying - prestera: add support for traffic policing on ingress and egress - Embedded Ethernet switches: - lan966x: add support for packet DMA (FDMA) - lan966x: add support for PTP programmable pins - ti: cpsw_new: enable bc/mc storm prevention - Qualcomm 802.11ax WiFi (ath11k): - Wake-on-WLAN support for QCA6390 and WCN6855 - device recovery (firmware restart) support - support setting Specific Absorption Rate (SAR) for WCN6855 - read country code from SMBIOS for WCN6855/QCA6390 - enable keep-alive during WoWLAN suspend - implement remain-on-channel support - MediaTek WiFi (mt76): - support Wireless Ethernet Dispatch offloading packet movement between the Ethernet switch and WiFi interfaces - non-standard VHT MCS10-11 support - mt7921 AP mode support - mt7921 IPv6 NS offload support - Ethernet PHYs: - micrel: ksz9031/ksz9131: cabletest support - lan87xx: SQI support for T1 PHYs - lan937x: add interrupt support for link detection" * tag 'net-next-5.19' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (1809 commits) ptp: ocp: Add firmware header checks ptp: ocp: fix PPS source selector debugfs reporting ptp: ocp: add .init function for sma_op vector ptp: ocp: vectorize the sma accessor functions ptp: ocp: constify selectors ptp: ocp: parameterize input/output sma selectors ptp: ocp: revise firmware display ptp: ocp: add Celestica timecard PCI ids ptp: ocp: Remove #ifdefs around PCI IDs ptp: ocp: 32-bit fixups for pci start address Revert "net/smc: fix listen processing for SMC-Rv2" ath6kl: Use cc-disable-warning to disable -Wdangling-pointer selftests/bpf: Dynptr tests bpf: Add dynptr data slices bpf: Add bpf_dynptr_read and bpf_dynptr_write bpf: Dynptr support for ring buffers bpf: Add bpf_dynptr_from_mem for local dynptrs bpf: Add verifier support for dynptrs bpf: Suppress 'passing zero to PTR_ERR' warning bpf: Introduce bpf_arch_text_invalidate for bpf_prog_pack ...
2022-05-18random32: use real rng for non-deterministic randomnessJason A. Donenfeld
random32.c has two random number generators in it: one that is meant to be used deterministically, with some predefined seed, and one that does the same exact thing as random.c, except does it poorly. The first one has some use cases. The second one no longer does and can be replaced with calls to random.c's proper random number generator. The relatively recent siphash-based bad random32.c code was added in response to concerns that the prior random32.c was too deterministic. Out of fears that random.c was (at the time) too slow, this code was anonymously contributed. Then out of that emerged a kind of shadow entropy gathering system, with its own tentacles throughout various net code, added willy nilly. Stop👏making👏bespoke👏random👏number👏generators👏. Fortunately, recent advances in random.c mean that we can stop playing with this sketchiness, and just use get_random_u32(), which is now fast enough. In micro benchmarks using RDPMC, I'm seeing the same median cycle count between the two functions, with the mean being _slightly_ higher due to batches refilling (which we can optimize further need be). However, when doing *real* benchmarks of the net functions that actually use these random numbers, the mean cycles actually *decreased* slightly (with the median still staying the same), likely because the additional prandom code means icache misses and complexity, whereas random.c is generally already being used by something else nearby. The biggest benefit of this is that there are many users of prandom who probably should be using cryptographically secure random numbers. This makes all of those accidental cases become secure by just flipping a switch. Later on, we can do a tree-wide cleanup to remove the static inline wrapper functions that this commit adds. There are also some low-ish hanging fruits for making this even faster in the future: a get_random_u16() function for use in the networking stack will give a 2x performance boost there, using SIMD for ChaCha20 will let us compute 4 or 8 or 16 blocks of output in parallel, instead of just one, giving us large buffers for cheap, and introducing a get_random_*_bh() function that assumes irqs are already disabled will shave off a few cycles for ordinary calls. These are things we can chip away at down the road. Acked-by: Jakub Kicinski <kuba@kernel.org> Acked-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2022-05-04memcg: accounting for objects allocated for new netdeviceVasily Averin
Creating a new netdevice allocates at least ~50Kb of memory for various kernel objects, but only ~5Kb of them are accounted to memcg. As a result, creating an unlimited number of netdevice inside a memcg-limited container does not fall within memcg restrictions, consumes a significant part of the host's memory, can cause global OOM and lead to random kills of host processes. The main consumers of non-accounted memory are: ~10Kb 80+ kernfs nodes ~6Kb ipv6_add_dev() allocations 6Kb __register_sysctl_table() allocations 4Kb neigh_sysctl_register() allocations 4Kb __devinet_sysctl_register() allocations 4Kb __addrconf_sysctl_register() allocations Accounting of these objects allows to increase the share of memcg-related memory up to 60-70% (~38Kb accounted vs ~54Kb total for dummy netdevice on typical VM with default Fedora 35 kernel) and this should be enough to somehow protect the host from misuse inside container. Other related objects are quite small and may not be taken into account to minimize the expected performance degradation. It should be separately mentonied ~300 bytes of percpu allocation of struct ipstats_mib in snmp6_alloc_dev(), on huge multi-cpu nodes it can become the main consumer of memory. This patch does not enables kernfs accounting as it affects other parts of the kernel and should be discussed separately. However, even without kernfs, this patch significantly improves the current situation and allows to take into account more than half of all netdevice allocations. Signed-off-by: Vasily Averin <vvs@openvz.org> Acked-by: Luis Chamberlain <mcgrof@kernel.org> Link: https://lore.kernel.org/r/354a0a5f-9ec3-a25c-3215-304eab2157bc@openvz.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-05-02ipv6: Don't send rs packets to the interface of ARPHRD_TUNNELjianghaoran
ARPHRD_TUNNEL interface can't process rs packets and will generate TX errors ex: ip tunnel add ethn mode ipip local 192.168.1.1 remote 192.168.1.2 ifconfig ethn x.x.x.x ethn: flags=209<UP,POINTOPOINT,RUNNING,NOARP> mtu 1480 inet x.x.x.x netmask 255.255.255.255 destination x.x.x.x inet6 fe80::5efe:ac1e:3cdb prefixlen 64 scopeid 0x20<link> tunnel txqueuelen 1000 (IPIP Tunnel) RX packets 0 bytes 0 (0.0 B) RX errors 0 dropped 0 overruns 0 frame 0 TX packets 0 bytes 0 (0.0 B) TX errors 3 dropped 0 overruns 0 carrier 0 collisions 0 Signed-off-by: jianghaoran <jianghaoran@kylinos.cn> Link: https://lore.kernel.org/r/20220429053802.246681-1-jianghaoran@kylinos.cn Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-04-22net/ipv6: Enforce limits for accept_unsolicited_na sysctlArun Ajith S
Fix mistake in the original patch where limits were specified but the handler didn't take care of the limits. Signed-off-by: Arun Ajith S <aajith@arista.com> Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-04-17net/ipv6: Introduce accept_unsolicited_na knob to implement router-side ↵Arun Ajith S
changes for RFC9131 Add a new neighbour cache entry in STALE state for routers on receiving an unsolicited (gratuitous) neighbour advertisement with target link-layer-address option specified. This is similar to the arp_accept configuration for IPv4. A new sysctl endpoint is created to turn on this behaviour: /proc/sys/net/ipv6/conf/interface/accept_unsolicited_na. Signed-off-by: Arun Ajith S <aajith@arista.com> Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-04-06ipv6: fix locking issues with loops over idev->addr_listNiels Dossche
idev->addr_list needs to be protected by idev->lock. However, it is not always possible to do so while iterating and performing actions on inet6_ifaddr instances. For example, multiple functions (like addrconf_{join,leave}_anycast) eventually call down to other functions that acquire the idev->lock. The current code temporarily unlocked the idev->lock during the loops, which can cause race conditions. Moving the locks up is also not an appropriate solution as the ordering of lock acquisition will be inconsistent with for example mc_lock. This solution adds an additional field to inet6_ifaddr that is used to temporarily add the instances to a temporary list while holding idev->lock. The temporary list can then be traversed without holding idev->lock. This change was done in two places. In addrconf_ifdown, the list_for_each_entry_safe variant of the list loop is also no longer necessary as there is no deletion within that specific loop. Suggested-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Niels Dossche <dossche.niels@gmail.com> Acked-by: Paolo Abeni <pabeni@redhat.com> Link: https://lore.kernel.org/r/20220403231523.45843-1-dossche.niels@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-03-03Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski
net/batman-adv/hard-interface.c commit 690bb6fb64f5 ("batman-adv: Request iflink once in batadv-on-batadv check") commit 6ee3c393eeb7 ("batman-adv: Demote batadv-on-batadv skip error message") https://lore.kernel.org/all/20220302163049.101957-1-sw@simonwunderlich.de/ net/smc/af_smc.c commit 4d08b7b57ece ("net/smc: Fix cleanup when register ULP fails") commit 462791bbfa35 ("net/smc: add sysctl interface for SMC") https://lore.kernel.org/all/20220302112209.355def40@canb.auug.org.au/ Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-02-28net: ipv6: ensure we call ipv6_mc_down() at most oncej.nixdorf@avm.de
There are two reasons for addrconf_notify() to be called with NETDEV_DOWN: either the network device is actually going down, or IPv6 was disabled on the interface. If either of them stays down while the other is toggled, we repeatedly call the code for NETDEV_DOWN, including ipv6_mc_down(), while never calling the corresponding ipv6_mc_up() in between. This will cause a new entry in idev->mc_tomb to be allocated for each multicast group the interface is subscribed to, which in turn leaks one struct ifmcaddr6 per nontrivial multicast group the interface is subscribed to. The following reproducer will leak at least $n objects: ip addr add ff2e::4242/32 dev eth0 autojoin sysctl -w net.ipv6.conf.eth0.disable_ipv6=1 for i in $(seq 1 $n); do ip link set up eth0; ip link set down eth0 done Joining groups with IPV6_ADD_MEMBERSHIP (unprivileged) or setting the sysctl net.ipv6.conf.eth0.forwarding to 1 (=> subscribing to ff02::2) can also be used to create a nontrivial idev->mc_list, which will the leak objects with the right up-down-sequence. Based on both sources for NETDEV_DOWN events the interface IPv6 state should be considered: - not ready if the network interface is not ready OR IPv6 is disabled for it - ready if the network interface is ready AND IPv6 is enabled for it The functions ipv6_mc_up() and ipv6_down() should only be run when this state changes. Implement this by remembering when the IPv6 state is ready, and only run ipv6_mc_down() if it actually changed from ready to not ready. The other direction (not ready -> ready) already works correctly, as: - the interface notification triggered codepath for NETDEV_UP / NETDEV_CHANGE returns early if ipv6 is disabled, and - the disable_ipv6=0 triggered codepath skips fully initializing the interface as long as addrconf_link_ready(dev) returns false - calling ipv6_mc_up() repeatedly does not leak anything Fixes: 3ce62a84d53c ("ipv6: exit early in addrconf_notify() if IPv6 is disabled") Signed-off-by: Johannes Nixdorf <j.nixdorf@avm.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-02-24Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski
tools/testing/selftests/net/mptcp/mptcp_join.sh 34aa6e3bccd8 ("selftests: mptcp: add ip mptcp wrappers") 857898eb4b28 ("selftests: mptcp: add missing join check") 6ef84b1517e0 ("selftests: mptcp: more robust signal race test") https://lore.kernel.org/all/20220221131842.468893-1-broonie@kernel.org/ drivers/net/ethernet/mellanox/mlx5/core/en/tc/act/act.h drivers/net/ethernet/mellanox/mlx5/core/en/tc/act/ct.c fb7e76ea3f3b6 ("net/mlx5e: TC, Skip redundant ct clear actions") c63741b426e11 ("net/mlx5e: Fix MPLSoUDP encap to use MPLS action information") 09bf97923224f ("net/mlx5e: TC, Move pedit_headers_action to parse_attr") 84ba8062e383 ("net/mlx5e: Test CT and SAMPLE on flow attr") efe6f961cd2e ("net/mlx5e: CT, Don't set flow flag CT for ct clear flow") 3b49a7edec1d ("net/mlx5e: TC, Reject rules with multiple CT actions") Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-02-24ipv6: prevent a possible race condition with lifetimesNiels Dossche
valid_lft, prefered_lft and tstamp are always accessed under the lock "lock" in other places. Reading these without taking the lock may result in inconsistencies regarding the calculation of the valid and preferred variables since decisions are taken on these fields for those variables. Signed-off-by: Niels Dossche <dossche.niels@gmail.com> Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: Niels Dossche <niels.dossche@ugent.be> Link: https://lore.kernel.org/r/20220223131954.6570-1-niels.dossche@ugent.be Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-02-18net: Add new protocol attribute to IP addressesJacques de Laval
This patch adds a new protocol attribute to IPv4 and IPv6 addresses. Inspiration was taken from the protocol attribute of routes. User space applications like iproute2 can set/get the protocol with the Netlink API. The attribute is stored as an 8-bit unsigned integer. The protocol attribute is set by kernel for these categories: - IPv4 and IPv6 loopback addresses - IPv6 addresses generated from router announcements - IPv6 link local addresses User space may pass custom protocols, not defined by the kernel. Grouping addresses on their origin is useful in scenarios where you want to distinguish between addresses based on who added them, e.g. kernel vs. user space. Tagging addresses with a string label is an existing feature that could be used as a solution. Unfortunately the max length of a label is 15 characters, and for compatibility reasons the label must be prefixed with the name of the device followed by a colon. Since device names also have a max length of 15 characters, only -1 characters is guaranteed to be available for any origin tag, which is not that much. A reference implementation of user space setting and getting protocols is available for iproute2: https://github.com/westermo/iproute2/commit/9a6ea18bd79f47f293e5edc7780f315ea42ff540 Signed-off-by: Jacques de Laval <Jacques.De.Laval@westermo.com> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://lore.kernel.org/r/20220217150202.80802-1-Jacques.De.Laval@westermo.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-02-17Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski
No conflicts. Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-02-17ipv6/addrconf: ensure addrconf_verify_rtnl() has completedEric Dumazet
Before freeing the hash table in addrconf_exit_net(), we need to make sure the work queue has completed, or risk NULL dereference or UAF. Thus, use cancel_delayed_work_sync() to enforce this. We do not hold RTNL in addrconf_exit_net(), making this safe. Fixes: 8805d13ff1b2 ("ipv6/addrconf: use one delayed work per netns") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: syzbot <syzkaller@googlegroups.com> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://lore.kernel.org/r/20220216182037.3742-1-eric.dumazet@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-02-14ipv6: blackhole_netdev needs snmp6 countersIdo Schimmel
Whenever rt6_uncached_list_flush_dev() swaps rt->rt6_idev to the blackhole device, parts of IPv6 stack might still need to increment one SNMP counter. Root cause, patch from Ido, changelog from Eric :) This bug suggests that we need to audit rt->rt6_idev usages and make sure they are properly using RCU protection. Fixes: e5f80fcf869a ("ipv6: give an IPv6 dev to blackhole_netdev") Signed-off-by: Ido Schimmel <idosch@nvidia.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: syzbot <syzkaller@googlegroups.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-02-14ipv6: mcast: use rcu-safe version of ipv6_get_lladdr()Ignat Korchagin
Some time ago 8965779d2c0e ("ipv6,mcast: always hold idev->lock before mca_lock") switched ipv6_get_lladdr() to __ipv6_get_lladdr(), which is rcu-unsafe version. That was OK, because idev->lock was held for these codepaths. In 88e2ca308094 ("mld: convert ifmcaddr6 to RCU") these external locks were removed, so we probably need to restore the original rcu-safe call. Otherwise, we occasionally get a machine crashed/stalled with the following in dmesg: [ 3405.966610][T230589] general protection fault, probably for non-canonical address 0xdead00000000008c: 0000 [#1] SMP NOPTI [ 3405.982083][T230589] CPU: 44 PID: 230589 Comm: kworker/44:3 Tainted: G O 5.15.19-cloudflare-2022.2.1 #1 [ 3405.998061][T230589] Hardware name: SUPA-COOL-SERV [ 3406.009552][T230589] Workqueue: mld mld_ifc_work [ 3406.017224][T230589] RIP: 0010:__ipv6_get_lladdr+0x34/0x60 [ 3406.025780][T230589] Code: 57 10 48 83 c7 08 48 89 e5 48 39 d7 74 3e 48 8d 82 38 ff ff ff eb 13 48 8b 90 d0 00 00 00 48 8d 82 38 ff ff ff 48 39 d7 74 22 <66> 83 78 32 20 77 1b 75 e4 89 ca 23 50 2c 75 dd 48 8b 50 08 48 8b [ 3406.055748][T230589] RSP: 0018:ffff94e4b3fc3d10 EFLAGS: 00010202 [ 3406.065617][T230589] RAX: dead00000000005a RBX: ffff94e4b3fc3d30 RCX: 0000000000000040 [ 3406.077477][T230589] RDX: dead000000000122 RSI: ffff94e4b3fc3d30 RDI: ffff8c3a31431008 [ 3406.089389][T230589] RBP: ffff94e4b3fc3d10 R08: 0000000000000000 R09: 0000000000000000 [ 3406.101445][T230589] R10: ffff8c3a31430000 R11: 000000000000000b R12: ffff8c2c37887100 [ 3406.113553][T230589] R13: ffff8c3a39537000 R14: 00000000000005dc R15: ffff8c3a31431000 [ 3406.125730][T230589] FS: 0000000000000000(0000) GS:ffff8c3b9fc80000(0000) knlGS:0000000000000000 [ 3406.138992][T230589] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 3406.149895][T230589] CR2: 00007f0dfea1db60 CR3: 000000387b5f2000 CR4: 0000000000350ee0 [ 3406.162421][T230589] Call Trace: [ 3406.170235][T230589] <TASK> [ 3406.177736][T230589] mld_newpack+0xfe/0x1a0 [ 3406.186686][T230589] add_grhead+0x87/0xa0 [ 3406.195498][T230589] add_grec+0x485/0x4e0 [ 3406.204310][T230589] ? newidle_balance+0x126/0x3f0 [ 3406.214024][T230589] mld_ifc_work+0x15d/0x450 [ 3406.223279][T230589] process_one_work+0x1e6/0x380 [ 3406.232982][T230589] worker_thread+0x50/0x3a0 [ 3406.242371][T230589] ? rescuer_thread+0x360/0x360 [ 3406.252175][T230589] kthread+0x127/0x150 [ 3406.261197][T230589] ? set_kthread_struct+0x40/0x40 [ 3406.271287][T230589] ret_from_fork+0x22/0x30 [ 3406.280812][T230589] </TASK> [ 3406.288937][T230589] Modules linked in: ... [last unloaded: kheaders] [ 3406.476714][T230589] ---[ end trace 3525a7655f2f3b9e ]--- Fixes: 88e2ca308094 ("mld: convert ifmcaddr6 to RCU") Reported-by: David Pinilla Caparros <dpini@cloudflare.com> Signed-off-by: Ignat Korchagin <ignat@cloudflare.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-02-11ipv6: give an IPv6 dev to blackhole_netdevEric Dumazet
IPv6 addrconf notifiers wants the loopback device to be the last device being dismantled at netns deletion. This caused many limitations and work arounds. Back in linux-5.3, Mahesh added a per host blackhole_netdev that can be used whenever we need to make sure objects no longer refer to a disappearing device. If we attach to blackhole_netdev an ip6_ptr (allocate an idev), then we can use this special device (which is never freed) in place of the loopback_dev (which can be freed). This will permit improvements in netdev_run_todo() and other parts of the stack where had steps to make sure loopback_dev was the last device to disappear. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Mahesh Bandewar <maheshb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-02-08ipv6/addrconf: switch to per netns inet6_addr_lst hash tableEric Dumazet
IPv6 does not scale very well with the number of IPv6 addresses. It uses a global (shared by all netns) hash table with 256 buckets. Some functions like addrconf_verify_rtnl() and addrconf_ifdown() have to iterate all addresses in the hash table. I have seen addrconf_verify_rtnl() holding the cpu for 10ms or more. Switch to the per netns hashtable (and spinlock) added in prior patches. This considerably speeds up netns dismantle times on hosts with thousands of netns. This also has an impact on regular (fast path) IPv6 processing. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-02-08ipv6/addrconf: use one delayed work per netnsEric Dumazet
Next step for using per netns inet6_addr_lst is to have per netns work item to ultimately call addrconf_verify_rtnl() and addrconf_verify() with a new 'struct net*' argument. Everything is still using the global inet6_addr_lst[] table. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-02-08ipv6/addrconf: allocate a per netns hash tableEric Dumazet
Add a per netns hash table and a dedicated spinlock, first step to get rid of the global inet6_addr_lst[] one. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-02-07ip6mr: fix use-after-free in ip6mr_sk_done()Eric Dumazet
Apparently addrconf_exit_net() is called before igmp6_net_exit() and ndisc_net_exit() at netns dismantle time: net_namespace: call ip6table_mangle_net_exit() net_namespace: call ip6_tables_net_exit() net_namespace: call ipv6_sysctl_net_exit() net_namespace: call ioam6_net_exit() net_namespace: call seg6_net_exit() net_namespace: call ping_v6_proc_exit_net() net_namespace: call tcpv6_net_exit() ip6mr_sk_done sk=ffffa354c78a74c0 net_namespace: call ipv6_frags_exit_net() net_namespace: call addrconf_exit_net() net_namespace: call ip6addrlbl_net_exit() net_namespace: call ip6_flowlabel_net_exit() net_namespace: call ip6_route_net_exit_late() net_namespace: call fib6_rules_net_exit() net_namespace: call xfrm6_net_exit() net_namespace: call fib6_net_exit() net_namespace: call ip6_route_net_exit() net_namespace: call ipv6_inetpeer_exit() net_namespace: call if6_proc_net_exit() net_namespace: call ipv6_proc_exit_net() net_namespace: call udplite6_proc_exit_net() net_namespace: call raw6_exit_net() net_namespace: call igmp6_net_exit() ip6mr_sk_done sk=ffffa35472b2a180 ip6mr_sk_done sk=ffffa354c78a7980 net_namespace: call ndisc_net_exit() ip6mr_sk_done sk=ffffa35472b2ab00 net_namespace: call ip6mr_net_exit() net_namespace: call inet6_net_exit() This was fine because ip6mr_sk_done() would not reach the point decreasing net->ipv6.devconf_all->mc_forwarding until my patch in ip6mr_sk_done(). To fix this without changing struct pernet_operations ordering, we can clear net->ipv6.devconf_dflt and net->ipv6.devconf_all when they are freed from addrconf_exit_net() BUG: KASAN: use-after-free in instrument_atomic_read include/linux/instrumented.h:71 [inline] BUG: KASAN: use-after-free in atomic_read include/linux/atomic/atomic-instrumented.h:27 [inline] BUG: KASAN: use-after-free in ip6mr_sk_done+0x11b/0x410 net/ipv6/ip6mr.c:1578 Read of size 4 at addr ffff88801ff08688 by task kworker/u4:4/963 CPU: 0 PID: 963 Comm: kworker/u4:4 Not tainted 5.17.0-rc2-syzkaller-00650-g5a8fb33e5305 #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Workqueue: netns cleanup_net Call Trace: <TASK> __dump_stack lib/dump_stack.c:88 [inline] dump_stack_lvl+0xcd/0x134 lib/dump_stack.c:106 print_address_description.constprop.0.cold+0x8d/0x336 mm/kasan/report.c:255 __kasan_report mm/kasan/report.c:442 [inline] kasan_report.cold+0x83/0xdf mm/kasan/report.c:459 check_region_inline mm/kasan/generic.c:183 [inline] kasan_check_range+0x13d/0x180 mm/kasan/generic.c:189 instrument_atomic_read include/linux/instrumented.h:71 [inline] atomic_read include/linux/atomic/atomic-instrumented.h:27 [inline] ip6mr_sk_done+0x11b/0x410 net/ipv6/ip6mr.c:1578 rawv6_close+0x58/0x80 net/ipv6/raw.c:1201 inet_release+0x12e/0x280 net/ipv4/af_inet.c:428 inet6_release+0x4c/0x70 net/ipv6/af_inet6.c:478 __sock_release net/socket.c:650 [inline] sock_release+0x87/0x1b0 net/socket.c:678 inet_ctl_sock_destroy include/net/inet_common.h:65 [inline] igmp6_net_exit+0x6b/0x170 net/ipv6/mcast.c:3173 ops_exit_list+0xb0/0x170 net/core/net_namespace.c:168 cleanup_net+0x4ea/0xb00 net/core/net_namespace.c:600 process_one_work+0x9ac/0x1650 kernel/workqueue.c:2307 worker_thread+0x657/0x1110 kernel/workqueue.c:2454 kthread+0x2e9/0x3a0 kernel/kthread.c:377 ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:295 </TASK> Fixes: f2f2325ec799 ("ip6mr: ip6mr_sk_done() can exit early in common cases") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: syzbot <syzkaller@googlegroups.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-02-05ipv6: make mc_forwarding atomicEric Dumazet
This fixes minor data-races in ip6_mc_input() and batadv_mcast_mla_rtr_flags_softif_get_ipv6() Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-01-27Revert "ipv6: Honor all IPv6 PIO Valid Lifetime values"Guillaume Nault
This reverts commit b75326c201242de9495ff98e5d5cff41d7fc0d9d. This commit breaks Linux compatibility with USGv6 tests. The RFC this commit was based on is actually an expired draft: no published RFC currently allows the new behaviour it introduced. Without full IETF endorsement, the flash renumbering scenario this patch was supposed to enable is never going to work, as other IPv6 equipements on the same LAN will keep the 2 hours limit. Fixes: b75326c20124 ("ipv6: Honor all IPv6 PIO Valid Lifetime values") Signed-off-by: Guillaume Nault <gnault@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-12-06ipv6: add net device refcount tracker to struct inet6_devEric Dumazet
Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-11-01net: ndisc: introduce ndisc_evict_nocarrier sysctl parameterJames Prestwood
In most situations the neighbor discovery cache should be cleared on a NOCARRIER event which is currently done unconditionally. But for wireless roams the neighbor discovery cache can and should remain intact since the underlying network has not changed. This patch introduces a sysctl option ndisc_evict_nocarrier which can be disabled by a wireless supplicant during a roam. This allows packets to be sent after a roam immediately without having to wait for neighbor discovery. A user reported roughly a 1 second delay after a roam before packets could be sent out (note, on IPv4). This delay was due to the ARP cache being cleared. During testing of this same scenario using IPv6 no delay was noticed, but regardless there is no reason to clear the ndisc cache for wireless roams. Signed-off-by: James Prestwood <prestwoj@gmail.com> Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-10-22gre/sit: Don't generate link-local addr if addr_gen_mode is ↵Stephen Suryaputra
IN6_ADDR_GEN_MODE_NONE When addr_gen_mode is set to IN6_ADDR_GEN_MODE_NONE, the link-local addr should not be generated. But it isn't the case for GRE (as well as GRE6) and SIT tunnels. Make it so that tunnels consider the addr_gen_mode, especially for IN6_ADDR_GEN_MODE_NONE. Do this in add_v4_addrs() to cover both GRE and SIT only if the addr scope is link. Signed-off-by: Stephen Suryaputra <ssuryaextr@gmail.com> Acked-by: Antonio Quartulli <a@unstable.cc> Link: https://lore.kernel.org/r/20211020200618.467342-1-ssuryaextr@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-10-13ipv6: constify dev_addr passingJakub Kicinski
In preparation for netdev->dev_addr being constant make all relevant arguments in ndisc constant. Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-09-05ip/ip6_gre: use the same logic as SIT interfaces when computing v6LL addressAntonio Quartulli
GRE interfaces are not Ether-like and therefore it is not possible to generate the v6LL address the same way as (for example) GRETAP devices. With default settings, a GRE interface will attempt generating its v6LL address using the EUI64 approach, but this will fail when the local endpoint of the GRE tunnel is set to "any". In this case the GRE interface will end up with no v6LL address, thus violating RFC4291. SIT interfaces already implement a different logic to ensure that a v6LL address is always computed. Change the GRE v6LL generation logic to follow the same approach as SIT. This way GRE interfaces will always have a v6LL address as well. Behaviour of GRETAP interfaces has not been changed as they behave like classic Ether-like interfaces. To avoid code duplication sit_add_v4_addrs() has been renamed to add_v4_addrs() and adapted to handle also the IP6GRE/GRE cases. Signed-off-by: Antonio Quartulli <a@unstable.cc> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-08-27ipv6: add IFLA_INET6_RA_MTU to expose mtu valueRocco Yue
The kernel provides a "/proc/sys/net/ipv6/conf/<iface>/mtu" file, which can temporarily record the mtu value of the last received RA message when the RA mtu value is lower than the interface mtu, but this proc has following limitations: (1) when the interface mtu (/sys/class/net/<iface>/mtu) is updeated, mtu6 (/proc/sys/net/ipv6/conf/<iface>/mtu) will be updated to the value of interface mtu; (2) mtu6 (/proc/sys/net/ipv6/conf/<iface>/mtu) only affect ipv6 connection, and not affect ipv4. Therefore, when the mtu option is carried in the RA message, there will be a problem that the user sometimes cannot obtain RA mtu value correctly by reading mtu6. After this patch set, if a RA message carries the mtu option, you can send a netlink msg which nlmsg_type is RTM_GETLINK, and then by parsing the attribute of IFLA_INET6_RA_MTU to get the mtu value carried in the RA message received on the inet6 device. In addition, you can also get a link notification when ra_mtu is updated so it doesn't have to poll. In this way, if the MTU values that the device receives from the network in the PCO IPv4 and the RA IPv6 procedures are different, the user can obtain the correct ipv6 ra_mtu value and compare the value of ra_mtu and ipv4 mtu, then the device can use the lower MTU value for both IPv4 and IPv6. Signed-off-by: Rocco Yue <rocco.yue@mediatek.com> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://lore.kernel.org/r/20210827150412.9267-1-rocco.yue@mediatek.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-08-05net: Remove redundant if statementsYajun Deng
The 'if (dev)' statement already move into dev_{put , hold}, so remove redundant if statements. Signed-off-by: Yajun Deng <yajun.deng@linux.dev> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-08-04net: add extack arg for link opsRocco Yue
Pass extack arg to validate_linkmsg and validate_link_af callbacks. If a netlink attribute has a reject_message, use the extended ack mechanism to carry the message back to user space. Signed-off-by: Rocco Yue <rocco.yue@mediatek.com> Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-22ipv6: fix "'ioam6_if_id_max' defined but not used" warnMatthieu Baerts
When compiling without CONFIG_SYSCTL, this warning appears: net/ipv6/addrconf.c:99:12: error: 'ioam6_if_id_max' defined but not used [-Werror=unused-variable] 99 | static u32 ioam6_if_id_max = U16_MAX; | ^~~~~~~~~~~~~~~ cc1: all warnings being treated as errors Simply moving the declaration of this variable under ... #ifdef CONFIG_SYSCTL ... with other similar variables fixes the issue. Fixes: 9ee11f0fff20 ("ipv6: ioam: Data plane support for Pre-allocated Trace") Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-21ipv6: ioam: Data plane support for Pre-allocated TraceJustin Iurman
Implement support for processing the IOAM Pre-allocated Trace with IPv6, see [1] and [2]. Introduce a new IPv6 Hop-by-Hop TLV option, see IANA [3]. A new per-interface sysctl is introduced. The value is a boolean to accept (=1) or ignore (=0, by default) IPv6 IOAM options on ingress for an interface: - net.ipv6.conf.XXX.ioam6_enabled Two other sysctls are introduced to define IOAM IDs, represented by an integer. They are respectively per-namespace and per-interface: - net.ipv6.ioam6_id - net.ipv6.conf.XXX.ioam6_id The value of the first one represents the IOAM ID of the node itself (u32; max and default value = U32_MAX>>8, due to hop limit concatenation) while the other represents the IOAM ID of an interface (u16; max and default value = U16_MAX). Each "ioam6_id" sysctl has a "_wide" equivalent: - net.ipv6.ioam6_id_wide - net.ipv6.conf.XXX.ioam6_id_wide The value of the first one represents the wide IOAM ID of the node itself (u64; max and default value = U64_MAX>>8, due to hop limit concatenation) while the other represents the wide IOAM ID of an interface (u32; max and default value = U32_MAX). The use of short and wide equivalents is not exclusive, a deployment could choose to leverage both. For example, net.ipv6.conf.XXX.ioam6_id (short format) could be an identifier for a physical interface, whereas net.ipv6.conf.XXX.ioam6_id_wide (wide format) could be an identifier for a logical sub-interface. Documentation about new sysctls is provided at the end of this patchset. Two relativistic hash tables are used: one for IOAM namespaces, the other for IOAM schemas. A namespace can only have a single active schema and a schema can only be attached to a single namespace (1:1 relationship). [1] https://tools.ietf.org/html/draft-ietf-ippm-ioam-ipv6-options [2] https://tools.ietf.org/html/draft-ietf-ippm-ioam-data [3] https://www.iana.org/assignments/ipv6-parameters/ipv6-parameters.xhtml#ipv6-parameters-2 Signed-off-by: Justin Iurman <justin.iurman@uliege.be> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-20memcg: enable accounting for IP address and routing-related objectsVasily Averin
An netadmin inside container can use 'ip a a' and 'ip r a' to assign a large number of ipv4/ipv6 addresses and routing entries and force kernel to allocate megabytes of unaccounted memory for long-lived per-netdevice related kernel objects: 'struct in_ifaddr', 'struct inet6_ifaddr', 'struct fib6_node', 'struct rt6_info', 'struct fib_rules' and ip_fib caches. These objects can be manually removed, though usually they lives in memory till destroy of its net namespace. It makes sense to account for them to restrict the host's memory consumption from inside the memcg-limited container. One of such objects is the 'struct fib6_node' mostly allocated in net/ipv6/route.c::__ip6_ins_rt() inside the lock_bh()/unlock_bh() section: write_lock_bh(&table->tb6_lock); err = fib6_add(&table->tb6_root, rt, info, mxc); write_unlock_bh(&table->tb6_lock); In this case it is not enough to simply add SLAB_ACCOUNT to corresponding kmem cache. The proper memory cgroup still cannot be found due to the incorrect 'in_interrupt()' check used in memcg_kmem_bypass(). Obsoleted in_interrupt() does not describe real execution context properly. >From include/linux/preempt.h: The following macros are deprecated and should not be used in new code: in_interrupt() - We're in NMI,IRQ,SoftIRQ context or have BH disabled To verify the current execution context new macro should be used instead: in_task() - We're in task context Signed-off-by: Vasily Averin <vvs@virtuozzo.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-15ipv6: remove unnecessary local variableRocco Yue
The local variable "struct net *net" in the two functions of inet6_rtm_getaddr() and inet6_dump_addr() are actually useless, so remove them. Signed-off-by: Rocco Yue <rocco.yue@mediatek.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-18Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski
Trivial conflicts in net/can/isotp.c and tools/testing/selftests/net/mptcp/mptcp_connect.sh scaled_ppm_to_ppb() was moved from drivers/ptp/ptp_clock.c to include/linux/ptp_clock_kernel.h in -next so re-apply the fix there. Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-06-08net: ipv4: Remove unneed BUG() functionZheng Yongjun
When 'nla_parse_nested_deprecated' failed, it's no need to BUG() here, return -EINVAL is ok. Signed-off-by: Zheng Yongjun <zhengyongjun3@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-05-31ipv6: align code with contextRocco Yue
The Tab key is used three times, causing the code block to be out of alignment with the context. Signed-off-by: Rocco Yue <rocco.yue@mediatek.com> Link: https://lore.kernel.org/r/20210530113811.8817-1-rocco.yue@mediatek.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-04-16mld: fix suspicious RCU usage in __ipv6_dev_mc_dec()Taehee Yoo
__ipv6_dev_mc_dec() internally uses sleepable functions so that caller must not acquire atomic locks. But caller, which is addrconf_verify_rtnl() acquires rcu_read_lock_bh(). So this warning occurs in the __ipv6_dev_mc_dec(). Test commands: ip netns add A ip link add veth0 type veth peer name veth1 ip link set veth1 netns A ip link set veth0 up ip netns exec A ip link set veth1 up ip a a 2001:db8::1/64 dev veth0 valid_lft 2 preferred_lft 1 Splat looks like: ============================ WARNING: suspicious RCU usage 5.12.0-rc6+ #515 Not tainted ----------------------------- kernel/sched/core.c:8294 Illegal context switch in RCU-bh read-side critical section! other info that might help us debug this: rcu_scheduler_active = 2, debug_locks = 1 4 locks held by kworker/4:0/1997: #0: ffff88810bd72d48 ((wq_completion)ipv6_addrconf){+.+.}-{0:0}, at: process_one_work+0x761/0x1440 #1: ffff888105c8fe00 ((addr_chk_work).work){+.+.}-{0:0}, at: process_one_work+0x795/0x1440 #2: ffffffffb9279fb0 (rtnl_mutex){+.+.}-{3:3}, at: addrconf_verify_work+0xa/0x20 #3: ffffffffb8e30860 (rcu_read_lock_bh){....}-{1:2}, at: addrconf_verify_rtnl+0x23/0xc60 stack backtrace: CPU: 4 PID: 1997 Comm: kworker/4:0 Not tainted 5.12.0-rc6+ #515 Workqueue: ipv6_addrconf addrconf_verify_work Call Trace: dump_stack+0xa4/0xe5 ___might_sleep+0x27d/0x2b0 __mutex_lock+0xc8/0x13f0 ? lock_downgrade+0x690/0x690 ? __ipv6_dev_mc_dec+0x49/0x2a0 ? mark_held_locks+0xb7/0x120 ? mutex_lock_io_nested+0x1270/0x1270 ? lockdep_hardirqs_on_prepare+0x12c/0x3e0 ? _raw_spin_unlock_irqrestore+0x47/0x50 ? trace_hardirqs_on+0x41/0x120 ? __wake_up_common_lock+0xc9/0x100 ? __wake_up_common+0x620/0x620 ? memset+0x1f/0x40 ? netlink_broadcast_filtered+0x2c4/0xa70 ? __ipv6_dev_mc_dec+0x49/0x2a0 __ipv6_dev_mc_dec+0x49/0x2a0 ? netlink_broadcast_filtered+0x2f6/0xa70 addrconf_leave_solict.part.64+0xad/0xf0 ? addrconf_join_solict.part.63+0xf0/0xf0 ? nlmsg_notify+0x63/0x1b0 __ipv6_ifa_notify+0x22c/0x9c0 ? inet6_fill_ifaddr+0xbe0/0xbe0 ? lockdep_hardirqs_on_prepare+0x12c/0x3e0 ? __local_bh_enable_ip+0xa5/0xf0 ? ipv6_del_addr+0x347/0x870 ipv6_del_addr+0x3b1/0x870 ? addrconf_ifdown+0xfe0/0xfe0 ? rcu_read_lock_any_held.part.27+0x20/0x20 addrconf_verify_rtnl+0x8a9/0xc60 addrconf_verify_work+0xf/0x20 process_one_work+0x84c/0x1440 In order to avoid this problem, it uses rcu_read_unlock_bh() for a short time. RCU is used for avoiding freeing ifp(struct *inet6_ifaddr) while ifp is being used. But this will not be released even if rcu_read_unlock_bh() is used. Because before rcu_read_unlock_bh(), it uses in6_ifa_hold(ifp). So this is safe. Fixes: 63ed8de4be81 ("mld: add mc_lock for protecting per-interface mld data") Suggested-by: Eric Dumazet <edumazet@google.com> Reported-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Taehee Yoo <ap420073@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-04-09Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski
Conflicts: MAINTAINERS - keep Chandrasekar drivers/net/ethernet/mellanox/mlx5/core/en_main.c - simple fix + trust the code re-added to param.c in -next is fine include/linux/bpf.h - trivial include/linux/ethtool.h - trivial, fix kdoc while at it include/linux/skmsg.h - move to relevant place in tcp.c, comment re-wrapped net/core/skmsg.c - add the sk = sk // sk = NULL around calls net/tipc/crypto.c - trivial Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-04-08ipv6: report errors for iftoken via netlink extackStephen Hemminger
Setting iftoken can fail for several different reasons but there and there was no report to user as to the cause. Add netlink extended errors to the processing of the request. This requires adding additional argument through rtnl_af_ops set_link_af callback. Reported-by: Hongren Zheng <li@zenithal.me> Signed-off-by: Stephen Hemminger <stephen@networkplumber.org> Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-28ipv6: addrconf.c: Fix a typoBhaskar Chowdhury
s/Identifers/Identifiers/ Signed-off-by: Bhaskar Chowdhury <unixbhaskar@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-26mld: convert ifmcaddr6 to RCUTaehee Yoo
The ifmcaddr6 has been protected by inet6_dev->lock(rwlock) so that the critical section is atomic context. In order to switch this context, changing locking is needed. The ifmcaddr6 actually already protected by RTNL So if it's converted to use RCU, its control path context can be switched to sleepable. Suggested-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: Taehee Yoo <ap420073@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-01-26net: allow user to set metric on default route learned via Router AdvertisementPraveen Chaudhary
For IPv4, default route is learned via DHCPv4 and user is allowed to change metric using config etc/network/interfaces. But for IPv6, default route can be learned via RA, for which, currently a fixed metric value 1024 is used. Ideally, user should be able to configure metric on default route for IPv6 similar to IPv4. This patch adds sysctl for the same. Logs: For IPv4: Config in etc/network/interfaces: auto eth0 iface eth0 inet dhcp metric 4261413864 IPv4 Kernel Route Table: $ ip route list default via 172.21.47.1 dev eth0 metric 4261413864 FRR Table, if a static route is configured: [In real scenario, it is useful to prefer BGP learned default route over DHCPv4 default route.] Codes: K - kernel route, C - connected, S - static, R - RIP, O - OSPF, I - IS-IS, B - BGP, P - PIM, E - EIGRP, N - NHRP, T - Table, v - VNC, V - VNC-Direct, A - Babel, D - SHARP, > - selected route, * - FIB route S>* 0.0.0.0/0 [20/0] is directly connected, eth0, 00:00:03 K 0.0.0.0/0 [254/1000] via 172.21.47.1, eth0, 6d08h51m i.e. User can prefer Default Router learned via Routing Protocol in IPv4. Similar behavior is not possible for IPv6, without this fix. After fix [for IPv6]: sudo sysctl -w net.ipv6.conf.eth0.net.ipv6.conf.eth0.ra_defrtr_metric=1996489705 IP monitor: [When IPv6 RA is received] default via fe80::xx16:xxxx:feb3:ce8e dev eth0 proto ra metric 1996489705 pref high Kernel IPv6 routing table $ ip -6 route list default via fe80::be16:65ff:feb3:ce8e dev eth0 proto ra metric 1996489705 expires 21sec hoplimit 64 pref high FRR Table, if a static route is configured: [In real scenario, it is useful to prefer BGP learned default route over IPv6 RA default route.] Codes: K - kernel route, C - connected, S - static, R - RIPng, O - OSPFv3, I - IS-IS, B - BGP, N - NHRP, T - Table, v - VNC, V - VNC-Direct, A - Babel, D - SHARP, > - selected route, * - FIB route S>* ::/0 [20/0] is directly connected, eth0, 00:00:06 K ::/0 [119/1001] via fe80::xx16:xxxx:feb3:ce8e, eth0, 6d07h43m If the metric is changed later, the effect will be seen only when next IPv6 RA is received, because the default route must be fully controlled by RA msg. Below metric is changed from 1996489705 to 1996489704. $ sudo sysctl -w net.ipv6.conf.eth0.ra_defrtr_metric=1996489704 net.ipv6.conf.eth0.ra_defrtr_metric = 1996489704 IP monitor: [On next IPv6 RA msg, Kernel deletes prev route and installs new route with updated metric] Deleted default via fe80::xx16:xxxx:feb3:ce8e dev eth0 proto ra metric 1996489705 expires 3sec hoplimit 64 pref high default via fe80::xx16:xxxx:feb3:ce8e dev eth0 proto ra metric 1996489704 pref high Signed-off-by: Praveen Chaudhary <pchaudhary@linkedin.com> Signed-off-by: Zhenggen Xu <zxu@linkedin.com> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://lore.kernel.org/r/20210125214430.24079-1-pchaudhary@linkedin.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>