summaryrefslogtreecommitdiff
path: root/net/core
AgeCommit message (Collapse)Author
2021-08-18net: procfs: add seq_puts() statement for dev_mcastYajun Deng
Add seq_puts() statement for dev_mcast, make it more readable. As also, keep vertical alignment for {dev, ptype, dev_mcast} that under /proc/net. Signed-off-by: Yajun Deng <yajun.deng@linux.dev> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-08-14devlink: Clear whole devlink_flash_notify structLeon Romanovsky
The { 0 } doesn't clear all fields in the struct, but tells to the compiler to set all fields to zero and doesn't touch any sub-fields if they exists. The {} is an empty initialiser that instructs to fully initialize whole struct including sub-fields, which is error-prone for future devlink_flash_notify extensions. Fixes: 6700acc5f1fe ("devlink: collect flash notify params into a struct") Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-08-14devlink: Use xarray to store devlink instancesLeon Romanovsky
We can use xarray instead of linearly organized linked lists for the devlink instances. This will let us revise the locking scheme in favour of internal xarray locking that protects database. Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-08-14devlink: Count struct devlink consumersLeon Romanovsky
The struct devlink itself is protected by internal lock and doesn't need global lock during operation. That global lock is used to protect addition/removal new devlink instances from the global list in use by all devlink consumers in the system. The future conversion of linked list to be xarray will allow us to actually delete that lock, but first we need to count all struct devlink users. The reference counting provides us a way to ensure that no new user space commands success to grab devlink instance which is going to be destroyed makes it is safe to access it without lock. Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-08-14devlink: Remove check of always valid devlink pointerLeon Romanovsky
Devlink objects are accessible only after they were registered and have valid devlink_*->devlink pointers. Remove that check and simplify respective fill functions as an outcome of such change. Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-08-14devlink: Simplify devlink_pernet_pre_exit callLeon Romanovsky
The devlink_pernet_pre_exit() will be called if net namespace exits. That routine is relevant for devlink instances that were assigned to that namespaces first. This assignment is possible only with the following command: "devlink reload DEV netns ...", which already checks reload support. Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-08-13net: in_irq() cleanupChangbin Du
Replace the obsolete and ambiguos macro in_irq() with new macro in_hardirq(). Signed-off-by: Changbin Du <changbin.du@gmail.com> Link: https://lore.kernel.org/r/20210813145749.86512-1-changbin.du@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-08-13Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski
Conflicts: drivers/net/ethernet/broadcom/bnxt/bnxt_ptp.h 9e26680733d5 ("bnxt_en: Update firmware call to retrieve TX PTP timestamp") 9e518f25802c ("bnxt_en: 1PPS functions to configure TSIO pins") 099fdeda659d ("bnxt_en: Event handler for PPS events") kernel/bpf/helpers.c include/linux/bpf-cgroup.h a2baf4e8bb0f ("bpf: Fix potentially incorrect results with bpf_get_local_storage()") c7603cfa04e7 ("bpf: Add ambient BPF runtime context stored in current") drivers/net/ethernet/mellanox/mlx5/core/pci_irq.c 5957cc557dc5 ("net/mlx5: Set all field of mlx5_irq before inserting it to the xarray") 2d0b41a37679 ("net/mlx5: Refcount mlx5_irq with integer") MAINTAINERS 7b637cd52f02 ("MAINTAINERS: fix Microchip CAN BUS Analyzer Tool entry typo") 7d901a1e878a ("net: phy: add Maxlinear GPY115/21x/24x driver") Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-08-12pktgen: Add output for imix resultsNick Richardson
The bps for imix mode is calculated by: sum(imix_entry.size) / time_elapsed The actual counts of each imix_entry are displayed under the "Current:" section of the interface output in the following format: imix_size_counts: size_1,count_1 size_2,count_2 ... size_n,count_n Example (count = 200000): imix_weights: 256,1 859,3 205,2 imix_size_counts: 256,32082 859,99796 205,68122 Result: OK: 17992362(c17964678+d27684) usec, 200000 (859byte,0frags) 11115pps 47Mb/sec (47977140bps) errors: 0 Summary of changes: Calculate bps based on imix counters when in IMIX mode. Add output for IMIX counters. Signed-off-by: Nick Richardson <richardsonnick@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-08-12pktgen: Add imix distribution binsNick Richardson
In order to represent the distribution of imix packet sizes, a pre-computed data structure is used. It features 100 (IMIX_PRECISION) "bins". Contiguous ranges of these bins represent the respective packet size of each imix entry. This is done to avoid the overhead of selecting the correct imix packet size based on the corresponding weights. Example: imix_weights 40,7 576,4 1500,1 total_weight = 7 + 4 + 1 = 12 pkt_size 40 occurs 7/total_weight = 58% of the time pkt_size 576 occurs 4/total_weight = 33% of the time pkt_size 1500 occurs 1/total_weight = 9% of the time We generate a random number between 0-100 and select the corresponding packet size based on the specified weights. Eg. random number = 358723895 % 100 = 65 Selects the packet size corresponding to index:65 in the pre-computed imix_distribution array. An example of the pre-computed array is below: The imix_distribution will look like the following: 0 -> 0 (index of imix_entry.size == 40) 1 -> 0 (index of imix_entry.size == 40) 2 -> 0 (index of imix_entry.size == 40) [...] -> 0 (index of imix_entry.size == 40) 57 -> 0 (index of imix_entry.size == 40) 58 -> 1 (index of imix_entry.size == 576) [...] -> 1 (index of imix_entry.size == 576) 90 -> 1 (index of imix_entry.size == 576) 91 -> 2 (index of imix_entry.size == 1500) [...] -> 2 (index of imix_entry.size == 1500) 99 -> 2 (index of imix_entry.size == 1500) Create and use "bin" representation of the imix distribution. Signed-off-by: Nick Richardson <richardsonnick@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-08-12pktgen: Parse internet mix (imix) inputNick Richardson
Adds "imix_weights" command for specifying internet mix distribution. The command is in this format: "imix_weights size_1,weight_1 size_2,weight_2 ... size_n,weight_n" where the probability that packet size_i is picked is: weight_i / (weight_1 + weight_2 + .. + weight_n) The user may provide up to 100 imix entries (size_i,weight_i) in this command. The user specified imix entries will be displayed in the "Params" section of the interface output. Values for clone_skb > 0 is not supported in IMIX mode. Summary of changes: Add flag for enabling internet mix mode. Add command (imix_weights) for internet mix input. Return -ENOTSUPP when clone_skb > 0 in IMIX mode. Display imix_weights in Params. Create data structures to store imix entries and distribution. Signed-off-by: Nick Richardson <richardsonnick@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-08-11net: linkwatch: fix failure to restore device state across suspend/resumeWilly Tarreau
After migrating my laptop from 4.19-LTS to 5.4-LTS a while ago I noticed that my Ethernet port to which a bond and a VLAN interface are attached appeared to remain up after resuming from suspend with the cable unplugged (and that problem still persists with 5.10-LTS). It happens that the following happens: - the network driver (e1000e here) prepares to suspend, calls e1000e_down() which calls netif_carrier_off() to signal that the link is going down. - netif_carrier_off() adds a link_watch event to the list of events for this device - the device is completely stopped. - the machine suspends - the cable is unplugged and the machine brought to another location - the machine is resumed - the queued linkwatch events are processed for the device - the device doesn't yet have the __LINK_STATE_PRESENT bit and its events are silently dropped - the device is resumed with its link down - the upper VLAN and bond interfaces are never notified that the link had been turned down and remain up - the only way to provoke a change is to physically connect the machine to a port and possibly unplug it. The state after resume looks like this: $ ip -br li | egrep 'bond|eth' bond0 UP e8:6a:64:64:64:64 <BROADCAST,MULTICAST,MASTER,UP,LOWER_UP> eth0 DOWN e8:6a:64:64:64:64 <NO-CARRIER,BROADCAST,MULTICAST,SLAVE,UP> eth0.2@eth0 UP e8:6a:64:64:64:64 <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> Placing an explicit call to netdev_state_change() either in the suspend or the resume code in the NIC driver worked around this but the solution is not satisfying. The issue in fact really is in link_watch that loses events while it ought not to. It happens that the test for the device being present was added by commit 124eee3f6955 ("net: linkwatch: add check for netdevice being present to linkwatch_do_dev") in 4.20 to avoid an access to devices that are not present. Instead of dropping events, this patch proceeds slightly differently by postponing their handling so that they happen after the device is fully resumed. Fixes: 124eee3f6955 ("net: linkwatch: add check for netdevice being present to linkwatch_do_dev") Link: https://lists.openwall.net/netdev/2018/03/15/62 Cc: Heiner Kallweit <hkallweit1@gmail.com> Cc: Geert Uytterhoeven <geert+renesas@glider.be> Cc: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: Willy Tarreau <w@1wt.eu> Link: https://lore.kernel.org/r/20210809160628.22623-1-w@1wt.eu Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-08-11devlink: Add APIs to publish, unpublish individual parameterParav Pandit
Enable drivers to publish/unpublish individual parameter. Signed-off-by: Parav Pandit <parav@nvidia.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Reviewed-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-08-11devlink: Add API to register and unregister single parameterParav Pandit
Currently device configuration parameters can be registered as an array. Due to this a constant array must be registered. A single driver supporting multiple devices each with different device capabilities end up registering all parameters even if it doesn't support it. One possible workaround a driver can do is, it registers multiple single entry arrays to overcome such limitation. Better is to provide a API that enables driver to register/unregister a single parameter. This also further helps in two ways. (1) to reduce the memory of devlink_param_entry by avoiding in registering parameters which are not supported by the device. (2) avoid generating multiple parameter add, delete, publish, unpublish, init value notifications for such unsupported parameters Signed-off-by: Parav Pandit <parav@nvidia.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Reviewed-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-08-11devlink: Create a helper function for one parameter registrationParav Pandit
Create and use a helper function for one parameter registration. Subsequent patch also will reuse this for driver facing routine to register a single parameter. Signed-off-by: Parav Pandit <parav@nvidia.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Reviewed-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-08-11devlink: Add new "enable_vnet" generic device paramParav Pandit
Add new device generic parameter to enable/disable creation of VDPA net auxiliary device and associated device functionality in the devlink instance. User who prefers to disable such functionality can disable it using below example. $ devlink dev param set pci/0000:06:00.0 \ name enable_vnet value false cmode driverinit $ devlink dev reload pci/0000:06:00.0 At this point devlink instance do not create auxiliary device for the VDPA net functionality. Signed-off-by: Parav Pandit <parav@nvidia.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Reviewed-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-08-11devlink: Add new "enable_rdma" generic device paramParav Pandit
Add new device generic parameter to enable/disable creation of RDMA auxiliary device and associated device functionality in the devlink instance. User who prefers to disable such functionality can disable it using below example. $ devlink dev param set pci/0000:06:00.0 \ name enable_rdma value false cmode driverinit $ devlink dev reload pci/0000:06:00.0 At this point devlink instance do not create auxiliary device for the RDMA functionality. Signed-off-by: Parav Pandit <parav@nvidia.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Reviewed-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-08-11devlink: Add new "enable_eth" generic device paramParav Pandit
Add new device generic parameter to enable/disable creation of Ethernet auxiliary device and associated device functionality in the devlink instance. User who prefers to disable such functionality can disable it using below example. $ devlink dev param set pci/0000:06:00.0 \ name enable_eth value false cmode driverinit $ devlink dev reload pci/0000:06:00.0 At this point devlink instance do not create auxiliary device for the Ethernet functionality. Signed-off-by: Parav Pandit <parav@nvidia.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Reviewed-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-08-10net: Support filtering interfaces on no masterLahav Schlesinger
Currently there's support for filtering neighbours/links for interfaces which have a specific master device (using the IFLA_MASTER/NDA_MASTER attributes). This patch adds support for filtering interfaces/neighbours dump for interfaces that *don't* have a master. Signed-off-by: Lahav Schlesinger <lschlesinger@drivenets.com> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://lore.kernel.org/r/20210810090658.2778960-1-lschlesinger@drivenets.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-08-10Merge https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-nextJakub Kicinski
Daniel Borkmann says: ==================== bpf-next 2021-08-10 We've added 31 non-merge commits during the last 8 day(s) which contain a total of 28 files changed, 3644 insertions(+), 519 deletions(-). 1) Native XDP support for bonding driver & related BPF selftests, from Jussi Maki. 2) Large batch of new BPF JIT tests for test_bpf.ko that came out as a result from 32-bit MIPS JIT development, from Johan Almbladh. 3) Rewrite of netcnt BPF selftest and merge into test_progs, from Stanislav Fomichev. 4) Fix XDP bpf_prog_test_run infra after net to net-next merge, from Andrii Nakryiko. 5) Follow-up fix in unix_bpf_update_proto() to enforce socket type, from Cong Wang. 6) Fix bpf-iter-tcp4 selftest to print the correct dest IP, from Jose Blanquicet. 7) Various misc BPF XDP sample improvements, from Niklas Söderlund, Matthew Cover, and Muhammad Falak R Wani. * https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (31 commits) bpf, tests: Add tail call test suite bpf, tests: Add tests for BPF_CMPXCHG bpf, tests: Add tests for atomic operations bpf, tests: Add test for 32-bit context pointer argument passing bpf, tests: Add branch conversion JIT test bpf, tests: Add word-order tests for load/store of double words bpf, tests: Add tests for ALU operations implemented with function calls bpf, tests: Add more ALU64 BPF_MUL tests bpf, tests: Add more BPF_LSH/RSH/ARSH tests for ALU64 bpf, tests: Add more ALU32 tests for BPF_LSH/RSH/ARSH bpf, tests: Add more tests of ALU32 and ALU64 bitwise operations bpf, tests: Fix typos in test case descriptions bpf, tests: Add BPF_MOV tests for zero and sign extension bpf, tests: Add BPF_JMP32 test cases samples, bpf: Add an explict comment to handle nested vlan tagging. selftests/bpf: Add tests for XDP bonding selftests/bpf: Fix xdp_tx.c prog section name net, core: Allow netdev_lower_get_next_private_rcu in bh context bpf, devmap: Exclude XDP broadcast to master device net, bonding: Add XDP support to the bonding driver ... ==================== Link: https://lore.kernel.org/r/20210810130038.16927-1-daniel@iogearbox.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-08-09page_pool: add frag page recycling support in page poolYunsheng Lin
Currently page pool only support page recycling when there is only one user of the page, and the split page reusing implemented in the most driver can not use the page pool as bing-pong way of reusing requires the multi user support in page pool. Those reusing or recycling has below limitations: 1. page from page pool can only be used be one user in order for the page recycling to happen. 2. Bing-pong way of reusing in most driver does not support multi desc using different part of the same page in order to save memory. So add multi-users support and frag page recycling in page pool to overcome the above limitation. Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-08-09page_pool: add interface to manipulate frag count in page poolYunsheng Lin
For 32 bit systems with 64 bit dma, dma_addr[1] is used to store the upper 32 bit dma addr, those system should be rare those days. For normal system, the dma_addr[1] in 'struct page' is not used, so we can reuse dma_addr[1] for storing frag count, which means how many frags this page might be splited to. In order to simplify the page frag support in the page pool, the PAGE_POOL_DMA_USE_PP_FRAG_COUNT macro is added to indicate the 32 bit systems with 64 bit dma, and the page frag support in page pool is disabled for such system. The newly added page_pool_set_frag_count() is called to reserve the maximum frag count before any page frag is passed to the user. The page_pool_atomic_sub_frag_count_return() is called when user is done with the page frag. Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-08-09page_pool: keep pp info as long as page pool owns the pageYunsheng Lin
Currently, page->pp is cleared and set everytime the page is recycled, which is unnecessary. So only set the page->pp when the page is added to the page pool and only clear it when the page is released from the page pool. This is also a preparation to support allocating frag page in page pool. Reviewed-by: Ilias Apalodimas <ilias.apalodimas@linaro.org> Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-08-09net, core: Allow netdev_lower_get_next_private_rcu in bh contextJussi Maki
For the XDP bonding slave lookup to work in the NAPI poll context in which the redudant rcu_read_lock() has been removed we have to follow the same approach as in 694cea395fde ("bpf: Allow RCU-protected lookups to happen from bh context") and modify the WARN_ON to also check rcu_read_lock_bh_held(). Signed-off-by: Jussi Maki <joamaki@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Cc: Toke Høiland-Jørgensen <toke@redhat.com> Link: https://lore.kernel.org/bpf/20210731055738.16820-6-joamaki@gmail.com
2021-08-09net, core: Add support for XDP redirection to slave deviceJussi Maki
This adds the ndo_xdp_get_xmit_slave hook for transforming XDP_TX into XDP_REDIRECT after BPF program run when the ingress device is a bond slave. The dev_xdp_prog_count is exposed so that slave devices can be checked for loaded XDP programs in order to avoid the situation where both bond master and slave have programs loaded according to xdp_state. Signed-off-by: Jussi Maki <joamaki@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Cc: Jay Vosburgh <j.vosburgh@gmail.com> Cc: Veaceslav Falico <vfalico@gmail.com> Cc: Andy Gospodarek <andy@greyhouse.net> Link: https://lore.kernel.org/bpf/20210731055738.16820-3-joamaki@gmail.com
2021-08-09devlink: Fix port_type_set function pointer checkLeon Romanovsky
Fix a typo when checking existence of port_type_set function pointer. Fixes: 82564f6c706a ("devlink: Simplify devlink port API calls") Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-08-09devlink: Set device as early as possibleLeon Romanovsky
All kernel devlink implementations call to devlink_alloc() during initialization routine for specific device which is used later as a parent device for devlink_register(). Such late device assignment causes to the situation which requires us to call to device_register() before setting other parameters, but that call opens devlink to the world and makes accessible for the netlink users. Any attempt to move devlink_register() to be the last call generates the following error due to access to the devlink->dev pointer. [ 8.758862] devlink_nl_param_fill+0x2e8/0xe50 [ 8.760305] devlink_param_notify+0x6d/0x180 [ 8.760435] __devlink_params_register+0x2f1/0x670 [ 8.760558] devlink_params_register+0x1e/0x20 The simple change of API to set devlink device in the devlink_alloc() instead of devlink_register() fixes all this above and ensures that prior to call to devlink_register() everything already set. Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-08-09page_pool: mask the page->signature before the checkingYunsheng Lin
As mentioned in commit c07aea3ef4d4 ("mm: add a signature in struct page"): "The page->signature field is aliased to page->lru.next and page->compound_head." And as the comment in page_is_pfmemalloc(): "lru.next has bit 1 set if the page is allocated from the pfmemalloc reserves. Callers may simply overwrite it if they do not need to preserve that information." The page->signature is OR’ed with PP_SIGNATURE when a page is allocated in page pool, see __page_pool_alloc_pages_slow(), and page->signature is checked directly with PP_SIGNATURE in page_pool_return_skb_page(), which might cause resoure leaking problem for a page from page pool if bit 1 of lru.next is set for a pfmemalloc page. What happens here is that the original pp->signature is OR'ed with PP_SIGNATURE after the allocation in order to preserve any existing bits(such as the bit 1, used to indicate a pfmemalloc page), so when those bits are present, those page is not considered to be from page pool and the DMA mapping of those pages will be left stale. As bit 0 is for page->compound_head, So mask both bit 0/1 before the checking in page_pool_return_skb_page(). And we will return those pfmemalloc pages back to the page allocator after cleaning up the DMA mapping. Fixes: 6a5bcd84e886 ("page_pool: Allow drivers to hint on SKB recycling") Reviewed-by: Ilias Apalodimas <ilias.apalodimas@linaro.org> Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-08-08devlink: Simplify devlink port API callsLeon Romanovsky
Devlink port already has pointer to the devlink instance and all API calls that forward these devlink ports to the drivers perform same "devlink_port->devlink" assignment before actual call. This patch removes useless parameter and allows us in the future to create specific devlink_port_ops to manage user space access with reliable ops assignment. Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-08-05net: Remove redundant if statementsYajun Deng
The 'if (dev)' statement already move into dev_{put , hold}, so remove redundant if statements. Signed-off-by: Yajun Deng <yajun.deng@linux.dev> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-08-05net: core: don't call SIOCBRADD/DELIF for non-bridge devicesNikolay Aleksandrov
Commit ad2f99aedf8f ("net: bridge: move bridge ioctls out of .ndo_do_ioctl") changed SIOCBRADD/DELIF to use bridge's ioctl hook (br_ioctl_hook) without checking if the target netdevice is actually a bridge which can cause crashes and generally interpreting other devices' private pointers as net_bridge pointers. Crash example (lo - loopback): $ brctl addif lo ens16 BUG: kernel NULL pointer dereference, address: 000000000000059898 #PF: supervisor read access in kernel modede #PF: error_code(0x0000) - not-present pagege PGD 0 P4D 0 ^Ac Oops: 0000 [#1] SMP NOPTI CPU: 2 PID: 1376 Comm: brctl Kdump: loaded Tainted: G W 5.14.0-rc3+ #405 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-4.fc34 04/01/2014 RIP: 0010:add_del_if+0x1f/0x7c [bridge] Code: 80 bf 1b a0 41 5c e9 c0 3c 03 e1 0f 1f 44 00 00 41 55 41 54 41 89 f4 be 0c 00 00 00 55 48 89 fd 53 48 8b 87 88 00 00 00 89 d3 <4c> 8b a8 98 05 00 00 49 8b bd d0 00 00 00 e8 17 d7 f3 e0 84 c0 74 RSP: 0018:ffff888109d97cb0 EFLAGS: 00010202^Ac RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000 RDX: 0000000000000000 RSI: 000000000000000c RDI: ffff888101239bc0 RBP: ffff888101239bc0 R08: 0000000000000001 R09: 0000000000000000 R10: ffff888109d97cd8 R11: 00000000000000a3 R12: 0000000000000012 R13: 0000000000000000 R14: ffff888101239bc0 R15: ffff888109d97e10 FS: 00007fc1e365b540(0000) GS:ffff88822be80000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000000598 CR3: 0000000106506000 CR4: 00000000000006e0 Call Trace: br_ioctl_stub+0x7c/0x441 [bridge] br_ioctl_call+0x6d/0x8a dev_ifsioc+0x325/0x4e8 dev_ioctl+0x46b/0x4e1 sock_do_ioctl+0x7b/0xad sock_ioctl+0x2de/0x2f2 vfs_ioctl+0x1e/0x2b __do_sys_ioctl+0x63/0x86 do_syscall_64+0xcb/0xf2 entry_SYSCALL_64_after_hwframe+0x44/0xae RIP: 0033:0x7fc1e3589427 Code: 00 00 90 48 8b 05 69 aa 0c 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 39 aa 0c 00 f7 d8 64 89 01 48 RSP: 002b:00007ffc8d501d38 EFLAGS: 00000202 ORIG_RAX: 000000000000001010 RAX: ffffffffffffffda RBX: 0000000000000012 RCX: 00007fc1e3589427 RDX: 00007ffc8d501d60 RSI: 00000000000089a3 RDI: 0000000000000003 RBP: 00007ffc8d501d60 R08: 0000000000000000 R09: fefefeff77686d74 R10: fffffffffffff8f9 R11: 0000000000000202 R12: 00007ffc8d502e06 R13: 00007ffc8d502e06 R14: 0000000000000000 R15: 0000000000000000 Modules linked in: bridge stp llc bonding ipv6 virtio_net [last unloaded: llc]^Ac CR2: 0000000000000598 Reported-by: syzbot+79f4a8692e267bdb7227@syzkaller.appspotmail.com Fixes: ad2f99aedf8f ("net: bridge: move bridge ioctls out of .ndo_do_ioctl") Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-08-05net: bridge: fix ioctl lockingNikolay Aleksandrov
Before commit ad2f99aedf8f ("net: bridge: move bridge ioctls out of .ndo_do_ioctl") the bridge ioctl calls were divided in two parts: one was deviceless called by sock_ioctl and didn't expect rtnl to be held, the other was with a device called by dev_ifsioc() and expected rtnl to be held. After the commit above they were united in a single ioctl stub, but it didn't take care of the locking expectations. For sock_ioctl now we acquire (1) br_ioctl_mutex, (2) rtnl and for dev_ifsioc we acquire (1) rtnl, (2) br_ioctl_mutex The fix is to get a refcnt on the netdev for dev_ifsioc calls and drop rtnl then to reacquire it in the bridge ioctl stub after br_ioctl_mutex has been acquired. That will avoid playing locking games and make the rules straight-forward: we always take br_ioctl_mutex first, and then rtnl. Reported-by: syzbot+34fe5894623c4ab1b379@syzkaller.appspotmail.com Fixes: ad2f99aedf8f ("net: bridge: move bridge ioctls out of .ndo_do_ioctl") Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-08-05net: fix GRO skb truesize updatePaolo Abeni
commit 5e10da5385d2 ("skbuff: allow 'slow_gro' for skb carring sock reference") introduces a serious regression at the GRO layer setting the wrong truesize for stolen-head skbs. Restore the correct truesize: SKB_DATA_ALIGN(...) instead of SKB_TRUESIZE(...) Reported-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Fixes: 5e10da5385d2 ("skbuff: allow 'slow_gro' for skb carring sock reference") Signed-off-by: Paolo Abeni <pabeni@redhat.com> Tested-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-08-04net: Replace deprecated CPU-hotplug functions.Sebastian Andrzej Siewior
The functions get_online_cpus() and put_online_cpus() have been deprecated during the CPU hotplug rework. They map directly to cpus_read_lock() and cpus_read_unlock(). Replace deprecated CPU-hotplug functions with the official version. The behavior remains unchanged. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-08-04pktgen: Remove redundant clone_skb overrideNick Richardson
When the netif_receive xmit_mode is set, a line is supposed to set clone_skb to a default 0 value. This line is made redundant due to a preceding line that checks if clone_skb is more than zero and returns -ENOTSUPP. Overriding clone_skb to 0 does not make any difference to the behavior because if it was positive we return error. So it can be either 0 or negative, and in both cases the behavior is the same. Remove redundant line that sets clone_skb to zero. Signed-off-by: Nick Richardson <richardsonnick@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-08-04sock: allow reading and changing sk_userlocks with setsockoptPavel Tikhomirov
SOCK_SNDBUF_LOCK and SOCK_RCVBUF_LOCK flags disable automatic socket buffers adjustment done by kernel (see tcp_fixup_rcvbuf() and tcp_sndbuf_expand()). If we've just created a new socket this adjustment is enabled on it, but if one changes the socket buffer size by setsockopt(SO_{SND,RCV}BUF*) it becomes disabled. CRIU needs to call setsockopt(SO_{SND,RCV}BUF*) on each socket on restore as it first needs to increase buffer sizes for packet queues restore and second it needs to restore back original buffer sizes. So after CRIU restore all sockets become non-auto-adjustable, which can decrease network performance of restored applications significantly. CRIU need to be able to restore sockets with enabled/disabled adjustment to the same state it was before dump, so let's add special setsockopt for it. Let's also export SOCK_SNDBUF_LOCK and SOCK_RCVBUF_LOCK flags to uAPI so that using these interface one can reenable automatic socket buffer adjustment on their sockets. Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-08-04net: add netif_set_real_num_queues() for device reconfigJakub Kicinski
netif_set_real_num_rx_queues() and netif_set_real_num_tx_queues() can fail which breaks drivers trying to implement reconfiguration in a way that can't leave the device half-broken. In other words those functions are incompatible with prepare/commit approach. Luckily setting real number of queues can fail only if the number is increased, meaning that if we order operations correctly we can guarantee ending up with either new config (success), or the old one (on error). Provide a helper implementing such logic so that drivers don't have to duplicate it. Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-08-04net: add extack arg for link opsRocco Yue
Pass extack arg to validate_linkmsg and validate_link_af callbacks. If a netlink attribute has a reject_message, use the extended ack mechanism to carry the message back to user space. Signed-off-by: Rocco Yue <rocco.yue@mediatek.com> Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-08-03move netdev_boot_setup into Space.cArnd Bergmann
This is now only used by a handful of old ISA drivers, and can be moved into the file they already all depend on. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-08-03net: Keep vertical alignmentYajun Deng
Those files under /proc/net/stat/ don't have vertical alignment, it looks very difficult. Modify the seq_printf statement, keep vertical alignment. v2: - Use seq_puts() and seq_printf() correctly. Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Yajun Deng <yajun.deng@linux.dev> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-08-03bpf: use skb_expand_head in bpf_out_neigh_v4/6Vasily Averin
Unlike skb_realloc_headroom, new helper skb_expand_head does not allocate a new skb if possible. Additionally this patch replaces commonly used dereferencing with variables. Signed-off-by: Vasily Averin <vvs@virtuozzo.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-08-03skbuff: introduce skb_expand_head()Vasily Averin
Like skb_realloc_headroom(), new helper increases headroom of specified skb. Unlike skb_realloc_headroom(), it does not allocate a new skb if possible; copies skb->sk on new skb when as needed and frees original skb in case of failures. This helps to simplify ip[6]_finish_output2() and a few other similar cases. Signed-off-by: Vasily Averin <vvs@virtuozzo.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-31Merge https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-nextJakub Kicinski
Andrii Nakryiko says: ==================== bpf-next 2021-07-30 We've added 64 non-merge commits during the last 15 day(s) which contain a total of 83 files changed, 5027 insertions(+), 1808 deletions(-). The main changes are: 1) BTF-guided binary data dumping libbpf API, from Alan. 2) Internal factoring out of libbpf CO-RE relocation logic, from Alexei. 3) Ambient BPF run context and cgroup storage cleanup, from Andrii. 4) Few small API additions for libbpf 1.0 effort, from Evgeniy and Hengqi. 5) bpf_program__attach_kprobe_opts() fixes in libbpf, from Jiri. 6) bpf_{get,set}sockopt() support in BPF iterators, from Martin. 7) BPF map pinning improvements in libbpf, from Martynas. 8) Improved module BTF support in libbpf and bpftool, from Quentin. 9) Bpftool cleanups and documentation improvements, from Quentin. 10) Libbpf improvements for supporting CO-RE on old kernels, from Shuyi. 11) Increased maximum cgroup storage size, from Stanislav. 12) Small fixes and improvements to BPF tests and samples, from various folks. * https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (64 commits) tools: bpftool: Complete metrics list in "bpftool prog profile" doc tools: bpftool: Document and add bash completion for -L, -B options selftests/bpf: Update bpftool's consistency script for checking options tools: bpftool: Update and synchronise option list in doc and help msg tools: bpftool: Complete and synchronise attach or map types selftests/bpf: Check consistency between bpftool source, doc, completion tools: bpftool: Slightly ease bash completion updates unix_bpf: Fix a potential deadlock in unix_dgram_bpf_recvmsg() libbpf: Add btf__load_vmlinux_btf/btf__load_module_btf tools: bpftool: Support dumping split BTF by id libbpf: Add split BTF support for btf__load_from_kernel_by_id() tools: Replace btf__get_from_id() with btf__load_from_kernel_by_id() tools: Free BTF objects at various locations libbpf: Rename btf__get_from_id() as btf__load_from_kernel_by_id() libbpf: Rename btf__load() as btf__load_into_kernel() libbpf: Return non-null error on failures in libbpf_find_prog_btf_id() bpf: Emit better log message if bpf_iter ctx arg btf_id == 0 tools/resolve_btfids: Emit warnings and patch zero id for missing symbols bpf: Increase supported cgroup storage value size libbpf: Fix race when pinning maps in parallel ... ==================== Link: https://lore.kernel.org/r/20210730225606.1897330-1-andrii@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-07-31Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski
Conflicting commits, all resolutions pretty trivial: drivers/bus/mhi/pci_generic.c 5c2c85315948 ("bus: mhi: pci-generic: configurable network interface MRU") 56f6f4c4eb2a ("bus: mhi: pci_generic: Apply no-op for wake using sideband wake boolean") drivers/nfc/s3fwrn5/firmware.c a0302ff5906a ("nfc: s3fwrn5: remove unnecessary label") 46573e3ab08f ("nfc: s3fwrn5: fix undefined parameter values in dev_err()") 801e541c79bb ("nfc: s3fwrn5: fix undefined parameter values in dev_err()") MAINTAINERS 7d901a1e878a ("net: phy: add Maxlinear GPY115/21x/24x driver") 8a7b46fa7902 ("MAINTAINERS: add Yasushi SHOJI as reviewer for the Microchip CAN BUS Analyzer Tool driver") Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-07-30devlink: Allocate devlink directly in requested net namespaceLeon Romanovsky
There is no need in extra call indirection and check from impossible flow where someone tries to set namespace without prior call to devlink_alloc(). Instead of this extra logic and additional EXPORT_SYMBOL, use specialized devlink allocation function that receives net namespace as an argument. Such specialized API allows clear view when devlink initialized in wrong net namespace and/or kernel users don't try to change devlink namespace under the hood. Reviewed-by: Jiri Pirko <jiri@nvidia.com> Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-07-30devlink: Break parameter notification sequence to be before/after ↵Leon Romanovsky
unload/load driver The change of namespaces during devlink reload calls to driver unload before it accesses devlink parameters. The commands below causes to use-after-free bug when trying to get flow steering mode. * ip netns add n1 * devlink dev reload pci/0000:00:09.0 netns n1 ================================================================== BUG: KASAN: use-after-free in mlx5_devlink_fs_mode_get+0x96/0xa0 [mlx5_core] Read of size 4 at addr ffff888009d04308 by task devlink/275 CPU: 6 PID: 275 Comm: devlink Not tainted 5.12.0-rc2+ #2853 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014 Call Trace: dump_stack+0x93/0xc2 print_address_description.constprop.0+0x18/0x140 ? mlx5_devlink_fs_mode_get+0x96/0xa0 [mlx5_core] ? mlx5_devlink_fs_mode_get+0x96/0xa0 [mlx5_core] kasan_report.cold+0x7c/0xd8 ? mlx5_devlink_fs_mode_get+0x96/0xa0 [mlx5_core] mlx5_devlink_fs_mode_get+0x96/0xa0 [mlx5_core] devlink_nl_param_fill+0x1c8/0xe80 ? __free_pages_ok+0x37a/0x8a0 ? devlink_flash_update_timeout_notify+0xd0/0xd0 ? lock_acquire+0x1a9/0x6d0 ? fs_reclaim_acquire+0xb7/0x160 ? lock_is_held_type+0x98/0x110 ? 0xffffffff81000000 ? lock_release+0x1f9/0x6c0 ? fs_reclaim_release+0xa1/0xf0 ? lock_downgrade+0x6d0/0x6d0 ? lock_is_held_type+0x98/0x110 ? lock_is_held_type+0x98/0x110 ? memset+0x20/0x40 ? __build_skb_around+0x1f8/0x2b0 devlink_param_notify+0x6d/0x180 devlink_reload+0x1c3/0x520 ? devlink_remote_reload_actions_performed+0x30/0x30 ? mutex_trylock+0x24b/0x2d0 ? devlink_nl_cmd_reload+0x62b/0x1070 devlink_nl_cmd_reload+0x66d/0x1070 ? devlink_reload+0x520/0x520 ? devlink_get_from_attrs+0x1bc/0x260 ? devlink_nl_pre_doit+0x64/0x4d0 genl_family_rcv_msg_doit+0x1e9/0x2f0 ? mutex_lock_io_nested+0x1130/0x1130 ? genl_family_rcv_msg_attrs_parse.constprop.0+0x240/0x240 ? security_capable+0x51/0x90 genl_rcv_msg+0x27f/0x4a0 ? genl_get_cmd+0x3c0/0x3c0 ? lock_acquire+0x1a9/0x6d0 ? devlink_reload+0x520/0x520 ? lock_release+0x6c0/0x6c0 netlink_rcv_skb+0x11d/0x340 ? genl_get_cmd+0x3c0/0x3c0 ? netlink_ack+0x9f0/0x9f0 ? lock_release+0x1f9/0x6c0 genl_rcv+0x24/0x40 netlink_unicast+0x433/0x700 ? netlink_attachskb+0x730/0x730 ? _copy_from_iter_full+0x178/0x650 ? __alloc_skb+0x113/0x2b0 netlink_sendmsg+0x6f1/0xbd0 ? netlink_unicast+0x700/0x700 ? lock_is_held_type+0x98/0x110 ? netlink_unicast+0x700/0x700 sock_sendmsg+0xb0/0xe0 __sys_sendto+0x193/0x240 ? __x64_sys_getpeername+0xb0/0xb0 ? do_sys_openat2+0x10b/0x370 ? __up_read+0x1a1/0x7b0 ? do_user_addr_fault+0x219/0xdc0 ? __x64_sys_openat+0x120/0x1d0 ? __x64_sys_open+0x1a0/0x1a0 __x64_sys_sendto+0xdd/0x1b0 ? syscall_enter_from_user_mode+0x1d/0x50 do_syscall_64+0x2d/0x40 entry_SYSCALL_64_after_hwframe+0x44/0xae RIP: 0033:0x7fc69d0af14a Code: d8 64 89 02 48 c7 c0 ff ff ff ff eb b8 0f 1f 00 f3 0f 1e fa 41 89 ca 64 8b 04 25 18 00 00 00 85 c0 75 15 b8 2c 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 76 c3 0f 1f 44 00 00 55 48 83 ec 30 44 89 4c RSP: 002b:00007ffc1d8292f8 EFLAGS: 00000246 ORIG_RAX: 000000000000002c RAX: ffffffffffffffda RBX: 0000000000000005 RCX: 00007fc69d0af14a RDX: 0000000000000038 RSI: 0000555f57c56440 RDI: 0000000000000003 RBP: 0000555f57c56410 R08: 00007fc69d17b200 R09: 000000000000000c R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000 R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000 Allocated by task 146: kasan_save_stack+0x1b/0x40 __kasan_kmalloc+0x99/0xc0 mlx5_init_fs+0xf0/0x1c50 [mlx5_core] mlx5_load+0xd2/0x180 [mlx5_core] mlx5_init_one+0x2f6/0x450 [mlx5_core] probe_one+0x47d/0x6e0 [mlx5_core] pci_device_probe+0x2a0/0x4a0 really_probe+0x20a/0xc90 driver_probe_device+0xd8/0x380 device_driver_attach+0x1df/0x250 __driver_attach+0xff/0x240 bus_for_each_dev+0x11e/0x1a0 bus_add_driver+0x309/0x570 driver_register+0x1ee/0x380 0xffffffffa06b8062 do_one_initcall+0xd5/0x410 do_init_module+0x1c8/0x760 load_module+0x6d8b/0x9650 __do_sys_finit_module+0x118/0x1b0 do_syscall_64+0x2d/0x40 entry_SYSCALL_64_after_hwframe+0x44/0xae Freed by task 275: kasan_save_stack+0x1b/0x40 kasan_set_track+0x1c/0x30 kasan_set_free_info+0x20/0x30 __kasan_slab_free+0x102/0x140 slab_free_freelist_hook+0x74/0x1b0 kfree+0xd7/0x2a0 mlx5_unload+0x16/0xb0 [mlx5_core] mlx5_unload_one+0xae/0x120 [mlx5_core] mlx5_devlink_reload_down+0x1bc/0x380 [mlx5_core] devlink_reload+0x141/0x520 devlink_nl_cmd_reload+0x66d/0x1070 genl_family_rcv_msg_doit+0x1e9/0x2f0 genl_rcv_msg+0x27f/0x4a0 netlink_rcv_skb+0x11d/0x340 genl_rcv+0x24/0x40 netlink_unicast+0x433/0x700 netlink_sendmsg+0x6f1/0xbd0 sock_sendmsg+0xb0/0xe0 __sys_sendto+0x193/0x240 __x64_sys_sendto+0xdd/0x1b0 do_syscall_64+0x2d/0x40 entry_SYSCALL_64_after_hwframe+0x44/0xae The buggy address belongs to the object at ffff888009d04300 which belongs to the cache kmalloc-128 of size 128 The buggy address is located 8 bytes inside of 128-byte region [ffff888009d04300, ffff888009d04380) The buggy address belongs to the page: page:0000000086a64ecc refcount:1 mapcount:0 mapping:0000000000000000 index:0xffff888009d04000 pfn:0x9d04 head:0000000086a64ecc order:1 compound_mapcount:0 flags: 0x4000000000010200(slab|head) raw: 4000000000010200 ffffea0000203980 0000000200000002 ffff8880050428c0 raw: ffff888009d04000 000000008020001d 00000001ffffffff 0000000000000000 page dumped because: kasan: bad access detected Memory state around the buggy address: ffff888009d04200: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb ffff888009d04280: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc >ffff888009d04300: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb ^ ffff888009d04380: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc ffff888009d04400: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb ================================================================== The right solution to devlink reload is to notify about deletion of parameters, unload driver, change net namespaces, load driver and notify about addition of parameters. Fixes: 070c63f20f6c ("net: devlink: allow to change namespaces during reload") Reviewed-by: Parav Pandit <parav@nvidia.com> Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-07-30sk_buff: avoid potentially clearing 'slow_gro' fieldPaolo Abeni
If skb_dst_set_noref() is invoked with a NULL dst, the 'slow_gro' field is cleared, too. That could lead to wrong behavior if the skb later enters the GRO stage. Fix the potential issue replacing preserving a non-zero value of the 'slow_gro' field. Additionally, fix a comment typo. Reported-by: Sabrina Dubroca <sd@queasysnail.net> Reported-by: Jakub Kicinski <kuba@kernel.org> Fixes: 8a886b142bd0 ("sk_buff: track dst status in slow_gro") Signed-off-by: Paolo Abeni <pabeni@redhat.com> Link: https://lore.kernel.org/r/aa42529252dc8bb02bd42e8629427040d1058537.1627662501.git.pabeni@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-07-29net/sched: store the last executed chain also for clsact egressDavide Caratti
currently, only 'ingress' and 'clsact ingress' qdiscs store the tc 'chain id' in the skb extension. However, userspace programs (like ovs) are able to setup egress rules, and datapath gets confused in case it doesn't find the 'chain id' for a packet that's "recirculated" by tc. Change tcf_classify() to have the same semantic as tcf_classify_ingress() so that a single function can be called in ingress / egress, using the tc ingress / egress block respectively. Suggested-by: Alaa Hleilel <alaa@nvidia.com> Signed-off-by: Davide Caratti <dcaratti@redhat.com> Reviewed-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-29mctp: Add MCTP baseJeremy Kerr
Add basic Kconfig, an initial (empty) af_mctp source object, and {AF,PF}_MCTP definitions, and the required definitions for a new protocol type. Signed-off-by: Jeremy Kerr <jk@codeconstruct.com.au> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-29skbuff: allow 'slow_gro' for skb carring sock referencePaolo Abeni
This change leverages the infrastructure introduced by the previous patches to allow soft devices passing to the GRO engine owned skbs without impacting the fast-path. It's up to the GRO caller ensuring the slow_gro bit validity before invoking the GRO engine. The new helper skb_prepare_for_gro() is introduced for that goal. On slow_gro, skbs are aggregated only with equal sk. Additionally, skb truesize on GRO recycle and free is correctly updated so that sk wmem is not changed by the GRO processing. rfc-> v1: - fixed bad truesize on dev_gro_receive NAPI_FREE - use the existing state bit Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>