linux.git - Linus' kernel tree

Age	Commit message (Collapse)	Author
2018-07-13	xdp: don't make drivers report attachment mode	Jakub Kicinski
	prog_attached of struct netdev_bpf should have been superseded by simply setting prog_id long time ago, but we kept it around to allow offloading drivers to communicate attachment mode (drv vs hw). Subsequently drivers were also allowed to report back attachment flags (prog_flags), and since nowadays only programs attached will XDP_FLAGS_HW_MODE can get offloaded, we can tell the attachment mode from the flags driver reports. Remove prog_attached member. Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-07-13	xdp: add per mode attributes for attached programs	Jakub Kicinski
	In preparation for support of simultaneous driver and hardware XDP support add per-mode attributes. The catch-all IFLA_XDP_PROG_ID will still be reported, but user space can now also access the program ID in a new IFLA_XDP_<mode>_PROG_ID attribute. Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-07-13	ieee802154: add rx LQI from userspace	Romuald CARI
	The Link Quality Indication data exposed by drivers could not be accessed from userspace. Since this data is per-datagram received, it makes sense to make it available to userspace application through the ancillary data mechanism in recvmsg rather than through ioctls. This can be activated using the socket option WPAN_WANTLQI under SOL_IEEE802154 protocol. This LQI data is available in the ancillary data buffer under the SOL_IEEE802154 level as the type WPAN_LQI. The value is an unsigned byte indicating the link quality with values ranging 0-255. Signed-off-by: Romuald Cari <romuald.cari@devialet.com> Signed-off-by: Clément Peron <clement.peron@devialet.com> Signed-off-by: Stefan Schmidt <stefan@datenfreihafen.org>
2018-07-13	xsk: do not return EMSGSIZE in copy mode for packets larger than MTU	Magnus Karlsson
	This patch stops returning EMSGSIZE from sendmsg in copy mode when the size of the packet is larger than the MTU. Just send it to the device so that it will drop it as in zero-copy mode. This makes the error reporting consistent between copy mode and zero-copy mode. Fixes: 35fcde7f8deb ("xsk: support for Tx") Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-07-13	xsk: always return ENOBUFS from sendmsg if there is no TX queue	Magnus Karlsson
	This patch makes sure ENOBUFS is always returned from sendmsg if there is no TX queue configured. This was not the case for zero-copy mode. With this patch this error reporting is consistent between copy mode and zero-copy mode. Fixes: ac98d8aab61b ("xsk: wire upp Tx zero-copy functions") Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-07-13	xsk: do not return EAGAIN from sendmsg when completion queue is full	Magnus Karlsson
	This patch stops returning EAGAIN in TX copy mode when the completion queue is full as zero-copy does not do this. Instead this situation can be detected by comparing the head and tail pointers of the completion queue in both modes. In any case, EAGAIN was not the correct error code here since no amount of calling sendmsg will solve the problem. Only consuming one or more messages on the completion queue will fix this. With this patch, the error reporting becomes consistent between copy mode and zero-copy mode. Fixes: 35fcde7f8deb ("xsk: support for Tx") Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-07-13	xsk: do not return ENXIO from TX copy mode	Magnus Karlsson
	This patch removes the ENXIO return code from TX copy-mode when someone has forcefully changed the number of queues on the device so that the queue bound to the socket is no longer available. Just silently stop sending anything as in zero-copy mode so the error reporting gets consistent between the two modes. Fixes: 35fcde7f8deb ("xsk: support for Tx") Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-07-12	devlink: Add generic parameters region_snapshot	Alex Vesker
	region_snapshot - When set enables capturing region snapshots Signed-off-by: Alex Vesker <valex@mellanox.com> Signed-off-by: Jiri Pirko <jiri@mellanox.com> Reviewed-by: Moshe Shemesh <moshe@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-12	devlink: Add support for region snapshot read command	Alex Vesker
	Add support for DEVLINK_CMD_REGION_READ_GET used for both reading and dumping region data. Read allows reading from a region specific address for given length. Dump allows reading the full region. If only snapshot ID is provided a snapshot dump will be done. If snapshot ID, Address and Length are provided a snapshot read will done. This is used for both snapshot access and will be used in the same way to access current data on the region. Signed-off-by: Alex Vesker <valex@mellanox.com> Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-12	devlink: Add support for region snapshot delete command	Alex Vesker
	Add support for DEVLINK_CMD_REGION_DEL used for deleting a snapshot from a region. The snapshot ID is required. Also added notification support for NEW and DEL of snapshots. Signed-off-by: Alex Vesker <valex@mellanox.com> Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-12	devlink: Extend the support querying for region snapshot IDs	Alex Vesker
	Extend the support for DEVLINK_CMD_REGION_GET command to also return the IDs of the snapshot currently present on the region. Each reply will include a nested snapshots attribute that can contain multiple snapshot attributes each with an ID. Signed-off-by: Alex Vesker <valex@mellanox.com> Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-12	devlink: Add support for region get command	Alex Vesker
	Add support for DEVLINK_CMD_REGION_GET command which is used for querying for the supported DEV/REGION values of devlink devices. The support is both for doit and dumpit. Reply includes: BUS_NAME, DEVICE_NAME, REGION_NAME, REGION_SIZE Signed-off-by: Alex Vesker <valex@mellanox.com> Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-12	devlink: Add support for creating region snapshots	Alex Vesker
	Each device address region can store multiple snapshots, each snapshot is identified using a different numerical ID. This ID is used when deleting a snapshot or showing an address region specific snapshot. This patch exposes a callback to add a new snapshot to an address region. The snapshot will be deleted using the destructor function when destroying a region or when a snapshot delete command from devlink user tool. Signed-off-by: Alex Vesker <valex@mellanox.com> Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-12	devlink: Add callback to query for snapshot id before snapshot create	Alex Vesker
	To restrict the driver with the snapshot ID selection a new callback is introduced for the driver to get the snapshot ID before creating a new snapshot. This will also allow giving the same ID for multiple snapshots taken of different regions on the same time. Signed-off-by: Alex Vesker <valex@mellanox.com> Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-12	devlink: Add support for creating and destroying regions	Alex Vesker
	This allows a device to register its supported address regions. Each address region can be accessed directly for example reading the snapshots taken of this address space. Drivers are not limited in the name selection for different regions. An example of a region-name can be: pci cr-space, register-space. Signed-off-by: Alex Vesker <valex@mellanox.com> Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-12	net: gro: properly remove skb from list	Prashant Bhole
	Following crash occurs in validate_xmit_skb_list() when same skb is iterated multiple times in the loop and consume_skb() is called. The root cause is calling list_del_init(&skb->list) and not clearing skb->next in d4546c2509b1. list_del_init(&skb->list) sets skb->next to point to skb itself. skb->next needs to be cleared because other parts of network stack uses another kind of SKB lists. validate_xmit_skb_list() uses such list. A similar type of bugfix was reported by Jesper Dangaard Brouer. https://patchwork.ozlabs.org/patch/942541/ This patch clears skb->next and changes list_del_init() to list_del() so that list->prev will maintain the list poison. [ 148.185511] ================================================================== [ 148.187865] BUG: KASAN: use-after-free in validate_xmit_skb_list+0x4b/0xa0 [ 148.190158] Read of size 8 at addr ffff8801e52eefc0 by task swapper/1/0 [ 148.192940] [ 148.193642] CPU: 1 PID: 0 Comm: swapper/1 Not tainted 4.18.0-rc3+ #25 [ 148.195423] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ?-20180531_142017-buildhw-08.phx2.fedoraproject.org-1.fc28 04/01/2014 [ 148.199129] Call Trace: [ 148.200565] <IRQ> [ 148.201911] dump_stack+0xc6/0x14c [ 148.203572] ? dump_stack_print_info.cold.1+0x2f/0x2f [ 148.205083] ? kmsg_dump_rewind_nolock+0x59/0x59 [ 148.206307] ? validate_xmit_skb+0x2c6/0x560 [ 148.207432] ? debug_show_held_locks+0x30/0x30 [ 148.208571] ? validate_xmit_skb_list+0x4b/0xa0 [ 148.211144] print_address_description+0x6c/0x23c [ 148.212601] ? validate_xmit_skb_list+0x4b/0xa0 [ 148.213782] kasan_report.cold.6+0x241/0x2fd [ 148.214958] validate_xmit_skb_list+0x4b/0xa0 [ 148.216494] sch_direct_xmit+0x1b0/0x680 [ 148.217601] ? dev_watchdog+0x4e0/0x4e0 [ 148.218675] ? do_raw_spin_trylock+0x10/0x120 [ 148.219818] ? do_raw_spin_lock+0xe0/0xe0 [ 148.221032] __dev_queue_xmit+0x1167/0x1810 [ 148.222155] ? sched_clock+0x5/0x10 [...] [ 148.474257] Allocated by task 0: [ 148.475363] kasan_kmalloc+0xbf/0xe0 [ 148.476503] kmem_cache_alloc+0xb4/0x1b0 [ 148.477654] __build_skb+0x91/0x250 [ 148.478677] build_skb+0x67/0x180 [ 148.479657] e1000_clean_rx_irq+0x542/0x8a0 [ 148.480757] e1000_clean+0x652/0xd10 [ 148.481772] net_rx_action+0x4ea/0xc20 [ 148.482808] __do_softirq+0x1f9/0x574 [ 148.483831] [ 148.484575] Freed by task 0: [ 148.485504] __kasan_slab_free+0x12e/0x180 [ 148.486589] kmem_cache_free+0xb4/0x240 [ 148.487634] kfree_skbmem+0xed/0x150 [ 148.488648] consume_skb+0x146/0x250 [ 148.489665] validate_xmit_skb+0x2b7/0x560 [ 148.490754] validate_xmit_skb_list+0x70/0xa0 [ 148.491897] sch_direct_xmit+0x1b0/0x680 [ 148.493949] __dev_queue_xmit+0x1167/0x1810 [ 148.495103] br_dev_queue_push_xmit+0xce/0x250 [ 148.496196] br_forward_finish+0x276/0x280 [ 148.497234] __br_forward+0x44f/0x520 [ 148.498260] br_forward+0x19f/0x1b0 [ 148.499264] br_handle_frame_finish+0x65e/0x980 [ 148.500398] NF_HOOK.constprop.10+0x290/0x2a0 [ 148.501522] br_handle_frame+0x417/0x640 [ 148.502582] __netif_receive_skb_core+0xaac/0x18f0 [ 148.503753] __netif_receive_skb_one_core+0x98/0x120 [ 148.504958] netif_receive_skb_internal+0xe3/0x330 [ 148.506154] napi_gro_complete+0x190/0x2a0 [ 148.507243] dev_gro_receive+0x9f7/0x1100 [ 148.508316] napi_gro_receive+0xcb/0x260 [ 148.509387] e1000_clean_rx_irq+0x2fc/0x8a0 [ 148.510501] e1000_clean+0x652/0xd10 [ 148.511523] net_rx_action+0x4ea/0xc20 [ 148.512566] __do_softirq+0x1f9/0x574 [ 148.513598] [ 148.514346] The buggy address belongs to the object at ffff8801e52eefc0 [ 148.514346] which belongs to the cache skbuff_head_cache of size 232 [ 148.517047] The buggy address is located 0 bytes inside of [ 148.517047] 232-byte region [ffff8801e52eefc0, ffff8801e52ef0a8) [ 148.519549] The buggy address belongs to the page: [ 148.520726] page:ffffea000794bb00 count:1 mapcount:0 mapping:ffff880106f4dfc0 index:0xffff8801e52ee840 compound_mapcount: 0 [ 148.524325] flags: 0x17ffffc0008100(slab\|head) [ 148.525481] raw: 0017ffffc0008100 ffff880106b938d0 ffff880106b938d0 ffff880106f4dfc0 [ 148.527503] raw: ffff8801e52ee840 0000000000190011 00000001ffffffff 0000000000000000 [ 148.529547] page dumped because: kasan: bad access detected Fixes: d4546c2509b1 ("net: Convert GRO SKB handling to list_head.") Signed-off-by: Prashant Bhole <bhole_prashant_q7@lab.ntt.co.jp> Reported-by: Tyler Hicks <tyhicks@canonical.com> Tested-by: Tyler Hicks <tyhicks@canonical.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-12	packet: reset network header if packet shorter than ll reserved space	Willem de Bruijn
	If variable length link layer headers result in a packet shorter than dev->hard_header_len, reset the network header offset. Else skb->mac_len may exceed skb->len after skb_mac_reset_len. packet_sendmsg_spkt already has similar logic. Fixes: b84bbaf7a6c8 ("packet: in packet_snd start writing at link layer allocation") Signed-off-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-12	nsh: set mac len based on inner packet	Willem de Bruijn
	When pulling the NSH header in nsh_gso_segment, set the mac length based on the encapsulated packet type. skb_reset_mac_len computes an offset to the network header, which here still points to the outer packet: > skb_reset_network_header(skb); > [...] > __skb_pull(skb, nsh_len); > skb_reset_mac_header(skb); // now mac hdr starts nsh_len == 8B after net hdr > skb_reset_mac_len(skb); // mac len = net hdr - mac hdr == (u16) -8 == 65528 > [..] > skb_mac_gso_segment(skb, ..) Link: http://lkml.kernel.org/r/CAF=yD-KeAcTSOn4AxirAxL8m7QAS8GBBe1w09eziYwvPbbUeYA@mail.gmail.com Reported-by: syzbot+7b9ed9872dab8c32305d@syzkaller.appspotmail.com Fixes: c411ed854584 ("nsh: add GSO support") Signed-off-by: Willem de Bruijn <willemb@google.com> Acked-by: Jiri Benc <jbenc@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-12	net: ipv4: fix listify ip_rcv_finish in case of forwarding	Jesper Dangaard Brouer
	In commit 5fa12739a53d ("net: ipv4: listify ip_rcv_finish") calling dst_input(skb) was split-out. The ip_sublist_rcv_finish() just calls dst_input(skb) in a loop. The problem is that ip_sublist_rcv_finish() forgot to remove the SKB from the list before invoking dst_input(). Further more we need to clear skb->next as other parts of the network stack use another kind of SKB lists for xmit_more (see dev_hard_start_xmit). A crash occurs if e.g. dst_input() invoke ip_forward(), which calls dst_output()/ip_output() that eventually calls __dev_queue_xmit() + sch_direct_xmit(), and a crash occurs in validate_xmit_skb_list(). This patch only fixes the crash, but there is a huge potential for a performance boost if we can pass an SKB-list through to ip_forward. Fixes: 5fa12739a53d ("net: ipv4: listify ip_rcv_finish") Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Acked-by: Edward Cree <ecree@solarflare.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-12	net: Don't copy pfmemalloc flag in __copy_skb_header()	Stefano Brivio
	The pfmemalloc flag indicates that the skb was allocated from the PFMEMALLOC reserves, and the flag is currently copied on skb copy and clone. However, an skb copied from an skb flagged with pfmemalloc wasn't necessarily allocated from PFMEMALLOC reserves, and on the other hand an skb allocated that way might be copied from an skb that wasn't. So we should not copy the flag on skb copy, and rather decide whether to allow an skb to be associated with sockets unrelated to page reclaim depending only on how it was allocated. Move the pfmemalloc flag before headers_start[0] using an existing 1-bit hole, so that __copy_skb_header() doesn't copy it. When cloning, we'll now take care of this flag explicitly, contravening to the warning comment of __skb_clone(). While at it, restore the newline usage introduced by commit b19372273164 ("net: reorganize sk_buff for faster __copy_skb_header()") to visually separate bytes used in bitfields after headers_start[0], that was gone after commit a9e419dc7be6 ("netfilter: merge ctinfo into nfct pointer storage area"), and describe the pfmemalloc flag in the kernel-doc structure comment. This doesn't change the size of sk_buff or cacheline boundaries, but consolidates the 15 bits hole before tc_index into a 2 bytes hole before csum, that could now be filled more easily. Reported-by: Patrick Talbert <ptalbert@redhat.com> Fixes: c93bdd0e03e8 ("netvm: allow skb allocation to use PFMEMALLOC reserves") Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-12	net/sched: act_skbedit: don't use spinlock in the data path	Davide Caratti
	use RCU instead of spin_{,un}lock_bh, to protect concurrent read/write on act_skbedit configuration. This reduces the effects of contention in the data path, in case multiple readers are present. Signed-off-by: Davide Caratti <dcaratti@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-12	net/sched: skbedit: use per-cpu counters	Davide Caratti
	use per-CPU counters, instead of sharing a single set of stats with all cores: this removes the need of spinlocks when stats are read/updated. Signed-off-by: Davide Caratti <dcaratti@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-12	tcp: use monotonic timestamps for PAWS	Arnd Bergmann
	Using get_seconds() for timestamps is deprecated since it can lead to overflows on 32-bit systems. While the interface generally doesn't overflow until year 2106, the specific implementation of the TCP PAWS algorithm breaks in 2038 when the intermediate signed 32-bit timestamps overflow. A related problem is that the local timestamps in CLOCK_REALTIME form lead to unexpected behavior when settimeofday is called to set the system clock backwards or forwards by more than 24 days. While the first problem could be solved by using an overflow-safe method of comparing the timestamps, a nicer solution is to use a monotonic clocksource with ktime_get_seconds() that simply doesn't overflow (at least not until 136 years after boot) and that doesn't change during settimeofday(). To make 32-bit and 64-bit architectures behave the same way here, and also save a few bytes in the tcp_options_received structure, I'm changing the type to a 32-bit integer, which is now safe on all architectures. Finally, the ts_recent_stamp field also (confusingly) gets used to store a jiffies value in tcp_synq_overflow()/tcp_synq_no_recent_overflow(). This is currently safe, but changing the type to 32-bit requires some small changes there to keep it working. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-12	net/tls: Use aead_request_alloc/free for request alloc/free	Vakul Garg
	Instead of kzalloc/free for aead_request allocation and free, use functions aead_request_alloc(), aead_request_free(). It ensures that any sensitive crypto material held in crypto transforms is securely erased from memory. Signed-off-by: Vakul Garg <vakul.garg@nxp.com> Acked-by: Dave Watson <davejwatson@fb.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-12	tcp: allow user to create repair socket without window probes	Stefan Baranoff
	Under rare conditions where repair code may be used it is possible that window probes are either unnecessary or undesired. If the user knows that window probes are not wanted or needed this change allows them to skip sending them when a socket comes out of repair. Signed-off-by: Stefan Baranoff <sbaranoff@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-12	tcp: fix sequence numbers for repaired sockets re-using TIME-WAIT sockets	Stefan Baranoff
	This patch fixes a bug where the sequence numbers of a socket created using TCP repair functionality are lower than set after connect is called. This occurs when the repair socket overlaps with a TIME-WAIT socket and triggers the re-use code. The amount lower is equal to the number of times that a particular IP/port set is re-used and then put back into TIME-WAIT. Re-using the first time the sequence number is 1 lower, closing that socket and then re-opening (with repair) a new socket with the same addresses/ports puts the sequence number 2 lower than set via setsockopt. The third time is 3 lower, etc. I have not tested what the limit of this acrewal is, if any. The fix is, if a socket is in repair mode, to respect the already set sequence number and timestamp when it would have already re-used the TIME-WAIT socket. Signed-off-by: Stefan Baranoff <sbaranoff@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-12	sch_fq_codel: zero q->flows_cnt when fq_codel_init fails	Jacob Keller
	When fq_codel_init fails, qdisc_create_dflt will cleanup by using qdisc_destroy. This function calls the ->reset() op prior to calling the ->destroy() op. Unfortunately, during the failure flow for sch_fq_codel, the ->flows parameter is not initialized, so the fq_codel_reset function will null pointer dereference. kernel: BUG: unable to handle kernel NULL pointer dereference at 0000000000000008 kernel: IP: fq_codel_reset+0x58/0xd0 [sch_fq_codel] kernel: PGD 0 P4D 0 kernel: Oops: 0000 [#1] SMP PTI kernel: Modules linked in: i40iw i40e(OE) xt_CHECKSUM iptable_mangle ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_nat_ipv4 nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack tun bridge stp llc devlink ebtable_filter ebtables ip6table_filter ip6_tables rpcrdma ib_isert iscsi_target_mod sunrpc ib_iser libiscsi scsi_transport_iscsi ib_srpt target_core_mod ib_srp scsi_transport_srp ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm intel_rapl sb_edac x86_pkg_temp_thermal intel_powerclamp coretemp kvm irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel intel_cstate iTCO_wdt iTCO_vendor_support intel_uncore ib_core intel_rapl_perf mei_me mei joydev i2c_i801 lpc_ich ioatdma shpchp wmi sch_fq_codel xfs libcrc32c mgag200 ixgbe drm_kms_helper isci ttm firewire_ohci kernel: mdio drm igb libsas crc32c_intel firewire_core ptp pps_core scsi_transport_sas crc_itu_t dca i2c_algo_bit ipmi_si ipmi_devintf ipmi_msghandler [last unloaded: i40e] kernel: CPU: 10 PID: 4219 Comm: ip Tainted: G OE 4.16.13custom-fq-codel-test+ #3 kernel: Hardware name: Intel Corporation S2600CO/S2600CO, BIOS SE5C600.86B.02.05.0004.051120151007 05/11/2015 kernel: RIP: 0010:fq_codel_reset+0x58/0xd0 [sch_fq_codel] kernel: RSP: 0018:ffffbfbf4c1fb620 EFLAGS: 00010246 kernel: RAX: 0000000000000400 RBX: 0000000000000000 RCX: 00000000000005b9 kernel: RDX: 0000000000000000 RSI: ffff9d03264a60c0 RDI: ffff9cfd17b31c00 kernel: RBP: 0000000000000001 R08: 00000000000260c0 R09: ffffffffb679c3e9 kernel: R10: fffff1dab06a0e80 R11: ffff9cfd163af800 R12: ffff9cfd17b31c00 kernel: R13: 0000000000000001 R14: ffff9cfd153de600 R15: 0000000000000001 kernel: FS: 00007fdec2f92800(0000) GS:ffff9d0326480000(0000) knlGS:0000000000000000 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 kernel: CR2: 0000000000000008 CR3: 0000000c1956a006 CR4: 00000000000606e0 kernel: Call Trace: kernel: qdisc_destroy+0x56/0x140 kernel: qdisc_create_dflt+0x8b/0xb0 kernel: mq_init+0xc1/0xf0 kernel: qdisc_create_dflt+0x5a/0xb0 kernel: dev_activate+0x205/0x230 kernel: __dev_open+0xf5/0x160 kernel: __dev_change_flags+0x1a3/0x210 kernel: dev_change_flags+0x21/0x60 kernel: do_setlink+0x660/0xdf0 kernel: ? down_trylock+0x25/0x30 kernel: ? xfs_buf_trylock+0x1a/0xd0 [xfs] kernel: ? rtnl_newlink+0x816/0x990 kernel: ? _xfs_buf_find+0x327/0x580 [xfs] kernel: ? _cond_resched+0x15/0x30 kernel: ? kmem_cache_alloc+0x20/0x1b0 kernel: ? rtnetlink_rcv_msg+0x200/0x2f0 kernel: ? rtnl_calcit.isra.30+0x100/0x100 kernel: ? netlink_rcv_skb+0x4c/0x120 kernel: ? netlink_unicast+0x19e/0x260 kernel: ? netlink_sendmsg+0x1ff/0x3c0 kernel: ? sock_sendmsg+0x36/0x40 kernel: ? ___sys_sendmsg+0x295/0x2f0 kernel: ? ebitmap_cmp+0x6d/0x90 kernel: ? dev_get_by_name_rcu+0x73/0x90 kernel: ? skb_dequeue+0x52/0x60 kernel: ? __inode_wait_for_writeback+0x7f/0xf0 kernel: ? bit_waitqueue+0x30/0x30 kernel: ? fsnotify_grab_connector+0x3c/0x60 kernel: ? __sys_sendmsg+0x51/0x90 kernel: ? do_syscall_64+0x74/0x180 kernel: ? entry_SYSCALL_64_after_hwframe+0x3d/0xa2 kernel: Code: 00 00 48 89 87 00 02 00 00 8b 87 a0 01 00 00 85 c0 0f 84 84 00 00 00 31 ed 48 63 dd 83 c5 01 48 c1 e3 06 49 03 9c 24 90 01 00 00 <48> 8b 73 08 48 8b 3b e8 6c 9a 4f f6 48 8d 43 10 48 c7 03 00 00 kernel: RIP: fq_codel_reset+0x58/0xd0 [sch_fq_codel] RSP: ffffbfbf4c1fb620 kernel: CR2: 0000000000000008 kernel: ---[ end trace e81a62bede66274e ]--- This is caused because flows_cnt is non-zero, but flows hasn't been initialized. fq_codel_init has left the private data in a partially initialized state. To fix this, reset flows_cnt to 0 when we fail to initialize. Additionally, to make the state more consistent, also cleanup the flows pointer when the allocation of backlogs fails. This fixes the NULL pointer dereference, since both the for-loop and memset in fq_codel_reset will be no-ops when flow_cnt is zero. Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-12	new wrapper: alloc_file_pseudo()	Al Viro
	takes inode, vfsmount, name, O_... flags and file_operations and either returns a new struct file (in which case inode reference we held is consumed) or returns ERR_PTR(), in which case no refcounts are altered. converted aio_private_file() and sock_alloc_file() to it Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-07-12	alloc_file(): switch to passing O_... flags instead of FMODE_... mode	Al Viro
	... so that it could set both ->f_flags and ->f_mode, without callers having to set ->f_flags manually. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-07-12	Merge branch 'ieee802154-for-davem-2018-07-11' of ↵	David S. Miller
	git://git.kernel.org/pub/scm/linux/kernel/git/sschmidt/wpan Stefan Schmidt says: ==================== pull-request: ieee802154 for net 2018-07-11 An update from ieee802154 for your net tree. Build system fix for a missing include from Arnd Bergmann. Setting the IFLA_LINK for the lowpan parent from Lubomir Rintel. Fixes for some RX corner cases in adf7242 driver by Michael Hennerich. And some small patches to cleanup our BUG_ON vs WARN_ON usage. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-11	tipc: check session number before accepting link protocol messages	Jon Maloy
	In some virtual environments we observe a significant higher number of packet reordering and delays than we have been used to traditionally. This makes it necessary with stricter checks on incoming link protocol messages' session number, which until now only has been validated for RESET messages. Since the other two message types, ACTIVATE and STATE messages also carry this number, it is easy to extend the validation check to those messages. We also introduce a flag indicating if a link has a valid peer session number or not. This eliminates the mixing of 32- and 16-bit arithmethics we are currently using to achieve this. Acked-by: Ying Xue <ying.xue@windriver.com> Signed-off-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-11	tipc: add sequence number check for link STATE messages	Jon Maloy
	Some switch infrastructures produce huge amounts of packet duplicates. This becomes a problem if those messages are STATE/NACK protocol messages, causing unnecessary retransmissions of already accepted packets. We now introduce a unique sequence number per STATE protocol message so that duplicates can be identified and ignored. This will also be useful when tracing such cases, and to avert replay attacks when TIPC is encrypted. For compatibility reasons we have to introduce a new capability flag TIPC_LINK_PROTO_SEQNO to handle this new feature. Signed-off-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-11	Merge branch '10GbE' of ↵	David S. Miller
	git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/next-queue Jeff Kirsher says: ==================== L2 Fwd Offload & 10GbE Intel Driver Updates 2018-07-09 This patch series is meant to allow support for the L2 forward offload, aka MACVLAN offload without the need for using ndo_select_queue. The existing solution currently requires that we use ndo_select_queue in the transmit path if we want to associate specific Tx queues with a given MACVLAN interface. In order to get away from this we need to repurpose the tc_to_txq array and XPS pointer for the MACVLAN interface and use those as a means of accessing the queues on the lower device. As a result we cannot offload a device that is configured as multiqueue, however it doesn't really make sense to configure a macvlan interfaced as being multiqueue anyway since it doesn't really have a qdisc of its own in the first place. The big changes in this set are: Allow lower device to update tc_to_txq and XPS map of offloaded MACVLAN Disable XPS for single queue devices Replace accel_priv with sb_dev in ndo_select_queue Add sb_dev parameter to fallback function for ndo_select_queue Consolidated ndo_select_queue functions that appeared to be duplicates ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-11	tcp: expose both send and receive intervals for rate sample	Deepti Raghavan
	Congestion control algorithms, which access the rate sample through the tcp_cong_control function, only have access to the maximum of the send and receive interval, for cases where the acknowledgment rate may be inaccurate due to ACK compression or decimation. Algorithms may want to use send rates and receive rates as separate signals. Signed-off-by: Deepti Raghavan <deeptir@mit.edu> Acked-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-11	net: sched: fix unprotected access to rcu cookie pointer	Vlad Buslov
	Fix action attribute size calculation function to take rcu read lock and access act_cookie pointer with rcu dereference. Fixes: eec94fdb0480 ("net: sched: use rcu for action cookie update") Reported-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: Vlad Buslov <vladbu@mellanox.com> Reviewed-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-11	net: sched: act_ife: fix memory leak in ife init	Vlad Buslov
	Free params if tcf_idr_check_alloc() returned error. Fixes: 0190c1d452a9 ("net: sched: atomically check-allocate action") Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Vlad Buslov <vladbu@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-11	net/ipv6: propagate net.ipv6.conf.all.addr_gen_mode to devices	Sabrina Dubroca
	This aligns the addr_gen_mode sysctl with the expected behavior of the "all" variant. Fixes: d35a00b8e33d ("net/ipv6: allow sysctl to change link-local address generation mode") Suggested-by: David Ahern <dsahern@gmail.com> Signed-off-by: Sabrina Dubroca <sd@queasysnail.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-11	net/ipv6: reserve room for IFLA_INET6_ADDR_GEN_MODE	Sabrina Dubroca
	inet6_ifla6_size() is called to check how much space is needed by inet6_fill_link_af() and inet6_fill_ifinfo(), both of which include the IFLA_INET6_ADDR_GEN_MODE attribute. Reserve some room for it. Fixes: bc91b0f07ada ("ipv6: addrconf: implement address generation modes") Signed-off-by: Sabrina Dubroca <sd@queasysnail.net> Reviewed-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-11	net/ipv6: don't reinitialize ndev->cnf.addr_gen_mode on new inet6_dev	Sabrina Dubroca
	The value has already been copied from this netns's devconf_dflt, it shouldn't be reset to the global kernel default. Fixes: d35a00b8e33d ("net/ipv6: allow sysctl to change link-local address generation mode") Signed-off-by: Sabrina Dubroca <sd@queasysnail.net> Reviewed-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-11	net/ipv6: fix addrconf_sysctl_addr_gen_mode	Sabrina Dubroca
	addrconf_sysctl_addr_gen_mode() has multiple problems. First, it ignores the errors returned by proc_dointvec(). addrconf_sysctl_addr_gen_mode() calls proc_dointvec() directly, which writes the value to memory, and then checks if it's valid and may return EINVAL. If a bad value is given, the value displayed when reading net.ipv6.conf.foo.addr_gen_mode next time will be invalid. In case the value provided by the user was valid, addrconf_dev_config() won't be called since idev->cnf.addr_gen_mode has already been updated. Fix this in the usual way we deal with values that need to be checked after the proc_do*() helper has returned: define a local ctl_table and storage, call proc_dointvec() on that temporary area, then check and store. addrconf_sysctl_addr_gen_mode() also writes the new value to the global ipv6_devconf_dflt, when we're writing to some netns's default, so that new netns will inherit the value that was set by the change occuring in any netns. That doesn't make any sense, so let's drop this assignment. Finally, since addr_gen_mode is a __u32, switch to proc_douintvec(). Fixes: d35a00b8e33d ("net/ipv6: allow sysctl to change link-local address generation mode") Signed-off-by: Sabrina Dubroca <sd@queasysnail.net> Reviewed-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-11	net/sched: flower: Fix null pointer dereference when run tc vlan command	Jianbo Liu
	Zahari issued tc vlan command without setting vlan_ethtype, which will crash kernel. To avoid this, we must check tb[TCA_FLOWER_KEY_VLAN_ETH_TYPE] is not null before use it. Also we don't need to dump vlan_ethtype or cvlan_ethtype in this case. Fixes: d64efd0926ba ('net/sched: flower: Add supprt for matching on QinQ vlan headers') Signed-off-by: Jianbo Liu <jianbol@mellanox.com> Reported-by: Zahari Doychev <zahari.doychev@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-11	bpf: fix panic due to oob in bpf_prog_test_run_skb	Daniel Borkmann
	sykzaller triggered several panics similar to the below: [...] [ 248.851531] BUG: KASAN: use-after-free in _copy_to_user+0x5c/0x90 [ 248.857656] Read of size 985 at addr ffff8808017ffff2 by task a.out/1425 [...] [ 248.865902] CPU: 1 PID: 1425 Comm: a.out Not tainted 4.18.0-rc4+ #13 [ 248.865903] Hardware name: Supermicro SYS-5039MS-H12TRF/X11SSE-F, BIOS 2.1a 03/08/2018 [ 248.865905] Call Trace: [ 248.865910] dump_stack+0xd6/0x185 [ 248.865911] ? show_regs_print_info+0xb/0xb [ 248.865913] ? printk+0x9c/0xc3 [ 248.865915] ? kmsg_dump_rewind_nolock+0xe4/0xe4 [ 248.865919] print_address_description+0x6f/0x270 [ 248.865920] kasan_report+0x25b/0x380 [ 248.865922] ? _copy_to_user+0x5c/0x90 [ 248.865924] check_memory_region+0x137/0x190 [ 248.865925] kasan_check_read+0x11/0x20 [ 248.865927] _copy_to_user+0x5c/0x90 [ 248.865930] bpf_test_finish.isra.8+0x4f/0xc0 [ 248.865932] bpf_prog_test_run_skb+0x6a0/0xba0 [...] After scrubbing the BPF prog a bit from the noise, turns out it called bpf_skb_change_head() for the lwt_xmit prog with headroom of 2. Nothing wrong in that, however, this was run with repeat >> 0 in bpf_prog_test_run_skb() and the same skb thus keeps changing until the pskb_expand_head() called from skb_cow() keeps bailing out in atomic alloc context with -ENOMEM. So upon return we'll basically have 0 headroom left yet blindly do the __skb_push() of 14 bytes and keep copying data from there in bpf_test_finish() out of bounds. Fix to check if we have enough headroom and if pskb_expand_head() fails, bail out with error. Another bug independent of this fix (but related in triggering above) is that BPF_PROG_TEST_RUN should be reworked to reset the skb/xdp buffer to it's original state from input as otherwise repeating the same test in a loop won't work for benchmarking when underlying input buffer is getting changed by the prog each time and reused for the next run leading to unexpected results. Fixes: 1cf1cae963c2 ("bpf: introduce BPF_PROG_TEST_RUN command") Reported-by: syzbot+709412e651e55ed96498@syzkaller.appspotmail.com Reported-by: syzbot+54f39d6ab58f39720a55@syzkaller.appspotmail.com Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-07-11	bpf: fix availability probing for seg6 helpers	Mathieu Xhonneux
	bpf_lwt_seg6_* helpers require CONFIG_IPV6_SEG6_BPF, and currently return -EOPNOTSUPP to indicate unavailability. This patch forces the BPF verifier to reject programs using these helpers when !CONFIG_IPV6_SEG6_BPF, allowing users to more easily probe if they are available or not. Signed-off-by: Mathieu Xhonneux <m.xhonneux@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-07-11	infiniband: i40iw, nes: don't use wall time for TCP sequence numbers	Arnd Bergmann
	The nes infiniband driver uses current_kernel_time() to get a nanosecond granunarity timestamp to initialize its tcp sequence counters. This is one of only a few remaining users of that deprecated function, so we should try to get rid of it. Aside from using a deprecated API, there are several problems I see here: - Using a CLOCK_REALTIME based time source makes it predictable in case the time base is synchronized. - Using a coarse timestamp means it only gets updated once per jiffie, making it even more predictable in order to avoid having to access the hardware clock source - The upper 2 bits are always zero because the nanoseconds are at most 999999999. For the Linux TCP implementation, we use secure_tcp_seq(), which appears to be appropriate here as well, and solves all the above problems. i40iw uses a variant of the same code, so I do that same thing there for ipv4. Unlike nes, i40e also supports ipv6, which needs to call secure_tcpv6_seq instead. Acked-by: Shiraz Saleem <shiraz.saleem@intel.com> Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-07-11	ipv6: xfrm: use 64-bit timestamps	Arnd Bergmann
	get_seconds() is deprecated because it can overflow on 32-bit architectures. For the xfrm_state->lastused member, we treat the data as a 64-bit number already, so we just need to use the right accessor that works on both 32-bit and 64-bit machines. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2018-07-11	xfrm: use time64_t for in-kernel timestamps	Arnd Bergmann
	The lifetime managment uses '__u64' timestamps on the user space interface, but 'unsigned long' for reading the current time in the kernel with get_seconds(). While this is probably safe beyond y2038, it will still overflow in 2106, and the get_seconds() call is deprecated because fo that. This changes the xfrm time handling to use time64_t consistently, along with reading the time using the safer ktime_get_real_seconds(). It still suffers from problems that can happen from a concurrent settimeofday() call or (to a lesser degree) a leap second update, but since the time stamps are part of the user API, there is nothing we can do to prevent that. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2018-07-10	sch_cake: Conditionally split GSO segments	Toke Høiland-Jørgensen
	At lower bandwidths, the transmission time of a single GSO segment can add an unacceptable amount of latency due to HOL blocking. Furthermore, with a software shaper, any tuning mechanism employed by the kernel to control the maximum size of GSO segments is thrown off by the artificial limit on bandwidth. For this reason, we split GSO segments into their individual packets iff the shaper is active and configured to a bandwidth <= 1 Gbps. Signed-off-by: Toke Høiland-Jørgensen <toke@toke.dk> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-10	sch_cake: Add overhead compensation support to the rate shaper	Toke Høiland-Jørgensen
	This commit adds configurable overhead compensation support to the rate shaper. With this feature, userspace can configure the actual bottleneck link overhead and encapsulation mode used, which will be used by the shaper to calculate the precise duration of each packet on the wire. This feature is needed because CAKE is often deployed one or two hops upstream of the actual bottleneck (which can be, e.g., inside a DSL or cable modem). In this case, the link layer characteristics and overhead reported by the kernel does not match the actual bottleneck. Being able to set the actual values in use makes it possible to configure the shaper rate much closer to the actual bottleneck rate (our experience shows it is possible to get with 0.1% of the actual physical bottleneck rate), thus keeping latency low without sacrificing bandwidth. The overhead compensation has three tunables: A fixed per-packet overhead size (which, if set, will be accounted from the IP packet header), a minimum packet size (MPU) and a framing mode supporting either ATM or PTM framing. We include a set of common keywords in TC to help users configure the right parameters. If no overhead value is set, the value reported by the kernel is used. Signed-off-by: Toke Høiland-Jørgensen <toke@toke.dk> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-10	sch_cake: Add DiffServ handling	Toke Høiland-Jørgensen
	This adds support for DiffServ-based priority queueing to CAKE. If the shaper is in use, each priority tier gets its own virtual clock, which limits that tier's rate to a fraction of the overall shaped rate, to discourage trying to game the priority mechanism. CAKE defaults to a simple, three-tier mode that interprets most code points as "best effort", but places CS1 traffic into a low-priority "bulk" tier which is assigned 1/16 of the total rate, and a few code points indicating latency-sensitive or control traffic (specifically TOS4, VA, EF, CS6, CS7) into a "latency sensitive" high-priority tier, which is assigned 1/4 rate. The other supported DiffServ modes are a 4-tier mode matching the 802.11e precedence rules, as well as two 8-tier modes, one of which implements strict precedence of the eight priority levels. This commit also adds an optional DiffServ 'wash' mode, which will zero out the DSCP fields of any packet passing through CAKE. While this can technically be done with other mechanisms in the kernel, having the feature available in CAKE significantly decreases configuration complexity; and the implementation cost is low on top of the other DiffServ-handling code. Filters and applications can set the skb->priority field to override the DSCP-based classification into tiers. If TC_H_MAJ(skb->priority) matches CAKE's qdisc handle, the minor number will be interpreted as a priority tier if it is less than or equal to the number of configured priority tiers. Signed-off-by: Toke Høiland-Jørgensen <toke@toke.dk> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-10	sch_cake: Add NAT awareness to packet classifier	Toke Høiland-Jørgensen
	When CAKE is deployed on a gateway that also performs NAT (which is a common deployment mode), the host fairness mechanism cannot distinguish internal hosts from each other, and so fails to work correctly. To fix this, we add an optional NAT awareness mode, which will query the kernel conntrack mechanism to obtain the pre-NAT addresses for each packet and use that in the flow and host hashing. When the shaper is enabled and the host is already performing NAT, the cost of this lookup is negligible. However, in unlimited mode with no NAT being performed, there is a significant CPU cost at higher bandwidths. For this reason, the feature is turned off by default. Cc: netfilter-devel@vger.kernel.org Signed-off-by: Toke Høiland-Jørgensen <toke@toke.dk> Signed-off-by: David S. Miller <davem@davemloft.net>