Age | Commit message (Collapse) | Author |
|
Currently incorrect default hugepage pool size is reported by proc
nr_hugepages when number of pages for the default huge page size is
specified twice.
When multiple huge page sizes are supported, /proc/sys/vm/nr_hugepages
indicates the current number of pre-allocated huge pages of the default
size. Basically /proc/sys/vm/nr_hugepages displays default_hstate->
max_huge_pages and after boot time pre-allocation, max_huge_pages should
equal the number of pre-allocated pages (nr_hugepages).
Test case:
Note that this is specific to x86 architecture.
Boot the kernel with command line option 'default_hugepagesz=1G
hugepages=X hugepagesz=2M hugepages=Y hugepagesz=1G hugepages=Z'. After
boot, 'cat /proc/sys/vm/nr_hugepages' and 'sysctl -a | grep hugepages'
returns the value X. However, dmesg output shows that Z huge pages were
pre-allocated.
So, the root cause of the problem here is that the global variable
default_hstate_max_huge_pages is set if a default huge page size is
specified (directly or indirectly) on the command line. After the command
line processing in hugetlb_init, if default_hstate_max_huge_pages is set,
the value is assigned to default_hstae.max_huge_pages. However,
default_hstate.max_huge_pages may have already been set based on the
number of pre-allocated huge pages of default_hstate size.
The solution to this problem is if hstate->max_huge_pages is already set
then it should not set as a result of global max_huge_pages value.
Basically if the value of the variable hugepages is set multiple times on
a command line for a specific supported hugepagesize then proc layer
should consider the last specified value.
Signed-off-by: Vaishali Thakkar <vaishali.thakkar@oracle.com>
Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Commit 3565fce3a659 ("mm, x86: get_user_pages() for dax mappings") has
moved up the pte_page(pte) in x86's fast gup_pte_range(), for no
discernible reason: put it back where it belongs, after the pte_flags
check and the pfn_valid cross-check.
That may be the cause of the NULL pointer dereference in
gup_pte_range(), seen when vfio called vaddr_get_pfn() when starting a
qemu-kvm based VM.
Signed-off-by: Hugh Dickins <hughd@google.com>
Reported-by: Michael Long <Harn-Solo@gmx.de>
Tested-by: Michael Long <Harn-Solo@gmx.de>
Acked-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
We don't require a dedicated thread for fsnotify cleanup. Switch it
over to a workqueue job instead that runs on the system_unbound_wq.
In the interest of not thrashing the queued job too often when there are
a lot of marks being removed, we delay the reaper job slightly when
queueing it, to allow several to gather on the list.
Signed-off-by: Jeff Layton <jeff.layton@primarydata.com>
Tested-by: Eryu Guan <guaneryu@gmail.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Eric Paris <eparis@parisplace.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
This reverts commit c510eff6beba ("fsnotify: destroy marks with
call_srcu instead of dedicated thread").
Eryu reported that he was seeing some OOM kills kick in when running a
testcase that adds and removes inotify marks on a file in a tight loop.
The above commit changed the code to use call_srcu to clean up the
marks. While that does (in principle) work, the srcu callback job is
limited to cleaning up entries in small batches and only once per jiffy.
It's easily possible to overwhelm that machinery with too many call_srcu
callbacks, and Eryu's reproduer did just that.
There's also another potential problem with using call_srcu here. While
you can obviously sleep while holding the srcu_read_lock, the callbacks
run under local_bh_disable, so you can't sleep there.
It's possible when putting the last reference to the fsnotify_mark that
we'll end up putting a chain of references including the fsnotify_group,
uid, and associated keys. While I don't see any obvious ways that that
could occurs, it's probably still best to avoid using call_srcu here
after all.
This patch reverts the above patch. A later patch will take a different
approach to eliminated the dedicated thread here.
Signed-off-by: Jeff Layton <jeff.layton@primarydata.com>
Reported-by: Eryu Guan <guaneryu@gmail.com>
Tested-by: Eryu Guan <guaneryu@gmail.com>
Cc: Jan Kara <jack@suse.com>
Cc: Eric Paris <eparis@parisplace.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Grazvydas Ignotas has reported a regression in remap_file_pages()
emulation.
Testcase:
#define _GNU_SOURCE
#include <assert.h>
#include <stdlib.h>
#include <stdio.h>
#include <sys/mman.h>
#define SIZE (4096 * 3)
int main(int argc, char **argv)
{
unsigned long *p;
long i;
p = mmap(NULL, SIZE, PROT_READ | PROT_WRITE,
MAP_SHARED | MAP_ANONYMOUS, -1, 0);
if (p == MAP_FAILED) {
perror("mmap");
return -1;
}
for (i = 0; i < SIZE / 4096; i++)
p[i * 4096 / sizeof(*p)] = i;
if (remap_file_pages(p, 4096, 0, 1, 0)) {
perror("remap_file_pages");
return -1;
}
if (remap_file_pages(p, 4096 * 2, 0, 1, 0)) {
perror("remap_file_pages");
return -1;
}
assert(p[0] == 1);
munmap(p, SIZE);
return 0;
}
The second remap_file_pages() fails with -EINVAL.
The reason is that remap_file_pages() emulation assumes that the target
vma covers whole area we want to over map. That assumption is broken by
first remap_file_pages() call: it split the area into two vma.
The solution is to check next adjacent vmas, if they map the same file
with the same flags.
Fixes: c8d78c1823f4 ("mm: replace remap_file_pages() syscall with emulation")
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reported-by: Grazvydas Ignotas <notasas@gmail.com>
Tested-by: Grazvydas Ignotas <notasas@gmail.com>
Cc: <stable@vger.kernel.org> [4.0+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
DAX doesn't deposit pgtables when it maps huge pages: nothing to
withdraw. It can lead to crash.
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Matthew Wilcox <willy@linux.intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
This patch updates the code for determining the L4 protocol and L3 header
length so that when IPv6 extension headers are being used we can determine
the offset and type of the L4 protocol.
Signed-off-by: Alexander Duyck <aduyck@mirantis.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
|
|
Recent changes should have enabled support for IPv6 based tunnels and
support for TSO with outer UDP checksums. As such we can update the
feature flags to reflect that.
In addition we can clean-up the flags that aren't needed such as SCTP and
RXCSUM since having the bits there doesn't add any value.
I also found one spot where we were setting the same flag twice. It looks
like it was probably a git merge error that resulted in the line being
duplicated. As such I have dropped it in this patch.
Signed-off-by: Alexander Duyck <aduyck@mirantis.com>
Acked-by: Anjali Singhai Jain <anjali.singhai@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
|
|
Recent changes should have enabled support for IPv6 based tunnels and
support for TSO with outer UDP checksums. As such we can update the
feature flags to reflect that.
In addition we can clean-up the flags that aren't needed such as SCTP and
RXCSUM since having the bits there doesn't add any value.
Signed-off-by: Alexander Duyck <aduyck@mirantis.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
|
|
All of the documentation in the datasheets for the XL710 do not call out
any reason to exclude support for IPv6 based tunnels. As such I am
dropping the code that was excluding these tunnel types from having their
port numbers recognized. This way we can take advantage of things such as
checksum offload for inner headers over IPv6 based VXLAN or GENEVE
tunnels.
Signed-off-by: Alexander Duyck <aduyck@mirantis.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
|
|
This patch contains a number of fixes to make certain that we are using
the correct protocols when parsing both the inner and outer headers of a
frame that is mixed between IPv4 and IPv6 for inner and outer.
Signed-off-by: Alexander Duyck <aduyck@mirantis.com>
Acked-by: Kiran Patil <kiran.patil@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
|
|
The XL722 has support for providing the outer UDP tunnel checksum on
transmits. Make use of this feature to support segmenting UDP tunnels with
outer checksums enabled.
Signed-off-by: Alexander Duyck <aduyck@mirantis.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
|
|
This is mostly a minor clean-up for the Rx checksum path in order to avoid
some of the unnecessary conditional checks that were being applied.
Signed-off-by: Alexander Duyck <aduyck@mirantis.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
|
|
Yuval Mintz says:
====================
qed{,e}: Add vlan filtering offload
This series adds vlan filtering offload to qede.
First patch introduces small additional infrastructure needed in
qed to support it, while second contains the main bulk of driver changes.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Device would start receiving only vlan-tagged traffic with tags matching
that of one of the configured vlan IDs, unless:
- Device is expliicly placed in PROMISC mode.
- Device exhausts its vlan filter credits.
Signed-off-by: Sudarsana Reddy Kalluru <sudarsana.kalluru@qlogic.com>
Signed-off-by: Yuval Mintz <Yuval.Mintz@qlogic.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Today, interfaces are working in vlan-promisc mode; But once
vlan filtering offloaded would be supported, we'll need a method to
control it directly [e.g., when setting device to PROMISC, or when
running out of vlan credits].
This adds the necessary API for L2 client to manually choose whether to
accept all vlans or only those for which filters were configured.
Signed-off-by: Yuval Mintz <Yuval.Mintz@qlogic.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
This avoids a harmless randconfig warning I get when USB_NET_CDC_SUBSET
is enabled, but all of the more specific drivers are not:
drivers/net/usb/cdc_subset.c:241:2: #warning You need to configure some hardware for this driver
The current behavior is clearly intentional, giving a warning when
a user picks a configuration that won't do anything good. The only
reason for even addressing this is that I'm getting close to
eliminating all 'randconfig' warnings on ARM, and this came up
a couple of times.
My workaround is to not even build the module when none of the
configurations are enable.
Alternatively we could simply remove the #warning (nothing wrong
for compile-testing), turn it into a runtime warning, or
change the Kconfig options into a menu to hide CONFIG_USB_NET_CDC_SUBSET.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Files in sysfs are created using the name from the phy_driver struct,
when two names are the same we may get a duplicate filename warning,
fix this.
Reported-by: kernel test robot <ying.huang@linux.intel.com>
Signed-off-by: Andrew F. Davis <afd@ti.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
This patch takes advantage of several assumptions we can make about the
headers of the frame in order to reduce overall processing overhead for
computing the outer header checksum.
First we can assume the entire header is in the region pointed to by
skb->head as this is what csum_start is based on.
Second, as a result of our first assumption, we can just call csum_partial
instead of making a call to skb_checksum which would end up having to
configure things so that we could walk through the frags list.
Signed-off-by: Alexander Duyck <aduyck@mirantis.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
follows up commit 45f6fad84cc3 ("ipv6: add complete rcu protection around
np->opt") which added mixed rcu/refcount protection to np->opt.
Given the current implementation of rcu_pointer_handoff(), this has no
effect at runtime.
Signed-off-by: Benjamin Poirier <bpoirier@suse.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
In metadata mode, the vxlan interface is not supposed to use the fdb control
plane but an external one (openvswitch or static routes). With the current
code, packets may leak into the fdb handling code which usually causes them
to be dropped anyway but may have strange side effects.
Just drop the packets directly when in metadata mode if the destination data
are not correctly provided on egress.
Signed-off-by: Jiri Benc <jbenc@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
A return value of the bchannel_get_rxbuf() function is compared with the
positive ENOMEM value instead of the negative -ENOMEM value to detect a
memory allocation problem. Thus, after a possible memory allocation
failure the bc->bch.rx_skb will be NULL which will lead to a NULL
pointer dereference.
Signed-off-by: Anton Protopopov <a.s.protopopov@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
The cfrfml_receive() function might return positive value EPROTO
Signed-off-by: Anton Protopopov <a.s.protopopov@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
The atalk_sendmsg() function might return wrong value ENETUNREACH
instead of -ENETUNREACH.
Signed-off-by: Anton Protopopov <a.s.protopopov@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Failure of kzalloc should cause the enclosing function
to return -ENOMEM, not -ENODEV.
Additionally, removed the following checkpatch warnings:
ERROR: spaces required around that '==' (ctx:VxV)
ERROR: space required before the open parenthesis '('
CHECK: Comparison to NULL could be written "!lp"
Signed-off-by: Amitoj Kaur Chawla <amitoj1606@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
My implementation around IFF_NO_QUEUE driver flag assumed that leaving
tx_queue_len untouched (specifically: not setting it to zero) by drivers
would make it possible to assign a regular qdisc to them without having
to worry about setting tx_queue_len to a useful value. This was only
partially true: I overlooked that some drivers don't call ether_setup()
and therefore not initialize tx_queue_len to the default value of 1000.
Consequently, removing the workarounds in place for that case in qdisc
implementations which cared about it (namely, pfifo, bfifo, gred, htb,
plug and sfb) leads to problems with these specific interface types and
qdiscs.
Luckily, there's already a sanitization point for drivers setting
tx_queue_len to zero, which can be reused to assign the fallback value
most qdisc implementations used, which is 1.
Fixes: 348e3435cbefa ("net: sched: drop all special handling of tx_queue_len == 0")
Tested-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Phil Sutter <phil@nwl.cc>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
ether_setup sets IFF_TX_SKB_SHARING but this is not supported by gre
as it modifies the skb on xmit.
Also, clean up whitespace in ipgre_tap_setup when we're already touching it.
Signed-off-by: Jiri Benc <jbenc@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
ether_setup sets IFF_TX_SKB_SHARING but this is not supported by
geneve as it modifies the skb on xmit.
Signed-off-by: Jiri Benc <jbenc@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
ether_setup sets IFF_TX_SKB_SHARING but this is not supported by vxlan
as it modifies the skb on xmit.
Signed-off-by: Jiri Benc <jbenc@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Jiri Benc says:
====================
iptunnel: scrub packet in iptunnel_pull_header
As every IP tunnel has to scrub skb on decapsulation, iptunnel_pull_header
tried to do that and open coded part of skb_scrub_packet. Various tunneling
protocols (VXLAN, Geneve) then called full skb_scrub_packet on their own,
duplicating part of the scrubbing already done.
Consolidate the code, calling skb_scrub_packet from iptunnel_pull_header.
This will allow additional cleanups in VXLAN code, as the packet is scrubbed
early during rx processing after this patchset and VXLAN can start filling
out skb fields earlier.
The full picture of vxlan cleanup patches can be seen at:
https://github.com/jbenc/linux-vxlan/commits/master
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Part of skb_scrub_packet was open coded in iptunnel_pull_header. Let it call
skb_scrub_packet directly instead.
Signed-off-by: Jiri Benc <jbenc@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
This is in preparation for iptunnel_pull_header calling skb_scrub_packet.
Signed-off-by: Jiri Benc <jbenc@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
This is in preparation for iptunnel_pull_header calling skb_scrub_packet.
Signed-off-by: Jiri Benc <jbenc@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Similarly to the existing vxlan_get_sk_family.
Signed-off-by: Jiri Benc <jbenc@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Remove the shared br_log_state function and print the info directly in
br_set_state, where the net_bridge_port state is actually changed.
Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Acked-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Hariprasad Shenai says:
====================
cxgb4: Use __dev_[um]c_[un]sync for MAC address syncing
This patch series adds support to use __dev_uc_sync/__dev_mc_sync to add
MAC address and __dev_uc_unsync/__dev_mc_unsync to delete MAC address.
This patch series has been created against net-next tree and includes
patches on cxgb4 and cxgb4vf driver.
We have included all the maintainers of respective drivers. Kindly review
the change and let us know in case of any review comments.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Signed-off-by: Hariprasad Shenai <hariprasad@chelsio.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Signed-off-by: Hariprasad Shenai <hariprasad@chelsio.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Add exception handling to the Tx checksum path so that we can handle cases
of TSO where the frame is bad, or Tx checksum where we didn't recognize a
protocol
Drop I40E_TX_FLAGS_CSUM as it is unused, move the CHECKSUM_PARTIAL check
into the function itself so that we can decrease indent.
Signed-off-by: Alexander Duyck <aduyck@mirantis.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
|
|
This patch defers writing to the Tx descriptor bits until we know we have
successfully completed a given operation. So for example we defer updating
the tunnelling portion of the context descriptor until we have fully
identified the type.
The advantage to this approach is that we can assemble values as we go
instead of having to try and kludge everything together all at once. As a
result we can significantly clean up the tunneling configuration for
instance as we can just do a pointer walk and do the math for the distance
between each set of points.
Signed-off-by: Alexander Duyck <aduyck@mirantis.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
|
|
Calculate the maximum MTU taking into account the size of headers
involved in GENEVE encapsulation, as for other tunnel types.
Changes in v3:
- Correct comment style
Changes in v2:
- Conform more closely to ip_tunnel_change_mtu
- Exclude GENEVE options from max MTU calculation
Signed-off-by: David Wragg <david@weave.works>
Acked-by: Jesse Gross <jesse@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
The tun_id field in struct ip_tunnel_key is __be64, not __be32. We need to
convert the vni to tun_id correctly.
Fixes: 54bfd872bf16 ("vxlan: keep flags and vni in network byte order")
Reported-by: Paolo Abeni <pabeni@redhat.com>
Tested-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Jiri Benc <jbenc@redhat.com>
Acked-by: Thadeu Lima de Souza Cascardo <cascardo@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
This patch adds support for IPv6 extension headers in setting up the Tx
checksum. Without this patch extension headers would cause IPv6 traffic to
fail as the transport protocol could not be identified.
Signed-off-by: Alexander Duyck <aduyck@mirantis.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
|
|
This patch fixes two issues. First was the fact that iphdr(skb)->protocl
was being used to test for the outer transport protocol. This completely
breaks IPv6 support. Second was the fact that we cleared the flag for v4
going to v6, but we didn't take care of txflags going the other way. As
such we would have the v6 flag still set even if the inner header was v4.
Signed-off-by: Alexander Duyck <aduyck@mirantis.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
|
|
The Tx checksum path was maintaining a set of 3 pointers and two lengths in
order to prepare the packet for being checksummed. The thing is we only
really needed 2 pointers, and the lengths that were being maintained can
easily be computed.
As such we can replace the IPv4 and IPv6 header pointers with one single
union that represents both, or a generic pointer to the start of the
network header. For the L4 headers we can do the same with TCP and a
generic pointer to the start of the transport header. The length of the
TCP header is obtained by simply multiplying doff by 4, and the network
header length can be obtained by subtracting the network header pointer
from the transport header pointer.
While I was at it I renamed l4_hdr to l4_proto to make it a bit more clear
and less likely to be confused with l4.hdr which is the transport header
pointer.
Signed-off-by: Alexander Duyck <aduyck@mirantis.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
|
|
This patch goes through and pulls all of the spots where we were updating
either the TCP or IP checksums in the TSO and checksum path into the TSO
function. The general idea here is that we should only be updating the
header after we verify we have completed a skb_cow_head check to verify the
head is writable.
One other advantage to doing this is that it makes things much more
obvious. For example, in the case of IPv6 there was one spot where the
offset of the IPv4 header checksum was being updated which is obviously
incorrect.
Signed-off-by: Alexander Duyck <aduyck@mirantis.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
|
|
This patch makes it so that the L4 header offsets and such can be ignored
when dealing with the L3 checksum and length update. This is done making
use of two things.
First we can just use the offset from the L4 header to the start of the
packet to determine the L4 offset, and from that we can then make use of
the data offset to determine the full length of the headers.
As far as adjusting the checksum to remove the length we can simply add the
inverse of the length instead of having to recompute the entire
pseudo-header without the length. In the case of an IPv6 header this
should be significantly cheaper since we can make use of a value we already
needed instead of having to read the source and destination address out of
the packet.
Signed-off-by: Alexander Duyck <aduyck@mirantis.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
|
|
Instead of casing u32 values to u64 it makes more sense to just start out
with u64 values in the first place. This way we don't need to create a
mess with all of the casts needed to populate a 64b value.
Signed-off-by: Alexander Duyck <aduyck@mirantis.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
|
|
The i40e and i40evf drivers contained code for inserting an outer checksum
on UDP tunnels. The issue however is that the upper levels of the stack
never requested such an offload and it results in possible errors.
In addition the same logic was being applied to the Rx side where it was
attempting to validate the outer checksum, but the logic there was
incorrect in that it was testing for the resultant sum to be equal to the
header checksum instead of being equal to 0.
Since this code is so massively flawed, and doing things that we didn't ask
for it to do I am just dropping it, and will bring it back later to use as
an offload for SKB_GSO_UDP_TUNNEL_CSUM which can make use of such a
feature.
As far as the Rx feature I am dropping it completely since it would need to
be massively expanded and applied to IPv4 and IPv6 checksums for all parts,
not just the one that supports Tx checksum offload for the outer.
Signed-off-by: Alexander Duyck <aduyck@mirantis.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
|
|
Florian Westphal says:
====================
netlink: remove mmapped netlink support
As discussed during netconf 2016 in Seville, this series removes
CONFIG_NETLINK_MMAP.
Close to three years after it was merged it has retained several problems
that do not appear to be fixable.
No official netfilter libmnl release contains support for mmap backed netlink
sockets. No openvswitch release makes use of it either.
To use the mmap interface, userspace not only has to probe for mmap netlink
support, it also has to implement a recv/socket receive path in order to
handle messages that exceed the size of an rx ring element (NL_MMAP_STATUS_COPY).
So if there are odd programs out there that attempt to use MMAP netlink
they should continue to work as they already need a socket based code path
to work properly.
The actual revert (first patch) has a list of problems.
The followup patches remove a couple of helpers that are no longer needed
after the revert.
I did a few tests with mmap vs. socket based interface on a 4.4 based
kernel on an i7-4790 box and there are no performance advantages:
loopback, single nfqueue, queueing in -t filter INPUT:
traffic generated by 8 * ping -q -f localhost:
socket backend:
real 0m27.325s
user 0m3.993s
sys 0m23.292s
with mmap ring backend:
real 0m29.054s
user 0m4.924s
sys 0m24.127s
with single tcp stream, unidirectional, loopback mtu set at 1500
(nc localhost discard < /dev/zero > /dev/null):
socket interface:
time nfqdump -b $((8 * 1024 * 1024 * 1024)) -w /dev/null
real 0m15.960s
user 0m1.756s
sys 0m11.143s
mmap ring:
real 0m16.441s
user 0m3.040s
sys 0m13.687s
socket interface nfqdump[1] with --gso option (i.e. MTU is exceeded,
no kernel-side segmentation and checksum fixups) completes in about 5s.
I also tested dumping a conntrack table with 1m entries.
On my box this takes about 2.4 seconds for both mmap and socket backend:
time LD_PRELOAD=../../src/.libs/libmnl.so ./nfct-dump-sk > /dev/null
mnl_cb_run: Success
messages: 1000000
real 0m2.485s
user 0m1.085s
sys 0m1.400s
time LD_PRELOAD=../../src/.libs/libmnl.so ./nfct-dump-mmap > /dev/null
messages: 1000000
real 0m2.451s
user 0m1.124s
sys 0m1.328s
[1] https://git.breakpoint.cc/cgit/fw/nfqdump.git/
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
|