Age | Commit message (Collapse) | Author |
|
NFS: NFSoRDMA Client Side Changes
These patches include several bugfixes and cleanups for the NFSoRDMA client.
This includes bugfixes for NFS v4.1, proper RDMA_ERROR handling, and fixes
from the recent workqueue swicchover. These patches also switch xprtrdma to
use the new CQ API
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
* tag 'nfs-rdma-4.6-1' of git://git.linux-nfs.org/projects/anna/nfs-rdma: (787 commits)
xprtrdma: Use new CQ API for RPC-over-RDMA client send CQs
xprtrdma: Use an anonymous union in struct rpcrdma_mw
xprtrdma: Use new CQ API for RPC-over-RDMA client receive CQs
xprtrdma: Serialize credit accounting again
xprtrdma: Properly handle RDMA_ERROR replies
rpcrdma: Add RPCRDMA_HDRLEN_ERR
xprtrdma: Do not wait if ib_post_send() fails
xprtrdma: Segment head and tail XDR buffers on page boundaries
xprtrdma: Clean up dprintk format string containing a newline
xprtrdma: Clean up physical_op_map()
xprtrdma: Clean up unused RPCRDMA_INLINE_PAD_THRESH macro
|
|
Calling ib_poll_cq() to sort through WCs during a completion is a
common pattern amongst RDMA consumers. Since commit 14d3a3b2498e
("IB: add a proper completion queue abstraction"), WC sorting can
be handled by the IB core.
By converting to this new API, xprtrdma is made a better neighbor to
other RDMA consumers, as it allows the core to schedule the delivery
of completions more fairly amongst all active consumers.
Because each ib_cqe carries a pointer to a completion method, the
core can now post its own operations on a consumer's QP, and handle
the completions itself, without changes to the consumer.
Send completions were previously handled entirely in the completion
upcall handler (ie, deferring to a process context is unneeded).
Thus IB_POLL_SOFTIRQ is a direct replacement for the current
xprtrdma send code path.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Devesh Sharma <devesh.sharma@broadcom.com>
Reviewed-by: Sagi Grimberg <sagig@mellanox.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
|
|
Clean up: Make code more readable.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Devesh Sharma <devesh.sharma@broadcom.com>
Reviewed-by: Sagi Grimberg <sagig@mellanox.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
|
|
Calling ib_poll_cq() to sort through WCs during a completion is a
common pattern amongst RDMA consumers. Since commit 14d3a3b2498e
("IB: add a proper completion queue abstraction"), WC sorting can
be handled by the IB core.
By converting to this new API, xprtrdma is made a better neighbor to
other RDMA consumers, as it allows the core to schedule the delivery
of completions more fairly amongst all active consumers.
Because each ib_cqe carries a pointer to a completion method, the
core can now post its own operations on a consumer's QP, and handle
the completions itself, without changes to the consumer.
xprtrdma's reply processing is already handled in a work queue, but
there is some initial order-dependent processing that is done in the
soft IRQ context before a work item is scheduled.
IB_POLL_SOFTIRQ is a direct replacement for the current xprtrdma
receive code path.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Devesh Sharma <devesh.sharma@broadcom.com>
Reviewed-by: Sagi Grimberg <sagig@mellanox.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
|
|
Commit fe97b47cd623 ("xprtrdma: Use workqueue to process RPC/RDMA
replies") replaced the reply tasklet with a workqueue that allows
RPC replies to be processed in parallel. Thus the credit values in
RPC-over-RDMA replies can be applied in a different order than in
which the server sent them.
To fix this, revert commit eba8ff660b2d ("xprtrdma: Move credit
update to RPC reply handler"). Reverting is done by hand to
accommodate code changes that have occurred since then.
Fixes: fe97b47cd623 ("xprtrdma: Use workqueue to process . . .")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Sagi Grimberg <sagig@mellanox.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
|
|
These are shorter than RPCRDMA_HDRLEN_MIN, and they need to
complete the waiting RPC.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Sagi Grimberg <sagig@mellanox.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
|
|
If ib_post_send() in ro_unmap_sync() fails, the WRs have not been
posted, no completions will fire, and wait_for_completion() will
wait forever. Skip the wait in that case.
To ensure the MRs are invalid, disconnect.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
|
|
A single memory allocation is used for the pair of buffers wherein
the RPC client builds an RPC call message and decodes its matching
reply. These buffers are sized based on the maximum possible size
of the RPC call and reply messages for the operation in progress.
This means that as the call buffer increases in size, the start of
the reply buffer is pushed farther into the memory allocation.
RPC requests are growing in size. It used to be that both the call
and reply buffers fit inside a single page.
But these days, thanks to NFSv4 (and especially security labels in
NFSv4.2) the maximum call and reply sizes are large. NFSv4.0 OPEN,
for example, now requires a 6KB allocation for a pair of call and
reply buffers, and NFSv4 LOOKUP is not far behind.
As the maximum size of a call increases, the reply buffer is pushed
far enough into the buffer's memory allocation that a page boundary
can appear in the middle of it.
When the maximum possible reply size is larger than the client's
RDMA receive buffers (currently 1KB), the client has to register a
Reply chunk for the server to RDMA Write the reply into.
The logic in rpcrdma_convert_iovs() assumes that xdr_buf head and
tail buffers would always be contained on a single page. It supplies
just one segment for the head and one for the tail.
FMR, for example, registers up to a page boundary (only a portion of
the reply buffer in the OPEN case above). But without additional
segments, it doesn't register the rest of the buffer.
When the server tries to write the OPEN reply, the RDMA Write fails
with a remote access error since the client registered only part of
the Reply chunk.
rpcrdma_convert_iovs() must split the XDR buffer into multiple
segments, each of which are guaranteed not to contain a page
boundary. That way fmr_op_map is given the proper number of segments
to register the whole reply buffer.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Devesh Sharma <devesh.sharma@broadcom.com>
Reviewed-by: Sagi Grimberg <sagig@mellanox.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
|
|
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Sagi Grimberg <sagig@mellanox.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
|
|
physical_op_unmap{_sync} don't use mr_nsegs, so don't bother to set
it in physical_op_map.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Sagi Grimberg <sagig@mellanox.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client
Pull Ceph fixes from Sage Weil:
"There are two small messenger bug fixes and a log spam regression fix"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client:
libceph: don't spam dmesg with stray reply warnings
libceph: use the right footer size when skipping a message
libceph: don't bail early from try_read() when skipping a message
|
|
Pull nfsd bugfix from Bruce Fields:
"One fix for a bug that could cause a NULL write past the end of a
buffer in case of unusually long writes to some system interfaces used
by mountd and other nfs support utilities"
* tag 'nfsd-4.5-1' of git://linux-nfs.org/~bfields/linux:
sunrpc/cache: fix off-by-one in qword_get()
|
|
Commit d15f9d694b77 ("libceph: check data_len in ->alloc_msg()")
mistakenly bumped the log level on the "tid %llu unknown, skipping"
message. Turn it back into a dout() - stray replies are perfectly
normal when OSDs flap, crash, get killed for testing purposes, etc.
Cc: stable@vger.kernel.org # 4.3+
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Alex Elder <elder@linaro.org>
|
|
ceph_msg_footer is 21 bytes long, while ceph_msg_footer_old is only 13.
Don't skip too much when CEPH_FEATURE_MSG_AUTH isn't negotiated.
Cc: stable@vger.kernel.org # 3.19+
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Alex Elder <elder@linaro.org>
|
|
The contract between try_read() and try_write() is that when called
each processes as much data as possible. When instructed by osd_client
to skip a message, try_read() is violating this contract by returning
after receiving and discarding a single message instead of checking for
more. try_write() then gets a chance to write out more requests,
generating more replies/skips for try_read() to handle, forcing the
messenger into a starvation loop.
Cc: stable@vger.kernel.org # 3.10+
Reported-by: Varada Kari <Varada.Kari@sandisk.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Tested-by: Varada Kari <Varada.Kari@sandisk.com>
Reviewed-by: Alex Elder <elder@linaro.org>
|
|
Pull NFS client bugfixes from Trond Myklebust:
"Stable bugfixes:
- Fix nfs_size_to_loff_t
- NFSv4: Fix a dentry leak on alias use
Other bugfixes:
- Don't schedule a layoutreturn if the layout segment can be freed
immediately.
- Always set NFS_LAYOUT_RETURN_REQUESTED with lo->plh_return_iomode
- rpcrdma_bc_receive_call() should init rq_private_buf.len
- fix stateid handling for the NFS v4.2 operations
- pnfs/blocklayout: fix a memeory leak when using,vmalloc_to_page
- fix panic in gss_pipe_downcall() in fips mode
- Fix a race between layoutget and pnfs_destroy_layout
- Fix a race between layoutget and bulk recalls"
* tag 'nfs-for-4.5-4' of git://git.linux-nfs.org/projects/trondmy/linux-nfs:
NFSv4.x/pnfs: Fix a race between layoutget and bulk recalls
NFSv4.x/pnfs: Fix a race between layoutget and pnfs_destroy_layout
auth_gss: fix panic in gss_pipe_downcall() in fips mode
pnfs/blocklayout: fix a memeory leak when using,vmalloc_to_page
nfs4: fix stateid handling for the NFS v4.2 operations
NFSv4: Fix a dentry leak on alias use
xprtrdma: rpcrdma_bc_receive_call() should init rq_private_buf.len
pNFS: Always set NFS_LAYOUT_RETURN_REQUESTED with lo->plh_return_iomode
pNFS: Fix pnfs_mark_matching_lsegs_return()
nfs: fix nfs_size_to_loff_t
|
|
The qword_get() function NUL-terminates its output buffer. If the input
string is in hex format \xXXXX... and the same length as the output
buffer, there is an off-by-one:
int qword_get(char **bpp, char *dest, int bufsize)
{
...
while (len < bufsize) {
...
*dest++ = (h << 4) | l;
len++;
}
...
*dest = '\0';
return len;
}
This patch ensures the NUL terminator doesn't fall outside the output
buffer.
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Cc: stable@vger.kernel.org
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
|
|
* multipath:
NFS add callback_ops to nfs4_proc_bind_conn_to_session_callback
pnfs/NFSv4.1: Add multipath capabilities to pNFS flexfiles servers over NFSv3
SUNRPC: Allow addition of new transports to a struct rpc_clnt
NFSv4.1: nfs4_proc_bind_conn_to_session must iterate over all connections
SUNRPC: Make NFS swap work with multipath
SUNRPC: Add a helper to apply a function to all the rpc_clnt's transports
SUNRPC: Allow caller to specify the transport to use
SUNRPC: Use the multipath iterator to assign a transport to each task
SUNRPC: Make rpc_clnt store the multipath iterators
SUNRPC: Add a structure to track multiple transports
SUNRPC: Make freeing of struct xprt rcu-safe
SUNRPC: Uninline xprt_get(); It isn't performance critical.
SUNRPC: Reorder rpc_task to put waitqueue related info in same cachelines
SUNRPC: Remove unused function rpc_task_reset_client
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth
Johan Hedberg says:
====================
pull request: bluetooth 2016-02-20
Here's an important patch for 4.5 which fixes potential invalid pointer
access when processing completed Bluetooth HCI commands.
Please let me know if there are any issues pulling. Thanks.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Dmitry Vyukov noted recently that the sctp_port_hashtable had an error in
its size computation, observing that the current method never guaranteed
that the hashsize (measured in number of entries) would be a power of two,
which the input hash function for that table requires. The root cause of
the problem is that two values need to be computed (one, the allocation
order of the storage requries, as passed to __get_free_pages, and two the
number of entries for the hash table). Both need to be ^2, but for
different reasons, and the existing code is simply computing one order
value, and using it as the basis for both, which is wrong (i.e. it assumes
that ((1<<order)*PAGE_SIZE)/sizeof(bucket) is still ^2 when its not).
To fix this, we change the logic slightly. We start by computing a goal
allocation order (which is limited by the maximum size hash table we want
to support. Then we attempt to allocate that size table, decreasing the
order until a successful allocation is made. Then, with the resultant
successful order we compute the number of buckets that hash table supports,
which we then round down to the nearest power of two, giving us the number
of entries the table actually supports.
I've tested this locally here, using non-debug and spinlock-debug kernels,
and the number of entries in the hashtable consistently work out to be
powers of two in all cases.
Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
Reported-by: Dmitry Vyukov <dvyukov@google.com>
CC: Dmitry Vyukov <dvyukov@google.com>
CC: Vladislav Yasevich <vyasevich@gmail.com>
CC: "David S. Miller" <davem@davemloft.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
In commit 44d271377479 ("Bluetooth: Compress the size of struct
hci_ctrl") we squashed down the size of the structure by using a union
with the assumption that all users would use the flag to determine
whether we had a req_complete or a req_complete_skb.
Unfortunately we had a case in hci_req_cmd_complete() where we weren't
looking at the flag. This can result in a situation where we might be
storing a hci_req_complete_skb_t in a hci_req_complete_t variable, or
vice versa.
During some testing I found at least one case where the function
hci_req_sync_complete() was called improperly because the kernel thought
that it didn't require an SKB. Looking through the stack in kgdb I
found that it was called by hci_event_packet() and that
hci_event_packet() had both of its locals "req_complete" and
"req_complete_skb" pointing to the same place: both to
hci_req_sync_complete().
Let's make sure we always check the flag.
For more details on debugging done, see <http://crbug.com/588288>.
Fixes: 44d271377479 ("Bluetooth: Compress the size of struct hci_ctrl")
Signed-off-by: Douglas Anderson <dianders@chromium.org>
Acked-by: Johan Hedberg <johan.hedberg@intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
|
|
The unix_stream_read_generic function tries to use a continue statement
to restart the receive loop after waiting for a message. This may not
work as intended as the caller might use a recvmsg call to peek at
control messages without specifying a message buffer. If this was the
case, the continue will cause the function to return without an error
and without the credential information if the function had to wait for a
message while it had returned with the credentials otherwise. Change to
using goto to restart the loop without checking the condition first in
this case so that credentials are returned either way.
Signed-off-by: Rainer Weikusat <rweikusat@mobileactivedefense.com>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
The value passed by unix_diag_get_exact to unix_lookup_by_ino has type
__u32, but unix_lookup_by_ino's argument ino has type int, which is not
a problem yet.
However, when ino is compared with sock_i_ino return value of type
unsigned long, ino is sign extended to signed long, and this results
to incorrect comparison on 64-bit architectures for inode numbers
greater than INT_MAX.
This bug was found by strace test suite.
Fixes: 5d3cae8bc39d ("unix_diag: Dumping exact socket core")
Signed-off-by: Dmitry V. Levin <ldv@altlinux.org>
Acked-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
the commit 35e2d1152b22 ("tunnels: Allow IPv6 UDP checksums to be
correctly controlled.") changed the default xmit checksum setting
for lwt vxlan/geneve ipv6 tunnels, so that now the checksum is not
set into external UDP header.
This commit changes the rx checksum setting for both lwt vxlan/geneve
devices created by openvswitch accordingly, so that lwt over ipv6
tunnel pairs are again able to communicate with default values.
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Acked-by: Jiri Benc <jbenc@redhat.com>
Acked-by: Jesse Gross <jesse@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
tipc_bcast_unlock need to be unlocked in error path.
Signed-off-by: Insu Yun <wuninsu@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Antonio Quartulli says:
====================
Two of the fixes included in this patchset prevent wrong memory
access - it was triggered when removing an object from a list
after it was already free'd due to bad reference counting.
This misbehaviour existed for both the gw_node and the
orig_node_vlan object and has been fixed by Sven Eckelmann.
The last patch fixes our interface feasibility check and prevents
it from looping indefinitely when two net_device objects
reference each other via iflink index (i.e. veth pair), by
Andrew Lunn
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
An error response from a RTM_GETNETCONF request can return the positive
error value EINVAL in the struct nlmsgerr that can mislead userspace.
Signed-off-by: Anton Protopopov <a.s.protopopov@gmail.com>
Acked-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
When I used netdev_for_each_lower_dev in commit bad531623253 ("vrf:
remove slave queue and private slave struct") I thought that it acts
like netdev_for_each_lower_private and can be used to remove the current
device from the list while walking, but unfortunately it acts more like
netdev_for_each_lower_private_rcu and doesn't allow it. The difference
is where the "iter" points to, right now it points to the current element
and that makes it impossible to remove it. Change the logic to be
similar to netdev_for_each_lower_private and make it point to the "next"
element so we can safely delete the current one. VRF is the only such
user right now, there's no change for the read-only users.
Here's what can happen now:
[98423.249858] general protection fault: 0000 [#1] SMP
[98423.250175] Modules linked in: vrf bridge(O) stp llc nfsd auth_rpcgss
oid_registry nfs_acl nfs lockd grace sunrpc crct10dif_pclmul
crc32_pclmul crc32c_intel ghash_clmulni_intel jitterentropy_rng
sha256_generic hmac drbg ppdev aesni_intel aes_x86_64 glue_helper lrw
gf128mul ablk_helper cryptd evdev serio_raw pcspkr virtio_balloon
parport_pc parport i2c_piix4 i2c_core virtio_console acpi_cpufreq button
9pnet_virtio 9p 9pnet fscache ipv6 autofs4 ext4 crc16 mbcache jbd2 sg
virtio_blk virtio_net sr_mod cdrom e1000 ata_generic ehci_pci uhci_hcd
ehci_hcd usbcore usb_common virtio_pci ata_piix libata floppy
virtio_ring virtio scsi_mod [last unloaded: bridge]
[98423.255040] CPU: 1 PID: 14173 Comm: ip Tainted: G O
4.5.0-rc2+ #81
[98423.255386] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
BIOS 1.8.1-20150318_183358- 04/01/2014
[98423.255777] task: ffff8800547f5540 ti: ffff88003428c000 task.ti:
ffff88003428c000
[98423.256123] RIP: 0010:[<ffffffff81514f3e>] [<ffffffff81514f3e>]
netdev_lower_get_next+0x1e/0x30
[98423.256534] RSP: 0018:ffff88003428f940 EFLAGS: 00010207
[98423.256766] RAX: 0002000100000004 RBX: ffff880054ff9000 RCX:
0000000000000000
[98423.257039] RDX: ffff88003428f8b8 RSI: ffff88003428f950 RDI:
ffff880054ff90c0
[98423.257287] RBP: ffff88003428f940 R08: 0000000000000000 R09:
0000000000000000
[98423.257537] R10: 0000000000000001 R11: 0000000000000000 R12:
ffff88003428f9e0
[98423.257802] R13: ffff880054a5fd00 R14: ffff88003428f970 R15:
0000000000000001
[98423.258055] FS: 00007f3d76881700(0000) GS:ffff88005d000000(0000)
knlGS:0000000000000000
[98423.258418] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[98423.258650] CR2: 00007ffe5951ffa8 CR3: 0000000052077000 CR4:
00000000000406e0
[98423.258902] Stack:
[98423.259075] ffff88003428f960 ffffffffa0442636 0002000100000004
ffff880054ff9000
[98423.259647] ffff88003428f9b0 ffffffff81518205 ffff880054ff9000
ffff88003428f978
[98423.260208] ffff88003428f978 ffff88003428f9e0 ffff88003428f9e0
ffff880035b35f00
[98423.260739] Call Trace:
[98423.260920] [<ffffffffa0442636>] vrf_dev_uninit+0x76/0xa0 [vrf]
[98423.261156] [<ffffffff81518205>]
rollback_registered_many+0x205/0x390
[98423.261401] [<ffffffff815183ec>] unregister_netdevice_many+0x1c/0x70
[98423.261641] [<ffffffff8153223c>] rtnl_delete_link+0x3c/0x50
[98423.271557] [<ffffffff815335bb>] rtnl_dellink+0xcb/0x1d0
[98423.271800] [<ffffffff811cd7da>] ? __inc_zone_state+0x4a/0x90
[98423.272049] [<ffffffff815337b4>] rtnetlink_rcv_msg+0x84/0x200
[98423.272279] [<ffffffff810cfe7d>] ? trace_hardirqs_on+0xd/0x10
[98423.272513] [<ffffffff8153370b>] ? rtnetlink_rcv+0x1b/0x40
[98423.272755] [<ffffffff81533730>] ? rtnetlink_rcv+0x40/0x40
[98423.272983] [<ffffffff8155d6e7>] netlink_rcv_skb+0x97/0xb0
[98423.273209] [<ffffffff8153371a>] rtnetlink_rcv+0x2a/0x40
[98423.273476] [<ffffffff8155ce8b>] netlink_unicast+0x11b/0x1a0
[98423.273710] [<ffffffff8155d2f1>] netlink_sendmsg+0x3e1/0x610
[98423.273947] [<ffffffff814fbc98>] sock_sendmsg+0x38/0x70
[98423.274175] [<ffffffff814fc253>] ___sys_sendmsg+0x2e3/0x2f0
[98423.274416] [<ffffffff810d841e>] ? do_raw_spin_unlock+0xbe/0x140
[98423.274658] [<ffffffff811e1bec>] ? handle_mm_fault+0x26c/0x2210
[98423.274894] [<ffffffff811e19cd>] ? handle_mm_fault+0x4d/0x2210
[98423.275130] [<ffffffff81269611>] ? __fget_light+0x91/0xb0
[98423.275365] [<ffffffff814fcd42>] __sys_sendmsg+0x42/0x80
[98423.275595] [<ffffffff814fcd92>] SyS_sendmsg+0x12/0x20
[98423.275827] [<ffffffff81611bb6>] entry_SYSCALL_64_fastpath+0x16/0x7a
[98423.276073] Code: c3 31 c0 5d c3 0f 1f 84 00 00 00 00 00 66 66 66 66
90 48 8b 06 55 48 81 c7 c0 00 00 00 48 89 e5 48 8b 00 48 39 f8 74 09 48
89 06 <48> 8b 40 e8 5d c3 31 c0 5d c3 0f 1f 84 00 00 00 00 00 66 66 66
[98423.279639] RIP [<ffffffff81514f3e>] netdev_lower_get_next+0x1e/0x30
[98423.279920] RSP <ffff88003428f940>
CC: David Ahern <dsa@cumulusnetworks.com>
CC: David S. Miller <davem@davemloft.net>
CC: Roopa Prabhu <roopa@cumulusnetworks.com>
CC: Vlad Yasevich <vyasevic@redhat.com>
Fixes: bad531623253 ("vrf: remove slave queue and private slave struct")
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Reviewed-by: David Ahern <dsa@cumulusnetworks.com>
Tested-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
NFS: NFSoRDMA Client Bugfix
This patch fixes a bug where NFS v4.1 callbacks were returning
RPC_GARBAGE_ARGS to the server.
Signed-off-by: Anna Schumaker <Anna@OcarinaProject.net>
|
|
The cfrfml_receive() function might return positive value EPROTO
Signed-off-by: Anton Protopopov <a.s.protopopov@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
The atalk_sendmsg() function might return wrong value ENETUNREACH
instead of -ENETUNREACH.
Signed-off-by: Anton Protopopov <a.s.protopopov@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
My implementation around IFF_NO_QUEUE driver flag assumed that leaving
tx_queue_len untouched (specifically: not setting it to zero) by drivers
would make it possible to assign a regular qdisc to them without having
to worry about setting tx_queue_len to a useful value. This was only
partially true: I overlooked that some drivers don't call ether_setup()
and therefore not initialize tx_queue_len to the default value of 1000.
Consequently, removing the workarounds in place for that case in qdisc
implementations which cared about it (namely, pfifo, bfifo, gred, htb,
plug and sfb) leads to problems with these specific interface types and
qdiscs.
Luckily, there's already a sanitization point for drivers setting
tx_queue_len to zero, which can be reused to assign the fallback value
most qdisc implementations used, which is 1.
Fixes: 348e3435cbefa ("net: sched: drop all special handling of tx_queue_len == 0")
Tested-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Phil Sutter <phil@nwl.cc>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
ether_setup sets IFF_TX_SKB_SHARING but this is not supported by gre
as it modifies the skb on xmit.
Also, clean up whitespace in ipgre_tap_setup when we're already touching it.
Signed-off-by: Jiri Benc <jbenc@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Ilya reported following lockdep splat:
kernel: =========================
kernel: [ BUG: held lock freed! ]
kernel: 4.5.0-rc1-ceph-00026-g5e0a311 #1 Not tainted
kernel: -------------------------
kernel: swapper/5/0 is freeing memory
ffff880035c9d200-ffff880035c9dbff, with a lock still held there!
kernel: (&(&queue->rskq_lock)->rlock){+.-...}, at:
[<ffffffff816f6a88>] inet_csk_reqsk_queue_add+0x28/0xa0
kernel: 4 locks held by swapper/5/0:
kernel: #0: (rcu_read_lock){......}, at: [<ffffffff8169ef6b>]
netif_receive_skb_internal+0x4b/0x1f0
kernel: #1: (rcu_read_lock){......}, at: [<ffffffff816e977f>]
ip_local_deliver_finish+0x3f/0x380
kernel: #2: (slock-AF_INET){+.-...}, at: [<ffffffff81685ffb>]
sk_clone_lock+0x19b/0x440
kernel: #3: (&(&queue->rskq_lock)->rlock){+.-...}, at:
[<ffffffff816f6a88>] inet_csk_reqsk_queue_add+0x28/0xa0
To properly fix this issue, inet_csk_reqsk_queue_add() needs
to return to its callers if the child as been queued
into accept queue.
We also need to make sure listener is still there before
calling sk->sk_data_ready(), by holding a reference on it,
since the reference carried by the child can disappear as
soon as the child is put on accept queue.
Reported-by: Ilya Dryomov <idryomov@gmail.com>
Fixes: ebb516af60e1 ("tcp/dccp: fix race at listener dismantle phase")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Since the gc of ipv4 route was removed, the route cached would has
no chance to be removed, and even it has been timeout, it still could
be used, cause no code to check it's expires.
Fix this issue by checking and removing route cache when we get route.
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
actions could change the etherproto in particular with ethernet
tunnelled data. Typically such actions, after peeling the outer header,
will ask for the packet to be reclassified. We then need to restart
the classification with the new proto header.
Example setup used to catch this:
sudo tc qdisc add dev $ETH ingress
sudo $TC filter add dev $ETH parent ffff: pref 1 protocol 802.1Q \
u32 match u32 0 0 flowid 1:1 \
action vlan pop reclassify
Fixes: 3b3ae880266d ("net: sched: consolidate tc_classify{,_compat}")
Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
crypto_alloc_hash never returns NULL
Signed-off-by: Insu Yun <wuninsu@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
With commit 0071f56e46da ("dsa: Register netdev before phy"), we are now trying
to free a network device that has been previously registered, and in case of
errors, this will make us hit the BUG_ON(dev->reg_state != NETREG_UNREGISTERED)
condition.
Fix this by adding a missing unregister_netdev() before free_netdev().
Fixes: 0071f56e46da ("dsa: Register netdev before phy")
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
A previous commit (33f72e6) added notification via netlink for tunnels
when created/modified/deleted. If the notification returned an error,
this error was returned from the tunnel function. If there were no
listeners, the error code ESRCH was returned, even though having no
listeners is not an error. Other calls to this and other similar
notification functions either ignore the error code, or filter ESRCH.
This patch checks for ESRCH and does not flag this as an error.
Reviewed-by: Hamish Martin <hamish.martin@alliedtelesis.co.nz>
Signed-off-by: Mark Tomlinson <mark.tomlinson@alliedtelesis.co.nz>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
On Mon, 15 Feb 2016, Trond Myklebust wrote:
> Hi Scott,
>
> On Mon, Feb 15, 2016 at 2:28 PM, Scott Mayhew <smayhew@redhat.com> wrote:
> > md5 is disabled in fips mode, and attempting to import a gss context
> > using md5 while in fips mode will result in crypto_alg_mod_lookup()
> > returning -ENOENT, which will make its way back up to
> > gss_pipe_downcall(), where the BUG() is triggered. Handling the -ENOENT
> > allows for a more graceful failure.
> >
> > Signed-off-by: Scott Mayhew <smayhew@redhat.com>
> > ---
> > net/sunrpc/auth_gss/auth_gss.c | 3 +++
> > 1 file changed, 3 insertions(+)
> >
> > diff --git a/net/sunrpc/auth_gss/auth_gss.c b/net/sunrpc/auth_gss/auth_gss.c
> > index 799e65b..c30fc3b 100644
> > --- a/net/sunrpc/auth_gss/auth_gss.c
> > +++ b/net/sunrpc/auth_gss/auth_gss.c
> > @@ -737,6 +737,9 @@ gss_pipe_downcall(struct file *filp, const char __user *src, size_t mlen)
> > case -ENOSYS:
> > gss_msg->msg.errno = -EAGAIN;
> > break;
> > + case -ENOENT:
> > + gss_msg->msg.errno = -EPROTONOSUPPORT;
> > + break;
> > default:
> > printk(KERN_CRIT "%s: bad return from "
> > "gss_fill_context: %zd\n", __func__, err);
> > --
> > 2.4.3
> >
>
> Well debugged, but I unfortunately do have to ask if this patch is
> sufficient? In addition to -ENOENT, and -ENOMEM, it looks to me as if
> crypto_alg_mod_lookup() can also fail with -EINTR, -ETIMEDOUT, and
> -EAGAIN. Don't we also want to handle those?
You're right, I was focusing on the panic that I could easily reproduce.
I'm still not sure how I could trigger those other conditions.
>
> In fact, peering into the rats nest that is
> gss_import_sec_context_kerberos(), it looks as if that is just a tiny
> subset of all the errors that we might run into. Perhaps the right
> thing to do here is to get rid of the BUG() (but keep the above
> printk) and just return a generic error?
That sounds fine to me -- updated patch attached.
-Scott
>From d54c6b64a107a90a38cab97577de05f9a4625052 Mon Sep 17 00:00:00 2001
From: Scott Mayhew <smayhew@redhat.com>
Date: Mon, 15 Feb 2016 15:12:19 -0500
Subject: [PATCH] auth_gss: remove the BUG() from gss_pipe_downcall()
Instead return a generic error via gss_msg->msg.errno. None of the
errors returned by gss_fill_context() should necessarily trigger a
kernel panic.
Signed-off-by: Scott Mayhew <smayhew@redhat.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
|
|
Some NFSv4.1 OPEN requests were hanging waiting for the NFS server
to finish recalling delegations. Turns out that each NFSv4.1 CB
request on RDMA gets a GARBAGE_ARGS reply from the Linux client.
Commit 756b9b37cfb2e3dc added a line in bc_svc_process that
overwrites the incoming rq_rcv_buf's length with the value in
rq_private_buf.len. But rpcrdma_bc_receive_call() does not invoke
xprt_complete_bc_request(), thus rq_private_buf.len is not
initialized. svc_process_common() is invoked with a zero-length
RPC message, and fails.
Fixes: 756b9b37cfb2e3dc ('SUNRPC: Fix callback channel')
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
|
|
If tcp_v4_inbound_md5_hash() returns an error, we must release
the refcount on the request socket, not on the listener.
The bug was added for IPv4 only.
Fixes: 079096f103fac ("tcp/dccp: install syn_recv requests into ehash table")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
There are some cases where rtt_us derives from deltas of jiffies,
instead of using usec timestamps.
Since we want to track minimal rtt, better to assume a delta of 0 jiffie
might be in fact be very close to 1 jiffie.
It is kind of sad jiffies_to_usecs(1) calls a function instead of simply
using a constant.
Fixes: f672258391b42 ("tcp: track min RTT using windowed min-filter")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
The phy has not been initialized, disconnecting it in the error
path results in a NULL pointer exception. Drop the phy_disconnect
from the error path.
Signed-off-by: Sascha Hauer <s.hauer@pengutronix.de>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Acked-by: Neil Armstrong <narmstrong@baylibre.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
In commit 5266698661401a ("tipc: let broadcast packet reception
use new link receive function") we introduced a new per-node
broadcast reception link instance. This link is created at the
moment the node itself is created. Unfortunately, the allocation
is done after the node instance has already been added to the node
lookup hash table. This creates a potential race condition, where
arriving broadcast packets are able to find and access the node
before it has been fully initialized, and before the above mentioned
link has been created. The result is occasional crashes in the function
tipc_bcast_rcv(), which is trying to access the not-yet existing link.
We fix this by deferring the addition of the node instance until after
it has been fully initialized in the function tipc_node_create().
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
A recent change to the mdb code confused the compiler to the point
where it did not realize that the port-group returned from
br_mdb_add_group() is always valid when the function returns a nonzero
return value, so we get a spurious warning:
net/bridge/br_mdb.c: In function 'br_mdb_add':
net/bridge/br_mdb.c:542:4: error: 'pg' may be used uninitialized in this function [-Werror=maybe-uninitialized]
__br_mdb_notify(dev, entry, RTM_NEWMDB, pg);
Slightly rearranging the code in br_mdb_add_group() makes the problem
go away, as gcc is clever enough to see that both functions check
for 'ret != 0'.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Fixes: 9e8430f8d60d ("bridge: mdb: Passing the port-group pointer to br_mdb module")
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
This patch corrects the unaligned accesses seen on GRE TEB tunnels when
generating hash keys. Specifically what this patch does is make it so that
we force the use of skb_copy_bits when the GRE inner headers will be
unaligned due to NET_IP_ALIGNED being a non-zero value.
Signed-off-by: Alexander Duyck <aduyck@mirantis.com>
Acked-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Pablo Neira Ayuso says:
====================
Netfilter fixes for net
The following patchset contain a rather large batch for your net that
includes accumulated bugfixes, they are:
1) Run conntrack cleanup from workqueue process context to avoid hitting
soft lockup via watchdog for large tables. This is required by the
IPv6 masquerading extension. From Florian Westphal.
2) Use original skbuff from nfnetlink batch when calling netlink_ack()
on error since this needs to access the skb->sk pointer.
3) Incremental fix on top of recent Sasha Levin's lock fix for conntrack
resizing.
4) Fix several problems in nfnetlink batch message header sanitization
and error handling, from Phil Turnbull.
5) Select NF_DUP_IPV6 based on CONFIG_IPV6, from Arnd Bergmann.
6) Fix wrong signess in return values on nf_tables counter expression,
from Anton Protopopov.
Due to the NetDev 1.1 organization burden, I had no chance to pass up
this to you any sooner in this release cycle, sorry about that.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
The unix_dgram_sendmsg routine use the following test
if (unlikely(unix_peer(other) != sk && unix_recvq_full(other))) {
to determine if sk and other are in an n:1 association (either
established via connect or by using sendto to send messages to an
unrelated socket identified by address). This isn't correct as the
specified address could have been bound to the sending socket itself or
because this socket could have been connected to itself by the time of
the unix_peer_get but disconnected before the unix_state_lock(other). In
both cases, the if-block would be entered despite other == sk which
might either block the sender unintentionally or lead to trying to unlock
the same spin lock twice for a non-blocking send. Add a other != sk
check to guard against this.
Fixes: 7d267278a9ec ("unix: avoid use-after-free in ep_remove_wait_queue")
Reported-By: Philipp Hahn <pmhahn@pmhahn.de>
Signed-off-by: Rainer Weikusat <rweikusat@mobileactivedefense.com>
Tested-by: Philipp Hahn <pmhahn@pmhahn.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
The present unix_stream_read_generic contains various code sequences of
the form
err = -EDISASTER;
if (<test>)
goto out;
This has the unfortunate side effect of possibly causing the error code
to bleed through to the final
out:
return copied ? : err;
and then to be wrongly returned if no data was copied because the caller
didn't supply a data buffer, as demonstrated by the program available at
http://pad.lv/1540731
Change it such that err is only set if an error condition was detected.
Fixes: 3822b5c2fc62 ("af_unix: Revert 'lock_interruptible' in stream receive code")
Reported-by: Joseph Salisbury <joseph.salisbury@canonical.com>
Signed-off-by: Rainer Weikusat <rweikusat@mobileactivedefense.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|