summaryrefslogtreecommitdiff
path: root/net
AgeCommit message (Collapse)Author
2016-07-11ipv4: reject RTNH_F_DEAD and RTNH_F_LINKDOWN from user spaceJulian Anastasov
Vegard Nossum is reporting for a crash in fib_dump_info when nh_dev = NULL and fib_nhs == 1: Pid: 50, comm: netlink.exe Not tainted 4.7.0-rc5+ RIP: 0033:[<00000000602b3d18>] RSP: 0000000062623890 EFLAGS: 00010202 RAX: 0000000000000000 RBX: 000000006261b800 RCX: 0000000000000000 RDX: 0000000000000000 RSI: 0000000000000024 RDI: 000000006245ba00 RBP: 00000000626238f0 R08: 000000000000029c R09: 0000000000000000 R10: 0000000062468038 R11: 000000006245ba00 R12: 000000006245ba00 R13: 00000000625f96c0 R14: 00000000601e16f0 R15: 0000000000000000 Kernel panic - not syncing: Kernel mode fault at addr 0x2e0, ip 0x602b3d18 CPU: 0 PID: 50 Comm: netlink.exe Not tainted 4.7.0-rc5+ #581 Stack: 626238f0 960226a02 00000400 000000fe 62623910 600afca7 62623970 62623a48 62468038 00000018 00000000 00000000 Call Trace: [<602b3e93>] rtmsg_fib+0xd3/0x190 [<602b6680>] fib_table_insert+0x260/0x500 [<602b0e5d>] inet_rtm_newroute+0x4d/0x60 [<60250def>] rtnetlink_rcv_msg+0x8f/0x270 [<60267079>] netlink_rcv_skb+0xc9/0xe0 [<60250d4b>] rtnetlink_rcv+0x3b/0x50 [<60265400>] netlink_unicast+0x1a0/0x2c0 [<60265e47>] netlink_sendmsg+0x3f7/0x470 [<6021dc9a>] sock_sendmsg+0x3a/0x90 [<6021e0d0>] ___sys_sendmsg+0x300/0x360 [<6021fa64>] __sys_sendmsg+0x54/0xa0 [<6021fac0>] SyS_sendmsg+0x10/0x20 [<6001ea68>] handle_syscall+0x88/0x90 [<600295fd>] userspace+0x3fd/0x500 [<6001ac55>] fork_handler+0x85/0x90 $ addr2line -e vmlinux -i 0x602b3d18 include/linux/inetdevice.h:222 net/ipv4/fib_semantics.c:1264 Problem happens when RTNH_F_LINKDOWN is provided from user space when creating routes that do not use the flag, catched with netlink fuzzer. Currently, the kernel allows user space to set both flags to nh_flags and fib_flags but this is not intentional, the assumption was that they are not set. Fix this by rejecting both flags with EINVAL. Reported-by: Vegard Nossum <vegard.nossum@oracle.com> Fixes: 0eeb075fad73 ("net: ipv4 sysctl option to ignore routes when nexthop link is down") Signed-off-by: Julian Anastasov <ja@ssi.bg> Cc: Andy Gospodarek <gospo@cumulusnetworks.com> Cc: Dinesh Dutt <ddutt@cumulusnetworks.com> Cc: Scott Feldman <sfeldma@gmail.com> Reviewed-by: Andy Gospodarek <gospo@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-11tcp: make challenge acks less predictableEric Dumazet
Yue Cao claims that current host rate limiting of challenge ACKS (RFC 5961) could leak enough information to allow a patient attacker to hijack TCP sessions. He will soon provide details in an academic paper. This patch increases the default limit from 100 to 1000, and adds some randomization so that the attacker can no longer hijack sessions without spending a considerable amount of probes. Based on initial analysis and patch from Linus. Note that we also have per socket rate limiting, so it is tempting to remove the host limit in the future. v2: randomize the count of challenge acks per second, not the period. Fixes: 282f23c6ee34 ("tcp: implement RFC 5961 3.2") Reported-by: Yue Cao <ycao009@ucr.edu> Signed-off-by: Eric Dumazet <edumazet@google.com> Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: Yuchung Cheng <ycheng@google.com> Cc: Neal Cardwell <ncardwell@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Acked-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-11tunnels: correct conditional build of MPLS and IPv6Simon Horman
Using a combination if #if conditionals and goto labels to unwind tunnel4_init seems unwieldy. This patch takes a simpler approach of directly unregistering previously registered protocols when an error occurs. This fixes a number of problems with the current implementation including the potential presence of labels when they are unused and the potential absence of unregister code when it is needed. Fixes: 8afe97e5d416 ("tunnels: support MPLS over IPv4 tunnels") Signed-off-by: Simon Horman <simon.horman@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-11sctp: implement prsctp PRIO policyXin Long
prsctp PRIO policy is a policy to abandon lower priority chunks when asoc doesn't have enough snd buffer, so that the current chunk with higher priority can be queued successfully. Similar to TTL/RTX policy, we will set the priority of the chunk to prsctp_param with sinfo->sinfo_timetolive in sctp_set_prsctp_policy(). So if PRIO policy is enabled, msg->expire_at won't work. asoc->sent_cnt_removable will record how many chunks can be checked to remove. If priority policy is enabled, when the chunk is queued into the out_queue, we will increase sent_cnt_removable. When the chunk is moved to abandon_queue or dequeue and free, we will decrease sent_cnt_removable. In sctp_sendmsg, we will check if there is enough snd buffer for current msg and if sent_cnt_removable is not 0. Then try to abandon chunks in sctp_prune_prsctp when sendmsg from the retransmit/transmited queue, and free chunks from out_queue in right order until the abandon+free size > msg_len - sctp_wfree. For the abandon size, we have to wait until it sends FORWARD TSN, receives the sack and the chunks are really freed. Signed-off-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-11sctp: implement prsctp RTX policyXin Long
prsctp RTX policy is a policy to abandon chunks when they are retransmitted beyond the max count. This patch uses sent_count to count how many times one chunk has been sent, and prsctp_param is the max rtx count, which is from sinfo->sinfo_timetolive in sctp_set_prsctp_policy(). So similar to TTL policy, if RTX policy is enabled, msg->expire_at won't work. Then in sctp_chunk_abandoned, this patch checks if chunk->sent_count is bigger than chunk->prsctp_param to abandon this chunk. Signed-off-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-11sctp: implement prsctp TTL policyXin Long
prsctp TTL policy is a policy to abandon chunks when they expire at the specific time in local stack. It's similar with expires_at in struct sctp_datamsg. This patch uses sinfo->sinfo_timetolive to set the specific time for TTL policy. sinfo->sinfo_timetolive is also used for msg->expires_at. So if prsctp_enable or TTL policy is not enabled, msg->expires_at still works as before. Signed-off-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-11sctp: add SCTP_PR_ASSOC_STATUS on sctp sockoptXin Long
This patch adds SCTP_PR_ASSOC_STATUS to sctp sockopt, which is used to dump the prsctp statistics info from the asoc. The prsctp statistics includes abandoned_sent/unsent from the asoc. abandoned_sent is the count of the packets we drop packets from retransmit/transmited queue, and abandoned_unsent is the count of the packets we drop from out_queue according to the policy. Note: another option for prsctp statistics dump described in rfc is SCTP_PR_STREAM_STATUS, which is used to dump the prsctp statistics info from each stream. But by now, linux doesn't yet have per stream statistics info, it needs rfc6525 to be implemented. As the prsctp statistics for each stream has to be based on per stream statistics, we will delay it until rfc6525 is done in linux. Signed-off-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-11sctp: add SCTP_DEFAULT_PRINFO into sctp sockoptXin Long
This patch adds SCTP_DEFAULT_PRINFO to sctp sockopt. It is used to set/get sctp Partially Reliable Policies' default params, which includes 3 policies (ttl, rtx, prio) and their values. Still, if we set policy params in sndinfo, we will use the params of sndinfo against chunks, instead of the default params. In this patch, we will use 5-8bit of sp/asoc->default_flags to store prsctp policies, and reuse asoc->default_timetolive to store their values. It means if we enable and set prsctp policy, prior ttl timeout in sctp will not work any more. Signed-off-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-11sctp: add SCTP_PR_SUPPORTED on sctp sockoptXin Long
According to section 4.5 of rfc7496, prsctp_enable should be per asoc. We will add prsctp_enable to both asoc and ep, and replace the places where it used net.sctp->prsctp_enable with asoc->prsctp_enable. ep->prsctp_enable will be initialized with net.sctp->prsctp_enable, and asoc->prsctp_enable will be initialized with ep->prsctp_enable. We can also modify it's value through sockopt SCTP_PR_SUPPORTED. Signed-off-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-11NFS: Don't drop CB requests with invalid principalsChuck Lever
Before commit 778be232a207 ("NFS do not find client in NFSv4 pg_authenticate"), the Linux callback server replied with RPC_AUTH_ERROR / RPC_AUTH_BADCRED, instead of dropping the CB request. Let's restore that behavior so the server has a chance to do something useful about it, and provide a warning that helps admins correct the problem. Fixes: 778be232a207 ("NFS do not find client in NFSv4 ...") Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Tested-by: Steve Wise <swise@opengridcomputing.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-07-11svc: Avoid garbage replies when pc_func() returns rpc_drop_replyChuck Lever
If an RPC program does not set vs_dispatch and pc_func() returns rpc_drop_reply, the server sends a reply anyway containing a single word containing the value RPC_DROP_REPLY (in network byte-order, of course). This is a nonsense RPC message. Fixes: 9e701c610923 ("svcrpc: simpler request dropping") Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Tested-by: Steve Wise <swise@opengridcomputing.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-07-11xprtrdma: No direct data placement with krb5i and krb5pChuck Lever
Direct data placement is not allowed when using flavors that guarantee integrity or privacy. When such security flavors are in effect, don't allow the use of Read and Write chunks for moving individual data items. All messages larger than the inline threshold are sent via Long Call or Long Reply. On my systems (CX-3 Pro on FDR), for small I/O operations, the use of Long messages adds only around 5 usecs of latency in each direction. Note that when integrity or encryption is used, the host CPU touches every byte in these messages. Even if it could be used, data movement offload doesn't buy much in this case. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Tested-by: Steve Wise <swise@opengridcomputing.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-07-11xprtrdma: Clean up fixup_copy_count accountingChuck Lever
fixup_copy_count should count only the number of bytes copied to the page list. The head and tail are now always handled without a data copy. And the debugging at the end of rpcrdma_inline_fixup() is also no longer necessary, since copy_len will be non-zero when there is reply data in the tail (a normal and valid case). Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Tested-by: Steve Wise <swise@opengridcomputing.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-07-11xprtrdma: Update only specific fields in private receive bufferChuck Lever
Now that rpcrdma_inline_fixup() updates only two fields in rq_rcv_buf, a full memcpy of that structure to rq_private_buf is unwarranted. Updating rq_private_buf fields only where needed also better documents what is going on. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Tested-by: Steve Wise <swise@opengridcomputing.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-07-11xprtrdma: Do not update {head, tail}.iov_len in rpcrdma_inline_fixup()Chuck Lever
While trying NFSv4.0/RDMA with sec=krb5p, I noticed small NFS READ operations failed. After the client unwrapped the NFS READ reply message, the NFS READ XDR decoder was not able to decode the reply. The message was "Server cheating in reply", with the reported number of received payload bytes being zero. Applications reported a read(2) that returned -1/EIO. The problem is rpcrdma_inline_fixup() sets the tail.iov_len to zero when the incoming reply fits entirely in the head iovec. The zero tail.iov_len confused xdr_buf_trim(), which then mangled the actual reply data instead of simply removing the trailing GSS checksum. As near as I can tell, RPC transports are not supposed to update the head.iov_len, page_len, or tail.iov_len fields in the receive XDR buffer when handling an incoming RPC reply message. These fields contain the length of each component of the XDR buffer, and hence the maximum number of bytes of reply data that can be stored in each XDR buffer component. I've concluded this because: - This is how xdr_partial_copy_from_skb() appears to behave - rpcrdma_inline_fixup() already does not alter page_len - call_decode() compares rq_private_buf and rq_rcv_buf and WARNs if they are not exactly the same Unfortunately, as soon as I tried the simple fix to just remove the line that sets tail.iov_len to zero, I saw that the logic that appends the implicit Write chunk pad inline depends on inline_fixup setting tail.iov_len to zero. To address this, re-organize the tail iovec handling logic to use the same approach as with the head iovec: simply point tail.iov_base to the correct bytes in the receive buffer. While I remember all this, write down the conclusion in documenting comments. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Tested-by: Steve Wise <swise@opengridcomputing.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-07-11xprtrdma: rpcrdma_inline_fixup() overruns the receive page listChuck Lever
When the remaining length of an incoming reply is longer than the XDR buf's page_len, switch over to the tail iovec instead of copying more than page_len bytes into the page list. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Tested-by: Steve Wise <swise@opengridcomputing.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-07-11xprtrdma: Chunk list encoders no longer share one rl_segments arrayChuck Lever
Currently, all three chunk list encoders each use a portion of the one rl_segments array in rpcrdma_req. This is because the MWs for each chunk list were preserved in rl_segments so that ro_unmap could find and invalidate them after the RPC was complete. However, now that MWs are placed on a per-req linked list as they are registered, there is no longer any information in rpcrdma_mr_seg that is shared between ro_map and ro_unmap_{sync,safe}, and thus nothing in rl_segments needs to be preserved after rpcrdma_marshal_req is complete. Thus the rl_segments array can be used now just for the needs of each rpcrdma_convert_iovs call. Once each chunk list is encoded, the next chunk list encoder is free to re-use all of rl_segments. This means all three chunk lists in one RPC request can now each encode a full size data payload with no increase in the size of rl_segments. This is a key requirement for Kerberos support, since both the Call and Reply for a single RPC transaction are conveyed via Long messages (RDMA Read/Write). Both can be large. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Tested-by: Steve Wise <swise@opengridcomputing.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-07-11xprtrdma: Place registered MWs on a per-req listChuck Lever
Instead of placing registered MWs sparsely into the rl_segments array, place these MWs on a per-req list. ro_unmap_{sync,safe} can then simply pull those MWs off the list instead of walking through the array. This change significantly reduces the size of struct rpcrdma_req by removing nsegs and rl_mw from every array element. As an additional clean-up, chunk co-ordinates are returned in the "*mw" output argument so they are no longer needed in every array element. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Tested-by: Steve Wise <swise@opengridcomputing.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-07-11xprtrdma: Release orphaned MRs immediatelyChuck Lever
Instead of leaving orphaned MRs to be released when the transport is destroyed, release them immediately. The MR free list can now be replenished if it becomes exhausted. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Tested-by: Steve Wise <swise@opengridcomputing.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-07-11xprtrdma: Allocate MRs on demandChuck Lever
Frequent MR list exhaustion can impact I/O throughput, so enough MRs are always created during transport set-up to prevent running out. This means more MRs are created than most workloads need. Commit 94f58c58c0b4 ("xprtrdma: Allow Read list and Reply chunk simultaneously") introduced support for sending two chunk lists per RPC, which consumes more MRs per RPC. Instead of trying to provision more MRs, introduce a mechanism for allocating MRs on demand. A few MRs are allocated during transport set-up to kick things off. This significantly reduces the average number of MRs per transport while allowing the MR count to grow for workloads or devices that need more MRs. FRWR with mlx4 allocated almost 400 MRs per transport before this patch. Now it starts with 32. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Tested-by: Steve Wise <swise@opengridcomputing.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-07-11xprtrdma: Chunk list encoders must not return zeroChuck Lever
Clean up, based on code audit: Remove the possibility that the chunk list XDR encoders can return zero, which would be interpreted as a NULL. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Tested-by: Steve Wise <swise@opengridcomputing.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-07-11xprtrdma: Honor ->send_request API contractChuck Lever
Commit c93c62231cf5 ("xprtrdma: Disconnect on registration failure") added a disconnect for some RPC marshaling failures. This is needed only in a handful of cases, but it was triggering for simple stuff like temporary resource shortages. Try to straighten this out. Fix up the lower layers so they don't return -ENOMEM or other error codes that the RPC client's FSM doesn't explicitly recognize. Also fix up the places in the send_request path that do want a disconnect. For example, when ib_post_send or ib_post_recv fail, this is a sign that there is a send or receive queue resource miscalculation. That should be rare, and is a sign of a software bug. But xprtrdma can recover: disconnect to reset the transport and start over. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Tested-by: Steve Wise <swise@opengridcomputing.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-07-11xprtrdma: Reply buffer exhaustion can be catastrophicChuck Lever
Not having an rpcrdma_rep at call_allocate time can be a problem. It means that send_request can't post a receive buffer to catch the RPC's reply. Possible consequences are RPC timeouts or even transport deadlock. Instead of allowing an RPC to proceed if an rpcrdma_rep is not available, return NULL to force call_allocate to wait and try again. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Tested-by: Steve Wise <swise@opengridcomputing.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-07-11xprtrdma: Clean up device capability detectionChuck Lever
Clean up: Move device capability detection into memreg-specific source files. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Tested-by: Steve Wise <swise@opengridcomputing.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-07-11xprtrdma: Remove rpcrdma_map_one() and friendsChuck Lever
Clean up: ALLPHYSICAL is gone and FMR has been converted to use scatterlists. There are no more users of these functions. This patch shrinks the size of struct rpcrdma_req by about 3500 bytes on x86_64. There is one of these structs for each RPC credit (128 credits per transport connection). Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Tested-by: Steve Wise <swise@opengridcomputing.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-07-11xprtrdma: Remove ALLPHYSICAL memory registration modeChuck Lever
No HCA or RNIC in the kernel tree requires the use of ALLPHYSICAL. ALLPHYSICAL advertises in the clear on the network fabric an R_key that is good for all of the client's memory. No known exploit exists, but theoretically any user on the server can use that R_key on the client's QP to read or update any part of the client's memory. ALLPHYSICAL exposes the client to server bugs, including: o base/bounds errors causing data outside the i/o buffer to be accessed o RDMA access after reply causing data corruption and/or integrity fail ALLPHYSICAL can't protect application memory regions from server update after a local signal or soft timeout has terminated an RPC. ALLPHYSICAL chunks are no larger than a page. Special cases to handle small chunks and long chunk lists have been a source of implementation complexity and bugs. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Tested-by: Steve Wise <swise@opengridcomputing.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-07-11xprtrdma: Do not leak an MW during a DMA map failureChuck Lever
Based on code audit. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Tested-by: Steve Wise <swise@opengridcomputing.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-07-11xprtrdma: Refactor MR recovery work queuesChuck Lever
I found that commit ead3f26e359e ("xprtrdma: Add ro_unmap_safe memreg method"), which introduces ro_unmap_safe, never wired up the FMR recovery worker. The FMR and FRWR recovery work queues both do the same thing. Instead of setting up separate individual work queues for this, schedule a delayed worker to deal with them, since recovering MRs is not performance-critical. Fixes: ead3f26e359e ("xprtrdma: Add ro_unmap_safe memreg method") Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Tested-by: Steve Wise <swise@opengridcomputing.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-07-11xprtrdma: Use scatterlist for DMA mapping and unmapping under FMRChuck Lever
The use of a scatterlist for handling DMA mapping and unmapping was recently introduced in frwr_ops.c in commit 4143f34e01e9 ("xprtrdma: Port to new memory registration API"). That commit did not make a similar update to xprtrdma's FMR support because the core ib_map_phys_fmr() and ib_unmap_fmr() APIs have not been changed to take a scatterlist argument. However, FMR still needs to do DMA mapping and unmapping. It appears that RDS, for example, uses a scatterlist for this, then builds the DMA addr array for the ib_map_phys_fmr call separately. I see that SRP also utilizes a scatterlist for DMA mapping. xprtrdma can do something similar. This modernization is used immediately to properly defer DMA unmapping during fmr_unmap_safe (a FIXME). It separates the DMA unmapping coordinates from the rl_segments array. This array, being part of an rpcrdma_req, is always re-used immediately when an RPC exits. A scatterlist is allocated in memory independent of the rl_segments array, so it can be preserved indefinitely (ie, until the MR invalidation and DMA unmapping can actually be done by a worker thread). The FRWR and FMR DMA mapping code are slightly different from each other now, and will diverge further when the "Check for holes" logic can be removed from FRWR (support for SG_GAP MRs). So I chose not to create helpers for the common-looking code. Fixes: ead3f26e359e ("xprtrdma: Add ro_unmap_safe memreg method") Suggested-by: Sagi Grimberg <sagi@lightbits.io> Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Tested-by: Steve Wise <swise@opengridcomputing.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-07-11xprtrdma: Rename fields in rpcrdma_fmrChuck Lever
Clean up: Use the same naming convention used in other RPC/RDMA-related data structures. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Tested-by: Steve Wise <swise@opengridcomputing.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-07-11xprtrdma: Move init and release helpersChuck Lever
Clean up: Moving these helpers in a separate patch makes later patches more readable. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Tested-by: Steve Wise <swise@opengridcomputing.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-07-11xprtrdma: Create common scatterlist fields in rpcrdma_mwChuck Lever
Clean up: FMR is about to replace the rpcrdma_map_one code with scatterlists. Move the scatterlist fields out of the FRWR-specific union and into the generic part of rpcrdma_mw. One minor change: -EIO is now returned if FRWR registration fails. The RPC is terminated immediately, since the problem is likely due to a software bug, thus retrying likely won't help. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Tested-by: Steve Wise <swise@opengridcomputing.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-07-11xprtrdma: Remove FMRs from the unmap list after unmappingChuck Lever
ib_unmap_fmr() takes a list of FMRs to unmap. However, it does not remove the FMRs from this list as it processes them. Other ib_unmap_fmr() call sites are careful to remove FMRs from the list after ib_unmap_fmr() returns. Since commit 7c7a5390dc6c8 ("xprtrdma: Add ro_unmap_sync method for FMR") fmr_op_unmap_sync passes more than one FMR to ib_unmap_fmr(), but it didn't bother to remove the FMRs from that list once the call was complete. I've noticed some instability that could be related to list tangling by the new fmr_op_unmap_sync() logic. In an abundance of caution, add some defensive logic to clean up properly after ib_unmap_fmr(). Fixes: 7c7a5390dc6c8 ("xprtrdma: Add ro_unmap_sync method for FMR") Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Tested-by: Steve Wise <swise@opengridcomputing.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-07-11udp: prevent bugcheck if filter truncates packet too muchMichal Kubeček
If socket filter truncates an udp packet below the length of UDP header in udpv6_queue_rcv_skb() or udp_queue_rcv_skb(), it will trigger a BUG_ON in skb_pull_rcsum(). This BUG_ON (and therefore a system crash if kernel is configured that way) can be easily enforced by an unprivileged user which was reported as CVE-2016-6162. For a reproducer, see http://seclists.org/oss-sec/2016/q3/8 Fixes: e6afc8ace6dd ("udp: remove headers from UDP packets before queueing") Reported-by: Marco Grassi <marco.gra@gmail.com> Signed-off-by: Michal Kubecek <mkubecek@suse.cz> Acked-by: Eric Dumazet <edumazet@google.com> Acked-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-11Merge tag 'batadv-net-for-davem-20160708' of git://git.open-mesh.org/linux-mergeDavid S. Miller
Simon Wunderlich says: ==================== Here are a couple batman-adv bugfix patches, all by Sven Eckelmann: - Fix possible NULL pointer dereference for vlan_insert_tag (two patches) - Fix reference handling in some features, which may lead to reference leaks or invalid memory access (four patches) - Fix speedy join: DHCP packets handled by the gateway feature should be sent with 4-address unicast instead of 3-address unicast to make speedy join work. This fixes/speeds up DHCP assignment for clients which join a mesh for the first time. (one patch) ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-11netfilter: nf_conntrack_h323: fix off-by-one in DecodeQ931Toby DiPasquale
This patch corrects an off-by-one error in the DecodeQ931 function in the nf_conntrack_h323 module. This error could result in reading off the end of a Q.931 frame. Signed-off-by: Toby DiPasquale <toby@cbcg.net> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-07-11Merge tag 'ipvs-for-v4.8' of ↵Pablo Neira Ayuso
https://git.kernel.org/pub/scm/linux/kernel/git/horms/ipvs-next Simon Horman says: ==================== IPVS Updates for v4.8 please consider these enhancements to the IPVS. This alters the behaviour of the "least connection" schedulers such that pre-established connections are included in the active connection count. This avoids overloading servers when a large number of new connections arrive in a short space of time - e.g. when clients reconnect after a node or network failure. ==================== Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-07-11netfilter: nf_tables: get rid of possible_net_t from set and basechainPablo Neira Ayuso
We can pass the netns pointer as parameter to the functions that need to gain access to it. From basechains, I didn't find any client for this field anymore so let's remove this too. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-07-11netfilter: nft_ct: make byte/packet expr more friendlyLiping Zhang
If we want to use ct packets expr, and add a rule like follows: # nft add rule filter input ct packets gt 1 counter We will find that no packets will hit it, because nf_conntrack_acct is disabled by default. So It will not work until we enable it manually via "echo 1 > /proc/sys/net/netfilter/nf_conntrack_acct". This is not friendly, so like xt_connbytes do, if the user want to use ct byte/packet expr, enable nf_conntrack_acct automatically. Signed-off-by: Liping Zhang <liping.zhang@spreadtrum.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-07-11netfilter: physdev: physdev-is-out should not work with OUTPUT chainHangbin Liu
physdev_mt() will check skb->nf_bridge first, which was alloced in br_nf_pre_routing. So if we want to use --physdev-out and physdev-is-out, we need to match it in FORWARD or POSTROUTING chain. physdev_mt_check() only checked physdev-out and missed physdev-is-out. Fix it and update the debug message to make it clearer. Signed-off-by: Hangbin Liu <liuhangbin@gmail.com> Reviewed-by: Marcelo R Leitner <marcelo.leitner@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-07-11netfilter: nat: convert nat bysrc hash to rhashtableFlorian Westphal
It did use a fixed-size bucket list plus single lock to protect add/del. Unlike the main conntrack table we only need to add and remove keys. Convert it to rhashtable to get table autosizing and per-bucket locking. The maximum number of entries is -- as before -- tied to the number of conntracks so we do not need another upperlimit. The change does not handle rhashtable_remove_fast error, only possible "error" is -ENOENT, and that is something that can happen legitimetely, e.g. because nat module was inserted at a later time and no src manip took place yet. Tested with http-client-benchmark + httpterm with DNAT and SNAT rules in place. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-07-11Merge tag 'ipvs-fixes2-for-v4.7' of ↵Pablo Neira Ayuso
https://git.kernel.org/pub/scm/linux/kernel/git/horms/ipvs Simon Horman says: ==================== Second Round of IPVS Fixes for v4.7 The fix from Quentin Armitage allows the backup sync daemon to be bound to a link-local mcast IPv6 address as is already the case for IPv4. ==================== Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-07-11netfilter: move nat hlist_head to nf_connFlorian Westphal
The nat extension structure is 32bytes in size on x86_64: struct nf_conn_nat { struct hlist_node bysource; /* 0 16 */ struct nf_conn * ct; /* 16 8 */ union nf_conntrack_nat_help help; /* 24 4 */ int masq_index; /* 28 4 */ /* size: 32, cachelines: 1, members: 4 */ /* last cacheline: 32 bytes */ }; The hlist is needed to quickly check for possible tuple collisions when installing a new nat binding. Storing this in the extension area has two drawbacks: 1. We need ct backpointer to get the conntrack struct from the extension. 2. When reallocation of extension area occurs we need to fixup the bysource hash head via hlist_replace_rcu. We can avoid both by placing the hlist_head in nf_conn and place nf_conn in the bysource hash rather than the extenstion. We can also remove the ->move support; no other extension needs it. Moving the entire nat extension into nf_conn would be possible as well but then we have to add yet another callback for deletion from the bysource hash table rather than just using nat extension ->destroy hook for this. nf_conn size doesn't increase due to aligment, followup patch replaces hlist_node with single pointer. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-07-11netfilter: conntrack: simplify early_dropFlorian Westphal
We don't need to acquire the bucket lock during early drop, we can use lockless traveral just like ____nf_conntrack_find. The timer deletion serves as synchronization point, if another cpu attempts to evict same entry, only one will succeed with timer deletion. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-07-11netfilter: nf_ct_helper: unlink helper again when hash resize happenLiping Zhang
From: Liping Zhang <liping.zhang@spreadtrum.com> Similar to ctnl_untimeout, when hash resize happened, we should try to do unhelp from the 0# bucket again. Signed-off-by: Liping Zhang <liping.zhang@spreadtrum.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-07-11netfilter: cttimeout: unlink timeout obj again when hash resize happenLiping Zhang
Imagine such situation, nf_conntrack_htable_size now is 4096, we are doing ctnl_untimeout, and iterate on 3000# bucket. Meanwhile, another user try to reduce hash size to 2048, then all nf_conn are removed to the new hashtable. When this hash resize operation finished, we still try to itreate ct begin from 3000# bucket, find nothing to do and just return. We may miss unlinking some timeout objects. And later we will end up with invalid references to timeout object that are already gone. So when we find that hash resize happened, try to unlink timeout objects from the 0# bucket again. Signed-off-by: Liping Zhang <liping.zhang@spreadtrum.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-07-11netfilter: conntrack: fix race between nf_conntrack proc read and hash resizeLiping Zhang
When we do "cat /proc/net/nf_conntrack", and meanwhile resize the conntrack hash table via /sys/module/nf_conntrack/parameters/hashsize, race will happen, because reader can observe a newly allocated hash but the old size (or vice versa). So oops will happen like follows: BUG: unable to handle kernel NULL pointer dereference at 0000000000000017 IP: [<ffffffffa0418e21>] seq_print_acct+0x11/0x50 [nf_conntrack] Call Trace: [<ffffffffa0412f4e>] ? ct_seq_show+0x14e/0x340 [nf_conntrack] [<ffffffff81261a1c>] seq_read+0x2cc/0x390 [<ffffffff812a8d62>] proc_reg_read+0x42/0x70 [<ffffffff8123bee7>] __vfs_read+0x37/0x130 [<ffffffff81347980>] ? security_file_permission+0xa0/0xc0 [<ffffffff8123cf75>] vfs_read+0x95/0x140 [<ffffffff8123e475>] SyS_read+0x55/0xc0 [<ffffffff817c2572>] entry_SYSCALL_64_fastpath+0x1a/0xa4 It is very easy to reproduce this kernel crash. 1. open one shell and input the following cmds: while : ; do echo $RANDOM > /sys/module/nf_conntrack/parameters/hashsize done 2. open more shells and input the following cmds: while : ; do cat /proc/net/nf_conntrack done 3. just wait a monent, oops will happen soon. The solution in this patch is based on Florian's Commit 5e3c61f98175 ("netfilter: conntrack: fix lookup race during hash resize"). And add a wrapper function nf_conntrack_get_ht to get hash and hsize suggested by Florian Westphal. Signed-off-by: Liping Zhang <liping.zhang@spreadtrum.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-07-11NFC: digital: Fix RTOX supervisor PDU handlingThierry Escande
When the target needs more time to process the received PDU, it sends Response Timeout Extension (RTOX) PDU. When the initiator receives a RTOX PDU, it must reply with a RTOX PDU and extends the current rwt value with the formula: rwt_int = rwt * rtox This patch takes care of the rtox value passed by the target in the RTOX PDU and extends the timeout for the next response accordingly. Signed-off-by: Thierry Escande <thierry.escande@collabora.com> Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
2016-07-11NFC: digital: Add support for NFC DEP Response Waiting TimeThierry Escande
When sending an ATR_REQ, the initiator must wait for the ATR_RES at least 'RWT(nfcdep,activation) + dRWT(nfcdep)' and no more than 'RWT(nfcdep,activation) + dRWT(nfcdep) + dT(nfcdep,initiator)'. This gives a timeout value between 1237 ms and 1337 ms. This patch defines DIGITAL_ATR_RES_RWT to 1337 used for the timeout value of ATR_REQ command. For other DEP PDUs, the initiator must wait between 'RWT + dRWT(nfcdep)' and 'RWT + dRWT(nfcdep) + dT(nfcdep,initiator)' where RWT is given by the following formula: '(256 * 16 / f(c)) * 2^wt' where wt is the value of the TO field in the ATR_RES response and is in the range between 0 and 14. This patch declares a mapping table for wt values and gives RWT max values between 100 ms and 5049 ms. This patch also defines DIGITAL_ATR_RES_TO_WT, the maximum wt value in target mode, to 8. Signed-off-by: Thierry Escande <thierry.escande@collabora.com> Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
2016-07-11NFC: digital: Free supervisor PDUsThierry Escande
This patch frees the RTOX resp sk_buff in initiator mode. It also makes use of the free_resp exit point for ATN supervisor PDUs in both initiator and target mode. Signed-off-by: Thierry Escande <thierry.escande@collabora.com> Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>