Age | Commit message (Collapse) | Author |
|
git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull ftrace fix from Steven Rostedt:
"Namhyung Kim discovered a use after free bug. It has to do with adding
a pid filter to function tracing in an instance, and then freeing the
instance"
* tag 'trace-v4.11-rc5-3' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
ftrace: Fix function pid filter on instances
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6
Pull crypto fixes from Herbert Xu:
"This fixes the following problems:
- regression in new XTS/LRW code when used with async crypto
- long-standing bug in ahash API when used with certain algos
- bogus memory dereference in async algif_aead with certain algos"
* 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6:
crypto: algif_aead - Fix bogus request dereference in completion function
crypto: ahash - Fix EINPROGRESS notification callback
crypto: lrw - Fix use-after-free on EINPROGRESS
crypto: xts - Fix use-after-free on EINPROGRESS
|
|
Like event pid filtering test, add function pid filtering test with the
new "function-fork" option. It also tests it on an instance directory
so that it can verify the bug related pid filtering on instances.
Link: http://lkml.kernel.org/r/20170417024430.21194-5-namhyung@kernel.org
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Shuah Khan <shuahkh@osg.samsung.com>
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
|
|
When a blkfront device is resized from dom0, emit a KOBJ_CHANGE uevent to
notify the guest about the change. This allows for custom udev rules, such
as automatically resizing a filesystem, when an event occurs.
With this patch you get these udev
KERNEL[577.206230] change /devices/vbd-51728/block/xvdb (block)
UDEV [577.226218] change /devices/vbd-51728/block/xvdb (block)
Signed-off-by: Marc Olson <marcolso@amazon.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
|
|
This fixes CVE-2017-7472.
Running the following program as an unprivileged user exhausts kernel
memory by leaking thread keyrings:
#include <keyutils.h>
int main()
{
for (;;)
keyctl_set_reqkey_keyring(KEY_REQKEY_DEFL_THREAD_KEYRING);
}
Fix it by only creating a new thread keyring if there wasn't one before.
To make things more consistent, make install_thread_keyring_to_cred()
and install_process_keyring_to_cred() both return 0 if the corresponding
keyring is already present.
Fixes: d84f4f992cbd ("CRED: Inaugurate COW credentials")
Cc: stable@vger.kernel.org # 2.6.29+
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: David Howells <dhowells@redhat.com>
|
|
This fixes CVE-2017-6951.
Userspace should not be able to do things with the "dead" key type as it
doesn't have some of the helper functions set upon it that the kernel
needs. Attempting to use it may cause the kernel to crash.
Fix this by changing the name of the type to ".dead" so that it's rejected
up front on userspace syscalls by key_get_type_from_user().
Though this doesn't seem to affect recent kernels, it does affect older
ones, certainly those prior to:
commit c06cfb08b88dfbe13be44a69ae2fdc3a7c902d81
Author: David Howells <dhowells@redhat.com>
Date: Tue Sep 16 17:36:06 2014 +0100
KEYS: Remove key_type::match in favour of overriding default by match_preparse
which went in before 3.18-rc1.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: stable@vger.kernel.org
|
|
This fixes CVE-2016-9604.
Keyrings whose name begin with a '.' are special internal keyrings and so
userspace isn't allowed to create keyrings by this name to prevent
shadowing. However, the patch that added the guard didn't fix
KEYCTL_JOIN_SESSION_KEYRING. Not only can that create dot-named keyrings,
it can also subscribe to them as a session keyring if they grant SEARCH
permission to the user.
This, for example, allows a root process to set .builtin_trusted_keys as
its session keyring, at which point it has full access because now the
possessor permissions are added. This permits root to add extra public
keys, thereby bypassing module verification.
This also affects kexec and IMA.
This can be tested by (as root):
keyctl session .builtin_trusted_keys
keyctl add user a a @s
keyctl list @s
which on my test box gives me:
2 keys in keyring:
180010936: ---lswrv 0 0 asymmetric: Build time autogenerated kernel key: ae3d4a31b82daa8e1a75b49dc2bba949fd992a05
801382539: --alswrv 0 0 user: a
Fix this by rejecting names beginning with a '.' in the keyctl.
Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: Mimi Zohar <zohar@linux.vnet.ibm.com>
cc: linux-ima-devel@lists.sourceforge.net
cc: stable@vger.kernel.org
|
|
Prior to commit 2337d207288f ("powerpc/64: CONFIG_RELOCATABLE support for hmi
interrupts"), the branch from hmi_exception_early() to hmi_exception_realmode()
was just a bl hmi_exception_realmode, which the linker would turn into a bl to
the local entry point of hmi_exception_realmode. This was broken when
CONFIG_RELOCATABLE=y because hmi_exception_realmode() is not in the low part of
the kernel text that is copied down to 0x0.
But in fixing that, we added a new bug on little endian kernels. Because the
branch is now a bctrl when CONFIG_RELOCATABLE=y, we branch to the global entry
point of hmi_exception_realmode(). The global entry point must be called with
r12 containing the address of hmi_exception_realmode(), because it uses that
value to calculate the TOC value (r2).
This may manifest as a checkstop, because we take a junk value from r12 which
came from HSRR1, add a small constant to it and then use that as the TOC
pointer. The HSRR1 value will have 0x9 as the top nibble, which puts it above
RAM and somewhere in MMIO space.
Fix it by changing the BRANCH_LINK_TO_FAR() macro to always use r12 to load the
label we're branching to. This means r12 will be setup correctly on LE, fixing
this bug, and r12 is also volatile across function calls on BE so it's a good
choice anyway.
Fixes: 2337d207288f ("powerpc/64: CONFIG_RELOCATABLE support for hmi interrupts")
Reported-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
Acked-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
|
|
If we set a kprobe on a 'stdu' instruction on powerpc64, we see a kernel
OOPS:
Bad kernel stack pointer cd93c840 at c000000000009868
Oops: Bad kernel stack pointer, sig: 6 [#1]
...
GPR00: c000001fcd93cb30 00000000cd93c840 c0000000015c5e00 00000000cd93c840
...
NIP [c000000000009868] resume_kernel+0x2c/0x58
LR [c000000000006208] program_check_common+0x108/0x180
On a 64-bit system when the user probes on a 'stdu' instruction, the kernel does
not emulate actual store in emulate_step() because it may corrupt the exception
frame. So the kernel does the actual store operation in exception return code
i.e. resume_kernel().
resume_kernel() loads the saved stack pointer from memory using lwz, which only
loads the low 32-bits of the address, causing the kernel crash.
Fix this by loading the 64-bit value instead.
Fixes: be96f63375a1 ("powerpc: Split out instruction analysis part of emulate_step()")
Cc: stable@vger.kernel.org # v3.18+
Signed-off-by: Ravi Bangoria <ravi.bangoria@linux.vnet.ibm.com>
Reviewed-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
Reviewed-by: Ananth N Mavinakayanahalli <ananth@linux.vnet.ibm.com>
[mpe: Change log massage, add stable tag]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
|
|
The parsing of sadb_x_ipsecrequest is broken in a number of ways.
First of all we're not verifying sadb_x_ipsecrequest_len. This
is needed when the structure carries addresses at the end. Worse
we don't even look at the length when we parse those optional
addresses.
The migration code had similar parsing code that's better but
it also has some deficiencies. The length is overcounted first
of all as it includes the header itself. It also fails to check
the length before dereferencing the sa_family field.
This patch fixes those problems in parse_sockaddr_pair and then
uses it in parse_ipsecrequest.
Reported-by: Andrey Konovalov <andreyknvl@google.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux
Pull parisc fix from Helge Deller:
"One patch which fixes get_user() for 64-bit values on 32-bit kernels.
Up to now we lost the upper 32-bits of the returned 64-bit value"
* 'parisc-4.11-5' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux:
parisc: Fix get_user() for 64-bit value on 32-bit kernel
|
|
commit 4fcd1813e640 ("Fix reconnect to not defer smb3 session reconnect
long after socket reconnect") added support for Negotiate requests to
be initiated by echo calls.
To avoid delays in calling echo after a reconnect, I added the patch
introduced by the commit b8c600120fc8 ("Call echo service immediately
after socket reconnect").
This has however caused a regression with cifs shares which do not have
support for echo calls to trigger Negotiate requests. On connections
which need to call Negotiation, the echo calls trigger an error which
triggers a reconnect which in turn triggers another echo call. This
results in a loop which is only broken when an operation is performed on
the cifs share. For an idle share, it can DOS a server.
The patch uses the smb_operation can_echo() for cifs so that it is
called only if connection has been already been setup.
kernel bz: 194531
Signed-off-by: Sachin Prabhu <sprabhu@redhat.com>
Tested-by: Jonathan Liu <net147@gmail.com>
Acked-by: Pavel Shilovsky <pshilov@microsoft.com>
CC: Stable <stable@vger.kernel.org>
Signed-off-by: Steve French <smfrench@gmail.com>
|
|
When function tracer has a pid filter, it adds a probe to sched_switch
to track if current task can be ignored. The probe checks the
ftrace_ignore_pid from current tr to filter tasks. But it misses to
delete the probe when removing an instance so that it can cause a crash
due to the invalid tr pointer (use-after-free).
This is easily reproducible with the following:
# cd /sys/kernel/debug/tracing
# mkdir instances/buggy
# echo $$ > instances/buggy/set_ftrace_pid
# rmdir instances/buggy
============================================================================
BUG: KASAN: use-after-free in ftrace_filter_pid_sched_switch_probe+0x3d/0x90
Read of size 8 by task kworker/0:1/17
CPU: 0 PID: 17 Comm: kworker/0:1 Tainted: G B 4.11.0-rc3 #198
Call Trace:
dump_stack+0x68/0x9f
kasan_object_err+0x21/0x70
kasan_report.part.1+0x22b/0x500
? ftrace_filter_pid_sched_switch_probe+0x3d/0x90
kasan_report+0x25/0x30
__asan_load8+0x5e/0x70
ftrace_filter_pid_sched_switch_probe+0x3d/0x90
? fpid_start+0x130/0x130
__schedule+0x571/0xce0
...
To fix it, use ftrace_clear_pids() to unregister the probe. As
instance_rmdir() already updated ftrace codes, it can just free the
filter safely.
Link: http://lkml.kernel.org/r/20170417024430.21194-2-namhyung@kernel.org
Fixes: 0c8916c34203 ("tracing: Add rmdir to remove multibuffer instances")
Cc: Ingo Molnar <mingo@kernel.org>
Cc: stable@vger.kernel.org
Reviewed-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
|
|
Daniel Borkmann says:
====================
Two BPF fixes
The set fixes cb_access and xdp_adjust_head bits in struct bpf_prog,
that are used for requirement checks on the program rather than f.e.
heuristics. Thus, for tail calls, we cannot make any assumptions and
are forced to set them.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Commit 17bedab27231 ("bpf: xdp: Allow head adjustment in XDP prog")
added the xdp_adjust_head bit to the BPF prog in order to tell drivers
that the program that is to be attached requires support for the XDP
bpf_xdp_adjust_head() helper such that drivers not supporting this
helper can reject the program. There are also drivers that do support
the helper, but need to check for xdp_adjust_head bit in order to move
packet metadata prepended by the firmware away for making headroom.
For these cases, the current check for xdp_adjust_head bit is insufficient
since there can be cases where the program itself does not use the
bpf_xdp_adjust_head() helper, but tail calls into another program that
uses bpf_xdp_adjust_head(). As such, the xdp_adjust_head bit is still
set to 0. Since the first program has no control over which program it
calls into, we need to assume that bpf_xdp_adjust_head() helper is used
upon tail calls. Thus, for the very same reasons in cb_access, set the
xdp_adjust_head bit to 1 when the main program uses tail calls.
Fixes: 17bedab27231 ("bpf: xdp: Allow head adjustment in XDP prog")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Cc: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Commit ff936a04e5f2 ("bpf: fix cb access in socket filter programs")
added a fix for socket filter programs such that in i) AF_PACKET the
20 bytes of skb->cb[] area gets zeroed before use in order to not leak
data, and ii) socket filter programs attached to TCP/UDP sockets need
to save/restore these 20 bytes since they are also used by protocol
layers at that time.
The problem is that bpf_prog_run_save_cb() and bpf_prog_run_clear_cb()
only look at the actual attached program to determine whether to zero
or save/restore the skb->cb[] parts. There can be cases where the
actual attached program does not access the skb->cb[], but the program
tail calls into another program which does access this area. In such
a case, the zero or save/restore is currently not performed.
Since the programs we tail call into are unknown at verification time
and can dynamically change, we need to assume that whenever the attached
program performs a tail call, that later programs could access the
skb->cb[], and therefore we need to always set cb_access to 1.
Fixes: ff936a04e5f2 ("bpf: fix cb access in socket filter programs")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
We lack a saddr check for ::1. This causes security issues e.g. with acls
permitting connections from ::1 because of assumption that these originate
from local machine.
Assuming a source address of ::1 is local seems reasonable.
RFC4291 doesn't allow such a source address either, so drop such packets.
Reported-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Acked-by: Eric Dumazet <edumazet@google.com>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
https://git.kernel.org/pub/scm/linux/kernel/git/sunxi/linux into clk-fixes
Pull Allwinner clock fixes for 4.11 from Maxime Ripard:
Two build errors fixes for the sunxi-ng drivers.
The two other patches fix random CPU crashes happening on the A33 since
CPUFreq has been enabled in 4.11.
* tag 'sunxi-clk-fixes-for-4.11-2-bis' of https://git.kernel.org/pub/scm/linux/kernel/git/sunxi/linux:
clk: sunxi-ng: a33: gate then ungate PLL CPU clk after rate change
clk: sunxi-ng: Add clk notifier to gate then ungate PLL clocks
clk: sunxi-ng: fix build failure in ccu-sun9i-a80 driver
clk: sunxi-ng: fix build error without CONFIG_RESET_CONTROLLER
|
|
Sean Wang says:
====================
mediatek: Fix crash caused by reporting inconsistent skb->len to BQL
Changes since v1:
- fix inconsistent enumeration which easily causes the potential bug
The series fixes kernel BUG caused by inconsistent SKB length reported
into BQL. The reason for inconsistent length comes from hardware BUG which
results in different port number carried on the TXD within the lifecycle of
SKB. So patch 2) is proposed for use a software way to track which port
the SKB involving instead of hardware way. And patch 1) is given for another
issue I found which causes TXD and SKB inconsistency that is not expected
in the initial logic, so it is also being corrected it in the series.
The log for the kernel BUG caused by the issue is posted as below.
[ 120.825955] kernel BUG at ... lib/dynamic_queue_limits.c:26!
[ 120.837684] Internal error: Oops - BUG: 0 [#1] SMP ARM
[ 120.842778] Modules linked in:
[ 120.845811] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.11.0-rc1-191576-gdbcef47 #35
[ 120.853488] Hardware name: Mediatek Cortex-A7 (Device Tree)
[ 120.859012] task: c1007480 task.stack: c1000000
[ 120.863510] PC is at dql_completed+0x108/0x17c
[ 120.867915] LR is at 0x46
[ 120.870512] pc : [<c03c19c8>] lr : [<00000046>] psr: 80000113
[ 120.870512] sp : c1001d58 ip : c1001d80 fp : c1001d7c
[ 120.881895] r10: 0000003e r9 : df6b3400 r8 : 0ed86506
[ 120.887075] r7 : 00000001 r6 : 00000001 r5 : 0ed8654c r4 : df0135d8
[ 120.893546] r3 : 00000001 r2 : df016800 r1 : 0000fece r0 : df6b3480
[ 120.900018] Flags: Nzcv IRQs on FIQs on Mode SVC_32 ISA ARM Segment none
[ 120.907093] Control: 10c5387d Table: 9e27806a DAC: 00000051
[ 120.912789] Process swapper/0 (pid: 0, stack limit = 0xc1000218)
[ 120.918744] Stack: (0xc1001d58 to 0xc1002000)
....
121.085331] 1fc0: 00000000 c0a52a28 00000000 c10855d4 c1003c58 c0a52a24 c100885c 8000406a
[ 121.093444] 1fe0: 410fc073 00000000 00000000 c1001ff8 8000807c c0a009cc 00000000 00000000
[ 121.101575] [<c03c19c8>] (dql_completed) from [<c04cb010>] (mtk_napi_tx+0x1d0/0x37c)
[ 121.109263] [<c04cb010>] (mtk_napi_tx) from [<c05e28cc>] (net_rx_action+0x24c/0x3b8)
[ 121.116951] [<c05e28cc>] (net_rx_action) from [<c010152c>] (__do_softirq+0xe4/0x35c)
[ 121.124638] [<c010152c>] (__do_softirq) from [<c012a624>] (irq_exit+0xe8/0x150)
[ 121.131895] [<c012a624>] (irq_exit) from [<c017750c>] (__handle_domain_irq+0x70/0xc4)
[ 121.139666] [<c017750c>] (__handle_domain_irq) from [<c0101404>] (gic_handle_irq+0x58/0x9c)
[ 121.147953] [<c0101404>] (gic_handle_irq) from [<c010e18c>] (__irq_svc+0x6c/0x90)
[ 121.155373] Exception stack(0xc1001ef8 to 0xc1001f40)
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Fix port inconsistency on TXD due to hardware BUG that would cause
different port number is carried on the same TXD between tx_map()
and tx_unmap() with the iperf test. It would cause confusing BQL
logic which leads to kernel panic when dual GMAC runs concurrently.
Signed-off-by: Sean Wang <sean.wang@mediatek.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Fix inconsistency between the TXD descriptor and the used buffer that
would cause unexpected logic at mtk_tx_unmap() during skb housekeeping.
Signed-off-by: Sean Wang <sean.wang@mediatek.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Now the command:
ethtool --phy-statistics eth0
will cause system crash with meassage "Unable to handle kernel NULL pointer
dereference at virtual address 00000010" from:
(kszphy_get_stats) from [<c069f1d8>] (ethtool_get_phy_stats+0xd8/0x210)
(ethtool_get_phy_stats) from [<c06a0738>] (dev_ethtool+0x5b8/0x228c)
(dev_ethtool) from [<c06b5484>] (dev_ioctl+0x3fc/0x964)
(dev_ioctl) from [<c0679f7c>] (sock_ioctl+0x170/0x2c0)
(sock_ioctl) from [<c02419d4>] (do_vfs_ioctl+0xa8/0x95c)
(do_vfs_ioctl) from [<c02422c4>] (SyS_ioctl+0x3c/0x64)
(SyS_ioctl) from [<c0107d60>] (ret_fast_syscall+0x0/0x44)
The reason: phy_driver structure for KSZ9031 phy has no .probe() callback
defined. As result, struct phy_device *phydev->priv pointer will not be
initializes (null).
This issue will affect also following phys:
KSZ8795, KSZ886X, KSZ8873MLL, KSZ9031, KSZ9021, KSZ8061, KS8737
Fix it by:
- adding .probe() = kszphy_probe() callback to KSZ9031, KSZ9021
phys. The kszphy_probe() can be re-used as it doesn't do any phy specific
settings.
- removing statistic callbacks from other phys (KSZ8795, KSZ886X,
KSZ8873MLL, KSZ8061, KS8737) as they doesn't have corresponding
statistic counters.
Fixes: 2b2427d06426 ("phy: micrel: Add ethtool statistics counters")
Signed-off-by: Grygorii Strashko <grygorii.strashko@ti.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Only need 1 l3mdev FIB rule. Fix setting NLM_F_EXCL in the nlmsghdr.
Fixes: 1aa6c4f6b8cd8 ("net: vrf: Add l3mdev rules on first device create")
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Add the PCI_SUBSYS_DEVID_81XX_RGX and use the same to set
the max bgx per node count.
This fixes the issue intoduced by following commit
78aacb6f6 net: thunderx: Fix invalid mac addresses for node1 interfaces
With this commit the max_bgx_per_node for 81xx is set as 2 instead of 3
because of which num_vfs is always calculated as zero.
Signed-off-by: George Cherian <george.cherian@cavium.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Syzkaller reported a use-after-free in ip_recv_error at line
info->ipi_ifindex = skb->dev->ifindex;
This function is called on dequeue from the error queue, at which
point the device pointer may no longer be valid.
Save ifindex on enqueue in __skb_complete_tx_timestamp, when the
pointer is valid or NULL. Store it in temporary storage skb->cb.
It is safe to reference skb->dev here, as called from device drivers
or dev_queue_xmit. The exception is when called from tcp_ack_tstamp;
in that case it is NULL and ifindex is set to 0 (invalid).
Do not return a pktinfo cmsg if ifindex is 0. This maintains the
current behavior of not returning a cmsg if skb->dev was NULL.
On dequeue, the ipv4 path will cast from sock_exterr_skb to
in_pktinfo. Both have ifindex as their first element, so no explicit
conversion is needed. This is by design, introduced in commit
0b922b7a829c ("net: original ingress device index in PKTINFO"). For
ipv6 ip6_datagram_support_cmsg converts to in6_pktinfo.
Fixes: 829ae9d61165 ("net-timestamp: allow reading recv cmsg on errqueue with origin tstamp")
Reported-by: Andrey Konovalov <andreyknvl@google.com>
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Similar to commit 87e9f0315952
("ipv4: fix a potential deadlock in mcast getsockopt() path"),
there is a deadlock scenario for IP_ROUTER_ALERT too:
CPU0 CPU1
---- ----
lock(rtnl_mutex);
lock(sk_lock-AF_INET);
lock(rtnl_mutex);
lock(sk_lock-AF_INET);
Fix this by always locking RTNL first on all setsockopt() paths.
Note, after this patch ip_ra_lock is no longer needed either.
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Tested-by: Andrey Konovalov <andreyknvl@google.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
For ease of management it would be nice for users to specify that the
device node for a nbd device is destroyed once it is disconnected and
there are no more users. Add a client flag and enable this operation to
happen.
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
|
|
In order to support deleting the device on disconnect we need to
refcount the actual nbd_device struct. So add the refcounting framework
and change how we free the normal devices at rmmod time so we can catch
reference leaks.
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
|
|
Allow users to query the status of existing nbd devices. Right now this
only returns whether or not the device is connected, but could be
extended in the future to include more information.
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
|
|
Sometimes we like to upgrade our server without making all of our
clients freak out and reconnect. This patch provides a way to specify a
dead connection timeout to allow us to pause all requests and wait for
new connections to be opened. With this in place I can take down the
nbd server for less than the dead connection timeout time and bring it
back up and everything resumes gracefully.
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
|
|
When running a disconnect torture test I noticed that sometimes we would
crash with a negative ref count on our queue. This was because we were
ending the same request twice. Turns out we were racing with
NBD_CLEAR_SOCK clearing the requests as well as the teardown of the
device clearing the requests. So instead make the ioctl only shutdown
the sockets and make it so that we only ever run nbd_clear_que from the
device teardown.
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
|
|
Provide a mechanism to notify userspace that there's been a link problem
on a NBD device. This will allow userspace to re-establish a connection
and provide the new socket to the device without disrupting the device.
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
|
|
We want to be able to reconnect dead connections to existing block
devices, so add a reconfigure netlink command. We will also allow users
to change their timeout on the fly, but everything else will require a
disconnect and reconnect. You won't be able to add more connections
either, simply replace dead connections with new more lively
connections.
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
|
|
The existing ioctl interface for configuring NBD devices is a bit
cumbersome and hard to extend. The other problem is we leave a
userspace app sitting in it's syscall until the device disconnects,
which is less than ideal.
This patch introduces a netlink interface for adding and disconnecting
nbd devices. This has the benefits of being easily extendable without
breaking older userspace applications, and allows us to configure a nbd
device without leaving a userspace app sitting waiting for the device to
disconnect.
With this interface we also gain the ability to configure more devices
than are preallocated at insmod time. We also have gained the ability
to not specify a particular device and be provided one for us so that
userspace doesn't need to find a free device to configure.
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
|
|
In preparation for the upcoming netlink interface we need to not rely on
already having the bdev for the NBD device we are doing operations on.
Instead of passing the bdev around, just use it in places where we know
we already have the bdev.
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
|
|
In order to properly refcount the various aspects of a NBD device we
need to separate out the configuration elements of the nbd device. The
configuration of a NBD device has a different lifetime from the actual
device, so it doesn't make sense to bundle these two concepts. Add a
config_refs to keep track of the configuration structure, that way we
can be sure that we never access it when we've torn down the device.
Add a new nbd_config structure to hold all of the transient
configuration information. Finally create this when we open the device
so that it is in place when we start to configure the device. This has
a nice side-effect of fixing a long standing problem where you could end
up with a half-configured nbd device that needed to be "disconnected" in
order to be usable again. Now once we close our device the
configuration will be discarded.
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
|
|
Currently if we have multiple connections and one of them goes down we will tear
down the whole device. However there's no reason we need to do this as we
could have other connections that are working fine. Deal with this by keeping
track of the state of the different connections, and if we lose one we mark it
as dead and send all IO destined for that socket to one of the other healthy
sockets. Any outstanding requests that were on the dead socket will timeout and
be re-submitted properly.
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
|
|
When adding a new socket we look it up and then try to add it to our
configuration. If any of those steps fail we need to make sure we put
the socket so we don't leak them.
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
|
|
The number of rx queues is determined by the rss_cpus parameter
or the cpu topology. If that is higher than EFX_MAX_RX_QUEUES the
driver can corrupt state.
Fixes: 8ceee660aacb ("New driver "sfc" for Solarstorm SFC4000 controller.")
Signed-off-by: Bert Kenward <bkenward@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc
Pull ARM SoC fixes from Olof Johansson:
"Again, a batch that's been sitting a couple of weeks, mostly because
I anticipated a bit more material but it didn't show up -- which is
good.
These are all your garden variety fixes for ARM platforms.
The most visible issue fixed here is probably the SMP reset issue on
OMAP, the rest are minor stuff"
* tag 'armsoc-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc:
arm64: allwinner: a64: add pmu0 regs for USB PHY
ARM: OMAP2+: omap_device: Sync omap_device and pm_runtime after probe defer
reset: add exported __reset_control_get, return NULL if optional
ARM: orion5x: only call into phylib when available
ARM: omap2+: Revert omap-smp.c changes resetting CPU1 during boot
ARM: dts: am335x-evmsk: adjust mmc2 param to allow suspend
ARM: dts: ti: fix PCI bus dtc warnings
ARM: dts: am335x-baltos: disable EEE for Atheros 8035 PHY
ARM: dts: OMAP3: Fix MFG ID EEPROM
ARM: sun8i: a33: add operating-points-v2 property to all nodes
ARM: sun8i: a33: remove highest OPP to fix CPU crashes
|
|
Pull block fixes from Jens Axboe:
"Four small fixes.
Three of them fix the same error in NVMe, in loop, fc, and rdma
respectively. The last fix from Ming fixes a regression in this
series, where our bvec gap logic was wrong and causes an oops on
NVMe for certain conditions"
* 'for-linus' of git://git.kernel.dk/linux-block:
block: fix bio_will_gap() for first bvec with offset
nvme-fc: Fix sqsize wrong assignment based on ctrl MQES capability
nvme-rdma: Fix sqsize wrong assignment based on ctrl MQES capability
nvme-loop: Fix sqsize wrong assignment based on ctrl MQES capability
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tmlind/linux-omap into fixes
Regression fix for omap interconnect code for deferred probe.
Without this fix we can get PM related warnings for devices that
use deferred probe. If necessary, this fix can wait for the
v4.12 merge window no problem.
* tag 'omap-for-v4.11/fixes-rc6-signed' of git://git.kernel.org/pub/scm/linux/kernel/git/tmlind/linux-omap:
ARM: OMAP2+: omap_device: Sync omap_device and pm_runtime after probe defer
ARM: omap2+: Revert omap-smp.c changes resetting CPU1 during boot
ARM: dts: am335x-evmsk: adjust mmc2 param to allow suspend
ARM: dts: ti: fix PCI bus dtc warnings
ARM: dts: am335x-baltos: disable EEE for Atheros 8035 PHY
ARM: dts: OMAP3: Fix MFG ID EEPROM
Signed-off-by: Olof Johansson <olof@lixom.net>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup
Pull cgroup fix from Tejun Heo:
"Unfortunately, the commit to fix the cgroup mount race in the previous
pull request can lead to hangs.
The original bug has been around for a while and isn't too likely to
be triggered in usual use cases. Revert the commit for now"
* 'for-4.11-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
Revert "cgroup: avoid attaching a cgroup root to two different superblocks"
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty
Pull tty fix from Greg KH:
"Here is a single tty core revert for a patch that was reported to
cause problems.
The original issue is one that we have lived with for decades, so
trying to scramble to fix the fix in time for 4.11-final does not make
sense due to the fragility of the tty ldisc layer. Just reverting it
makes sense for now"
* tag 'tty-4.11-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty:
Revert "tty: don't panic on OOM in tty_set_ldisc()"
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull ftrace fix from Steven Rostedt:
"While rewriting the function probe code, I stumbled over a long
standing bug. This bug has been there sinc function tracing was added
way back when. But my new development depends on this bug being fixed,
and it should be fixed regardless as it causes ftrace to disable
itself when triggered, and a reboot is required to enable it again.
The bug is that the function probe does not disable itself properly if
there's another probe of its type still enabled. For example:
# cd /sys/kernel/debug/tracing
# echo schedule:traceoff > set_ftrace_filter
# echo do_IRQ:traceoff > set_ftrace_filter
# echo \!do_IRQ:traceoff > /debug/tracing/set_ftrace_filter
# echo do_IRQ:traceoff > set_ftrace_filter
The above registers two traceoff probes (one for schedule and one for
do_IRQ, and then removes do_IRQ.
But since there still exists one for schedule, it is not done
properly. When adding do_IRQ back, the breakage in the accounting is
noticed by the ftrace self tests, and it causes a warning and disables
ftrace"
* tag 'trace-v4.11-rc5-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
ftrace: Fix removing of second function probe
|
|
There were a bunch of places in pblk_lines_init() where we didn't set an
error code. And in pblk_writer_init() we accidentally return 1 instead
of a correct error code, which would result in a Oops later.
Fixes: 11a5d6fdf919 ("lightnvm: physical block device (pblk) target")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Matias Bjørling <matias@cnexlabs.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
|
|
WARN_ON() takes a condition, not an error message. I slightly tweaked
some conditions so hopefully it's more clear.
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Matias Bjørling <matias@cnexlabs.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
|
|
These labels are reversed so we could end up dereferencing an error
pointer or leaking.
Fixes: 7f347ba6bb3a ("lightnvm: physical block device (pblk) target")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Matias Bjørling <matias@cnexlabs.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
|
|
This patch introduces pblk, a host-side translation layer for
Open-Channel SSDs to expose them like block devices. The translation
layer allows data placement decisions, and I/O scheduling to be
managed by the host, enabling users to optimize the SSD for their
specific workloads.
An open-channel SSD has a set of LUNs (parallel units) and a
collection of blocks. Each block can be read in any order, but
writes must be sequential. Writes may also fail, and if a block
requires it, must also be reset before new writes can be
applied.
To manage the constraints, pblk maintains a logical to
physical address (L2P) table, write cache, garbage
collection logic, recovery scheme, and logic to rate-limit
user I/Os versus garbage collection I/Os.
The L2P table is fully-associative and manages sectors at a
4KB granularity. Pblk stores the L2P table in two places, in
the out-of-band area of the media and on the last page of a
line. In the cause of a power failure, pblk will perform a
scan to recover the L2P table.
The user data is organized into lines. A line is data
striped across blocks and LUNs. The lines enable the host to
reduce the amount of metadata to maintain besides the user
data and makes it easier to implement RAID or erasure coding
in the future.
pblk implements multi-tenant support and can be instantiated
multiple times on the same drive. Each instance owns a
portion of the SSD - both regarding I/O bandwidth and
capacity - providing I/O isolation for each case.
Finally, pblk also exposes a sysfs interface that allows
user-space to peek into the internals of pblk. The interface
is available at /dev/block/*/pblk/ where * is the block
device name exposed.
This work also contains contributions from:
Matias Bjørling <matias@cnexlabs.com>
Simon A. F. Lund <slund@cnexlabs.com>
Young Tack Jin <youngtack.jin@gmail.com>
Huaicheng Li <huaicheng@cs.uchicago.edu>
Signed-off-by: Javier González <javier@cnexlabs.com>
Signed-off-by: Matias Bjørling <matias@cnexlabs.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
|