Age | Commit message (Collapse) | Author |
|
If ip6_dst_lookup_tail has acquired a dst and fails the IPv4-mapped
check, release the dst before returning an error.
Fixes: ec5e3b0a1d41 ("ipv6: Inhibit IPv4-mapped src address on the wire.")
Signed-off-by: Willem de Bruijn <willemb@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc
Pull ARM SoC fixes from Arnd Bergmann:
"Two more bugfixes that came in during this week:
- a defconfig change to enable a vital driver used on some Qualcomm
based phones. This was already queued for 4.11, but the maintainer
asked to have it in 4.10 after all.
- a regression fix for the reset controller framework, this got
broken by a typo in the 4.10 merge window"
* tag 'fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc:
ARM: multi_v7_defconfig: enable Qualcomm RPMCC
reset: fix shared reset triggered_count decrement on error
|
|
Pull ARM fixes from Russell King:
"A couple of fixes from Kees concerning problems he spotted with our
user access support"
* 'fixes' of git://git.armlinux.org.uk/~rmk/linux-arm:
ARM: 8658/1: uaccess: fix zeroing of 64-bit get_user()
ARM: 8657/1: uaccess: consistently check object sizes
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 fix from Thomas Gleixner:
"Make the build clean by working around yet another GCC stupidity"
* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/vm86: Fix unused variable warning if THP is disabled
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull locking fix from Thomas Gleixner:
"Move the futex init function to core initcall so user mode helper does
not run into an uninitialized futex syscall"
* 'locking-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
futex: Move futex_init() to core_initcall
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timer fixes from Thomas Gleixner:
"Two small fixes::
- Prevent deadlock on the tick broadcast lock. Found and fixed by
Mike.
- Stop using printk() in the timekeeping debug code to prevent a
deadlock against the scheduler"
* 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
timekeeping: Use deferred printk() in debug code
tick/broadcast: Prevent deadlock on tick_broadcast_lock
|
|
Pull networking fixes from David Miller:
1) Fix leak in dpaa_eth error paths, from Dan Carpenter.
2) Use after free when using IPV6_RECVPKTINFO, from Andrey Konovalov.
3) fanout_release() cannot be invoked from atomic contexts, from Anoob
Soman.
4) Fix bogus attempt at lockdep annotation in IRDA.
5) dev_fill_metadata_dst() can OOP on a NULL dst cache pointer, from
Paolo Abeni.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net:
irda: Fix lockdep annotations in hashbin_delete().
vxlan: fix oops in dev_fill_metadata_dst
dccp: fix freeing skb too early for IPV6_RECVPKTINFO
dpaa_eth: small leak on error
packet: Do not call fanout_release from atomic contexts
|
|
Use rcuidle console tracepoint because, apparently, it may be issued
from an idle CPU:
hw-breakpoint: Failed to enable monitor mode on CPU 0.
hw-breakpoint: CPU 0 failed to disable vector catch
===============================
[ ERR: suspicious RCU usage. ]
4.10.0-rc8-next-20170215+ #119 Not tainted
-------------------------------
./include/trace/events/printk.h:32 suspicious rcu_dereference_check() usage!
other info that might help us debug this:
RCU used illegally from idle CPU!
rcu_scheduler_active = 2, debug_locks = 0
RCU used illegally from extended quiescent state!
2 locks held by swapper/0/0:
#0: (cpu_pm_notifier_lock){......}, at: [<c0237e2c>] cpu_pm_exit+0x10/0x54
#1: (console_lock){+.+.+.}, at: [<c01ab350>] vprintk_emit+0x264/0x474
stack backtrace:
CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.10.0-rc8-next-20170215+ #119
Hardware name: Generic OMAP4 (Flattened Device Tree)
console_unlock
vprintk_emit
vprintk_default
printk
reset_ctrl_regs
dbg_cpu_pm_notify
notifier_call_chain
cpu_pm_exit
omap_enter_idle_coupled
cpuidle_enter_state
cpuidle_enter_state_coupled
do_idle
cpu_startup_entry
start_kernel
This RCU warning, however, is suppressed by lockdep_off() in printk().
lockdep_off() increments the ->lockdep_recursion counter and thus
disables RCU_LOCKDEP_WARN() and debug_lockdep_rcu_enabled(), which want
lockdep to be enabled "current->lockdep_recursion == 0".
Link: http://lkml.kernel.org/r/20170217015932.11898-1-sergey.senozhatsky@gmail.com
Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Reported-by: Tony Lindgren <tony@atomide.com>
Tested-by: Tony Lindgren <tony@atomide.com>
Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Acked-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tony Lindgren <tony@atomide.com>
Cc: Russell King <rmk@armlinux.org.uk>
Cc: <stable@vger.kernel.org> [3.4+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
trivial fix to spelling mistake in error message
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
|
|
This patch enables the Qualcomm RPM based Clock Controller present on
A-family boards.
Signed-off-by: Andy Gross <andy.gross@linaro.org>
Acked-by: Bjorn Andersson <bjorn.andersson@linaro.org>
Signed-off-by: Olof Johansson <olof@lixom.net>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
|
|
trivial fix to spelling mistake in error message
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
|
|
commit 82e88ff1ea94 ("hrtimer: Revert CLOCK_MONOTONIC_RAW support") removed
unfortunately a sanity check in the hrtimer code which was part of that
MONOTONIC_RAW patch series.
It would have caught the bogus usage of CLOCK_MONOTONIC_RAW in the wireless
code. So bring it back.
It is way too easy to take any random clockid and feed it to the hrtimer
subsystem. At best, it gets mapped to a monotonic base, but it would be
better to just catch illegal values as early as possible.
Detect invalid clockids, map them to CLOCK_MONOTONIC and emit a warning.
[ tglx: Replaced the BUG by a WARN and gracefully map to CLOCK_MONOTONIC ]
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
Cc: Tomasz Nowicki <tn@semihalf.com>
Cc: Christoffer Dall <christoffer.dall@linaro.org>
Link: http://lkml.kernel.org/r/1452879670-16133-3-git-send-email-marc.zyngier@arm.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
|
|
Add maintainer information for bmips-cpufreq.c.
Signed-off-by: Markus Mayer <mmayer@broadcom.com>
Acked-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
|
|
Since commit 2d984ad132a8 (PM / QoS: Introcuce latency tolerance device
PM QoS type) we reassign "c" to point at qos->latency_tolerance before
freeing c->notifiers, but the notifiers field of latency_tolerance is
never used.
Restore the original behaviour of freeing the notifiers pointer on
qos->resume_latency, which is used, and fix the following kmemleak
warning.
unreferenced object 0xed9dba00 (size 64):
comm "kworker/0:1", pid 36, jiffies 4294670128 (age 15202.983s)
hex dump (first 32 bytes):
00 00 00 00 04 ba 9d ed 04 ba 9d ed 00 00 00 00 ................
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
backtrace:
[<c06f6084>] kmemleak_alloc+0x74/0xb8
[<c011c964>] kmem_cache_alloc_trace+0x170/0x25c
[<c035f448>] dev_pm_qos_constraints_allocate+0x3c/0xe4
[<c035f574>] __dev_pm_qos_add_request+0x84/0x1a0
[<c035f6cc>] dev_pm_qos_add_request+0x3c/0x54
[<c03c3fc4>] usb_hub_create_port_device+0x110/0x2b8
[<c03b2a60>] hub_probe+0xadc/0xc80
[<c03bb050>] usb_probe_interface+0x1b4/0x260
[<c035773c>] driver_probe_device+0x198/0x40c
[<c0357b14>] __device_attach_driver+0x8c/0x98
[<c0355bbc>] bus_for_each_drv+0x8c/0x9c
[<c0357494>] __device_attach+0x98/0x138
[<c0357c64>] device_initial_probe+0x14/0x18
[<c03569dc>] bus_probe_device+0x30/0x88
[<c0354c54>] device_add+0x430/0x554
[<c03b92d8>] usb_set_configuration+0x660/0x6fc
Fixes: 2d984ad132a8 (PM / QoS: Introcuce latency tolerance device PM QoS type)
Signed-off-by: John Keeping <john@metanate.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
|
|
Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
|
|
When passing "test_suspend=mem" to the kernel:
PM: can't test 'mem' suspend state
and the suspend test is not run.
Commit 406e79385f3223d8 ("PM / sleep: System sleep state selection
interface rework") changed pm_labels[] from a contiguous NULL-terminated
array to a sparse array (with the first element unpopulated), breaking
the assumptions of the iterator in setup_test_suspend().
Iterate from PM_SUSPEND_MIN to PM_SUSPEND_MAX - 1 to fix this.
Fixes: 406e79385f3223d8 (PM / sleep: System sleep state selection interface rework)
Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
|
|
A nested lock depth was added to the hasbin_delete() code but it
doesn't actually work some well and results in tons of lockdep splats.
Fix the code instead to properly drop the lock around the operation
and just keep peeking the head of the hashbin queue.
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Tested-by: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Signed-off-by: Jens Axboe <axboe@fb.com>
|
|
Signed-off-by: Jens Axboe <axboe@fb.com>
|
|
Pull block layer fix from Jens Axboe:
"A single fix for a lockdep splat reported by Thomas and Gabriel"
* 'for-linus' of git://git.kernel.dk/linux-block:
cfq-iosched: don't call wbt_disable_default() with IRQs disabled
|
|
Since the commit 0c1d70af924b ("net: use dst_cache for vxlan device")
vxlan_fill_metadata_dst() calls vxlan_get_route() passing a NULL
dst_cache pointer, so the latter should explicitly check for
valid dst_cache ptr. Unfortunately the commit d71785ffc7e7 ("net: add
dst_cache to ovs vxlan lwtunnel") removed said check.
As a result is possible to trigger a null pointer access calling
vxlan_fill_metadata_dst(), e.g. with:
ovs-vsctl add-br ovs-br0
ovs-vsctl add-port ovs-br0 vxlan0 -- set interface vxlan0 \
type=vxlan options:remote_ip=192.168.1.1 \
options:key=1234 options:dst_port=4789 ofport_request=10
ip address add dev ovs-br0 172.16.1.2/24
ovs-vsctl set Bridge ovs-br0 ipfix=@i -- --id=@i create IPFIX \
targets=\"172.16.1.1:1234\" sampling=1
iperf -c 172.16.1.1 -u -l 1000 -b 10M -t 1 -p 1234
This commit addresses the issue passing to vxlan_get_route() the
dst_cache already available into the lwt info processed by
vxlan_fill_metadata_dst().
Fixes: d71785ffc7e7 ("net: add dst_cache to ovs vxlan lwtunnel")
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Acked-by: Jiri Benc <jbenc@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
sk_page_frag_refill() allocates either a compound page or an order-0
page. We can use page_ref_inc() which is slightly faster than get_page()
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
precision loss in window scaling
Prevent sending out a left-shifted sequence number from a Linux sender in
response to a peer's shrunk receive-window caused by losing least significant
bits in window-scaling.
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
Cc: James Morris <jmorris@namei.org>
Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
Cc: Patrick McHardy <kaber@trash.net>
Signed-off-by: Cheng Cui <Cheng.Cui@netapp.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Edward Cree says:
====================
sfc: misc. fixes
Three largely unrelated fixes to increase robustness in rare edge cases.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
efx_start_all can return without initialising queues as a reset is pending.
This means that when netif_device_attach is called, the kernel can start
sending traffic without having an initialised TX queue to send to.
This patch avoids this by not calling netif_device_attach if there is a
pending reset.
Fixes: e283546c0465 ("sfc:On MCDI timeout, issue an FLR (and mark MCDI to fail-fast)")
Signed-off-by: Edward Cree <ecree@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
If the hw doesn't think they exist, we should defer to its authority.
Signed-off-by: Edward Cree <ecree@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
On EF10, hardware filter IDs are 13 bits, but in some places we store
32-bit "full filter IDs" in which higher order bits encode the filter
match-priority. This could cause a filter to have a full filter ID of
0xffff, which is also the value EFX_EF10_FILTER_ID_INVALID which we use
in 16-bit "short" filter IDs (without match-priority bits). This would
occur if the hardware filter ID was 0x1fff and the match-priority was 7.
Unfortunately, some code that checks for EFX_EF10_FILTER_ID_INVALID can
be called on full filter IDs, and will WARN_ON if this ever happens.
So, since we have plenty of spare bits in the full filter ID, this patch
shifts the priority bits left one bit when constructing the full filter
IDs, ensuring that the 0x2000 bit of a full filter ID will always be 0
and thus no full filter ID can ever equal EFX_EF10_FILTER_ID_INVALID.
This patch also replaces open-coded full<->short filter ID conversions
with calls to functions, thus keeping the definition of the full filter
ID format in one place.
Signed-off-by: Edward Cree <ecree@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
I got a warning about broken code on ARM64 with 64K pages:
drivers/net/vmxnet3/vmxnet3_drv.c: In function 'vmxnet3_rq_init':
drivers/net/vmxnet3/vmxnet3_drv.c:1679:29: error: large integer implicitly truncated to unsigned type [-Werror=overflow]
rq->buf_info[0][i].len = PAGE_SIZE;
'len' here is a 16-bit integer, so this clearly won't work. I don't think
this driver is used much on anything other than x86, so there is no need
to fix this properly and we can work around it with a Kconfig dependency
to forbid known-broken configurations. qemu in theory supports it on
other architectures too, but presumably only for compatibility with x86
guests that also run on vmware.
CONFIG_PAGE_SIZE_64KB is used on hexagon, mips, sh and tile, the other
symbols are architecture-specific names for the same thing.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
It is required to build it as a module.
Signed-off-by: Valentin Longchamp <valentin.longchamp@keymile.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
In the function rds_ib_xmit_atomic, ib_ring is not allocated
successfully. As such, it is not necessary to unalloc it.
Cc: Joe Jin <joe.jin@oracle.com>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Signed-off-by: Zhu Yanjun <yanjun.zhu@oracle.com>
Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Use PCI_DEVICE_ID_NETRONOME_NFP*, defined in linux/pci_ids.h,
rather than replicating the same values in the NFP driver.
Signed-off-by: Simon Horman <simon.horman@netronome.com>
Acked-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
The qdisc_stab_lock is used in qdisc_get_stab and qdisc_put_stab.
These two functions are invoked in qdisc_create, qdisc_change, and
qdisc_destroy which run fully under RTNL.
So it already makes sure only one could access the qdisc_stab_list at
the same time. Then it is unnecessary to use qdisc_stab_lock now.
Signed-off-by: Gao Feng <fgao@ikuai8.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Change module filename from af-rxrpc.ko to rxrpc.ko so as to be consistent
with the other protocol drivers.
Also adjust the documentation to reflect this.
Further, there is no longer a standalone rxkad module, as it has been
merged into the rxrpc core, so get rid of references to that.
Reported-by: Marc Dionne <marc.dionne@auristor.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Return the correct tx_errors stats in netvsc.
Reviewed-by: Haiyang Zhang <haiyangz@microsoft.com>
Signed-off-by: Simon Xiao <sixiao@microsoft.com>
Signed-off-by: Stephen Hemminger <sthemmin@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
When allocating rtnl dump messages, struct ifla_port_vsi is never dumped,
so we can save header plus payload in rtnl_port_size(). Infact, attribute
IFLA_PORT_VSI_TYPE and struct ifla_port_vsi are not used anywhere in
the kernel. We only need to keep the nla policy should applications in
user space be filling this out. Same NLA_BINARY issue exists as was fixed
in 364d5716a7ad ("rtnetlink: ifla_vf_policy: fix misuses of NLA_BINARY")
and others, but then again IFLA_PORT_VSI_TYPE is not used anywhere, so
just add a comment that it's unused.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
We need to verify that the controller supports the security
commands before actually trying to issue them.
Signed-off-by: Scott Bauer <scott.bauer@intel.com>
[hch: moved the check so that we don't call into the OPAL code if not
supported]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
|
|
Insted of bloating the containing structure with it all the time this
allocates struct opal_dev dynamically. Additionally this allows moving
the definition of struct opal_dev into sed-opal.c. For this a new
private data field is added to it that is passed to the send/receive
callback. After that a lot of internals can be made private as well.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Tested-by: Scott Bauer <scott.bauer@intel.com>
Reviewed-by: Scott Bauer <scott.bauer@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
|
|
Not having OPAL or a sub-feature supported is an entirely normal
condition for many drives, so don't warn about it. Keep the messages,
but tone them down to debug only.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Scott Bauer <scott.bauer@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
|
|
For blk-mq with scheduling, we can potentially end up with ALL
driver tags assigned and sitting on the flush queues. If we
defer because of an inlfight data request, then we can deadlock
if that data request doesn't already have a tag assigned.
This fixes a deadlock with running the xfs/297 xfstest, where
thousands of syncs can cause the drive queue to stall.
Signed-off-by: Jens Axboe <axboe@fb.com>
Reviewed-by: Omar Sandoval <osandov@fb.com>
|
|
Usually we don't ask the scheduler for work, if we already have
leftovers on the dispatch list. This is done to leave work on
the scheduler side for as long as possible, for proper merging.
But if we do have work leftover but didn't dispatch anything,
then we should ask the scheduler since we could potentially
issue requests from that.
Signed-off-by: Jens Axboe <axboe@fb.com>
Reviewed-by: Omar Sandoval <osandov@fb.com>
|
|
The current request insertion machinery works just fine for
directly inserting flushes, so no need to special case
this anymore.
Signed-off-by: Jens Axboe <axboe@fb.com>
Reviewed-by: Omar Sandoval <osandov@fb.com>
|
|
If we are currently out of driver tags, we don't want to add a
new flush (without a tag) to the head of the requeue list. We
want to add it to the back, behind the others that are
potentially also waiting for a tag.
Signed-off-by: Jens Axboe <axboe@fb.com>
Reviewed-by: Omar Sandoval <osandov@fb.com>
|
|
Currently we're almost there, but if we dispatch nothing, then we
still return success.
Signed-off-by: Jens Axboe <axboe@fb.com>
Reviewed-by: Omar Sandoval <osandov@fb.com>
|
|
The ethtool api {get|set}_settings is deprecated.
We move this driver to new api {get|set}_link_ksettings.
As I don't have the hardware, I'd be very pleased if
someone may test this patch.
Signed-off-by: Philippe Reynes <tremyfr@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
The ethtool api {get|set}_settings is deprecated.
We move this driver to new api {get|set}_link_ksettings.
As I don't have the hardware, I'd be very pleased if
someone may test this patch.
Signed-off-by: Philippe Reynes <tremyfr@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
added_by_external_learn fdb entries are added and expired by
external entities like switchdev driver or external controllers.
ageing is already disabled for such entries. Hence, don't
indicate expiry for such fdb entries.
CC: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
CC: Jiri Pirko <jiri@resnulli.us>
CC: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Tested-by: Ido Schimmel <idosch@mellanox.com>
Reviewed-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Daniel Borkmann says:
====================
Misc BPF improvements
This last series for this window adds various misc
improvements to BPF, one is to mark registered map and
prog types as __ro_after_init, another one for removing
cBPF stubs in eBPF JITs and moving the stub to the core
and last also improving JITs is to make generated images
visible to the kernel and kallsyms, so they can be
seen in traces. For details, please have a look at the
individual patches.
Thanks a lot!
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Long standing issue with JITed programs is that stack traces from
function tracing check whether a given address is kernel code
through {__,}kernel_text_address(), which checks for code in core
kernel, modules and dynamically allocated ftrace trampolines. But
what is still missing is BPF JITed programs (interpreted programs
are not an issue as __bpf_prog_run() will be attributed to them),
thus when a stack trace is triggered, the code walking the stack
won't see any of the JITed ones. The same for address correlation
done from user space via reading /proc/kallsyms. This is read by
tools like perf, but the latter is also useful for permanent live
tracing with eBPF itself in combination with stack maps when other
eBPF types are part of the callchain. See offwaketime example on
dumping stack from a map.
This work tries to tackle that issue by making the addresses and
symbols known to the kernel. The lookup from *kernel_text_address()
is implemented through a latched RB tree that can be read under
RCU in fast-path that is also shared for symbol/size/offset lookup
for a specific given address in kallsyms. The slow-path iteration
through all symbols in the seq file done via RCU list, which holds
a tiny fraction of all exported ksyms, usually below 0.1 percent.
Function symbols are exported as bpf_prog_<tag>, in order to aide
debugging and attribution. This facility is currently enabled for
root-only when bpf_jit_kallsyms is set to 1, and disabled if hardening
is active in any mode. The rationale behind this is that still a lot
of systems ship with world read permissions on kallsyms thus addresses
should not get suddenly exposed for them. If that situation gets
much better in future, we always have the option to change the
default on this. Likewise, unprivileged programs are not allowed
to add entries there either, but that is less of a concern as most
such programs types relevant in this context are for root-only anyway.
If enabled, call graphs and stack traces will then show a correct
attribution; one example is illustrated below, where the trace is
now visible in tooling such as perf script --kallsyms=/proc/kallsyms
and friends.
Before:
7fff8166889d bpf_clone_redirect+0x80007f0020ed (/lib/modules/4.9.0-rc8+/build/vmlinux)
f5d80 __sendmsg_nocancel+0xffff006451f1a007 (/usr/lib64/libc-2.18.so)
After:
7fff816688b7 bpf_clone_redirect+0x80007f002107 (/lib/modules/4.9.0-rc8+/build/vmlinux)
7fffa0575728 bpf_prog_33c45a467c9e061a+0x8000600020fb (/lib/modules/4.9.0-rc8+/build/vmlinux)
7fffa07ef1fc cls_bpf_classify+0x8000600020dc (/lib/modules/4.9.0-rc8+/build/vmlinux)
7fff81678b68 tc_classify+0x80007f002078 (/lib/modules/4.9.0-rc8+/build/vmlinux)
7fff8164d40b __netif_receive_skb_core+0x80007f0025fb (/lib/modules/4.9.0-rc8+/build/vmlinux)
7fff8164d718 __netif_receive_skb+0x80007f002018 (/lib/modules/4.9.0-rc8+/build/vmlinux)
7fff8164e565 process_backlog+0x80007f002095 (/lib/modules/4.9.0-rc8+/build/vmlinux)
7fff8164dc71 net_rx_action+0x80007f002231 (/lib/modules/4.9.0-rc8+/build/vmlinux)
7fff81767461 __softirqentry_text_start+0x80007f0020d1 (/lib/modules/4.9.0-rc8+/build/vmlinux)
7fff817658ac do_softirq_own_stack+0x80007f00201c (/lib/modules/4.9.0-rc8+/build/vmlinux)
7fff810a2c20 do_softirq+0x80007f002050 (/lib/modules/4.9.0-rc8+/build/vmlinux)
7fff810a2cb5 __local_bh_enable_ip+0x80007f002085 (/lib/modules/4.9.0-rc8+/build/vmlinux)
7fff8168d452 ip_finish_output2+0x80007f002152 (/lib/modules/4.9.0-rc8+/build/vmlinux)
7fff8168ea3d ip_finish_output+0x80007f00217d (/lib/modules/4.9.0-rc8+/build/vmlinux)
7fff8168f2af ip_output+0x80007f00203f (/lib/modules/4.9.0-rc8+/build/vmlinux)
[...]
7fff81005854 do_syscall_64+0x80007f002054 (/lib/modules/4.9.0-rc8+/build/vmlinux)
7fff817649eb return_from_SYSCALL_64+0x80007f002000 (/lib/modules/4.9.0-rc8+/build/vmlinux)
f5d80 __sendmsg_nocancel+0xffff01c484812007 (/usr/lib64/libc-2.18.so)
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Remove the dummy bpf_jit_compile() stubs for eBPF JITs and make
that a single __weak function in the core that can be overridden
similarly to the eBPF one. Also remove stale pr_err() mentions
of bpf_jit_compile.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
All map types and prog types are registered to the BPF core through
bpf_register_map_type() and bpf_register_prog_type() during init and
remain unchanged thereafter. As by design we don't (and never will)
have any pluggable code that can register to that at any later point
in time, lets mark all the existing bpf_{map,prog}_type_list objects
in the tree as __ro_after_init, so they can be moved to read-only
section from then onwards.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
|