Age | Commit message (Collapse) | Author |
|
Ido Schimmel says:
====================
mlxsw: Various fixes
This patchset contains various small fixes for mlxsw.
Patch #1 fixes a warning generated by switchdev core when the driver
fails to insert an MDB entry in the commit phase.
Patches #2-#4 fix a warning in check_flush_dependency() that can be
triggered when a work item in a WQ_MEM_RECLAIM workqueue tries to flush
a non-WQ_MEM_RECLAIM workqueue.
It seems that the semantics of the WQ_MEM_RECLAIM flag are not very
clear [1] and that various patches have been sent to remove it from
various workqueues throughout the kernel [2][3][4] in order to silence
the warning.
These patches do the same for the workqueues created by mlxsw that
probably should not have been created with this flag in the first place.
Patch #5 fixes a regression where an IP address cannot be assigned to a
VRF upper due to erroneous MAC validation check. Patch #6 adds a test
case.
Patch #7 adjusts Spectrum-2 shared buffer configuration to be compatible
with Spectrum-1. The problem and fix are described in detail in the
commit message.
Please consider patches #1-#5 for 5.0.y. I verified they apply cleanly.
[1] https://patchwork.kernel.org/patch/10791315/
[2] Commit ce162bfbc0b6 ("mac80211_hwsim: don't use WQ_MEM_RECLAIM")
[3] Commit 39baf10310e6 ("IB/core: Fix use workqueue without WQ_MEM_RECLAIM")
[4] Commit 75215e5bb22c ("iwcm: Don't allocate iwcm workqueue with WQ_MEM_RECLAIM")
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
In Spectrum-1, when a multicast packet is admitted to the shared buffer
it increases the quotas of all the ports and {port, TC} to which it is
forwarded to.
The above means that multicast packets are accounted multiple times in
the shared buffer and can therefore cause the associated shared buffer
pool to fill up very quickly.
To work around this issue, commit e83c045e53d7 ("mlxsw:
spectrum_buffers: Configure MC pool") added a dedicated multicast pool
in which multicast packets are accounted.
The issue is not present in Spectrum-2, but in order to be backward
compatible with Spectrum-1, its default behavior is to allow a multicast
packet to increase multiple egress quotas instead of one.
Until the new (non-backward compatible) mode is supported, configure a
dedicated multicast pool as in Spectrum-1.
Fixes: fe099bf682ab ("mlxsw: spectrum_buffers: Add Spectrum-2 shared buffer configuration")
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Reviewed-by: Petr Machata <petrm@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Test that it is possible to set an IP address on a VRF and that it is
not vetoed.
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Commit 74bc99397438 ("mlxsw: spectrum_router: Veto unsupported RIF MAC
addresses") enabled the driver to veto router interface (RIF) MAC
addresses that it cannot support.
This check should only be performed for interfaces for which the driver
actually configures a RIF. A VRF upper is not one of them, so ignore it.
Without this patch it is not possible to set an IP address on the VRF
device and use it as a loopback.
Fixes: 74bc99397438 ("mlxsw: spectrum_router: Veto unsupported RIF MAC addresses")
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Reported-by: Alexander Petrovskiy <alexpe@mellanox.com>
Tested-by: Alexander Petrovskiy <alexpe@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
The workqueue is used to periodically update the networking stack about
activity / statistics of various objects such as neighbours and TC
actions.
It should not be called as part of memory reclaim path, so remove the
WQ_MEM_RECLAIM flag.
Fixes: 3d5479e92087 ("mlxsw: core: Remove deprecated create_workqueue")
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
The ordered workqueue is used to offload various objects such as routes
and neighbours in the order they are notified.
It should not be called as part of memory reclaim path, so remove the
WQ_MEM_RECLAIM flag. This can also result in a warning [1], if a worker
tries to flush a non-WQ_MEM_RECLAIM workqueue.
[1]
[97703.542861] workqueue: WQ_MEM_RECLAIM mlxsw_core_ordered:mlxsw_sp_router_fib6_event_work [mlxsw_spectrum] is flushing !WQ_MEM_RECLAIM events:rht_deferred_worker
[97703.542884] WARNING: CPU: 1 PID: 32492 at kernel/workqueue.c:2605 check_flush_dependency+0xb5/0x130
...
[97703.542988] Hardware name: Mellanox Technologies Ltd. MSN3700C/VMOD0008, BIOS 5.11 10/10/2018
[97703.543049] Workqueue: mlxsw_core_ordered mlxsw_sp_router_fib6_event_work [mlxsw_spectrum]
[97703.543061] RIP: 0010:check_flush_dependency+0xb5/0x130
...
[97703.543071] RSP: 0018:ffffb3f08137bc00 EFLAGS: 00010086
[97703.543076] RAX: 0000000000000000 RBX: ffff96e07740ae00 RCX: 0000000000000000
[97703.543080] RDX: 0000000000000094 RSI: ffffffff82dc1934 RDI: 0000000000000046
[97703.543084] RBP: ffffb3f08137bc20 R08: ffffffff82dc18a0 R09: 00000000000225c0
[97703.543087] R10: 0000000000000000 R11: 0000000000007eec R12: ffffffff816e4ee0
[97703.543091] R13: ffff96e06f6a5c00 R14: ffff96e077ba7700 R15: ffffffff812ab0c0
[97703.543097] FS: 0000000000000000(0000) GS:ffff96e077a80000(0000) knlGS:0000000000000000
[97703.543101] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[97703.543104] CR2: 00007f8cd135b280 CR3: 00000001e860e003 CR4: 00000000003606e0
[97703.543109] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[97703.543112] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[97703.543115] Call Trace:
[97703.543129] __flush_work+0xbd/0x1e0
[97703.543137] ? __cancel_work_timer+0x136/0x1b0
[97703.543145] ? pwq_dec_nr_in_flight+0x49/0xa0
[97703.543154] __cancel_work_timer+0x136/0x1b0
[97703.543175] ? mlxsw_reg_trans_bulk_wait+0x145/0x400 [mlxsw_core]
[97703.543184] cancel_work_sync+0x10/0x20
[97703.543191] rhashtable_free_and_destroy+0x23/0x140
[97703.543198] rhashtable_destroy+0xd/0x10
[97703.543254] mlxsw_sp_fib_destroy+0xb1/0xf0 [mlxsw_spectrum]
[97703.543310] mlxsw_sp_vr_put+0xa8/0xc0 [mlxsw_spectrum]
[97703.543364] mlxsw_sp_fib_node_put+0xbf/0x140 [mlxsw_spectrum]
[97703.543418] ? mlxsw_sp_fib6_entry_destroy+0xe8/0x110 [mlxsw_spectrum]
[97703.543475] mlxsw_sp_router_fib6_event_work+0x6cd/0x7f0 [mlxsw_spectrum]
[97703.543484] process_one_work+0x1fd/0x400
[97703.543493] worker_thread+0x34/0x410
[97703.543500] kthread+0x121/0x140
[97703.543507] ? process_one_work+0x400/0x400
[97703.543512] ? kthread_park+0x90/0x90
[97703.543523] ret_from_fork+0x35/0x40
Fixes: a3832b31898f ("mlxsw: core: Create an ordered workqueue for FIB offload")
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Reported-by: Semion Lisyansky <semionl@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
The EMAD workqueue is used to handle retransmission of EMAD packets that
contain configuration data for the device's firmware.
Given the workers need to allocate these packets and that the code is
not called as part of memory reclaim path, remove the WQ_MEM_RECLAIM
flag.
Fixes: d965465b60ba ("mlxsw: core: Fix possible deadlock")
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
The driver cannot guarantee in the prepare phase that it will be able to
write an MDB entry to the device. In case the driver returned success
during the prepare phase, but then failed to add the entry in the commit
phase, a WARNING [1] will be generated by the switchdev core.
Fix this by doing the work in the prepare phase instead.
[1]
[ 358.544486] swp12s0: Commit of object (id=2) failed.
[ 358.550061] WARNING: CPU: 0 PID: 30 at net/switchdev/switchdev.c:281 switchdev_port_obj_add_now+0x9b/0xe0
[ 358.560754] CPU: 0 PID: 30 Comm: kworker/0:1 Not tainted 5.0.0-custom-13382-gf2449babf221 #1350
[ 358.570472] Hardware name: Mellanox Technologies Ltd. MSN2100-CB2FO/SA001017, BIOS 5.6.5 06/07/2016
[ 358.580582] Workqueue: events switchdev_deferred_process_work
[ 358.587001] RIP: 0010:switchdev_port_obj_add_now+0x9b/0xe0
...
[ 358.614109] RSP: 0018:ffffa6b900d6fe18 EFLAGS: 00010286
[ 358.619943] RAX: 0000000000000000 RBX: ffff8b00797ff000 RCX: 0000000000000000
[ 358.627912] RDX: ffff8b00b7a1d4c0 RSI: ffff8b00b7a152e8 RDI: ffff8b00b7a152e8
[ 358.635881] RBP: ffff8b005c3f5bc0 R08: 000000000000022b R09: 0000000000000000
[ 358.643850] R10: 0000000000000000 R11: ffffa6b900d6fcc8 R12: 0000000000000000
[ 358.651819] R13: dead000000000100 R14: ffff8b00b65a23c0 R15: 0ffff8b00b7a2200
[ 358.659790] FS: 0000000000000000(0000) GS:ffff8b00b7a00000(0000) knlGS:0000000000000000
[ 358.668820] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 358.675228] CR2: 00007f00aad90de0 CR3: 00000001ca80d000 CR4: 00000000001006f0
[ 358.683188] Call Trace:
[ 358.685918] switchdev_port_obj_add_deferred+0x13/0x60
[ 358.691655] switchdev_deferred_process+0x6b/0xf0
[ 358.696907] switchdev_deferred_process_work+0xa/0x10
[ 358.702548] process_one_work+0x1f5/0x3f0
[ 358.707022] worker_thread+0x28/0x3c0
[ 358.711099] ? process_one_work+0x3f0/0x3f0
[ 358.715768] kthread+0x10d/0x130
[ 358.719369] ? __kthread_create_on_node+0x180/0x180
[ 358.724815] ret_from_fork+0x35/0x40
Fixes: 3a49b4fde2a1 ("mlxsw: Adding layer 2 multicast support")
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Reported-by: Alex Kushnarov <alexanderk@mellanox.com>
Tested-by: Alex Kushnarov <alexanderk@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
The introduction of multipage bio vectors broke pblk's partial read
logic due to it not being prepared for multipage bio vectors.
Use bio vector iterators instead of direct bio vector indexing.
Fixes: 07173c3ec276 ("block: enable multipage bvecs")
Reported-by: Klaus Jensen <klaus.jensen@cnexlabs.com>
Signed-off-by: Hans Holmberg <hans.holmberg@cnexlabs.com>
Updated description.
Signed-off-by: Matias Bjørling <mb@lightnvm.io>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
When a QP is put into error state, the send queue will be flushed.
This mechanism is implemented in both the first and the second leg
of the send engine. Since the second leg is only responsible for
data transactions in the KDETH space for the TID RDMA WRITE request,
it should not perform the flushing of the send queue.
This patch removes the flushing function of the second leg, but
still keeps the bailing out of the QP if it is put into error state.
Fixes: 70dcb2e3dc6a ("IB/hfi1: Add the TID second leg send packet builder")
Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Kaike Wan <kaike.wan@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
|
|
Pull virtio fixes from Michael Tsirkin:
"Several fixes, add more reviewers to the list"
* tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost:
virtio: Honour 'may_reduce_num' in vring_create_virtqueue
MAiNTAINERS: add Paolo, Stefan for virtio blk/scsi
virtio_pci: fix a NULL pointer reference in vp_del_vqs
|
|
The Maximum Physical Memory 2GiB option for 64bit systems is currently
broken because kernel hangs at boot-time when this option is enabled
and the underlying system has more than 2GiB memory.
This issue can be easily reproduced on SiFive Unleashed board where
we have 8GiB of memory.
This patch fixes above issue by removing unusable memory region in
setup_bootmem().
Signed-off-by: Anup Patel <anup.patel@wdc.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Palmer Dabbelt <palmer@sifive.com>
|
|
wcd9335.c: undefined reference to 'devm_regmap_add_irq_chip'
Signed-off-by: Marc Gonzalez <marc.w.gonzalez@free.fr>
Signed-off-by: Mark Brown <broonie@kernel.org>
|
|
Commit 7769db588384 ("drm/i915/dp: optimize eDP 1.4+ link config fast
and narrow") started to optize the eDP 1.4+ link config, both per spec
and as preparation for display stream compression support.
Sadly, we again face panels that flat out fail with parameters they
claim to support. Revert, and go back to the drawing board.
v2: Actually revert to max params instead of just wide-and-slow.
Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=109959
Fixes: 7769db588384 ("drm/i915/dp: optimize eDP 1.4+ link config fast and narrow")
Cc: Ville Syrjälä <ville.syrjala@linux.intel.com>
Cc: Manasi Navare <manasi.d.navare@intel.com>
Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
Cc: Matt Atwood <matthew.s.atwood@intel.com>
Cc: "Lee, Shawn C" <shawn.c.lee@intel.com>
Cc: Dave Airlie <airlied@gmail.com>
Cc: intel-gfx@lists.freedesktop.org
Cc: <stable@vger.kernel.org> # v5.0+
Reviewed-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
Reviewed-by: Manasi Navare <manasi.d.navare@intel.com>
Tested-by: Albert Astals Cid <aacid@kde.org> # v5.0 backport
Tested-by: Emanuele Panigati <ilpanich@gmail.com> # v5.0 backport
Tested-by: Matteo Iervasi <matteoiervasi@gmail.com> # v5.0 backport
Signed-off-by: Jani Nikula <jani.nikula@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20190405075220.9815-1-jani.nikula@intel.com
(cherry picked from commit f11cb1c19ad0563b3c1ea5eb16a6bac0e401f428)
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
|
|
Re-enable clock gating of DDI clocks.
v2: Fix the default ddi clk state for mipi-dsi (Imre)
Fixes: 1026bea00381 ("drm/i915/icl: Ungate DSI clocks")
Signed-off-by: Vandita Kulkarni <vandita.kulkarni@intel.com>
Reviewed-by: Uma Shankar <uma.shankar@intel.com>
Signed-off-by: Jani Nikula <jani.nikula@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/1553513202-13863-2-git-send-email-vandita.kulkarni@intel.com
(cherry picked from commit 942d1cf48eae3fcd7e973cfb708d5c4860f0c713)
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
|
|
IO enable sequencing needs ddi clocks enabled.
These clocks will be gated at a later point in
the enable sequence.
v2: Fix the commit header (Uma)
v3: Remove the redundant read (Ville)
Fixes: 949fc52af19e ("drm/i915/icl: add pll mapping for DSI")
Signed-off-by: Vandita Kulkarni <vandita.kulkarni@intel.com>
Reviewed-by: Uma Shankar <uma.shankar@intel.com>
Signed-off-by: Jani Nikula <jani.nikula@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/1553513202-13863-1-git-send-email-vandita.kulkarni@intel.com
(cherry picked from commit c5b81a325263a891d5811aabe938c87e03db4c37)
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
|
|
nvme_cancel_request() is used in error handler, and it is always
reliable to cancel request synchronously, and avoids possible race
in which request may be completed after real hw queue is destroyed.
One issue is reported by our customer on NVMe RDMA, in which freed ib
queue pair may be used in nvme_rdma_complete_rq().
Cc: Sagi Grimberg <sagi@grimberg.me>
Cc: Bart Van Assche <bvanassche@acm.org>
Cc: James Smart <james.smart@broadcom.com>
Cc: linux-nvme@lists.infradead.org
Reviewed-by: Keith Busch <keith.busch@intel.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
In NVMe's error handler, follows the typical steps of tearing down
hardware for recovering controller:
1) stop blk_mq hw queues
2) stop the real hw queues
3) cancel in-flight requests via
blk_mq_tagset_busy_iter(tags, cancel_request, ...)
cancel_request():
mark the request as abort
blk_mq_complete_request(req);
4) destroy real hw queues
However, there may be race between #3 and #4, because blk_mq_complete_request()
may run q->mq_ops->complete(rq) remotelly and asynchronously, and
->complete(rq) may be run after #4.
This patch introduces blk_mq_complete_request_sync() for fixing the
above race.
Cc: Sagi Grimberg <sagi@grimberg.me>
Cc: Bart Van Assche <bvanassche@acm.org>
Cc: James Smart <james.smart@broadcom.com>
Cc: linux-nvme@lists.infradead.org
Reviewed-by: Keith Busch <keith.busch@intel.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
With commit 01396a374c3d ("s390/zcrypt: revisit ap device remove
procedure") the ap queue remove is now a two stage process. However,
a del_timer_sync() call may trigger the timer function which may
try to lock the very same spinlock as is held by the function
just initiating the del_timer_sync() call. This could end up in
a deadlock situation. Very unlikely but possible as you need to
remove an ap queue at the exact sime time when a timeout of a
request occurs.
Signed-off-by: Harald Freudenberger <freude@linux.ibm.com>
Reported-by: Pierre Morel <pmorel@linux.ibm.com>
Fixes: commit 01396a374c3d ("s390/zcrypt: revisit ap device remove procedure")
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
|
|
The spinlock in the raw3270_view structure is used by con3270, tty3270
and fs3270 in different ways. For con3270 the lock can be acquired in
irq context, for tty3270 and fs3270 the highest context is bh.
Lockdep sees the view->lock as a single class and if the 3270 driver
is used for the console the following message is generated:
WARNING: inconsistent lock state
5.1.0-rc3-05157-g5c168033979d #12 Not tainted
--------------------------------
inconsistent {IN-HARDIRQ-W} -> {HARDIRQ-ON-W} usage.
swapper/0/1 [HC0[0]:SC1[1]:HE1:SE0] takes:
(____ptrval____) (&(&view->lock)->rlock){?.-.}, at: tty3270_update+0x7c/0x330
Introduce a lockdep subclass for the view lock to distinguish bh from
irq locks.
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
|
|
When tag_set->nr_maps is 1, the block layer limits the number of hw queues
by nr_cpu_ids. No matter how many hw queues are used by virtio-scsi, as it
has (tag_set->nr_maps == 1), it can use at most nr_cpu_ids hw queues.
In addition, specifically for pci scenario, when the 'num_queues' specified
by qemu is more than maxcpus, virtio-scsi would not be able to allocate
more than maxcpus vectors in order to have a vector for each queue. As a
result, it falls back into MSI-X with one vector for config and one shared
for queues.
Considering above reasons, this patch limits the number of hw queues used
by virtio-scsi by nr_cpu_ids.
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Dongli Zhang <dongli.zhang@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
When tag_set->nr_maps is 1, the block layer limits the number of hw queues
by nr_cpu_ids. No matter how many hw queues are used by virtio-blk, as it
has (tag_set->nr_maps == 1), it can use at most nr_cpu_ids hw queues.
In addition, specifically for pci scenario, when the 'num-queues' specified
by qemu is more than maxcpus, virtio-blk would not be able to allocate more
than maxcpus vectors in order to have a vector for each queue. As a result,
it falls back into MSI-X with one vector for config and one shared for
queues.
Considering above reasons, this patch limits the number of hw queues used
by virtio-blk by nr_cpu_ids.
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Dongli Zhang <dongli.zhang@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
The function bfq_bfqq_expire() invokes the function
__bfq_bfqq_expire(), and the latter may free the in-service bfq-queue.
If this happens, then no other instruction of bfq_bfqq_expire() must
be executed, or a use-after-free will occur.
Basing on the assumption that __bfq_bfqq_expire() invokes
bfq_put_queue() on the in-service bfq-queue exactly once, the queue is
assumed to be freed if its refcounter is equal to one right before
invoking __bfq_bfqq_expire().
But, since commit 9dee8b3b057e ("block, bfq: fix queue removal from
weights tree") this assumption is false. __bfq_bfqq_expire() may also
invoke bfq_weights_tree_remove() and, since commit 9dee8b3b057e
("block, bfq: fix queue removal from weights tree"), also
the latter function may invoke bfq_put_queue(). So __bfq_bfqq_expire()
may invoke bfq_put_queue() twice, and this is the actual case where
the in-service queue may happen to be freed.
To address this issue, this commit moves the check on the refcounter
of the queue right around the last bfq_put_queue() that may be invoked
on the queue.
Fixes: 9dee8b3b057e ("block, bfq: fix queue removal from weights tree")
Reported-by: Dmitrii Tcvetkov <demfloro@demfloro.ru>
Reported-by: Douglas Anderson <dianders@chromium.org>
Tested-by: Dmitrii Tcvetkov <demfloro@demfloro.ru>
Tested-by: Douglas Anderson <dianders@chromium.org>
Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
snd_hdac_display_power() doesn't handle the concurrent calls carefully
enough, and it may lead to the doubly get_power or put_power calls,
when a runtime PM and an async work get called in racy way.
This patch addresses it by reusing the bus->lock mutex that has been
used for protecting the link state change in ext bus code, so that it
can protect against racy display state changes. The initialization of
bus->lock was moved from snd_hdac_ext_bus_init() to
snd_hdac_bus_init() as well accordingly.
Testcase: igt/i915_pm_rpm/module-reload #glk-dsi
Reported-by: Chris Wilson <chris@chris-wilson.co.uk>
Reviewed-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Imre Deak <imre.deak@intel.com>
Signed-off-by: Takashi Iwai <tiwai@suse.de>
|
|
To calculate a remaining time, it's required to subtract the current time
from the expiration time. In alarm_timer_remaining() the arguments of
ktime_sub are swapped.
Fixes: d653d8457c76 ("alarmtimer: Implement remaining callback")
Signed-off-by: Andrei Vagin <avagin@gmail.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Mukesh Ojha <mojha@codeaurora.org>
Cc: Stephen Boyd <sboyd@kernel.org>
Cc: John Stultz <john.stultz@linaro.org>
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/20190408041542.26338-1-avagin@gmail.com
|
|
The following commit:
a0b0fd53e1e6 ("locking/lockdep: Free lock classes that are no longer in use")
changed the behavior of lockdep_free_key_range() from
unconditionally zapping lock classes into only zapping lock classes if
debug_lock == true. Not zapping lock classes if debug_lock == false leaves
dangling pointers in several lockdep datastructures, e.g. lock_class::name
in the all_lock_classes list.
The shell command "cat /proc/lockdep" causes the kernel to iterate the
all_lock_classes list. Hence the "unable to handle kernel paging request" cash
that Shenghui encountered by running cat /proc/lockdep.
Since the new behavior can cause cat /proc/lockdep to crash, restore the
pre-v5.1 behavior.
This patch avoids that cat /proc/lockdep triggers the following crash
with debug_lock == false:
BUG: unable to handle kernel paging request at fffffbfff40ca448
RIP: 0010:__asan_load1+0x28/0x50
Call Trace:
string+0xac/0x180
vsnprintf+0x23e/0x820
seq_vprintf+0x82/0xc0
seq_printf+0x92/0xb0
print_name+0x34/0xb0
l_show+0x184/0x200
seq_read+0x59e/0x6c0
proc_reg_read+0x11f/0x170
__vfs_read+0x4d/0x90
vfs_read+0xc5/0x1f0
ksys_read+0xab/0x130
__x64_sys_read+0x43/0x50
do_syscall_64+0x71/0x210
entry_SYSCALL_64_after_hwframe+0x49/0xbe
Reported-by: shenghui <shhuiw@foxmail.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Waiman Long <longman@redhat.com>
Cc: Will Deacon <will.deacon@arm.com>
Fixes: a0b0fd53e1e6 ("locking/lockdep: Free lock classes that are no longer in use") # v5.1-rc1.
Link: https://lkml.kernel.org/r/20190403233552.124673-1-bvanassche@acm.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
Handle error before returning when try_module_get() fails
to prevent inconsistent mutex lock/unlock.
Fixes: 52034add7 (ASoC: pcm: update module refcount if
module_get_upon_open is set)
Signed-off-by: Ranjani Sridharan <ranjani.sridharan@linux.intel.com>
Signed-off-by: Mark Brown <broonie@kernel.org>
|
|
Before commit c5459b829b71 ("LSM: Plumb visibility into optional "enabled"
state"), /sys/module/apparmor/parameters/enabled would show "Y" or "N"
since it was using the "bool" handler. After being changed to "int",
this switched to "1" or "0", breaking the userspace AppArmor detection
of dbus-broker. This restores the Y/N output while keeping the LSM
infrastructure happy.
Before:
$ cat /sys/module/apparmor/parameters/enabled
1
After:
$ cat /sys/module/apparmor/parameters/enabled
Y
Reported-by: David Rheinsberg <david.rheinsberg@gmail.com>
Reviewed-by: David Rheinsberg <david.rheinsberg@gmail.com>
Link: https://lkml.kernel.org/r/CADyDSO6k8vYb1eryT4g6+EHrLCvb68GAbHVWuULkYjcZcYNhhw@mail.gmail.com
Fixes: c5459b829b71 ("LSM: Plumb visibility into optional "enabled" state")
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: John Johansen <john.johansen@canonical.com>
|
|
When master clock is used, master clock rate is set exclusively.
Parent clocks of master clock cannot be changed after a call to
clk_set_rate_exclusive(). So the parent clock of SAI kernel clock
must be set before.
Ensure also that exclusive rate operations are balanced
in STM32 SAI driver.
Signed-off-by: Olivier Moysan <olivier.moysan@st.com>
Signed-off-by: Mark Brown <broonie@kernel.org>
|
|
Fix wrong setting on number of channels. The context wants to set
constraint to 2 channels instead of 4.
Signed-off-by: Tzung-Bi Shih <tzungbi@google.com>
Acked-by: Pierre-Louis Bossart <pierre-louis.bossart@linux.intel.com>
Signed-off-by: Mark Brown <broonie@kernel.org>
|
|
Spurious interrupt support was added to perf in the following commit, almost
a decade ago:
63e6be6d98e1 ("perf, x86: Catch spurious interrupts after disabling counters")
The two previous patches (resolving the race condition when disabling a
PMC and NMI latency mitigation) allow for the removal of this older
spurious interrupt support.
Currently in x86_pmu_stop(), the bit for the PMC in the active_mask bitmap
is cleared before disabling the PMC, which sets up a race condition. This
race condition was mitigated by introducing the running bitmap. That race
condition can be eliminated by first disabling the PMC, waiting for PMC
reset on overflow and then clearing the bit for the PMC in the active_mask
bitmap. The NMI handler will not re-enable a disabled counter.
If x86_pmu_stop() is called from the perf NMI handler, the NMI latency
mitigation support will guard against any unhandled NMI messages.
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: <stable@vger.kernel.org> # 4.14.x-
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Link: https://lkml.kernel.org/r/Message-ID:
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
There's no include/dt-bindings/i3c/ directory, remove this F: entry
from the I3C file patterns.
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Joe Perches <joe@perches.com>
Reported-by: Joe Perches <joe@perches.com>
Fixes: 4f26d0666961 ("MAINTAINERS: Add myself as the I3C subsystem maintainer")
Signed-off-by: Boris Brezillon <boris.brezillon@collabora.com>
|
|
The controller was being disabled incorrectly. The correct way is to clear
the DEV_CTRL_ENABLE bit.
Fix this by clearing this bit.
Cc: Boris Brezillon <bbrezillon@kernel.org>
Cc: <stable@vger.kernel.org>
Fixes: 1dd728f5d4d4 ("i3c: master: Add driver for Synopsys DesignWare IP")
Signed-off-by: Vitor Soares <vitor.soares@synopsys.com>
Signed-off-by: Boris Brezillon <boris.brezillon@collabora.com>
|
|
The validation of random PID should be done by checking the
boardinfo->pid instead of info.pid which is empty.
Doing the change the info struture declaration is no longer necessary.
Cc: Boris Brezillon <bbrezillon@kernel.org>
Cc: <stable@vger.kernel.org>
Fixes: 3a379bbcea0a ("i3c: Add core I3C infrastructure")
Signed-off-by: Vitor Soares <vitor.soares@synopsys.com>
Signed-off-by: Boris Brezillon <boris.brezillon@collabora.com>
|
|
In commit da11b417583e ("libbpf: teach libbpf about log_level bit 2"),
the BPF_LOG_BUF_SIZE was increased to 16M. The XDP socket part of
libbpf allocated the log_buf on the stack, but for the new 16M buffer
size this is not going to work. Change the code so it uses a 16K buffer
instead.
Fixes: da11b417583e ("libbpf: teach libbpf about log_level bit 2")
Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
|
|
The issue is reported at https://github.com/libbpf/libbpf/issues/28.
Basically, per C standard, for
void *memcpy(void *dest, const void *src, size_t n)
if "dest" or "src" is NULL, regardless of whether "n" is 0 or not,
the result of memcpy is undefined. clang ubsan reported three such
instances in bpf.c with the following pattern:
memcpy(dest, 0, 0).
Although in practice, no known compiler will cause issues when
copy size is 0. Let us still fix the issue to silence ubsan
warnings.
Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
|
|
Create a new prio in the FDB, it will be used when inserting steering rules
into the FDB from the RDMA side. We create a new PRIO so rules from the
net side and rules from the RDMA side won't be inserted to the same PRIO,
each side has it's own sandbox to play in.
Signed-off-by: Mark Bloch <markb@mellanox.com>
Reviewed-by: Maor Gottlieb <maorg@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
|
|
When creating the FDB prios, use the enum values already defined and not
the hardcoded values.
Signed-off-by: Mark Bloch <markb@mellanox.com>
Reviewed-by: Maor Gottlieb <maorg@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
|
|
The recent commit 8bc086899816 ("powerpc/mm: Only define
MAX_PHYSMEM_BITS in SPARSEMEM configurations") removed our definition
of MAX_PHYSMEM_BITS when SPARSEMEM is disabled.
This inadvertently broke some 64-bit FLATMEM using configs with eg:
arch/powerpc/include/asm/book3s/64/mmu-hash.h:584:6: error: "MAX_PHYSMEM_BITS" is not defined, evaluates to 0
#if (MAX_PHYSMEM_BITS > MAX_EA_BITS_PER_CONTEXT)
^~~~~~~~~~~~~~~~
Fix it by making sure we define MAX_PHYSMEM_BITS for all 64-bit
configs regardless of SPARSEMEM.
Fixes: 8bc086899816 ("powerpc/mm: Only define MAX_PHYSMEM_BITS in SPARSEMEM configurations")
Reported-by: Andreas Schwab <schwab@linux-m68k.org>
Reported-by: Hugh Dickins <hughd@google.com>
Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
|
|
Badly-designed systems might have (for example) active-high wake pins
that default to high (e.g., because of external pull ups) until they
have an active firmware which starts driving it low. This can cause an
interrupt storm in the time between request_irq() and disable_irq().
We don't support shared interrupts here, so let's just pre-configure the
interrupt to avoid auto-enabling it.
Fixes: fd913ef7ce61 ("Bluetooth: btusb: Add out-of-band wakeup support")
Fixes: 5364a0b4f4be ("arm64: dts: rockchip: move QCA6174A wakeup pin into its USB node")
Signed-off-by: Brian Norris <briannorris@chromium.org>
Reviewed-by: Matthias Kaehlcke <mka@chromium.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Add preprocessor macros for the important PRCI output clocks
that are needed by both the FU540 PRCI driver and DT data.
Details are available in the FU540 manual in Chapter 7 of
https://static.dev.sifive.com/FU540-C000-v1.0.pdf
Signed-off-by: Paul Walmsley <paul.walmsley@sifive.com>
Reviewed-by: Rob Herring <robh@kernel.org>
Signed-off-by: Palmer Dabbelt <palmer@sifive.com>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/mips/linux
Pull MIPS fixes from Paul Burton:
"A few minor MIPS fixes:
- Provide struct pt_regs * from get_irq_regs() to kgdb_nmicallback()
when handling an IPI triggered by kgdb_roundup_cpus(), matching the
behavior of other architectures & resolving kgdb issues for SMP
systems.
- Defer a pointer dereference until after a NULL check in the
irq_shutdown callback for SGI IP27 HUB interrupts.
- A defconfig update for the MSCC Ocelot to enable some necessary
drivers"
* tag 'mips_fixes_5.1_2' of git://git.kernel.org/pub/scm/linux/kernel/git/mips/linux:
MIPS: generic: Add switchdev, pinctrl and fit to ocelot_defconfig
MIPS: SGI-IP27: Fix use of unchecked pointer in shutdown_bridge_irq
MIPS: KGDB: fix kgdb support for SMP platforms.
|
|
Pull misc fixes from Al Viro:
"A few regression fixes from this cycle"
* 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
aio: use kmem_cache_free() instead of kfree()
iov_iter: Fix build error without CONFIG_CRYPTO
aio: Fix an error code in __io_submit_one()
|
|
Daniel Borkmann says:
====================
This series is a major rework of previously submitted libbpf
patches [0] in order to add global data support for BPF. The
kernel has been extended to add proper infrastructure that allows
for full .bss/.data/.rodata sections on BPF loader side based
upon feedback from LPC discussions [1]. Latter support is then
also added into libbpf in this series which allows for more
natural C-like programming of BPF programs. For more information
on loader, please refer to 'bpf, libbpf: support global data/bss/
rodata sections' patch in this series.
Thanks a lot!
v5 -> v6:
- Removed synchronize_rcu() from map freeze (Jann)
- Rest as-is
v4 -> v5:
- Removed index selection again for ldimm64 (Alexei)
- Adapted related test cases and added new ones to test
rejection of off != 0
v3 -> v4:
- Various fixes in BTF verification e.g. to disallow
Var and DataSec to be an intermediate type during resolve (Martin)
- More BTF test cases added
- Few cleanups in key-less BTF commit (Martin)
- Bump libbpf minor version from 2 to 3
- Renamed and simplified read-only locking
- Various minor improvements all over the place
v2 -> v3:
- Implement BTF support in kernel, libbpf, bpftool, add tests
- Fix idx + off conversion (Andrii)
- Document lower / higher bits for direct value access (Andrii)
- Add tests with small value size (Andrii)
- Add index selection into ldimm64 (Andrii)
- Fix missing fdput() (Jann)
- Reject invalid flags in BPF_F_*_PROG (Jakub)
- Complete rework of libbpf support, includes:
- Add objname to map name (Stanislav)
- Make .rodata map full read-only after setup (Andrii)
- Merge relocation handling into single one (Andrii)
- Store global maps into obj->maps array (Andrii, Alexei)
- Debug message when skipping section (Andrii)
- Reject non-static global data till we have
semantics for sharing them (Yonghong, Andrii, Alexei)
- More test cases and completely reworked prog test (Alexei)
- Fixes, cleanups, etc all over the set
- Not yet addressed:
- Make BTF mandatory for these maps (Alexei)
-> Waiting till BTF support for these lands first
v1 -> v2:
- Instead of 32-bit static data, implement full global
data support (Alexei)
[0] https://patchwork.ozlabs.org/cover/1040290/
[1] http://vger.kernel.org/lpc-bpf2018.html#session-3
====================
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
Extend test_btf with various positive and negative tests around
BTF verification of kind Var and DataSec. All passing as well:
# ./test_btf
[...]
BTF raw test[4] (global data test #1): OK
BTF raw test[5] (global data test #2): OK
BTF raw test[6] (global data test #3): OK
BTF raw test[7] (global data test #4, unsupported linkage): OK
BTF raw test[8] (global data test #5, invalid var type): OK
BTF raw test[9] (global data test #6, invalid var type (fwd type)): OK
BTF raw test[10] (global data test #7, invalid var type (fwd type)): OK
BTF raw test[11] (global data test #8, invalid var size): OK
BTF raw test[12] (global data test #9, invalid var size): OK
BTF raw test[13] (global data test #10, invalid var size): OK
BTF raw test[14] (global data test #11, multiple section members): OK
BTF raw test[15] (global data test #12, invalid offset): OK
BTF raw test[16] (global data test #13, invalid offset): OK
BTF raw test[17] (global data test #14, invalid offset): OK
BTF raw test[18] (global data test #15, not var kind): OK
BTF raw test[19] (global data test #16, invalid var referencing sec): OK
BTF raw test[20] (global data test #17, invalid var referencing var): OK
BTF raw test[21] (global data test #18, invalid var loop): OK
BTF raw test[22] (global data test #19, invalid var referencing var): OK
BTF raw test[23] (global data test #20, invalid ptr referencing var): OK
BTF raw test[24] (global data test #21, var included in struct): OK
BTF raw test[25] (global data test #22, array of var): OK
[...]
PASS:167 SKIP:0 FAIL:0
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
Add tests for libbpf relocation of static variable references
into the .data, .rodata and .bss sections of the ELF, also add
read-only test for .rodata. All passing:
# ./test_progs
[...]
test_global_data:PASS:load program 0 nsec
test_global_data:PASS:pass global data run 925 nsec
test_global_data_number:PASS:relocate .bss reference 925 nsec
test_global_data_number:PASS:relocate .data reference 925 nsec
test_global_data_number:PASS:relocate .rodata reference 925 nsec
test_global_data_number:PASS:relocate .bss reference 925 nsec
test_global_data_number:PASS:relocate .data reference 925 nsec
test_global_data_number:PASS:relocate .rodata reference 925 nsec
test_global_data_number:PASS:relocate .bss reference 925 nsec
test_global_data_number:PASS:relocate .bss reference 925 nsec
test_global_data_number:PASS:relocate .rodata reference 925 nsec
test_global_data_number:PASS:relocate .rodata reference 925 nsec
test_global_data_number:PASS:relocate .rodata reference 925 nsec
test_global_data_string:PASS:relocate .rodata reference 925 nsec
test_global_data_string:PASS:relocate .data reference 925 nsec
test_global_data_string:PASS:relocate .bss reference 925 nsec
test_global_data_string:PASS:relocate .data reference 925 nsec
test_global_data_string:PASS:relocate .bss reference 925 nsec
test_global_data_struct:PASS:relocate .rodata reference 925 nsec
test_global_data_struct:PASS:relocate .bss reference 925 nsec
test_global_data_struct:PASS:relocate .rodata reference 925 nsec
test_global_data_struct:PASS:relocate .data reference 925 nsec
test_global_data_rdonly:PASS:test .rodata read-only map 925 nsec
[...]
Summary: 229 PASSED, 0 FAILED
Note map helper signatures have been changed to avoid warnings
when passing in const data.
Joint work with Daniel Borkmann.
Signed-off-by: Joe Stringer <joe@wand.net.nz>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
Extend test_verifier with various test cases around the two kernel
extensions, that is, {rd,wr}only map support as well as direct map
value access. All passing, one skipped due to xskmap not present
on test machine:
# ./test_verifier
[...]
#948/p XDP pkt read, pkt_meta' <= pkt_data, bad access 1 OK
#949/p XDP pkt read, pkt_meta' <= pkt_data, bad access 2 OK
#950/p XDP pkt read, pkt_data <= pkt_meta', good access OK
#951/p XDP pkt read, pkt_data <= pkt_meta', bad access 1 OK
#952/p XDP pkt read, pkt_data <= pkt_meta', bad access 2 OK
Summary: 1410 PASSED, 1 SKIPPED, 0 FAILED
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
Add the ability to bpftool to handle BTF Var and DataSec kinds
in order to dump them out of btf_dumper_type(). The value has a
single object with the section name, which itself holds an array
of variables it dumps. A single variable is an object by itself
printed along with its name. From there further type information
is dumped along with corresponding value information.
Example output from .rodata:
# ./bpftool m d i 150
[{
"value": {
".rodata": [{
"load_static_data.bar": 18446744073709551615
},{
"num2": 24
},{
"num5": 43947
},{
"num6": 171
},{
"str0": [97,98,99,100,101,102,103,104,105,106,107,108,109,110,111,112,113,114,115,116,117,118,119,120,121,122,0,0,0,0,0,0
]
},{
"struct0": {
"a": 42,
"b": 4278120431,
"c": 1229782938247303441
}
},{
"struct2": {
"a": 0,
"b": 0,
"c": 0
}
}
]
}
}
]
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
This adds libbpf support for BTF Var and DataSec kinds. Main point
here is that libbpf needs to do some preparatory work before the
whole BTF object can be loaded into the kernel, that is, fixing up
of DataSec size taken from the ELF section size and non-static
variable offset which needs to be taken from the ELF's string section.
Upstream LLVM doesn't fix these up since at time of BTF emission
it is too early in the compilation process thus this information
isn't available yet, hence loader needs to take care of it.
Note, deduplication handling has not been in the scope of this work
and needs to be addressed in a future commit.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://reviews.llvm.org/D59441
Acked-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
This work adds BPF loader support for global data sections
to libbpf. This allows to write BPF programs in more natural
C-like way by being able to define global variables and const
data.
Back at LPC 2018 [0] we presented a first prototype which
implemented support for global data sections by extending BPF
syscall where union bpf_attr would get additional memory/size
pair for each section passed during prog load in order to later
add this base address into the ldimm64 instruction along with
the user provided offset when accessing a variable. Consensus
from LPC was that for proper upstream support, it would be
more desirable to use maps instead of bpf_attr extension as
this would allow for introspection of these sections as well
as potential live updates of their content. This work follows
this path by taking the following steps from loader side:
1) In bpf_object__elf_collect() step we pick up ".data",
".rodata", and ".bss" section information.
2) If present, in bpf_object__init_internal_map() we add
maps to the obj's map array that corresponds to each
of the present sections. Given section size and access
properties can differ, a single entry array map is
created with value size that is corresponding to the
ELF section size of .data, .bss or .rodata. These
internal maps are integrated into the normal map
handling of libbpf such that when user traverses all
obj maps, they can be differentiated from user-created
ones via bpf_map__is_internal(). In later steps when
we actually create these maps in the kernel via
bpf_object__create_maps(), then for .data and .rodata
sections their content is copied into the map through
bpf_map_update_elem(). For .bss this is not necessary
since array map is already zero-initialized by default.
Additionally, for .rodata the map is frozen as read-only
after setup, such that neither from program nor syscall
side writes would be possible.
3) In bpf_program__collect_reloc() step, we record the
corresponding map, insn index, and relocation type for
the global data.
4) And last but not least in the actual relocation step in
bpf_program__relocate(), we mark the ldimm64 instruction
with src_reg = BPF_PSEUDO_MAP_VALUE where in the first
imm field the map's file descriptor is stored as similarly
done as in BPF_PSEUDO_MAP_FD, and in the second imm field
(as ldimm64 is 2-insn wide) we store the access offset
into the section. Given these maps have only single element
ldimm64's off remains zero in both parts.
5) On kernel side, this special marked BPF_PSEUDO_MAP_VALUE
load will then store the actual target address in order
to have a 'map-lookup'-free access. That is, the actual
map value base address + offset. The destination register
in the verifier will then be marked as PTR_TO_MAP_VALUE,
containing the fixed offset as reg->off and backing BPF
map as reg->map_ptr. Meaning, it's treated as any other
normal map value from verification side, only with
efficient, direct value access instead of actual call to
map lookup helper as in the typical case.
Currently, only support for static global variables has been
added, and libbpf rejects non-static global variables from
loading. This can be lifted until we have proper semantics
for how BPF will treat multi-object BPF loads. From BTF side,
libbpf will set the value type id of the types corresponding
to the ".bss", ".data" and ".rodata" names which LLVM will
emit without the object name prefix. The key type will be
left as zero, thus making use of the key-less BTF option in
array maps.
Simple example dump of program using globals vars in each
section:
# bpftool prog
[...]
6784: sched_cls name load_static_dat tag a7e1291567277844 gpl
loaded_at 2019-03-11T15:39:34+0000 uid 0
xlated 1776B jited 993B memlock 4096B map_ids 2238,2237,2235,2236,2239,2240
# bpftool map show id 2237
2237: array name test_glo.bss flags 0x0
key 4B value 64B max_entries 1 memlock 4096B
# bpftool map show id 2235
2235: array name test_glo.data flags 0x0
key 4B value 64B max_entries 1 memlock 4096B
# bpftool map show id 2236
2236: array name test_glo.rodata flags 0x80
key 4B value 96B max_entries 1 memlock 4096B
# bpftool prog dump xlated id 6784
int load_static_data(struct __sk_buff * skb):
; int load_static_data(struct __sk_buff *skb)
0: (b7) r6 = 0
; test_reloc(number, 0, &num0);
1: (63) *(u32 *)(r10 -4) = r6
2: (bf) r2 = r10
; int load_static_data(struct __sk_buff *skb)
3: (07) r2 += -4
; test_reloc(number, 0, &num0);
4: (18) r1 = map[id:2238]
6: (18) r3 = map[id:2237][0]+0 <-- direct addr in .bss area
8: (b7) r4 = 0
9: (85) call array_map_update_elem#100464
10: (b7) r1 = 1
; test_reloc(number, 1, &num1);
[...]
; test_reloc(string, 2, str2);
120: (18) r8 = map[id:2237][0]+16 <-- same here at offset +16
122: (18) r1 = map[id:2239]
124: (18) r3 = map[id:2237][0]+16
126: (b7) r4 = 0
127: (85) call array_map_update_elem#100464
128: (b7) r1 = 120
; str1[5] = 'x';
129: (73) *(u8 *)(r9 +5) = r1
; test_reloc(string, 3, str1);
130: (b7) r1 = 3
131: (63) *(u32 *)(r10 -4) = r1
132: (b7) r9 = 3
133: (bf) r2 = r10
; int load_static_data(struct __sk_buff *skb)
134: (07) r2 += -4
; test_reloc(string, 3, str1);
135: (18) r1 = map[id:2239]
137: (18) r3 = map[id:2235][0]+16 <-- direct addr in .data area
139: (b7) r4 = 0
140: (85) call array_map_update_elem#100464
141: (b7) r1 = 111
; __builtin_memcpy(&str2[2], "hello", sizeof("hello"));
142: (73) *(u8 *)(r8 +6) = r1 <-- further access based on .bss data
143: (b7) r1 = 108
144: (73) *(u8 *)(r8 +5) = r1
[...]
For Cilium use-case in particular, this enables migrating configuration
constants from Cilium daemon's generated header defines into global
data sections such that expensive runtime recompilations with LLVM can
be avoided altogether. Instead, the ELF file becomes effectively a
"template", meaning, it is compiled only once (!) and the Cilium daemon
will then rewrite relevant configuration data from the ELF's .data or
.rodata sections directly instead of recompiling the program. The
updated ELF is then loaded into the kernel and atomically replaces
the existing program in the networking datapath. More info in [0].
Based upon recent fix in LLVM, commit c0db6b6bd444 ("[BPF] Don't fail
for static variables").
[0] LPC 2018, BPF track, "ELF relocation for static data in BPF",
http://vger.kernel.org/lpc-bpf2018.html#session-3
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|