Age | Commit message (Collapse) | Author |
|
When shrink btree node cache from c->btree_cache in bch_mca_scan(),
no matter the selected node is reaped or not, it will be rotated from
the head to the tail of c->btree_cache list. But in bcache journal
code, when flushing the btree nodes with oldest journal entry, btree
nodes are iterated and slected from the tail of c->btree_cache list in
btree_flush_write(). The list_rotate_left() in bch_mca_scan() will
make btree_flush_write() iterate more nodes in c->btree_list in reverse
order.
This patch just reaps the selected btree node cache, and not move it
from the head to the tail of c->btree_cache list. Then bch_mca_scan()
will not mess up c->btree_cache list to btree_flush_write().
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
In order to skip the most recently freed btree node cahce, currently
in bch_mca_scan() the first 3 caches in c->btree_cache_freeable list
are skipped when shrinking bcache node caches in bch_mca_scan(). The
related code in bch_mca_scan() is,
737 list_for_each_entry_safe(b, t, &c->btree_cache_freeable, list) {
738 if (nr <= 0)
739 goto out;
740
741 if (++i > 3 &&
742 !mca_reap(b, 0, false)) {
lines free cache memory
746 }
747 nr--;
748 }
The problem is, if virtual memory code calls bch_mca_scan() and
the calculated 'nr' is 1 or 2, then in the above loop, nothing will
be shunk. In such case, if slub/slab manager calls bch_mca_scan()
for many times with small scan number, it does not help to shrink
cache memory and just wasts CPU cycles.
This patch just selects btree node caches from tail of the
c->btree_cache_freeable list, then the newly freed host cache can
still be allocated by mca_alloc(), and at least 1 node can be shunk.
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
The member 'accessed' of struct btree is used in bch_mca_scan() when
shrinking btree node caches. The original idea is, if b->accessed is
set, clean it and look at next btree node cache from c->btree_cache
list, and only shrink the caches whose b->accessed is cleaned. Then
only cold btree node cache will be shrunk.
But when I/O pressure is high, it is very probably that b->accessed
of a btree node cache will be set again in bch_btree_node_get()
before bch_mca_scan() selects it again. Then there is no chance for
bch_mca_scan() to shrink enough memory back to slub or slab system.
This patch removes member accessed from struct btree, then once a
btree node ache is selected, it will be immediately shunk. By this
change, bch_mca_scan() may release btree node cahce more efficiently.
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
It's useful to dump written block and keys on btree write, this patch
add them into trace_bcache_btree_write.
Signed-off-by: Guoju Fang <fangguoju@gmail.com>
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
the commit 91be66e1318f ("bcache: performance improvement for
btree_flush_write()") was an effort to flushing btree node with oldest
btree node faster in following methods,
- Only iterate dirty btree nodes in c->btree_cache, avoid scanning a lot
of clean btree nodes.
- Take c->btree_cache as a LRU-like list, aggressively flushing all
dirty nodes from tail of c->btree_cache util the btree node with
oldest journal entry is flushed. This is to reduce the time of holding
c->bucket_lock.
Guoju Fang and Shuang Li reported that they observe unexptected extra
write I/Os on cache device after applying the above patch. Guoju Fang
provideed more detailed diagnose information that the aggressive
btree nodes flushing may cause 10x more btree nodes to flush in his
workload. He points out when system memory is large enough to hold all
btree nodes in memory, c->btree_cache is not a LRU-like list any more.
Then the btree node with oldest journal entry is very probably not-
close to the tail of c->btree_cache list. In such situation much more
dirty btree nodes will be aggressively flushed before the target node
is flushed. When slow SATA SSD is used as cache device, such over-
aggressive flushing behavior will cause performance regression.
After spending a lot of time on debug and diagnose, I find the real
condition is more complicated, aggressive flushing dirty btree nodes
from tail of c->btree_cache list is not a good solution.
- When all btree nodes are cached in memory, c->btree_cache is not
a LRU-like list, the btree nodes with oldest journal entry won't
be close to the tail of the list.
- There can be hundreds dirty btree nodes reference the oldest journal
entry, before flushing all the nodes the oldest journal entry cannot
be reclaimed.
When the above two conditions mixed together, a simply flushing from
tail of c->btree_cache list is really NOT a good idea.
Fortunately there is still chance to make btree_flush_write() work
better. Here is how this patch avoids unnecessary btree nodes flushing,
- Only acquire c->journal.lock when getting oldest journal entry of
fifo c->journal.pin. In rested locations check the journal entries
locklessly, so their values can be changed on other cores
in parallel.
- In loop list_for_each_entry_safe_reverse(), checking latest front
point of fifo c->journal.pin. If it is different from the original
point which we get with locking c->journal.lock, it means the oldest
journal entry is reclaim on other cores. At this moment, all selected
dirty nodes recorded in array btree_nodes[] are all flushed and clean
on other CPU cores, it is unncessary to iterate c->btree_cache any
longer. Just quit the list_for_each_entry_safe_reverse() loop and
the following for-loop will skip all the selected clean nodes.
- Find a proper time to quit the list_for_each_entry_safe_reverse()
loop. Check the refcount value of orignial fifo front point, if the
value is larger than selected node number of btree_nodes[], it means
more matching btree nodes should be scanned. Otherwise it means no
more matching btee nodes in rest of c->btree_cache list, the loop
can be quit. If the original oldest journal entry is reclaimed and
fifo front point is updated, the refcount of original fifo front point
will be 0, then the loop will be quit too.
- Not hold c->bucket_lock too long time. c->bucket_lock is also required
for space allocation for cached data, hold it for too long time will
block regular I/O requests. When iterating list c->btree_cache, even
there are a lot of maching btree nodes, in order to not holding
c->bucket_lock for too long time, only BTREE_FLUSH_NR nodes are
selected and to flush in following for-loop.
With this patch, only btree nodes referencing oldest journal entry
are flushed to cache device, no aggressive flushing for unnecessary
btree node any more. And in order to avoid blocking regluar I/O
requests, each time when btree_flush_write() called, at most only
BTREE_FLUSH_NR btree nodes are selected to flush, even there are more
maching btree nodes in list c->btree_cache.
At last, one more thing to explain: Why it is safe to read front point
of c->journal.pin without holding c->journal.lock inside the
list_for_each_entry_safe_reverse() loop ?
Here is my answer: When reading the front point of fifo c->journal.pin,
we don't need to know the exact value of front point, we just want to
check whether the value is different from the original front point
(which is accurate value because we get it while c->jouranl.lock is
held). For such purpose, it works as expected without holding
c->journal.lock. Even the front point is changed on other CPU core and
not updated to local core, and current iterating btree node has
identical journal entry local as original fetched fifo front point, it
is still safe. Because after holding mutex b->write_lock (with memory
barrier) this btree node can be found as clean and skipped, the loop
will quite latter when iterate on next node of list c->btree_cache.
Fixes: 91be66e1318f ("bcache: performance improvement for btree_flush_write()")
Reported-by: Guoju Fang <fangguoju@gmail.com>
Reported-by: Shuang Li <psymon@bonuscloud.io>
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
To explain the pages allocated from mempool state->pool can be
swapped in __btree_sort(), because state->pool is a page pool,
which allocates pages by alloc_pages() indeed.
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
The crc64_be() is declared in <linux/crc64.h> so include
this where the symbol is defined to avoid the following
warning:
lib/crc64.c:43:12: warning: symbol 'crc64_be' was not declared. Should it be static?
Signed-off-by: Ben Dooks (Codethink) <ben.dooks@codethink.co.uk>
Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Avoid a pointless dependency on buffer heads in bcache by simply open
coding reading a single page. Also add a SB_OFFSET define for the
byte offset of the superblock instead of using magic numbers.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
This allows to properly build the superblock bio including the offset in
the page using the normal bio helpers. This fixes writing the superblock
for page sizes larger than 4k where the sb write bio would need an offset
in the bio_vec.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Returning the properly typed actual data structure insteaf of the
containing struct page will save the callers some work going
forward.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Avoid an extra reference count roundtrip by transferring the sb_page
ownership to the lower level register helpers.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
The patch "bcache: rework error unwinding in register_bcache" introduces
a use-after-free regression in register_bcache(). Here are current code,
2510 out_free_path:
2511 kfree(path);
2512 out_module_put:
2513 module_put(THIS_MODULE);
2514 out:
2515 pr_info("error %s: %s", path, err);
2516 return ret;
If some error happens and the above code path is executed, at line 2511
path is released, but referenced at line 2515. Then KASAN reports a use-
after-free error message.
This patch changes line 2515 in the following way to fix the problem,
2515 pr_info("error %s: %s", path?path:"", err);
Signed-off-by: Coly Li <colyli@suse.de>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Patch "bcache: rework error unwinding in register_bcache" from
Christoph Hellwig changes the local variables 'path' and 'err'
in undefined initial state. If the code in register_bcache() jumps
to label 'out:' or 'out_module_put:' by goto, these two variables
might be reference with undefined value by the following line,
out_module_put:
module_put(THIS_MODULE);
out:
pr_info("error %s: %s", path, err);
return ret;
Therefore this patch initializes these two local variables properly
in register_bcache() to avoid such issue.
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Split the successful and error return path, and use one goto label for each
resource to unwind. This also fixes some small errors like leaking the
module reference count in the reboot case (which seems entirely harmless)
or printing the wrong warning messages for early failures.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Split out an on-disk version struct cache_sb with the proper endianness
annotations. This fixes a fair chunk of sparse warnings, but there are
some left due to the way the checksum is defined.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Same as cache device, the buffer page needs to be put while
freeing cached_dev. Otherwise a page would be leaked every
time a cached_dev is stopped.
Signed-off-by: Liang Chen <liangchen.linux@gmail.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
In commit 9f79b78ef744 ("Convert filldir[64]() from __put_user() to
unsafe_put_user()") I changed filldir to not do individual __put_user()
accesses, but instead use unsafe_put_user() surrounded by the proper
user_access_begin/end() pair.
That make them enormously faster on modern x86, where the STAC/CLAC
games make individual user accesses fairly heavy-weight.
However, the user_access_begin() range was not really the exact right
one, since filldir() has the unfortunate problem that it needs to not
only fill out the new directory entry, it also needs to fix up the
previous one to contain the proper file offset.
It's unfortunate, but the "d_off" field in "struct dirent" is _not_ the
file offset of the directory entry itself - it's the offset of the next
one. So we end up backfilling the offset in the previous entry as we
walk along.
But since x86 didn't really care about the exact range, and used to be
the only architecture that did anything fancy in user_access_begin() to
begin with, the filldir[64]() changes did something lazy, and even
commented on it:
/*
* Note! This range-checks 'previous' (which may be NULL).
* The real range was checked in getdents
*/
if (!user_access_begin(dirent, sizeof(*dirent)))
goto efault;
and it all worked fine.
But now 32-bit ppc is starting to also implement user_access_begin(),
and the fact that we faked the range to only be the (possibly not even
valid) previous directory entry becomes a problem, because ppc32 will
actually be using the range that is passed in for more than just "check
that it's user space".
This is a complete rewrite of Christophe's original patch.
By saving off the record length of the previous entry instead of a
pointer to it in the filldir data structures, we can simplify the range
check and the writing of the previous entry d_off field. No need for
any conditionals in the user accesses themselves, although we retain the
conditional EINTR checking for the "was this the first directory entry"
signal handling latency logic.
Fixes: 9f79b78ef744 ("Convert filldir[64]() from __put_user() to unsafe_put_user()")
Link: https://lore.kernel.org/lkml/a02d3426f93f7eb04960a4d9140902d278cab0bb.1579697910.git.christophe.leroy@c-s.fr/
Link: https://lore.kernel.org/lkml/408c90c4068b00ea8f1c41cca45b84ec23d4946b.1579783936.git.christophe.leroy@c-s.fr/
Reported-and-tested-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Commit 8a23eb804ca4 ("Make filldir[64]() verify the directory entry
filename is valid") added some minimal validity checks on the directory
entries passed to filldir[64](). But they really were pretty minimal.
This fleshes out at least the name length check: we used to disallow
zero-length names, but really, negative lengths or oevr-long names
aren't ok either. Both could happen if there is some filesystem
corruption going on.
Now, most filesystems tend to use just an "unsigned char" or similar for
the length of a directory entry name, so even with a corrupt filesystem
you should never see anything odd like that. But since we then use the
name length to create the directory entry record length, let's make sure
it actually is half-way sensible.
Note how POSIX states that the size of a path component is limited by
NAME_MAX, but we actually use PATH_MAX for the check here. That's
because while NAME_MAX is generally the correct maximum name length
(it's 255, for the same old "name length is usually just a byte on
disk"), there's nothing in the VFS layer that really cares.
So the real limitation at a VFS layer is the total pathname length you
can pass as a filename: PATH_MAX.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Should work properly with the latest sbios on 5.5 and newer
kernels.
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
|
Set d0 and d1 pin directions for spi0 and spi1 as per their pinmux.
Signed-off-by: Raag Jadav <raagjadav@gmail.com>
Signed-off-by: Tony Lindgren <tony@atomide.com>
|
|
|
|
When submitting v2 of "fou: Support binding FoU socket" (1713cb37bf67),
I accidentally sent the wrong version of the patch and one fix was
missing. In the initial version of the patch, as well as the version 2
that I submitted, I incorrectly used ".type" for the two V6-attributes.
The correct is to use ".len".
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Fixes: 1713cb37bf67 ("fou: Support binding FoU socket")
Signed-off-by: Kristian Evensen <kristian.evensen@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/kvalo/wireless-drivers
Kalle Valo says:
====================
wireless-drivers fixes for v5.5
Second set of fixes for v5.5. There are quite a few patches,
especially on iwlwifi, due to me being on a long break. Libertas also
has a security fix and mt76 a build fix.
iwlwifi
* don't send the PPAG command when PPAG is disabled, since it can cause problems
* a few fixes for a HW bug
* a fix for RS offload;
* a fix for 3168 devices where the NVM tables where the wrong tables were being read
* fix a couple of potential memory leaks in TXQ code
* disable L0S states in all hardware since our hardware doesn't
officially support them anymore (and older versions of the hardware
had instability in these states)
* remove lar_disable parameter since it has been causing issues for
some people who erroneously disable it
* force the debug monitor HW to stop also when debug is disabled,
since it sometimes stays on and prevents low system power states
* don't send IWL_MVM_RXQ_NSSN_SYNC notification due to DMA problems
libertas
* fix two buffer overflows
mt76
* build fix related to CONFIG_MT76_LEDS
* fix off by one in bitrates handling
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
|
|
|
|
|
|
|
|
|
|
Add NPCM Peripheral SPI reset binding documentation,
Removing unnecessary aliases use.
Signed-off-by: Tomer Maimon <tmaimon77@gmail.com>
Acked-by: Rob Herring <robh@kernel.org>
Link: https://lore.kernel.org/r/20200115162301.235926-4-tmaimon77@gmail.com
Signed-off-by: Mark Brown <broonie@kernel.org>
|
|
There are spelling mistakes in dev_err messages. Fix them.
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Acked-by: Peter Ujfalusi <peter.ujfalusi@ti.com>
Link: https://lore.kernel.org/r/20200122093818.2800743-1-colin.king@canonical.com
Signed-off-by: Vinod Koul <vkoul@kernel.org>
|
|
There is a spelling mistake in a dev_err message. Fix it.
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Link: https://lore.kernel.org/r/20200122235237.2830344-1-colin.king@canonical.com
Signed-off-by: Vinod Koul <vkoul@kernel.org>
|
|
If both IFF_NAPI_FRAGS mode and XDP are enabled, and the XDP program
consumes the skb, we need to clear the napi.skb (or risk
a use-after-free) and release the mutex (or risk a deadlock)
WARNING: lock held when returning to user space!
5.5.0-rc6-syzkaller #0 Not tainted
------------------------------------------------
syz-executor.0/455 is leaving the kernel with locks still held!
1 lock held by syz-executor.0/455:
#0: ffff888098f6e748 (&tfile->napi_mutex){+.+.}, at: tun_get_user+0x1604/0x3fc0 drivers/net/tun.c:1835
Fixes: 90e33d459407 ("tun: enable napi_gro_frags() for TUN/TAP driver")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Cc: Petar Penkov <ppenkov@google.com>
Cc: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
During reload (or module unload), the router block is de-initialized.
Among other things, this results in the removal of a default multicast
route from each active virtual router (VRF). These default routes are
configured during initialization to trap packets to the CPU. In
Spectrum-2, unlike Spectrum-1, multicast routes are implemented using
ACL rules.
Since the router block is de-initialized before the ACL block, it is
possible that the ACL rules corresponding to the default routes are
deleted while being accessed by the ACL delayed work that queries rules'
activity from the device. This can result in a rare use-after-free [1].
Fix this by protecting the rules list accessed by the delayed work with
a lock. We cannot use a spinlock as the activity read operation is
blocking.
[1]
[ 123.331662] ==================================================================
[ 123.339920] BUG: KASAN: use-after-free in mlxsw_sp_acl_rule_activity_update_work+0x330/0x3b0
[ 123.349381] Read of size 8 at addr ffff8881f3bb4520 by task kworker/0:2/78
[ 123.357080]
[ 123.358773] CPU: 0 PID: 78 Comm: kworker/0:2 Not tainted 5.5.0-rc5-custom-33108-gf5df95d3ef41 #2209
[ 123.368898] Hardware name: Mellanox Technologies Ltd. MSN3700C/VMOD0008, BIOS 5.11 10/10/2018
[ 123.378456] Workqueue: mlxsw_core mlxsw_sp_acl_rule_activity_update_work
[ 123.385970] Call Trace:
[ 123.388734] dump_stack+0xc6/0x11e
[ 123.392568] print_address_description.constprop.4+0x21/0x340
[ 123.403236] __kasan_report.cold.8+0x76/0xb1
[ 123.414884] kasan_report+0xe/0x20
[ 123.418716] mlxsw_sp_acl_rule_activity_update_work+0x330/0x3b0
[ 123.444034] process_one_work+0xb06/0x19a0
[ 123.453731] worker_thread+0x91/0xe90
[ 123.467348] kthread+0x348/0x410
[ 123.476847] ret_from_fork+0x24/0x30
[ 123.480863]
[ 123.482545] Allocated by task 73:
[ 123.486273] save_stack+0x19/0x80
[ 123.490000] __kasan_kmalloc.constprop.6+0xc1/0xd0
[ 123.495379] mlxsw_sp_acl_rule_create+0xa7/0x230
[ 123.500566] mlxsw_sp2_mr_tcam_route_create+0xf6/0x3e0
[ 123.506334] mlxsw_sp_mr_tcam_route_create+0x5b4/0x820
[ 123.512102] mlxsw_sp_mr_table_create+0x3b5/0x690
[ 123.517389] mlxsw_sp_vr_get+0x289/0x4d0
[ 123.521797] mlxsw_sp_fib_node_get+0xa2/0x990
[ 123.526692] mlxsw_sp_router_fib4_event_work+0x54c/0x2d60
[ 123.532752] process_one_work+0xb06/0x19a0
[ 123.537352] worker_thread+0x91/0xe90
[ 123.541471] kthread+0x348/0x410
[ 123.545103] ret_from_fork+0x24/0x30
[ 123.549113]
[ 123.550795] Freed by task 518:
[ 123.554231] save_stack+0x19/0x80
[ 123.557958] __kasan_slab_free+0x125/0x170
[ 123.562556] kfree+0xd7/0x3a0
[ 123.565895] mlxsw_sp_acl_rule_destroy+0x63/0xd0
[ 123.571081] mlxsw_sp2_mr_tcam_route_destroy+0xd5/0x130
[ 123.576946] mlxsw_sp_mr_tcam_route_destroy+0xba/0x260
[ 123.582714] mlxsw_sp_mr_table_destroy+0x1ab/0x290
[ 123.588091] mlxsw_sp_vr_put+0x1db/0x350
[ 123.592496] mlxsw_sp_fib_node_put+0x298/0x4c0
[ 123.597486] mlxsw_sp_vr_fib_flush+0x15b/0x360
[ 123.602476] mlxsw_sp_router_fib_flush+0xba/0x470
[ 123.607756] mlxsw_sp_vrs_fini+0xaa/0x120
[ 123.612260] mlxsw_sp_router_fini+0x137/0x384
[ 123.617152] mlxsw_sp_fini+0x30a/0x4a0
[ 123.621374] mlxsw_core_bus_device_unregister+0x159/0x600
[ 123.627435] mlxsw_devlink_core_bus_device_reload_down+0x7e/0xb0
[ 123.634176] devlink_reload+0xb4/0x380
[ 123.638391] devlink_nl_cmd_reload+0x610/0x700
[ 123.643382] genl_rcv_msg+0x6a8/0xdc0
[ 123.647497] netlink_rcv_skb+0x134/0x3a0
[ 123.651904] genl_rcv+0x29/0x40
[ 123.655436] netlink_unicast+0x4d4/0x700
[ 123.659843] netlink_sendmsg+0x7c0/0xc70
[ 123.664251] __sys_sendto+0x265/0x3c0
[ 123.668367] __x64_sys_sendto+0xe2/0x1b0
[ 123.672773] do_syscall_64+0xa0/0x530
[ 123.676892] entry_SYSCALL_64_after_hwframe+0x49/0xbe
[ 123.682552]
[ 123.684238] The buggy address belongs to the object at ffff8881f3bb4500
[ 123.684238] which belongs to the cache kmalloc-128 of size 128
[ 123.698261] The buggy address is located 32 bytes inside of
[ 123.698261] 128-byte region [ffff8881f3bb4500, ffff8881f3bb4580)
[ 123.711303] The buggy address belongs to the page:
[ 123.716682] page:ffffea0007ceed00 refcount:1 mapcount:0 mapping:ffff888236403500 index:0x0
[ 123.725958] raw: 0200000000000200 dead000000000100 dead000000000122 ffff888236403500
[ 123.734646] raw: 0000000000000000 0000000000100010 00000001ffffffff 0000000000000000
[ 123.743315] page dumped because: kasan: bad access detected
[ 123.749562]
[ 123.751241] Memory state around the buggy address:
[ 123.756620] ffff8881f3bb4400: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[ 123.764716] ffff8881f3bb4480: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
[ 123.772812] >ffff8881f3bb4500: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[ 123.780904] ^
[ 123.785697] ffff8881f3bb4580: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
[ 123.793793] ffff8881f3bb4600: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[ 123.801883] ==================================================================
Fixes: cf7221a4f5a5 ("mlxsw: spectrum_router: Add Multicast routing support for Spectrum-2")
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Commit 0034d395f89d ("powerpc/mm/hash64: Map all the kernel regions in
the same 0xc range") has a bug in the definition of MIN_USER_CONTEXT.
The result is that the context id used for the vmemmap and the lowest
context id handed out to userspace are the same. The context id is
essentially the process identifier as far as the first stage of the
MMU translation is concerned.
This can result in multiple SLB entries with the same VSID (Virtual
Segment ID), accessible to the kernel and some random userspace
process that happens to get the overlapping id, which is not expected
eg:
07 c00c000008000000 40066bdea7000500 1T ESID= c00c00 VSID= 66bdea7 LLP:100
12 0002000008000000 40066bdea7000d80 1T ESID= 200 VSID= 66bdea7 LLP:100
Even though the user process and the kernel use the same VSID, the
permissions in the hash page table prevent the user process from
reading or writing to any kernel mappings.
It can also lead to SLB entries with different base page size
encodings (LLP), eg:
05 c00c000008000000 00006bde0053b500 256M ESID=c00c00000 VSID= 6bde0053b LLP:100
09 0000000008000000 00006bde0053bc80 256M ESID= 0 VSID= 6bde0053b LLP: 0
Such SLB entries can result in machine checks, eg. as seen on a G5:
Oops: Machine check, sig: 7 [#1]
BE PAGE SIZE=64K MU-Hash SMP NR_CPUS=4 NUMA Power Mac
NIP: c00000000026f248 LR: c000000000295e58 CTR: 0000000000000000
REGS: c0000000erfd3d70 TRAP: 0200 Tainted: G M (5.5.0-rcl-gcc-8.2.0-00010-g228b667d8ea1)
MSR: 9000000000109032 <SF,HV,EE,ME,IR,DR,RI> CR: 24282048 XER: 00000000
DAR: c00c000000612c80 DSISR: 00000400 IRQMASK: 0
...
NIP [c00000000026f248] .kmem_cache_free+0x58/0x140
LR [c088000008295e58] .putname 8x88/0xa
Call Trace:
.putname+0xB8/0xa
.filename_lookup.part.76+0xbe/0x160
.do_faccessat+0xe0/0x380
system_call+0x5c/ex68
This happens with 256MB segments and 64K pages, as the duplicate VSID
is hit with the first vmemmap segment and the first user segment, and
older 32-bit userspace maps things in the first user segment.
On other CPUs a machine check is not seen. Instead the userspace
process can get stuck continuously faulting, with the fault never
properly serviced, due to the kernel not understanding that there is
already a HPTE for the address but with inaccessible permissions.
On machines with 1T segments we've not seen the bug hit other than by
deliberately exercising it. That seems to be just a matter of luck
though, due to the typical layout of the user virtual address space
and the ranges of vmemmap that are typically populated.
To fix it we add 2 to MIN_USER_CONTEXT. This ensures the lowest
context given to userspace doesn't overlap with the VMEMMAP context,
or with the context for INVALID_REGION_ID.
Fixes: 0034d395f89d ("powerpc/mm/hash64: Map all the kernel regions in the same 0xc range")
Cc: stable@vger.kernel.org # v5.2+
Reported-by: Christian Marillat <marillat@debian.org>
Reported-by: Romain Dolbeau <romain@dolbeau.org>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
[mpe: Account for INVALID_REGION_ID, mostly rewrite change log]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200123102547.11623-1-mpe@ellerman.id.au
|
|
Hayes Wang says:
====================
r8152: serial fixes
v3:
1. Fix the typos for patch #5 and #6.
2. Modify the commit message of patch #9.
v2:
For patch #2, move declaring the variable "ocp_data".
v1:
These patches are used to fix some issues for RTL8153.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
When enabling this, the device would wait an internal signal which
wouldn't be triggered. Then, the device couldn't enter P3 mode, so
the power consumption is increased.
Signed-off-by: Hayes Wang <hayeswang@realtek.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Avoid the MCU to clear the lanwake after suspending. It may cause the
WOL fail. Disable LANWAKE_CLR_EN before suspending. Besides,enable it
and reset the lanwake status when resuming or initializing.
Signed-off-by: Hayes Wang <hayeswang@realtek.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
For certain platforms, it causes USB reset periodically.
Signed-off-by: Hayes Wang <hayeswang@realtek.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
For RTL8153B with QFN32, disable test IO. Otherwise, it may cause
abnormal behavior for the device randomly.
Signed-off-by: Hayes Wang <hayeswang@realtek.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
PLA MCU clock speed down could only be enabled when tx/rx are disabled.
Otherwise, the packet loss may occur.
Signed-off-by: Hayes Wang <hayeswang@realtek.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Enable U2P3 may miss zero packet for bulk-in.
Signed-off-by: Hayes Wang <hayeswang@realtek.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Initailization would reset runtime suspend by tp->saved_wolopts, so
the tp->saved_wolopts should be set before initializing.
Signed-off-by: Hayes Wang <hayeswang@realtek.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
When linking ON, the patch of flow control has to be reset. This
makes sure the patch works normally.
Signed-off-by: Hayes Wang <hayeswang@realtek.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Fix the runtime resume doesn't work normally for linking change.
1. Reset the settings and status of runtime suspend.
2. Sync the linking status.
3. Poll the linking change.
Signed-off-by: Hayes Wang <hayeswang@realtek.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
A malicious user could use RAW sockets and fool
GTP using them as standard SOCK_DGRAM UDP sockets.
BUG: KMSAN: uninit-value in udp_tunnel_encap_enable include/net/udp_tunnel.h:174 [inline]
BUG: KMSAN: uninit-value in setup_udp_tunnel_sock+0x45e/0x6f0 net/ipv4/udp_tunnel.c:85
CPU: 0 PID: 11262 Comm: syz-executor613 Not tainted 5.5.0-rc5-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
__dump_stack lib/dump_stack.c:77 [inline]
dump_stack+0x1c9/0x220 lib/dump_stack.c:118
kmsan_report+0xf7/0x1e0 mm/kmsan/kmsan_report.c:118
__msan_warning+0x58/0xa0 mm/kmsan/kmsan_instr.c:215
udp_tunnel_encap_enable include/net/udp_tunnel.h:174 [inline]
setup_udp_tunnel_sock+0x45e/0x6f0 net/ipv4/udp_tunnel.c:85
gtp_encap_enable_socket+0x37f/0x5a0 drivers/net/gtp.c:827
gtp_encap_enable drivers/net/gtp.c:844 [inline]
gtp_newlink+0xfb/0x1e50 drivers/net/gtp.c:666
__rtnl_newlink net/core/rtnetlink.c:3305 [inline]
rtnl_newlink+0x2973/0x3920 net/core/rtnetlink.c:3363
rtnetlink_rcv_msg+0x1153/0x1570 net/core/rtnetlink.c:5424
netlink_rcv_skb+0x451/0x650 net/netlink/af_netlink.c:2477
rtnetlink_rcv+0x50/0x60 net/core/rtnetlink.c:5442
netlink_unicast_kernel net/netlink/af_netlink.c:1302 [inline]
netlink_unicast+0xf9e/0x1100 net/netlink/af_netlink.c:1328
netlink_sendmsg+0x1248/0x14d0 net/netlink/af_netlink.c:1917
sock_sendmsg_nosec net/socket.c:639 [inline]
sock_sendmsg net/socket.c:659 [inline]
____sys_sendmsg+0x12b6/0x1350 net/socket.c:2330
___sys_sendmsg net/socket.c:2384 [inline]
__sys_sendmsg+0x451/0x5f0 net/socket.c:2417
__do_sys_sendmsg net/socket.c:2426 [inline]
__se_sys_sendmsg+0x97/0xb0 net/socket.c:2424
__x64_sys_sendmsg+0x4a/0x70 net/socket.c:2424
do_syscall_64+0xb8/0x160 arch/x86/entry/common.c:296
entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x441359
Code: e8 ac e8 ff ff 48 83 c4 18 c3 0f 1f 80 00 00 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 83 eb 08 fc ff c3 66 2e 0f 1f 84 00 00 00 00
RSP: 002b:00007fff1cd0ac28 EFLAGS: 00000246 ORIG_RAX: 000000000000002e
RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 0000000000441359
RDX: 0000000000000000 RSI: 0000000020000100 RDI: 0000000000000003
RBP: 00000000006cb018 R08: 00000000004002c8 R09: 00000000004002c8
R10: 00000000004002c8 R11: 0000000000000246 R12: 00000000004020d0
R13: 0000000000402160 R14: 0000000000000000 R15: 0000000000000000
Uninit was created at:
kmsan_save_stack_with_flags+0x3c/0x90 mm/kmsan/kmsan.c:144
kmsan_internal_alloc_meta_for_pages mm/kmsan/kmsan_shadow.c:307 [inline]
kmsan_alloc_page+0x12a/0x310 mm/kmsan/kmsan_shadow.c:336
__alloc_pages_nodemask+0x57f2/0x5f60 mm/page_alloc.c:4800
alloc_pages_current+0x67d/0x990 mm/mempolicy.c:2207
alloc_pages include/linux/gfp.h:534 [inline]
alloc_slab_page+0x111/0x12f0 mm/slub.c:1511
allocate_slab mm/slub.c:1656 [inline]
new_slab+0x2bc/0x1130 mm/slub.c:1722
new_slab_objects mm/slub.c:2473 [inline]
___slab_alloc+0x1533/0x1f30 mm/slub.c:2624
__slab_alloc mm/slub.c:2664 [inline]
slab_alloc_node mm/slub.c:2738 [inline]
slab_alloc mm/slub.c:2783 [inline]
kmem_cache_alloc+0xb23/0xd70 mm/slub.c:2788
sk_prot_alloc+0xf2/0x620 net/core/sock.c:1597
sk_alloc+0xf0/0xbe0 net/core/sock.c:1657
inet_create+0x7c7/0x1370 net/ipv4/af_inet.c:321
__sock_create+0x8eb/0xf00 net/socket.c:1420
sock_create net/socket.c:1471 [inline]
__sys_socket+0x1a1/0x600 net/socket.c:1513
__do_sys_socket net/socket.c:1522 [inline]
__se_sys_socket+0x8d/0xb0 net/socket.c:1520
__x64_sys_socket+0x4a/0x70 net/socket.c:1520
do_syscall_64+0xb8/0x160 arch/x86/entry/common.c:296
entry_SYSCALL_64_after_hwframe+0x44/0xa9
Fixes: 459aa660eb1d ("gtp: add initial driver for datapath of GPRS Tunneling Protocol (GTP-U)")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Pablo Neira <pablo@netfilter.org>
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
rtnl_create_link() needs to apply dev->min_mtu and dev->max_mtu
checks that we apply in do_setlink()
Otherwise malicious users can crash the kernel, for example after
an integer overflow :
BUG: KASAN: use-after-free in memset include/linux/string.h:365 [inline]
BUG: KASAN: use-after-free in __alloc_skb+0x37b/0x5e0 net/core/skbuff.c:238
Write of size 32 at addr ffff88819f20b9c0 by task swapper/0/0
CPU: 0 PID: 0 Comm: swapper/0 Not tainted 5.5.0-rc1-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
<IRQ>
__dump_stack lib/dump_stack.c:77 [inline]
dump_stack+0x197/0x210 lib/dump_stack.c:118
print_address_description.constprop.0.cold+0xd4/0x30b mm/kasan/report.c:374
__kasan_report.cold+0x1b/0x41 mm/kasan/report.c:506
kasan_report+0x12/0x20 mm/kasan/common.c:639
check_memory_region_inline mm/kasan/generic.c:185 [inline]
check_memory_region+0x134/0x1a0 mm/kasan/generic.c:192
memset+0x24/0x40 mm/kasan/common.c:108
memset include/linux/string.h:365 [inline]
__alloc_skb+0x37b/0x5e0 net/core/skbuff.c:238
alloc_skb include/linux/skbuff.h:1049 [inline]
alloc_skb_with_frags+0x93/0x590 net/core/skbuff.c:5664
sock_alloc_send_pskb+0x7ad/0x920 net/core/sock.c:2242
sock_alloc_send_skb+0x32/0x40 net/core/sock.c:2259
mld_newpack+0x1d7/0x7f0 net/ipv6/mcast.c:1609
add_grhead.isra.0+0x299/0x370 net/ipv6/mcast.c:1713
add_grec+0x7db/0x10b0 net/ipv6/mcast.c:1844
mld_send_cr net/ipv6/mcast.c:1970 [inline]
mld_ifc_timer_expire+0x3d3/0x950 net/ipv6/mcast.c:2477
call_timer_fn+0x1ac/0x780 kernel/time/timer.c:1404
expire_timers kernel/time/timer.c:1449 [inline]
__run_timers kernel/time/timer.c:1773 [inline]
__run_timers kernel/time/timer.c:1740 [inline]
run_timer_softirq+0x6c3/0x1790 kernel/time/timer.c:1786
__do_softirq+0x262/0x98c kernel/softirq.c:292
invoke_softirq kernel/softirq.c:373 [inline]
irq_exit+0x19b/0x1e0 kernel/softirq.c:413
exiting_irq arch/x86/include/asm/apic.h:536 [inline]
smp_apic_timer_interrupt+0x1a3/0x610 arch/x86/kernel/apic/apic.c:1137
apic_timer_interrupt+0xf/0x20 arch/x86/entry/entry_64.S:829
</IRQ>
RIP: 0010:native_safe_halt+0xe/0x10 arch/x86/include/asm/irqflags.h:61
Code: 98 6b ea f9 eb 8a cc cc cc cc cc cc e9 07 00 00 00 0f 00 2d 44 1c 60 00 f4 c3 66 90 e9 07 00 00 00 0f 00 2d 34 1c 60 00 fb f4 <c3> cc 55 48 89 e5 41 57 41 56 41 55 41 54 53 e8 4e 5d 9a f9 e8 79
RSP: 0018:ffffffff89807ce8 EFLAGS: 00000286 ORIG_RAX: ffffffffffffff13
RAX: 1ffffffff13266ae RBX: ffffffff8987a1c0 RCX: 0000000000000000
RDX: dffffc0000000000 RSI: 0000000000000006 RDI: ffffffff8987aa54
RBP: ffffffff89807d18 R08: ffffffff8987a1c0 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000000 R12: dffffc0000000000
R13: ffffffff8a799980 R14: 0000000000000000 R15: 0000000000000000
arch_cpu_idle+0xa/0x10 arch/x86/kernel/process.c:690
default_idle_call+0x84/0xb0 kernel/sched/idle.c:94
cpuidle_idle_call kernel/sched/idle.c:154 [inline]
do_idle+0x3c8/0x6e0 kernel/sched/idle.c:269
cpu_startup_entry+0x1b/0x20 kernel/sched/idle.c:361
rest_init+0x23b/0x371 init/main.c:451
arch_call_rest_init+0xe/0x1b
start_kernel+0x904/0x943 init/main.c:784
x86_64_start_reservations+0x29/0x2b arch/x86/kernel/head64.c:490
x86_64_start_kernel+0x77/0x7b arch/x86/kernel/head64.c:471
secondary_startup_64+0xa4/0xb0 arch/x86/kernel/head_64.S:242
The buggy address belongs to the page:
page:ffffea00067c82c0 refcount:0 mapcount:0 mapping:0000000000000000 index:0x0
raw: 057ffe0000000000 ffffea00067c82c8 ffffea00067c82c8 0000000000000000
raw: 0000000000000000 0000000000000000 00000000ffffffff 0000000000000000
page dumped because: kasan: bad access detected
Memory state around the buggy address:
ffff88819f20b880: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
ffff88819f20b900: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
>ffff88819f20b980: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
^
ffff88819f20ba00: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
ffff88819f20ba80: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
Fixes: 61e84623ace3 ("net: centralize net_device min/max MTU checking")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
The driver for Cisco Aironet 4500 and 4800 series cards (airo.c),
implements AIROOLDIOCTL/SIOCDEVPRIVATE in airo_ioctl().
The ioctl handler copies an aironet_ioctl struct from userspace, which
includes a command. Some of the commands are handled in readrids(),
where the user controlled command is converted into a driver-internal
value called "ridcode".
There are two command values, AIROGWEPKTMP and AIROGWEPKNV, which
correspond to ridcode values of RID_WEP_TEMP and RID_WEP_PERM
respectively. These commands both have checks that the user has
CAP_NET_ADMIN, with the comment that "Only super-user can read WEP
keys", otherwise they return -EPERM.
However there is another command value, AIRORRID, that lets the user
specify the ridcode value directly, with no other checks. This means
the user can bypass the CAP_NET_ADMIN check on AIROGWEPKTMP and
AIROGWEPKNV.
Fix it by moving the CAP_NET_ADMIN check out of the command handling
and instead do it later based on the ridcode. That way regardless of
whether the ridcode is set via AIROGWEPKTMP or AIROGWEPKNV, or passed
in using AIRORID, we always do the CAP_NET_ADMIN check.
Found by Ilja by code inspection, not tested as I don't have the
required hardware.
Reported-by: Ilja Van Sprundel <ivansprundel@ioactive.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
The driver for Cisco Aironet 4500 and 4800 series cards (airo.c),
implements AIROOLDIOCTL/SIOCDEVPRIVATE in airo_ioctl().
The ioctl handler copies an aironet_ioctl struct from userspace, which
includes a command and a length. Some of the commands are handled in
readrids(), which kmalloc()'s a buffer of RIDSIZE (2048) bytes.
That buffer is then passed to PC4500_readrid(), which has two cases.
The else case does some setup and then reads up to RIDSIZE bytes from
the hardware into the kmalloc()'ed buffer.
Here len == RIDSIZE, pBuf is the kmalloc()'ed buffer:
// read the rid length field
bap_read(ai, pBuf, 2, BAP1);
// length for remaining part of rid
len = min(len, (int)le16_to_cpu(*(__le16*)pBuf)) - 2;
...
// read remainder of the rid
rc = bap_read(ai, ((__le16*)pBuf)+1, len, BAP1);
PC4500_readrid() then returns to readrids() which does:
len = comp->len;
if (copy_to_user(comp->data, iobuf, min(len, (int)RIDSIZE))) {
Where comp->len is the user controlled length field.
So if the "rid length field" returned by the hardware is < 2048, and
the user requests 2048 bytes in comp->len, we will leak the previous
contents of the kmalloc()'ed buffer to userspace.
Fix it by kzalloc()'ing the buffer.
Found by Ilja by code inspection, not tested as I don't have the
required hardware.
Reported-by: Ilja Van Sprundel <ivansprundel@ioactive.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
The optee driver uses specific page table types to verify if a memory
region is normal. These types are not defined in nommu systems. Trying
to compile the driver in these systems results in a build error:
linux/drivers/tee/optee/call.c: In function ‘is_normal_memory’:
linux/drivers/tee/optee/call.c:533:26: error: ‘L_PTE_MT_MASK’ undeclared
(first use in this function); did you mean ‘PREEMPT_MASK’?
return (pgprot_val(p) & L_PTE_MT_MASK) == L_PTE_MT_WRITEALLOC;
^~~~~~~~~~~~~
PREEMPT_MASK
linux/drivers/tee/optee/call.c:533:26: note: each undeclared identifier is
reported only once for each function it appears in
linux/drivers/tee/optee/call.c:533:44: error: ‘L_PTE_MT_WRITEALLOC’ undeclared
(first use in this function)
return (pgprot_val(p) & L_PTE_MT_MASK) == L_PTE_MT_WRITEALLOC;
^~~~~~~~~~~~~~~~~~~
Make the optee driver depend on MMU to fix the compilation issue.
Signed-off-by: Vincenzo Frascino <vincenzo.frascino@arm.com>
[jw: update commit title]
Signed-off-by: Jens Wiklander <jens.wiklander@linaro.org>
|
|
phylink and phylib are interconnected. It makes sense for phylib and
phy driver patches to be also reviewed by the phylink maintainer.
So add Russell King as a designed reviewer of phylib.
Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Acked-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|