Age | Commit message (Collapse) | Author |
|
btrfs_zone_activate() checks if block_group->alloc_offset ==
block_group->zone_capacity every time it iterates the loop. But, it is
not depending on the index. Move out the check and do it only once.
Fixes: f9a912a3c45f ("btrfs: zoned: make zone activation multi stripe capable")
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
[BUG]
For a 4K sector sized btrfs with v1 cache enabled and only mounted on
systems with 4K page size, if it's mounted on subpage (64K page size)
systems, it can cause the following warning on v1 space cache:
BTRFS error (device dm-1): csum mismatch on free space cache
BTRFS warning (device dm-1): failed to load free space cache for block group 84082688, rebuilding it now
Although not a big deal, as kernel can rebuild it without problem, such
warning will bother end users, especially if they want to switch the
same btrfs seamlessly between different page sized systems.
[CAUSE]
V1 free space cache is still using fixed PAGE_SIZE for various bitmap,
like BITS_PER_BITMAP.
Such hard-coded PAGE_SIZE usage will cause various mismatch, from v1
cache size to checksum.
Thus kernel will always reject v1 cache with a different PAGE_SIZE with
csum mismatch.
[FIX]
Although we should fix v1 cache, it's already going to be marked
deprecated soon.
And we have v2 cache based on metadata (which is already fully subpage
compatible), and it has almost everything superior than v1 cache.
So just force subpage mount to use v2 cache on mount.
Reported-by: Matt Corallo <blnxfsl@bluematt.me>
CC: stable@vger.kernel.org # 5.15+
Link: https://lore.kernel.org/linux-btrfs/61aa27d1-30fc-c1a9-f0f4-9df544395ec3@bluematt.me/
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
There are 3 places where the cpu and node masks of the top cpuset can
be initialized in the order they are executed:
1) start_kernel -> cpuset_init()
2) start_kernel -> cgroup_init() -> cpuset_bind()
3) kernel_init_freeable() -> do_basic_setup() -> cpuset_init_smp()
The first cpuset_init() call just sets all the bits in the masks.
The second cpuset_bind() call sets cpus_allowed and mems_allowed to the
default v2 values. The third cpuset_init_smp() call sets them back to
v1 values.
For systems with cgroup v2 setup, cpuset_bind() is called once. As a
result, cpu and memory node hot add may fail to update the cpu and node
masks of the top cpuset to include the newly added cpu or node in a
cgroup v2 environment.
For systems with cgroup v1 setup, cpuset_bind() is called again by
rebind_subsystem() when the v1 cpuset filesystem is mounted as shown
in the dmesg log below with an instrumented kernel.
[ 2.609781] cpuset_bind() called - v2 = 1
[ 3.079473] cpuset_init_smp() called
[ 7.103710] cpuset_bind() called - v2 = 0
smp_init() is called after the first two init functions. So we don't
have a complete list of active cpus and memory nodes until later in
cpuset_init_smp() which is the right time to set up effective_cpus
and effective_mems.
To fix this cgroup v2 mask setup problem, the potentially incorrect
cpus_allowed & mems_allowed setting in cpuset_init_smp() are removed.
For cgroup v2 systems, the initial cpuset_bind() call will set the masks
correctly. For cgroup v1 systems, the second call to cpuset_bind()
will do the right setup.
cc: stable@vger.kernel.org
Signed-off-by: Waiman Long <longman@redhat.com>
Tested-by: Feng Tang <feng.tang@intel.com>
Reviewed-by: Michal Koutný <mkoutny@suse.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
|
|
The ice_for_each_vf macros have comments describing the implementation. One
of the arguments has a period on the end, which is not our typical style.
Remove the unnecessary period.
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
|
|
This function definition was missing a comment describing its
implementation. Add one.
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
|
|
The comment explaining ice_reset_vf has an extraneous "the" with the "if
the resets are disabled". Remove it.
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
|
|
Since commit fe99d1c06c16 ("ice: make ice_reset_all_vfs void"), the
ice_reset_all_vfs function has not returned anything. The function comment
still indicated it did. Fix this.
While here, also add a line to clarify the function resets all VFs at once
in response to hardware resets such as a PF reset.
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
|
|
The ice_get_vf_vsi function can return NULL in some cases, such as if
handling messages during a reset where the VSI is being removed and
recreated.
Several places throughout the driver do not bother to check whether this
VSI pointer is valid. Static analysis tools maybe report issues because
they detect paths where a potentially NULL pointer could be dereferenced.
Fix this by checking the return value of ice_get_vf_vsi everywhere.
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Reviewed-by: Paul Menzel <pmenzel@molgen.mpg.de>
Tested-by: Konrad Jankowski <konrad0.jankowski@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
|
|
The debug print in ice_vf_fdir_dump_info does not end in newlines. This can
look confusing when reading the kernel log, as the next print will
immediately continue on the same line.
Fix this by adding the forgotten newline.
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Reviewed-by: Paul Menzel <pmenzel@molgen.mpg.de>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
|
|
Switch id should be the same for each netdevice on a driver.
The id must be unique between devices on the same system, but
does not need to be unique between devices on different systems.
The switch id is used to locate ports on a switch and to know if
aggregated ports belong to the same switch.
To meet this requirements, use pci_get_dsn as switch id value, as
this is unique value for each devices on the same system.
Implementing switch id is needed by automatic tools for kubernetes.
Set switch id by setting devlink port attribiutes and calling
devlink_port_attrs_set while creating pf (for uplink) and vf
(for representator) devlink port.
To get switch id (in switchdev mode):
cat /sys/class/net/$PF0/phys_switch_id
Signed-off-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com>
Signed-off-by: Marcin Szycik <marcin.szycik@linux.intel.com>
Tested-by: Sandeep Penigalapati <sandeep.penigalapati@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
|
|
When number of words exceeds ICE_MAX_CHAIN_WORDS, -ENOSPC
should be returned not -EINVAL. Do not overwrite this
error code in ice_add_tc_flower_adv_fltr.
Signed-off-by: Wojciech Drewek <wojciech.drewek@intel.com>
Suggested-by: Marcin Szycik <marcin.szycik@linux.intel.com>
Acked-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Tested-by: Sandeep Penigalapati <sandeep.penigalapati@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
|
|
Both ice_idc.c and ice_virtchnl.c carry their own implementation of a
helper function that is looking for a given VSI based on provided
vsi_num. Their functionality is the same, so let's introduce the common
function in ice.h that both of the mentioned sites will use.
This is a strictly cleanup thing, no functionality is changed.
Reviewed-by: Alexander Lobakin <alexandr.lobakin@intel.com>
Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Tested-by: Konrad Jankowski <konrad0.jankowski@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
|
|
Fix the following coccicheck warning:
./drivers/net/ethernet/intel/ice/ice_gnss.c:79:26-27: WARNING opportunity for min()
Signed-off-by: Wan Jiabing <wanjiabing@vivo.com>
Tested-by: Gurucharan <gurucharanx.g@intel.com> (A Contingent worker at Intel)
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux
Pull s390 fixes from Heiko Carstens:
- Disable -Warray-bounds warning for gcc12, since the only known way to
workaround false positive warnings on lowcore accesses would result
in worse code on fast paths.
- Avoid lockdep_assert_held() warning in kvm vm memop code.
- Reduce overhead within gmap_rmap code to get rid of long latencies
when e.g. shutting down 2nd level guests.
* tag 's390-5.18-4' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux:
KVM: s390: vsie/gmap: reduce gmap_rmap overhead
KVM: s390: Fix lockdep issue in vm memop
s390: disable -Warray-bounds
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/mips/linux
Pull MIPS fix from Thomas Bogendoerfer:
"Extend R4000/R4400 CPU erratum workaround to all revisions"
* tag 'mips-fixes_5.18_1' of git://git.kernel.org/pub/scm/linux/kernel/git/mips/linux:
MIPS: Fix CP0 counter erratum detection for R4k CPUs
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from Paolo Abeni:
"Including fixes from can, rxrpc and wireguard.
Previous releases - regressions:
- igmp: respect RCU rules in ip_mc_source() and ip_mc_msfilter()
- mld: respect RCU rules in ip6_mc_source() and ip6_mc_msfilter()
- rds: acquire netns refcount on TCP sockets
- rxrpc: enable IPv6 checksums on transport socket
- nic: hinic: fix bug of wq out of bound access
- nic: thunder: don't use pci_irq_vector() in atomic context
- nic: bnxt_en: fix possible bnxt_open() failure caused by wrong RFS
flag
- nic: mlx5e:
- lag, fix use-after-free in fib event handler
- fix deadlock in sync reset flow
Previous releases - always broken:
- tcp: fix insufficient TCP source port randomness
- can: grcan: grcan_close(): fix deadlock
- nfc: reorder destructive operations in to avoid bugs
Misc:
- wireguard: improve selftests reliability"
* tag 'net-5.18-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (63 commits)
NFC: netlink: fix sleep in atomic bug when firmware download timeout
selftests: ocelot: tc_flower_chains: specify conform-exceed action for policer
tcp: drop the hash_32() part from the index calculation
tcp: increase source port perturb table to 2^16
tcp: dynamically allocate the perturb table used by source ports
tcp: add small random increments to the source port
tcp: resalt the secret every 10 seconds
tcp: use different parts of the port_offset for index and offset
secure_seq: use the 64 bits of the siphash for port offset calculation
wireguard: selftests: set panic_on_warn=1 from cmdline
wireguard: selftests: bump package deps
wireguard: selftests: restore support for ccache
wireguard: selftests: use newer toolchains to fill out architectures
wireguard: selftests: limit parallelism to $(nproc) tests at once
wireguard: selftests: make routing loop test non-fatal
net/mlx5: Fix matching on inner TTC
net/mlx5: Avoid double clear or set of sync reset requested
net/mlx5: Fix deadlock in sync reset flow
net/mlx5e: Fix trust state reset in reload
net/mlx5e: Avoid checking offload capability in post_parse action
...
|
|
kmap() is being deprecated and these usages are all local to the thread
so there is no reason kmap_local_page() can't be used.
Replace kmap() calls with kmap_local_page().
Signed-off-by: Alaa Mohamed <eng.alaamohamedsoliman.am@gmail.com>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Tested-by: Gurucharan <gurucharanx.g@intel.com> (A Contingent worker at Intel)
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
|
|
The module_param allow_unsupported_sfp should be a boolean to match the
type in the ixgbe_hw struct.
Signed-off-by: Jeff Daly <jeffd@silicom-usa.com>
Tested-by: Gurucharan <gurucharanx.g@intel.com> (A Contingent worker at Intel)
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
|
|
Handle adding and removing MDB entries for host
Signed-off-by: Casper Andersson <casper.casan@gmail.com>
Link: https://lore.kernel.org/r/20220503093922.1630804-1-casper.casan@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
The fwnode of GPIO IRQ must be set to its own fwnode, not the fwnode of the
parent IRQ. Therefore, this sets own fwnode instead of the parent IRQ fwnode to
GPIO IRQ's.
Fixes: 2ad74f40dacc ("gpio: visconti: Add Toshiba Visconti GPIO support")
Signed-off-by: Nobuhiro Iwamatsu <nobuhiro1.iwamatsu@toshiba.co.jp>
Reviewed-by: Linus Walleij <linus.walleij@linaro.org>
Signed-off-by: Bartosz Golaszewski <brgl@bgdev.pl>
|
|
A kernel hang can be observed when running setserial in a loop on a kernel
with force threaded interrupts. The sequence of events is:
setserial
open("/dev/ttyXXX")
request_irq()
do_stuff()
-> serial interrupt
-> wake(irq_thread)
desc->threads_active++;
close()
free_irq()
kthread_stop(irq_thread)
synchronize_irq() <- hangs because desc->threads_active != 0
The thread is created in request_irq() and woken up, but does not get on a
CPU to reach the actual thread function, which would handle the pending
wake-up. kthread_stop() sets the should stop condition which makes the
thread immediately exit, which in turn leaves the stale threads_active
count around.
This problem was introduced with commit 519cc8652b3a, which addressed a
interrupt sharing issue in the PCIe code.
Before that commit free_irq() invoked synchronize_irq(), which waits for
the hard interrupt handler and also for associated threads to complete.
To address the PCIe issue synchronize_irq() was replaced with
__synchronize_hardirq(), which only waits for the hard interrupt handler to
complete, but not for threaded handlers.
This was done under the assumption, that the interrupt thread already
reached the thread function and waits for a wake-up, which is guaranteed to
be handled before acting on the stop condition. The problematic case, that
the thread would not reach the thread function, was obviously overlooked.
Make sure that the interrupt thread is really started and reaches
thread_fn() before returning from __setup_irq().
This utilizes the existing wait queue in the interrupt descriptor. The
wait queue is unused for non-shared interrupts. For shared interrupts the
usage might cause a spurious wake-up of a waiter in synchronize_irq() or the
completion of a threaded handler might cause a spurious wake-up of the
waiter for the ready flag. Both are harmless and have no functional impact.
[ tglx: Amended changelog ]
Fixes: 519cc8652b3a ("genirq: Synchronize only with single thread on free_irq()")
Signed-off-by: Thomas Pfaff <tpfaff@pcs.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Marc Zyngier <maz@kernel.org>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/552fe7b4-9224-b183-bb87-a8f36d335690@pcs.com
|
|
My git tree has become the de facto main GPIO tree. Update the
MAINTAINERS file to reflect that.
Signed-off-by: Bartosz Golaszewski <brgl@bgdev.pl>
Reported-by: Baruch Siach <baruch@tkos.co.il>
Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Reviewed-by: Linus Walleij <linus.walleij@linaro.org>
|
|
There are sleep in atomic bug that could cause kernel panic during
firmware download process. The root cause is that nlmsg_new with
GFP_KERNEL parameter is called in fw_dnld_timeout which is a timer
handler. The call trace is shown below:
BUG: sleeping function called from invalid context at include/linux/sched/mm.h:265
Call Trace:
kmem_cache_alloc_node
__alloc_skb
nfc_genl_fw_download_done
call_timer_fn
__run_timers.part.0
run_timer_softirq
__do_softirq
...
The nlmsg_new with GFP_KERNEL parameter may sleep during memory
allocation process, and the timer handler is run as the result of
a "software interrupt" that should not call any other function
that could sleep.
This patch changes allocation mode of netlink message from GFP_KERNEL
to GFP_ATOMIC in order to prevent sleep in atomic bug. The GFP_ATOMIC
flag makes memory allocation operation could be used in atomic context.
Fixes: 9674da8759df ("NFC: Add firmware upload netlink command")
Fixes: 9ea7187c53f6 ("NFC: netlink: Rename CMD_FW_UPLOAD to CMD_FW_DOWNLOAD")
Signed-off-by: Duoming Zhou <duoming@zju.edu.cn>
Reviewed-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
Link: https://lore.kernel.org/r/20220504055847.38026-1-duoming@zju.edu.cn
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
Reading 100KB chunks from a big file (eg dd bs=100K) leads to poor
readahead behaviour. Studying the traces in detail, I noticed two
problems.
The first is that we were setting the readahead flag on the folio which
contains the last byte read from the block. This is wrong because we
will trigger readahead at the end of the read without waiting to see
if a subsequent read is going to use the pages we just read. Instead,
we need to set the readahead flag on the first folio _after_ the one
which contains the last byte that we're reading.
The second is that we were looking for the index of the folio with the
readahead flag set to exactly match the start + size - async_size.
If we've rounded this, either down (as previously) or up (as now),
we'll think we hit a folio marked as readahead by a different read,
and try to read the wrong pages. So round the expected index to the
order of the folio we hit.
Reported-by: Guo Xuenan <guoxuenan@huawei.com>
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
|
|
It is unsafe to call folio_next() on a folio unless you hold a reference
on it that prevents it from being split or freed. After returning
from the iterator, iomap calls folio_end_writeback() which may drop
the last reference to the page, or allow the page to be split. If that
happens, the iterator will not advance far enough through the bio_vec,
leading to assertion failures like the BUG() in folio_end_writeback()
that checks we're not trying to end writeback on a page not currently
under writeback. Other assertion failures were also seen, but they're
all explained by this one bug.
Fix the bug by remembering where the next folio starts before returning
from the iterator. There are other ways of fixing this bug, but this
seems the simplest.
Reported-by: Darrick J. Wong <djwong@kernel.org>
Tested-by: Darrick J. Wong <djwong@kernel.org>
Reported-by: Brian Foster <bfoster@redhat.com>
Tested-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
|
|
Vladimir Oltean says:
====================
Ocelot VCAP cleanups
This is a series of minor code cleanups brought to the Ocelot switch
driver logic for VCAP filters.
- don't use list_for_each_safe() in ocelot_vcap_filter_add_to_block
- don't use magic numbers for OCELOT_POLICER_DISCARD
====================
Link: https://lore.kernel.org/r/20220503120150.837233-1-vladimir.oltean@nxp.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
OCELOT_POLICER_DISCARD helps "kill dropped packets dead" since a
PERMIT/DENY mask mode with a port mask of 0 isn't enough to stop the CPU
port from receiving packets removed from the forwarding path.
The hardcoded initialization done for it in ocelot_vcap_init() is
confusing. All we need from it is to have a rate and a burst size of 0.
Reuse qos_policer_conf_set() for that purpose.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
The "port" argument is used for nothing else except printing on the
error path. Print errors on behalf of the policer index, which is less
confusing anyway.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Unify the code paths for adding to an empty list and to a list with
elements by keeping a "pos" list_head element that indicates where to
insert. Initialize "pos" with the list head itself in case
list_for_each_entry() doesn't iterate over any element.
Note that list_for_each_safe() isn't needed because no element is
removed from the list while iterating.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
This makes no functional difference but helps in minimizing the delta
for a future change.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
list_add(..., pos->prev) and list_add_tail(..., pos) are equivalent, use
the later form to unify with the case where the list is empty later.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
In commit 4fdabd509df3 ("dt-bindings: net: lan966x: remove PHY reset")
the PHY reset was removed, but I failed to remove it from the example.
Fix it.
Fixes: 4fdabd509df3 ("dt-bindings: net: lan966x: remove PHY reset")
Reported-by: Rob Herring <robh@kernel.org>
Signed-off-by: Michael Walle <michael@walle.cc>
Acked-by: Rob Herring <robh@kernel.org>
Link: https://lore.kernel.org/r/20220503132038.2714128-1-michael@walle.cc
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
As discussed here with Ido Schimmel:
https://patchwork.kernel.org/project/netdevbpf/patch/20220224102908.5255-2-jianbol@nvidia.com/
the default conform-exceed action is "reclassify", for a reason we don't
really understand.
The point is that hardware can't offload that police action, so not
specifying "conform-exceed" was always wrong, even though the command
used to work in hardware (but not in software) until the kernel started
adding validation for it.
Fix the command used by the selftest by making the policer drop on
exceed, and pass the packet to the next action (goto) on conform.
Fixes: 8cd6b020b644 ("selftests: ocelot: add some example VCAP IS1, IS2 and ES0 tc offloads")
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Link: https://lore.kernel.org/r/20220503121428.842906-1-vladimir.oltean@nxp.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Willy Tarreau says:
====================
insufficient TCP source port randomness
In a not-yet published paper, Moshe Kol, Amit Klein, and Yossi Gilad
report being able to accurately identify a client by forcing it to emit
only 40 times more connections than the number of entries in the
table_perturb[] table, which is indexed by hashing the connection tuple.
The current 2^8 setting allows them to perform that attack with only 10k
connections, which is not hard to achieve in a few seconds.
Eric, Amit and I have been working on this for a few weeks now imagining,
testing and eliminating a number of approaches that Amit and his team were
still able to break or that were found to be too risky or too expensive,
and ended up with the simple improvements in this series that resists to
the attack, doesn't degrade the performance, and preserves a reliable port
selection algorithm to avoid connection failures, including the odd/even
port selection preference that allows bind() to always find a port quickly
even under strong connect() stress.
The approach relies on several factors:
- resalting the hash secret that's used to choose the table_perturb[]
entry every 10 seconds to eliminate slow attacks and force the
attacker to forget everything that was learned after this delay.
This already eliminates most of the problem because if a client
stays silent for more than 10 seconds there's no link between the
previous and the next patterns, and 10s isn't yet frequent enough
to cause too frequent repetition of a same port that may induce a
connection failure ;
- adding small random increments to the source port. Previously, a
random 0 or 1 was added every 16 ports. Now a random 0 to 7 is
added after each port. This means that with the default 32768-60999
range, a worst case rollover happens after 1764 connections, and
an average of 3137. This doesn't stop statistical attacks but
requires significantly more iterations of the same attack to
confirm a guess.
- increasing the table_perturb[] size from 2^8 to 2^16, which Amit
says will require 2.6 million connections to be attacked with the
changes above, making it pointless to get a fingerprint that will
only last 10 seconds. Due to the size, the table was made dynamic.
- a few minor improvements on the bits used from the hash, to eliminate
some unfortunate correlations that may possibly have been exploited
to design future attack models.
These changes were tested under the most extreme conditions, up to
1.1 million connections per second to one and a few targets, showing no
performance regression, and only 2 connection failures within 13 billion,
which is less than 2^-32 and perfectly within usual values.
The series is split into small reviewable changes and was already reviewed
by Amit and Eric.
====================
Link: https://lore.kernel.org/r/20220502084614.24123-1-w@1wt.eu
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
In commit 190cc82489f4 ("tcp: change source port randomizarion at
connect() time"), the table_perturb[] array was introduced and an
index was taken from the port_offset via hash_32(). But it turns
out that hash_32() performs a multiplication while the input here
comes from the output of SipHash in secure_seq, that is well
distributed enough to avoid the need for yet another hash.
Suggested-by: Amit Klein <aksecurity@gmail.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Willy Tarreau <w@1wt.eu>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Moshe Kol, Amit Klein, and Yossi Gilad reported being able to accurately
identify a client by forcing it to emit only 40 times more connections
than there are entries in the table_perturb[] table. The previous two
improvements consisting in resalting the secret every 10s and adding
randomness to each port selection only slightly improved the situation,
and the current value of 2^8 was too small as it's not very difficult
to make a client emit 10k connections in less than 10 seconds.
Thus we're increasing the perturb table from 2^8 to 2^16 so that the
same precision now requires 2.6M connections, which is more difficult in
this time frame and harder to hide as a background activity. The impact
is that the table now uses 256 kB instead of 1 kB, which could mostly
affect devices making frequent outgoing connections. However such
components usually target a small set of destinations (load balancers,
database clients, perf assessment tools), and in practice only a few
entries will be visited, like before.
A live test at 1 million connections per second showed no performance
difference from the previous value.
Reported-by: Moshe Kol <moshe.kol@mail.huji.ac.il>
Reported-by: Yossi Gilad <yossi.gilad@mail.huji.ac.il>
Reported-by: Amit Klein <aksecurity@gmail.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Willy Tarreau <w@1wt.eu>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
We'll need to further increase the size of this table and it's likely
that at some point its size will not be suitable anymore for a static
table. Let's allocate it on boot from inet_hashinfo2_init(), which is
called from tcp_init().
Cc: Moshe Kol <moshe.kol@mail.huji.ac.il>
Cc: Yossi Gilad <yossi.gilad@mail.huji.ac.il>
Cc: Amit Klein <aksecurity@gmail.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Willy Tarreau <w@1wt.eu>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Here we're randomly adding between 0 and 7 random increments to the
selected source port in order to add some noise in the source port
selection that will make the next port less predictable.
With the default port range of 32768-60999 this means a worst case
reuse scenario of 14116/8=1764 connections between two consecutive
uses of the same port, with an average of 14116/4.5=3137. This code
was stressed at more than 800000 connections per second to a fixed
target with all connections closed by the client using RSTs (worst
condition) and only 2 connections failed among 13 billion, despite
the hash being reseeded every 10 seconds, indicating a perfectly
safe situation.
Cc: Moshe Kol <moshe.kol@mail.huji.ac.il>
Cc: Yossi Gilad <yossi.gilad@mail.huji.ac.il>
Cc: Amit Klein <aksecurity@gmail.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Willy Tarreau <w@1wt.eu>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
In order to limit the ability for an observer to recognize the source
ports sequence used to contact a set of destinations, we should
periodically shuffle the secret. 10 seconds looks effective enough
without causing particular issues.
Cc: Moshe Kol <moshe.kol@mail.huji.ac.il>
Cc: Yossi Gilad <yossi.gilad@mail.huji.ac.il>
Cc: Amit Klein <aksecurity@gmail.com>
Cc: Jason A. Donenfeld <Jason@zx2c4.com>
Tested-by: Willy Tarreau <w@1wt.eu>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Amit Klein suggests that we use different parts of port_offset for the
table's index and the port offset so that there is no direct relation
between them.
Cc: Jason A. Donenfeld <Jason@zx2c4.com>
Cc: Moshe Kol <moshe.kol@mail.huji.ac.il>
Cc: Yossi Gilad <yossi.gilad@mail.huji.ac.il>
Cc: Amit Klein <aksecurity@gmail.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Willy Tarreau <w@1wt.eu>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
SipHash replaced MD5 in secure_ipv{4,6}_port_ephemeral() via commit
7cd23e5300c1 ("secure_seq: use SipHash in place of MD5"), but the output
remained truncated to 32-bit only. In order to exploit more bits from the
hash, let's make the functions return the full 64-bit of siphash_3u32().
We also make sure the port offset calculation in __inet_hash_connect()
remains done on 32-bit to avoid the need for div_u64_rem() and an extra
cost on 32-bit systems.
Cc: Jason A. Donenfeld <Jason@zx2c4.com>
Cc: Moshe Kol <moshe.kol@mail.huji.ac.il>
Cc: Yossi Gilad <yossi.gilad@mail.huji.ac.il>
Cc: Amit Klein <aksecurity@gmail.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Willy Tarreau <w@1wt.eu>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Creating a new netdevice allocates at least ~50Kb of memory for various
kernel objects, but only ~5Kb of them are accounted to memcg. As a result,
creating an unlimited number of netdevice inside a memcg-limited container
does not fall within memcg restrictions, consumes a significant part
of the host's memory, can cause global OOM and lead to random kills of
host processes.
The main consumers of non-accounted memory are:
~10Kb 80+ kernfs nodes
~6Kb ipv6_add_dev() allocations
6Kb __register_sysctl_table() allocations
4Kb neigh_sysctl_register() allocations
4Kb __devinet_sysctl_register() allocations
4Kb __addrconf_sysctl_register() allocations
Accounting of these objects allows to increase the share of memcg-related
memory up to 60-70% (~38Kb accounted vs ~54Kb total for dummy netdevice
on typical VM with default Fedora 35 kernel) and this should be enough
to somehow protect the host from misuse inside container.
Other related objects are quite small and may not be taken into account
to minimize the expected performance degradation.
It should be separately mentonied ~300 bytes of percpu allocation
of struct ipstats_mib in snmp6_alloc_dev(), on huge multi-cpu nodes
it can become the main consumer of memory.
This patch does not enables kernfs accounting as it affects
other parts of the kernel and should be discussed separately.
However, even without kernfs, this patch significantly improves the
current situation and allows to take into account more than half
of all netdevice allocations.
Signed-off-by: Vasily Averin <vvs@openvz.org>
Acked-by: Luis Chamberlain <mcgrof@kernel.org>
Link: https://lore.kernel.org/r/354a0a5f-9ec3-a25c-3215-304eab2157bc@openvz.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Jason A. Donenfeld says:
====================
wireguard patches for 5.18-rc6
In working on some other problems, I wound up leaning on the WireGuard
CI more than usual and uncovered a few small issues with reliability.
These are fairly low key changes, since they don't impact kernel code
itself.
One change does stick out in particular, though, which is the "make
routing loop test non-fatal" commit. I'm not thrilled about doing this,
but currently [1] remains unsolved, and I'm still working on a real
solution to that (hopefully for 5.19 or 5.20 if I can come up with a
good idea...), so for now that test just prints a big red warning
instead.
[1] https://lore.kernel.org/netdev/YmszSXueTxYOC41G@zx2c4.com/
====================
Link: https://lore.kernel.org/r/20220504202920.72908-1-Jason@zx2c4.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Rather than setting this once init is running, set panic_on_warn from
the kernel command line, so that it catches splats from WireGuard
initialization code and the various crypto selftests.
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Use newer, more reliable package dependencies. These should hopefully
reduce flakes. However, we keep the old iputils package, as it
accumulated bugs after resulting in flakes on slow machines.
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
When moving to non-system toolchains, we inadvertantly killed the
ability to use ccache. So instead, build ccache support into the test
harness directly.
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Rather than relying on the system to have cross toolchains available,
simply download musl.cc's ones and use that libc.so, and then we use it
to fill in a few missing platforms, such as riscv64, riscv64, powerpc64,
and s390x.
Since riscv doesn't have a second serial port in its device description,
we have to use virtio's vport. This is actually the same situation on
ARM, but we were previously hacking QEMU up to work around this, which
required a custom QEMU. Instead just do the vport trick on ARM too.
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
The parallel tests were added to catch queueing issues from multiple
cores. But what happens in reality when testing tons of processes is
that these separate threads wind up fighting with the scheduler, and we
wind up with contention in places we don't care about that decrease the
chances of hitting a bug. So just do a test with the number of CPU
cores, rather than trying to scale up arbitrarily.
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
I hate to do this, but I still do not have a good solution to actually
fix this bug across architectures. So just disable it for now, so that
the CI can still deliver actionable results. This commit adds a large
red warning, so that at least the failure isn't lost forever, and
hopefully this can be revisited down the line.
Link: https://lore.kernel.org/netdev/CAHmME9pv1x6C4TNdL6648HydD8r+txpV4hTUXOBVkrapBXH4QQ@mail.gmail.com/
Link: https://lore.kernel.org/netdev/YmszSXueTxYOC41G@zx2c4.com/
Link: https://lore.kernel.org/wireguard/CAHmME9rNnBiNvBstb7MPwK-7AmAN0sOfnhdR=eeLrowWcKxaaQ@mail.gmail.com/
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
The FPU usage related to task FPU management is either protected by
disabling interrupts (switch_to, return to user) or via fpregs_lock() which
is a wrapper around local_bh_disable(). When kernel code wants to use the
FPU then it has to check whether it is possible by calling irq_fpu_usable().
But the condition in irq_fpu_usable() is wrong. It allows FPU to be used
when:
!in_interrupt() || interrupted_user_mode() || interrupted_kernel_fpu_idle()
The latter is checking whether some other context already uses FPU in the
kernel, but if that's not the case then it allows FPU to be used
unconditionally even if the calling context interrupted a fpregs_lock()
critical region. If that happens then the FPU state of the interrupted
context becomes corrupted.
Allow in kernel FPU usage only when no other context has in kernel FPU
usage and either the calling context is not hard interrupt context or the
hard interrupt did not interrupt a local bottomhalf disabled region.
It's hard to find a proper Fixes tag as the condition was broken in one way
or the other for a very long time and the eager/lazy FPU changes caused a
lot of churn. Picked something remotely connected from the history.
This survived undetected for quite some time as FPU usage in interrupt
context is rare, but the recent changes to the random code unearthed it at
least on a kernel which had FPU debugging enabled. There is probably a
higher rate of silent corruption as not all issues can be detected by the
FPU debugging code. This will be addressed in a subsequent change.
Fixes: 5d2bd7009f30 ("x86, fpu: decouple non-lazy/eager fpu restore from xsave")
Reported-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Borislav Petkov <bp@suse.de>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20220501193102.588689270@linutronix.de
|