Age | Commit message (Collapse) | Author |
|
During MU initialization, there maybe pending GSR and RSR pending
interrupt, clear them to avoid unexpected kernel dump when requesting
mailbox channel
Reviewed-by: Jacky Bai <ping.bai@nxp.com>
Reviewed-by: Ye Li <ye.li@nxp.com>
Signed-off-by: Peng Fan <peng.fan@nxp.com>
Signed-off-by: Jassi Brar <jaswinder.singh@linaro.org>
|
|
The two parameters of 'res' and 'cflags' are swapped, so fix it.
Without this fix, 'ublk del' hangs forever.
Cc: Pavel Begunkov <asml.silence@gmail.com>
Fixes: de23077eda61f ("io_uring: set completion results upfront")
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20220803120757.1668278-1-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
The section name of Rel and Rela starts with ".rel" and ".rela"
respectively (but, I do not know whether this is specification or
convention).
For example, ".rela.text" holds relocation entries applied to the
".text" section.
So, the code chops the ".rel" or ".rela" prefix to get the name of
the section to which the relocation applies.
However, I do not like to skip 4 or 5 bytes blindly because it is
potential memory overrun.
The ELF specification provides a more reliable way to do this.
- The sh_info field holds extra information, whose interpretation
depends on the section type
- If the section type is SHT_REL or SHT_RELA, the sh_info field holds
the section header index of the section to which the relocation
applies.
Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
|
|
The section index is always positive, so the argument, secindex, should
be unsigned.
Also, inserted the array range check.
If sym->st_shndx is a special section index (between SHN_LORESERVE and
SHN_HIRESERVE), there is no corresponding section header.
For example, if a symbol specifies an absolute value, sym->st_shndx is
SHN_ABS (=0xfff1).
The current users do not cause the out-of-range access of
info->sechddrs[], but it is better to avoid such a pitfall.
Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
|
|
SPECIAL() is only used in get_secindex(). Squash it.
Make the code more readable with more comments.
Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
|
|
Swap the order of 'mkdir' and 'trap' just in case the subshell is
interrupted between 'mkdir' and 'trap' although the effect might be
subtle.
This does not intend to make the cleanup perfect. There are more cases
that miss to remove the tmp directory, for example:
- When interrupted, dash does not invoke the EXIT trap (bash does)
- 'rm' command might be interrupted before removing the directory
I am not addressing all the cases since the tmp directory is harmless
after all.
Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
Reviewed-by: Nick Desaulniers <ndesaulniers@google.com>
|
|
Since the user can control the arguments of the ioctl() from the user
space, under special arguments that may result in a divide-by-zero bug.
If the user provides an improper 'pixclock' value that makes the argumet
of i740_calc_vclk() less than 'I740_RFREQ_FIX', it will cause a
divide-by-zero bug in:
drivers/video/fbdev/i740fb.c:353 p_best = min(15, ilog2(I740_MAX_VCO_FREQ / (freq / I740_RFREQ_FIX)));
The following log can reveal it:
divide error: 0000 [#1] PREEMPT SMP KASAN PTI
RIP: 0010:i740_calc_vclk drivers/video/fbdev/i740fb.c:353 [inline]
RIP: 0010:i740fb_decode_var drivers/video/fbdev/i740fb.c:646 [inline]
RIP: 0010:i740fb_set_par+0x163f/0x3b70 drivers/video/fbdev/i740fb.c:742
Call Trace:
fb_set_var+0x604/0xeb0 drivers/video/fbdev/core/fbmem.c:1034
do_fb_ioctl+0x234/0x670 drivers/video/fbdev/core/fbmem.c:1110
fb_ioctl+0xdd/0x130 drivers/video/fbdev/core/fbmem.c:1189
Fix this by checking the argument of i740_calc_vclk() first.
Signed-off-by: Zheyu Ma <zheyuma97@gmail.com>
Signed-off-by: Helge Deller <deller@gmx.de>
|
|
Since the user can control the arguments of the ioctl() from the user
space, under special arguments that may result in a divide-by-zero bug
in:
drivers/video/fbdev/arkfb.c:784: ark_set_pixclock(info, (hdiv * info->var.pixclock) / hmul);
with hdiv=1, pixclock=1 and hmul=2 you end up with (1*1)/2 = (int) 0.
and then in:
drivers/video/fbdev/arkfb.c:504: rv = dac_set_freq(par->dac, 0, 1000000000 / pixclock);
we'll get a division-by-zero.
The following log can reveal it:
divide error: 0000 [#1] PREEMPT SMP KASAN PTI
RIP: 0010:ark_set_pixclock drivers/video/fbdev/arkfb.c:504 [inline]
RIP: 0010:arkfb_set_par+0x10fc/0x24c0 drivers/video/fbdev/arkfb.c:784
Call Trace:
fb_set_var+0x604/0xeb0 drivers/video/fbdev/core/fbmem.c:1034
do_fb_ioctl+0x234/0x670 drivers/video/fbdev/core/fbmem.c:1110
fb_ioctl+0xdd/0x130 drivers/video/fbdev/core/fbmem.c:1189
Fix this by checking the argument of ark_set_pixclock() first.
Fixes: 681e14730c73 ("arkfb: new framebuffer driver for ARK Logic cards")
Signed-off-by: Zheyu Ma <zheyuma97@gmail.com>
Signed-off-by: Helge Deller <deller@gmx.de>
|
|
RSB fill sequence does not have any protection for miss-prediction of
conditional branch at the end of the sequence. CPU can speculatively
execute code immediately after the sequence, while RSB filling hasn't
completed yet.
#define __FILL_RETURN_BUFFER(reg, nr, sp) \
mov $(nr/2), reg; \
771: \
ANNOTATE_INTRA_FUNCTION_CALL; \
call 772f; \
773: /* speculation trap */ \
UNWIND_HINT_EMPTY; \
pause; \
lfence; \
jmp 773b; \
772: \
ANNOTATE_INTRA_FUNCTION_CALL; \
call 774f; \
775: /* speculation trap */ \
UNWIND_HINT_EMPTY; \
pause; \
lfence; \
jmp 775b; \
774: \
add $(BITS_PER_LONG/8) * 2, sp; \
dec reg; \
jnz 771b; <----- CPU can miss-predict here.
Before RSB is filled, RETs that come in program order after this macro
can be executed speculatively, making them vulnerable to RSB-based
attacks.
Mitigate it by adding an LFENCE after the conditional branch to prevent
speculation while RSB is being filled.
Suggested-by: Andrew Cooper <andrew.cooper3@citrix.com>
Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
|
|
This function always returns 0, and ignores the nofail boolean. Drop the
nofail argument, make the function void return and fix up the callers.
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
|
|
Commit 9ad21c3f3ecf ("kbuild: try harder to find symbol names in
modpost") added Elf_Sword (in a wrong way), but did not use it at all.
BTW, the current code looks weird.
The fix for the 32-bit part would be:
Elf64_Sword --> Elf32_Sword
(inconsistet prefix, Elf32_ vs Elf64_)
The fix for the 64-bit part would be:
Elf64_Sxword --> Elf64_Sword
(the size is different between Sword and Sxword)
Note:
Elf32_Sword == Elf64_Sword == int32_t
Elf32_Sxword == Elf64_Sxword == int64_t
Anyway, let's drop unused code instead of fixing it.
Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
Reviewed-by: Nick Desaulniers <ndesaulniers@google.com>
|
|
There's been an ongoing mission to re-enable the -Wformat warning for
Clang. A previous attempt at enabling the warning showed that there were
many instances of this warning throughout the codebase. The sheer amount
of these warnings really polluted builds and thus -Wno-format was added
to _temporarily_ toggle them off.
After many patches the warning has largely been eradicated for x86,
x86_64, arm, and arm64 on a variety of configs. The time to enable the
warning has never been better as it seems for the first time we are
ahead of them and can now solve them as they appear rather than tackling
from a backlog.
As to the root cause of this large backlog of warnings, Clang seems to
pickup on some more nuanced cases of format warnings caused by implicit
integer conversion as well as default argument promotions from
printf-like functions.
Link: https://github.com/ClangBuiltLinux/linux/issues/378
Suggested-by: Nick Desaulniers <ndesaulniers@google.com>
Signed-off-by: Justin Stitt <justinstitt@google.com>
Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
|
|
GCC-12 started triggering a new warning:
arch/x86/mm/numa.c: In function ‘cpumask_of_node’:
arch/x86/mm/numa.c:916:39: warning: the comparison will always evaluate as ‘false’ for the address of ‘node_to_cpumask_map’ will never be NULL [-Waddress]
916 | if (node_to_cpumask_map[node] == NULL) {
| ^~
node_to_cpumask_map is of type cpumask_var_t[].
When CONFIG_CPUMASK_OFFSTACK is set, cpumask_var_t is typedef'd to a
pointer for dynamic allocation, else to an array of one element. The
"wicked game" can be checked on line 700 of include/linux/cpumask.h.
The original code in debug_cpumask_set_cpu() and cpumask_of_node() were
probably written by the original authors with CONFIG_CPUMASK_OFFSTACK=y
(i.e. dynamic allocation) in mind, checking if the cpumask was available
via a direct NULL check.
When CONFIG_CPUMASK_OFFSTACK is not set, GCC gives the above warning
while compiling the kernel.
Fix that by using cpumask_available(), which does the NULL check when
CONFIG_CPUMASK_OFFSTACK is set, otherwise returns true. Use it wherever
such checks are made.
Conditional definitions of cpumask_available() can be found along with
the definition of cpumask_var_t. Check the cpumask.h reference mentioned
above.
Fixes: c032ef60d1aa ("cpumask: convert node_to_cpumask_map[] to cpumask_var_t")
Fixes: de2d9445f162 ("x86: Unify node_to_cpumask_map handling between 32 and 64bit")
Signed-off-by: Siddh Raman Pant <code@siddh.me>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20220731160913.632092-1-code@siddh.me
|
|
tl;dr: The Enhanced IBRS mitigation for Spectre v2 does not work as
documented for RET instructions after VM exits. Mitigate it with a new
one-entry RSB stuffing mechanism and a new LFENCE.
== Background ==
Indirect Branch Restricted Speculation (IBRS) was designed to help
mitigate Branch Target Injection and Speculative Store Bypass, i.e.
Spectre, attacks. IBRS prevents software run in less privileged modes
from affecting branch prediction in more privileged modes. IBRS requires
the MSR to be written on every privilege level change.
To overcome some of the performance issues of IBRS, Enhanced IBRS was
introduced. eIBRS is an "always on" IBRS, in other words, just turn
it on once instead of writing the MSR on every privilege level change.
When eIBRS is enabled, more privileged modes should be protected from
less privileged modes, including protecting VMMs from guests.
== Problem ==
Here's a simplification of how guests are run on Linux' KVM:
void run_kvm_guest(void)
{
// Prepare to run guest
VMRESUME();
// Clean up after guest runs
}
The execution flow for that would look something like this to the
processor:
1. Host-side: call run_kvm_guest()
2. Host-side: VMRESUME
3. Guest runs, does "CALL guest_function"
4. VM exit, host runs again
5. Host might make some "cleanup" function calls
6. Host-side: RET from run_kvm_guest()
Now, when back on the host, there are a couple of possible scenarios of
post-guest activity the host needs to do before executing host code:
* on pre-eIBRS hardware (legacy IBRS, or nothing at all), the RSB is not
touched and Linux has to do a 32-entry stuffing.
* on eIBRS hardware, VM exit with IBRS enabled, or restoring the host
IBRS=1 shortly after VM exit, has a documented side effect of flushing
the RSB except in this PBRSB situation where the software needs to stuff
the last RSB entry "by hand".
IOW, with eIBRS supported, host RET instructions should no longer be
influenced by guest behavior after the host retires a single CALL
instruction.
However, if the RET instructions are "unbalanced" with CALLs after a VM
exit as is the RET in #6, it might speculatively use the address for the
instruction after the CALL in #3 as an RSB prediction. This is a problem
since the (untrusted) guest controls this address.
Balanced CALL/RET instruction pairs such as in step #5 are not affected.
== Solution ==
The PBRSB issue affects a wide variety of Intel processors which
support eIBRS. But not all of them need mitigation. Today,
X86_FEATURE_RSB_VMEXIT triggers an RSB filling sequence that mitigates
PBRSB. Systems setting RSB_VMEXIT need no further mitigation - i.e.,
eIBRS systems which enable legacy IBRS explicitly.
However, such systems (X86_FEATURE_IBRS_ENHANCED) do not set RSB_VMEXIT
and most of them need a new mitigation.
Therefore, introduce a new feature flag X86_FEATURE_RSB_VMEXIT_LITE
which triggers a lighter-weight PBRSB mitigation versus RSB_VMEXIT.
The lighter-weight mitigation performs a CALL instruction which is
immediately followed by a speculative execution barrier (INT3). This
steers speculative execution to the barrier -- just like a retpoline
-- which ensures that speculation can never reach an unbalanced RET.
Then, ensure this CALL is retired before continuing execution with an
LFENCE.
In other words, the window of exposure is opened at VM exit where RET
behavior is troublesome. While the window is open, force RSB predictions
sampling for RET targets to a dead end at the INT3. Close the window
with the LFENCE.
There is a subset of eIBRS systems which are not vulnerable to PBRSB.
Add these systems to the cpu_vuln_whitelist[] as NO_EIBRS_PBRSB.
Future systems that aren't vulnerable will set ARCH_CAP_PBRSB_NO.
[ bp: Massage, incorporate review comments from Andy Cooper. ]
Signed-off-by: Daniel Sneddon <daniel.sneddon@linux.intel.com>
Co-developed-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
|
|
There are several symbols defined in kernel/sched/sched.h but get wrapped
in CONFIG_CGROUP_SCHED, even though dummy versions get built in rt.c and
therefore trigger Sparse warnings:
kernel/sched/rt.c:309:6: warning: symbol 'unregister_rt_sched_group' was not declared. Should it be static?
kernel/sched/rt.c:311:6: warning: symbol 'free_rt_sched_group' was not declared. Should it be static?
kernel/sched/rt.c:313:5: warning: symbol 'alloc_rt_sched_group' was not declared. Should it be static?
Fix this by moving them outside the CONFIG_CGROUP_SCHED block.
[ mingo: Refreshed to the latest scheduler tree, tweaked changelog. ]
Signed-off-by: Ben Dooks <ben-linux@fluff.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20220721145155.358366-1-ben-linux@fluff.org
|
|
The AVR32 architecture does no longer exist in the Linux kernel, hence
remove a reference to also non-existing Linux BSP CD from Atmel.
Signed-off-by: Hans-Christian Noren Egtvedt <egtvedt@samfundet.no>
|
|
The AVR32 architecture has been removed from the kernel in commit
26202873bb51fafdaa51be3e8de7aab9beb49f70, hence clean out the
atmel,at32ap-lcdc parts in the atmel_lcdfb.c video driver.
AVR32 architecture never supported device tree, hence this code was not
used by anybody.
Signed-off-by: Hans-Christian Noren Egtvedt <egtvedt@samfundet.no>
|
|
The AVR32 architecture does no longer exist in the Linux kernel, hence
remove a reference to it in Kconfig help text to avoid confusion.
Signed-off-by: Hans-Christian Noren Egtvedt <egtvedt@samfundet.no>
|
|
The AVR32 architecture does no longer exist in the Linux kernel, hence
remove a reference to it in Kconfig help text to avoid confusion.
Signed-off-by: Hans-Christian Noren Egtvedt <egtvedt@samfundet.no>
|
|
The AVR32 architecture has been removed from the kernel in commit
26202873bb51fafdaa51be3e8de7aab9beb49f70, hence clean out the
cdns,at32ap7000-macb compatible entry in Cadence macb Ethernet driver.
AVR32 architecture never supported device tree, hence this code was not
used by anybody.
Updated documentation to match the default entry, no users of
cdns,at32ap7000-macb in the kernel tree.
Signed-off-by: Hans-Christian Noren Egtvedt <egtvedt@samfundet.no>
|
|
I have changed my overall maintainer email address to the samfundet.no
domain, hence update the atmel-ssc module to reflect that.
Also remove the AVR32 reference, since the AVR32 architecture no longer
exist in the Linux kernel.
Signed-off-by: Hans-Christian Noren Egtvedt <egtvedt@samfundet.no>
|
|
The AVR32 architecture does no longer exist in the Linux kernel, hence
remove a reference to it in comments to avoid confusion.
Signed-off-by: Hans-Christian Noren Egtvedt <egtvedt@samfundet.no>
|
|
The AVR32 architecture does no longer exist in the Linux kernel, hence
remove a reference to it in comments to avoid confusion.
Signed-off-by: Hans-Christian Noren Egtvedt <egtvedt@samfundet.no>
|
|
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
With cgroup v2, the cpuset's cpus_allowed mask can be empty indicating
that the cpuset will just use the effective CPUs of its parent. So
cpuset_can_attach() can call task_can_attach() with an empty mask.
This can lead to cpumask_any_and() returns nr_cpu_ids causing the call
to dl_bw_of() to crash due to percpu value access of an out of bound
CPU value. For example:
[80468.182258] BUG: unable to handle page fault for address: ffffffff8b6648b0
:
[80468.191019] RIP: 0010:dl_cpu_busy+0x30/0x2b0
:
[80468.207946] Call Trace:
[80468.208947] cpuset_can_attach+0xa0/0x140
[80468.209953] cgroup_migrate_execute+0x8c/0x490
[80468.210931] cgroup_update_dfl_csses+0x254/0x270
[80468.211898] cgroup_subtree_control_write+0x322/0x400
[80468.212854] kernfs_fop_write_iter+0x11c/0x1b0
[80468.213777] new_sync_write+0x11f/0x1b0
[80468.214689] vfs_write+0x1eb/0x280
[80468.215592] ksys_write+0x5f/0xe0
[80468.216463] do_syscall_64+0x5c/0x80
[80468.224287] entry_SYSCALL_64_after_hwframe+0x44/0xae
Fix that by using effective_cpus instead. For cgroup v1, effective_cpus
is the same as cpus_allowed. For v2, effective_cpus is the real cpumask
to be used by tasks within the cpuset anyway.
Also update task_can_attach()'s 2nd argument name to cs_effective_cpus to
reflect the change. In addition, a check is added to task_can_attach()
to guard against the possibility that cpumask_any_and() may return a
value >= nr_cpu_ids.
Fixes: 7f51412a415d ("sched/deadline: Fix bandwidth check/update when migrating tasks between exclusive cpusets")
Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Juri Lelli <juri.lelli@redhat.com>
Link: https://lore.kernel.org/r/20220803015451.2219567-1-longman@redhat.com
|
|
Going forward, I'll be using my kernel.org for upstream work.
Signed-off-by: Lee Jones <lee.jones@linaro.org>
Acked-by: Daniel Thompson <daniel.thompson@linaro.org>
Signed-off-by: Lee Jones <lee@kernel.org>
|
|
Conflicts:
net/ax25/af_ax25.c
d7c4c9e075f8c ("ax25: fix incorrect dev_tracker usage")
d62607c3fe459 ("net: rename reference+tracking helpers")
drivers/net/netdevsim/fib.c
180a6a3ee60a ("netdevsim: fib: Fix reference count leak on route deletion failure")
012ec02ae441 ("netdevsim: convert driver to use unlocked devlink API during init/fini")
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
When building ppc64_book3e_allmodconfig the build fails with:
arch/powerpc/kexec/file_load_64.c:1063:14: error: implicit declaration of function ‘firmware_has_feature’
1063 | if (!firmware_has_feature(FW_FEATURE_LPAR))
| ^~~~~~~~~~~~~~~~~~~~
Add a direct include of asm/firmware.h to fix the error.
Fixes: b1fc44eaa9ba ("pseries/iommu/ddw: Fix kdump to work in absence of ibm,dma-window")
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20220803063152.1249270-1-mpe@ellerman.id.au
|
|
When we multiply an unsigned int by a u32 we still end up with an
unsigned int. That means we should specify "%u" not "%lu" in the
format code.
NOTE: this fix was chosen instead of somehow promoting the value to
"unsigned long" since the max baud rate from the earlier call to
uart_get_baud_rate() is 4000000 and the max sampling rate is 32.
4000000 * 32 = 0x07a12000, not even close to overflowing 32-bits.
Fixes: c474c775716e ("tty: serial: qcom-geni-serial: Fix get_clk_div_rate() which otherwise could return a sub-optimal clock rate.")
Reported-by: Mark Brown <broonie@kernel.org>
Signed-off-by: Douglas Anderson <dianders@chromium.org>
Link: https://lore.kernel.org/r/20220802132250.1.Iea061e14157a17e114dbe2eca764568a02d6b889@changeid
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
The commit in Fixes: has changed a .txt file into a .yaml file. Update the
documentation accordingly.
While at it add some `` around some file names to improve the output.
Fixes: 70991f1e6858 ("dt-bindings: net: convert sff,sfp to dtschema")
Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Link: https://lore.kernel.org/r/be3c7e87ca7f027703247eccfe000b8e34805094.1659247114.git.christophe.jaillet@wanadoo.fr
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
https://gitlab.freedesktop.org/agd5f/linux into drm-next
amd-drm-next-5.20-2022-07-29:
amdgpu:
- Misc spelling and grammar fixes
- DC whitespace cleanups
- ACP smatch fix
- GFX 11.0 updates
- PSP 13.0 updates
- VCN 4.0 updates
- DC FP fix for PPC64
- Misc bug fixes
amdkfd:
- SVM fixes
Signed-off-by: Dave Airlie <airlied@redhat.com>
From: Alex Deucher <alexander.deucher@amd.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20220729202742.6636-1-alexander.deucher@amd.com
|
|
This fixes a race between changing the ext4 superblock uuid and operations
like mounting, resizing, changing features, etc.
Reviewed-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Jeremy Bongio <bongiojp@gmail.com>
Link: https://lore.kernel.org/r/20220721224422.438351-1-bongiojp@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
This patch avoids an attempt to resize the filesystem to an
unaligned cluster boundary. An online resize to a size that is not
integral to cluster size results in the last iteration attempting to
grow the fs by a negative amount, which trips a BUG_ON and leaves the fs
with a corrupted in-memory superblock.
Signed-off-by: Oleg Kiselev <okiselev@amazon.com>
Link: https://lore.kernel.org/r/0E92A0AB-4F16-4F1A-94B7-702CC6504FDE@amazon.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
This patch avoids doing an O(n**2)-complexity walk through every flex group.
Instead, it uses the already computed overhead information for the newly
allocated space, and simply adds it to the previously calculated
overhead stored in the superblock. This drastically reduces the time
taken to resize very large bigalloc filesystems (from 3+ hours for a
64TB fs down to milliseconds).
Signed-off-by: Oleg Kiselev <okiselev@amazon.com>
Link: https://lore.kernel.org/r/CE4F359F-4779-45E6-B6A9-8D67FDFF5AE2@amazon.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
Following process will fail assertion 'jh->b_frozen_data == NULL' in
jbd2_journal_dirty_metadata():
jbd2_journal_commit_transaction
unlink(dir/a)
jh->b_transaction = trans1
jh->b_jlist = BJ_Metadata
journal->j_running_transaction = NULL
trans1->t_state = T_COMMIT
unlink(dir/b)
handle->h_trans = trans2
do_get_write_access
jh->b_modified = 0
jh->b_frozen_data = frozen_buffer
jh->b_next_transaction = trans2
jbd2_journal_dirty_metadata
is_handle_aborted
is_journal_aborted // return false
--> jbd2 abort <--
while (commit_transaction->t_buffers)
if (is_journal_aborted)
jbd2_journal_refile_buffer
__jbd2_journal_refile_buffer
WRITE_ONCE(jh->b_transaction,
jh->b_next_transaction)
WRITE_ONCE(jh->b_next_transaction, NULL)
__jbd2_journal_file_buffer(jh, BJ_Reserved)
J_ASSERT_JH(jh, jh->b_frozen_data == NULL) // assertion failure !
The reproducer (See detail in [Link]) reports:
------------[ cut here ]------------
kernel BUG at fs/jbd2/transaction.c:1629!
invalid opcode: 0000 [#1] PREEMPT SMP
CPU: 2 PID: 584 Comm: unlink Tainted: G W
5.19.0-rc6-00115-g4a57a8400075-dirty #697
RIP: 0010:jbd2_journal_dirty_metadata+0x3c5/0x470
RSP: 0018:ffffc90000be7ce0 EFLAGS: 00010202
Call Trace:
<TASK>
__ext4_handle_dirty_metadata+0xa0/0x290
ext4_handle_dirty_dirblock+0x10c/0x1d0
ext4_delete_entry+0x104/0x200
__ext4_unlink+0x22b/0x360
ext4_unlink+0x275/0x390
vfs_unlink+0x20b/0x4c0
do_unlinkat+0x42f/0x4c0
__x64_sys_unlink+0x37/0x50
do_syscall_64+0x35/0x80
After journal aborting, __jbd2_journal_refile_buffer() is executed with
holding @jh->b_state_lock, we can fix it by moving 'is_handle_aborted()'
into the area protected by @jh->b_state_lock.
Link: https://bugzilla.kernel.org/show_bug.cgi?id=216251
Fixes: 470decc613ab20 ("[PATCH] jbd2: initial copy of files from jbd")
Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com>
Link: https://lore.kernel.org/r/20220715125152.4022726-1-chengzhihao1@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
Block range to free is validated in ext4_free_blocks() using
ext4_inode_block_valid() and then it's passed to ext4_mb_clear_bb().
However in some situations on bigalloc file system the range might be
adjusted after the validation in ext4_free_blocks() which can lead to
troubles on corrupted file systems such as one found by syzkaller that
resulted in the following BUG
kernel BUG at fs/ext4/ext4.h:3319!
PREEMPT SMP NOPTI
CPU: 28 PID: 4243 Comm: repro Kdump: loaded Not tainted 5.19.0-rc6+ #1
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.15.0-1.fc35 04/01/2014
RIP: 0010:ext4_free_blocks+0x95e/0xa90
Call Trace:
<TASK>
? lock_timer_base+0x61/0x80
? __es_remove_extent+0x5a/0x760
? __mod_timer+0x256/0x380
? ext4_ind_truncate_ensure_credits+0x90/0x220
ext4_clear_blocks+0x107/0x1b0
ext4_free_data+0x15b/0x170
ext4_ind_truncate+0x214/0x2c0
? _raw_spin_unlock+0x15/0x30
? ext4_discard_preallocations+0x15a/0x410
? ext4_journal_check_start+0xe/0x90
? __ext4_journal_start_sb+0x2f/0x110
ext4_truncate+0x1b5/0x460
? __ext4_journal_start_sb+0x2f/0x110
ext4_evict_inode+0x2b4/0x6f0
evict+0xd0/0x1d0
ext4_enable_quotas+0x11f/0x1f0
ext4_orphan_cleanup+0x3de/0x430
? proc_create_seq_private+0x43/0x50
ext4_fill_super+0x295f/0x3ae0
? snprintf+0x39/0x40
? sget_fc+0x19c/0x330
? ext4_reconfigure+0x850/0x850
get_tree_bdev+0x16d/0x260
vfs_get_tree+0x25/0xb0
path_mount+0x431/0xa70
__x64_sys_mount+0xe2/0x120
do_syscall_64+0x5b/0x80
? do_user_addr_fault+0x1e2/0x670
? exc_page_fault+0x70/0x170
entry_SYSCALL_64_after_hwframe+0x46/0xb0
RIP: 0033:0x7fdf4e512ace
Fix it by making sure that the block range is properly validated before
used every time it changes in ext4_free_blocks() or ext4_mb_clear_bb().
Link: https://syzkaller.appspot.com/bug?id=5266d464285a03cee9dbfda7d2452a72c3c2ae7c
Reported-by: syzbot+15cd994e273307bf5cfa@syzkaller.appspotmail.com
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Cc: Tadeusz Struk <tadeusz.struk@linaro.org>
Tested-by: Tadeusz Struk <tadeusz.struk@linaro.org>
Link: https://lore.kernel.org/r/20220714165903.58260-1-lczerner@redhat.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
Use the fact that entries with elevated refcount are not removed from
the hash and just move removal of the entry from the hash to the entry
freeing time. When doing this we also change the generic code to hold
one reference to the cache entry, not two of them, which makes code
somewhat more obvious.
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20220712105436.32204-10-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
Nobody uses mb_cache_entry_delete() anymore. Remove it.
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20220712105436.32204-9-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
Currently when we decide to reuse xattr block we detect the case when
the last reference to xattr block is being dropped at the same time and
cancel the reuse attempt. Convert ext2 to a new scheme when as soon as
matching mbcache entry is found, we wait with dropping the last xattr
block reference until mbcache entry reference is dropped (meaning either
the xattr block reference is increased or we decided not to reuse the
block).
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20220712105436.32204-8-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
Replace one else in ext2_xattr_set() with a goto. This makes following
code changes simpler to follow. No functional changes.
Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Link: https://lore.kernel.org/r/20220712105436.32204-7-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
Free of xattr block reference is opencode in two places. Factor it out
into a separate function and use it.
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20220712105436.32204-6-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
When ext4_xattr_block_set() decides to remove xattr block the following
race can happen:
CPU1 CPU2
ext4_xattr_block_set() ext4_xattr_release_block()
new_bh = ext4_xattr_block_cache_find()
lock_buffer(bh);
ref = le32_to_cpu(BHDR(bh)->h_refcount);
if (ref == 1) {
...
mb_cache_entry_delete();
unlock_buffer(bh);
ext4_free_blocks();
...
ext4_forget(..., bh, ...);
jbd2_journal_revoke(..., bh);
ext4_journal_get_write_access(..., new_bh, ...)
do_get_write_access()
jbd2_journal_cancel_revoke(..., new_bh);
Later the code in ext4_xattr_block_set() finds out the block got freed
and cancels reusal of the block but the revoke stays canceled and so in
case of block reuse and journal replay the filesystem can get corrupted.
If the race works out slightly differently, we can also hit assertions
in the jbd2 code.
Fix the problem by making sure that once matching mbcache entry is
found, code dropping the last xattr block reference (or trying to modify
xattr block in place) waits until the mbcache entry reference is
dropped. This way code trying to reuse xattr block is protected from
someone trying to drop the last reference to xattr block.
Reported-and-tested-by: Ritesh Harjani <ritesh.list@gmail.com>
CC: stable@vger.kernel.org
Fixes: 82939d7999df ("ext4: convert to mbcache2")
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20220712105436.32204-5-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
Remove unnecessary else (and thus indentation level) from a code block
in ext4_xattr_block_set(). It will also make following code changes
easier. No functional changes.
CC: stable@vger.kernel.org
Fixes: 82939d7999df ("ext4: convert to mbcache2")
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20220712105436.32204-4-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
Currently we remove EA inode from mbcache as soon as its xattr refcount
drops to zero. However there can be pending attempts to reuse the inode
and thus refcount handling code has to handle the situation when
refcount increases from zero anyway. So save some work and just keep EA
inode in mbcache until it is getting evicted. At that moment we are sure
following iget() of EA inode will fail anyway (or wait for eviction to
finish and load things from the disk again) and so removing mbcache
entry at that moment is fine and simplifies the code a bit.
CC: stable@vger.kernel.org
Fixes: 82939d7999df ("ext4: convert to mbcache2")
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20220712105436.32204-3-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
Add function mb_cache_entry_delete_or_get() to delete mbcache entry if
it is unused and also add a function to wait for entry to become unused
- mb_cache_entry_wait_unused(). We do not share code between the two
deleting function as one of them will go away soon.
CC: stable@vger.kernel.org
Fixes: 82939d7999df ("ext4: convert to mbcache2")
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20220712105436.32204-2-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
Do not reclaim entries that are currently used by somebody from a
shrinker. Firstly, these entries are likely useful. Secondly, we will
need to keep such entries to protect pending increment of xattr block
refcount.
CC: stable@vger.kernel.org
Fixes: 82939d7999df ("ext4: convert to mbcache2")
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20220712105436.32204-1-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
ext4_append() must always allocate a new block, otherwise we run the
risk of overwriting existing directory block corrupting the directory
tree in the process resulting in all manner of problems later on.
Add a sanity check to see if the logical block is already allocated and
error out if it is.
Cc: stable@kernel.org
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Link: https://lore.kernel.org/r/20220704142721.157985-2-lczerner@redhat.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
Currently ext4 directory handling code implicitly assumes that the
directory blocks are always within the i_size. In fact ext4_append()
will attempt to allocate next directory block based solely on i_size and
the i_size is then appropriately increased after a successful
allocation.
However, for this to work it requires i_size to be correct. If, for any
reason, the directory inode i_size is corrupted in a way that the
directory tree refers to a valid directory block past i_size, we could
end up corrupting parts of the directory tree structure by overwriting
already used directory blocks when modifying the directory.
Fix it by catching the corruption early in __ext4_read_dirblock().
Addresses Red-Hat-Bugzilla: #2070205
CVE: CVE-2022-1184
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Cc: stable@vger.kernel.org
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Link: https://lore.kernel.org/r/20220704142721.157985-1-lczerner@redhat.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
Add support to display the mb_optimize_scan value in
/proc/fs/ext4/<dev>/options file. The option is only
displayed when the value is non default.
Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Link: https://lore.kernel.org/r/20220704054603.21462-1-ojaswin@linux.ibm.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
Now if check directoy entry is corrupted, ext4_empty_dir may return true
then directory will be removed when file system mounted with "errors=continue".
In order not to make things worse just return false when directory is corrupted.
Signed-off-by: Ye Bin <yebin10@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20220622090223.682234-1-yebin10@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|