summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2024-07-11selftests/x86: build sysret_rip.c with clangJohn Hubbard
When building with clang, via: make LLVM=1 -C tools/testing/selftests ...the build fails because clang's inline asm doesn't support all of the features that are used in the asm() snippet in sysret_rip.c. Fix this by moving the asm code into the clang_helpers_64.S file, where it can be built with the assembler's full set of features. Acked-by: Muhammad Usama Anjum <usama.anjum@collabora.com> Signed-off-by: John Hubbard <jhubbard@nvidia.com> Signed-off-by: Shuah Khan <skhan@linuxfoundation.org>
2024-07-11selftests/x86: build fsgsbase_restore.c with clangJohn Hubbard
When building with clang, via: make LLVM=1 -C tools/testing/selftests Fix this by moving the inline asm to "pure" assembly, in two new files: clang_helpers_32.S, clang_helpers_64.S. As a bonus, the pure asm avoids the need for ifdefs, and is now very simple and easy on the eyes. Acked-by: Muhammad Usama Anjum <usama.anjum@collabora.com> Signed-off-by: John Hubbard <jhubbard@nvidia.com> Signed-off-by: Shuah Khan <skhan@linuxfoundation.org>
2024-07-11selftests: x86: test_FISTTP: use fisttps instead of ambiguous fisttpMuhammad Usama Anjum
Use fisttps instead of fisttp to specify correctly that the output variable is of size short. test_FISTTP.c:28:3: error: ambiguous instructions require an explicit suffix (could be 'fisttps', or 'fisttpl') 28 | " fisttp res16""\n" | ^ <inline asm>:3:2: note: instantiated into assembly here 3 | fisttp res16 | ^ ...followed by three more cases of the same warning for other lines. [jh: removed a bit of duplication from the warnings report, above, and fixed a typo in the title] Signed-off-by: Muhammad Usama Anjum <usama.anjum@collabora.com> Signed-off-by: John Hubbard <jhubbard@nvidia.com> Signed-off-by: Shuah Khan <skhan@linuxfoundation.org>
2024-07-11selftests/x86: fix Makefile dependencies to work with clangJohn Hubbard
When building with clang, via: make LLVM=1 -C tools/testing/selftests ...the following build failure occurs in selftests/x86: clang: error: cannot specify -o when generating multiple output files This happens because, although gcc doesn't complain if you invoke it like this: gcc file1.c header2.h ...clang won't accept that form--it rejects the .h file(s). Also, the above approach is inaccurate anyway, because file.c includes header2.h in this case, and the inclusion of header2.h on the invocation is an artifact of the Makefile's desire to maintain dependencies. In Makefiles of this type, a better way to do it is to use Makefile dependencies to trigger the appropriate incremental rebuilds, and separately use file lists (see EXTRA_FILES in this commit) to track what to pass to the compiler. This commit splits those concepts up, by setting up both EXTRA_FILES and the Makefile dependencies with a single call to the new Makefile function extra-files. That fixes the build failure, while still providing the correct dependencies in all cases. Acked-by: Muhammad Usama Anjum <usama.anjum@collabora.com> Signed-off-by: John Hubbard <jhubbard@nvidia.com> Signed-off-by: Shuah Khan <skhan@linuxfoundation.org>
2024-07-11selftests/timers: remove unused irqcount variableJohn Hubbard
When building with clang, via: make LLVM=1 -C tools/testing/selftest ...clang warns about an unused irqcount variable. clang is correct: the variable is incremented and then ignored. Fix this by deleting the irqcount variable. Signed-off-by: John Hubbard <jhubbard@nvidia.com> Acked-by: John Stultz <jstultz@google.com> Reviewed-by: Muhammad Usama Anjum <usama.anjum@collabora.com> Signed-off-by: Shuah Khan <skhan@linuxfoundation.org>
2024-07-11selftests: Add information about TAP conformance in testsMuhammad Usama Anjum
Although "TAP" word is being used already in documentation, but it hasn't been defined in informative way for developers that how to write TAP conformant tests and what are the benefits. Write a short brief about it. Signed-off-by: Muhammad Usama Anjum <usama.anjum@collabora.com> Signed-off-by: Shuah Khan <skhan@linuxfoundation.org>
2024-07-11selftests/resctrl: Remove test name comparing from write_bm_pid_to_resctrl()Ilpo Järvinen
write_bm_pid_to_resctrl() uses resctrl_val to check test name which is not a good interface generic resctrl FS functions should provide. Tests define mongrp when needed. Remove the test name check in write_bm_pid_to_resctrl() to only rely on the mongrp parameter being non-NULL. Remove write_bm_pid_to_resctrl() resctrl_val parameter and resctrl_val member from the struct resctrl_val_param that are not used anymore. Similarly, remove the test name constants that are no longer used. Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com> Tested-by: Babu Moger <babu.moger@amd.com> Reviewed-by: Reinette Chatre <reinette.chatre@intel.com> Signed-off-by: Shuah Khan <skhan@linuxfoundation.org>
2024-07-11selftests/resctrl: Remove mongrp from CMT testIlpo Järvinen
The CMT selftest instantiates a monitor group to read LLC occupancy. Since the test also creates a control group, it is unnecessary to create another one for monitoring because control groups already provide monitoring too. Remove the unnecessary monitor group from the CMT selftest. Suggested-by: Reinette Chatre <reinette.chatre@intel.com> Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com> Tested-by: Babu Moger <babu.moger@amd.com> Reviewed-by: Reinette Chatre <reinette.chatre@intel.com> Signed-off-by: Shuah Khan <skhan@linuxfoundation.org>
2024-07-11selftests/resctrl: Remove mongrp from MBA testIlpo Järvinen
Nothing during MBA test uses mongrp even if it has been defined ever since the introduction of the MBA test in the commit 01fee6b4d1f9 ("selftests/resctrl: Add MBA test"). Remove the mongrp from MBA test. Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com> Tested-by: Babu Moger <babu.moger@amd.com> Reviewed-by: Reinette Chatre <reinette.chatre@intel.com> Signed-off-by: Shuah Khan <skhan@linuxfoundation.org>
2024-07-11selftests/resctrl: Convert ctrlgrp & mongrp to pointersIlpo Järvinen
The struct resctrl_val_param has control and monitor groups as char arrays but they are not supposed to be mutated within resctrl_val(). Convert the ctrlgrp and mongrp char array within resctrl_val_param to plain const char pointers and adjust the strlen() based checks to check NULL instead. Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com> Tested-by: Babu Moger <babu.moger@amd.com> Reviewed-by: Reinette Chatre <reinette.chatre@intel.com> Signed-off-by: Shuah Khan <skhan@linuxfoundation.org>
2024-07-11selftests/resctrl: Make some strings passed to resctrlfs functions constIlpo Järvinen
Control group, monitor group and resctrl_val are not mutated and should not be mutated within resctrlfs.c functions. Mark this by using const char * for the arguments. Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com> Tested-by: Babu Moger <babu.moger@amd.com> Reviewed-by: Reinette Chatre <reinette.chatre@intel.com> Signed-off-by: Shuah Khan <skhan@linuxfoundation.org>
2024-07-11selftests/resctrl: Simplify bandwidth report type handlingIlpo Järvinen
bw_report is only needed for selecting the correct value from the values IMC measured. It is a member in the resctrl_val_param struct and is always set to "reads". The value is then checked in resctrl_val() using validate_bw_report_request() that besides validating the input, assumes it can mutate the string which is questionable programming practice. Simplify handling bw_report: - Convert validate_bw_report_request() into get_bw_report_type() that inputs and returns const char *. Use NULL to indicate error. - Validate the report types inside measure_mem_bw(), not in resctrl_val(). - Pass bw_report to measure_mem_bw() from ->measure() hook because resctrl_val() no longer needs bw_report for anything. Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com> Tested-by: Babu Moger <babu.moger@amd.com> Reviewed-by: Reinette Chatre <reinette.chatre@intel.com> Signed-off-by: Shuah Khan <skhan@linuxfoundation.org>
2024-07-11selftests/resctrl: Add ->init() callback into resctrl_val_paramIlpo Järvinen
The struct resctrl_val_param is there to customize behavior inside resctrl_val() which is currently not used to full extent and there are number of strcmp()s for test name in resctrl_val done by resctrl_val(). Create ->init() hook into the struct resctrl_val_param to cleanly do per test initialization. Remove also unused branches to setup paths and the related #defines for CMT test. While touching kerneldoc, make the adjacent line consistent with the newly added form (callback vs call back). Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com> Tested-by: Babu Moger <babu.moger@amd.com> Reviewed-by: Reinette Chatre <reinette.chatre@intel.com> Signed-off-by: Shuah Khan <skhan@linuxfoundation.org>
2024-07-11selftests/resctrl: Add ->measure() callback to resctrl_val_paramIlpo Järvinen
The measurement done in resctrl_val() varies depending on test type. The decision for how to measure is decided based on the string compare to test name which is quite inflexible. Add ->measure() callback into the struct resctrl_val_param to allow each test to provide necessary code as a function which simplifies what resctrl_val() has to do. Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com> Tested-by: Babu Moger <babu.moger@amd.com> Reviewed-by: Reinette Chatre <reinette.chatre@intel.com> Signed-off-by: Shuah Khan <skhan@linuxfoundation.org>
2024-07-11selftests/resctrl: Simplify mem bandwidth file code for MBA & MBM testsIlpo Järvinen
initialize_mem_bw_resctrl() and set_mbm_path() contain complicated set of conditions, each yielding different file to be opened to measure memory bandwidth through resctrl FS. In practice, only two of them are used. For MBA test, ctrlgrp is always provided, and for MBM test both ctrlgrp and mongrp are set. The file used differ between MBA/MBM test, however, MBM test unnecessarily create monitor group because resctrl FS already provides monitoring interface underneath any ctrlgrp too, which is what the MBA selftest uses. Consolidate memory bandwidth file used to the one used by the MBA selftest. Remove all unused branches opening other files to simplify the code. Suggested-by: Reinette Chatre <reinette.chatre@intel.com> Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com> Tested-by: Babu Moger <babu.moger@amd.com> Reviewed-by: Reinette Chatre <reinette.chatre@intel.com> Signed-off-by: Shuah Khan <skhan@linuxfoundation.org>
2024-07-11selftests/resctrl: Rename measure_vals() to measure_mem_bw_vals() & documentIlpo Järvinen
measure_vals() is awfully generic name so rename it to measure_mem_bw() to describe better what it does and document the function parameters. Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com> Tested-by: Babu Moger <babu.moger@amd.com> Reviewed-by: Reinette Chatre <reinette.chatre@intel.com> Signed-off-by: Shuah Khan <skhan@linuxfoundation.org>
2024-07-11selftests/resctrl: Cleanup bm_pid and ppid usage & limit scopeIlpo Järvinen
'bm_pid' and 'ppid' are global variables. As they are used by different processes and in signal handler, they cannot be entirely converted into local variables. The scope of those variables can still be reduced into resctrl_val.c only. As PARENT_EXIT() macro is using 'ppid', make it a function in resctrl_val.c and pass ppid to it as an argument because it is easier to understand than using the global variable directly. Pass 'bm_pid' into measure_vals() instead of relying on the global variable which helps to make the call signatures of measure_vals() and measure_llc_resctrl() more similar to each other. Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com> Tested-by: Babu Moger <babu.moger@amd.com> Reviewed-by: Reinette Chatre <reinette.chatre@intel.com> Signed-off-by: Shuah Khan <skhan@linuxfoundation.org>
2024-07-11selftests/resctrl: Use correct type for pidsIlpo Järvinen
A few functions receive PIDs through int arguments. PIDs variables should be of type pid_t, not int. Convert pid arguments from int to pid_t. Before printing PID, match the type to %d by casting to int which is enough for Linux (standard would allow using a longer integer type but generalizing for that would complicate the code unnecessarily, the selftest code does not need to be portable). Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com> Tested-by: Babu Moger <babu.moger@amd.com> Reviewed-by: Reinette Chatre <reinette.chatre@intel.com> Signed-off-by: Shuah Khan <skhan@linuxfoundation.org>
2024-07-11selftests/resctrl: Consolidate get_domain_id() into resctrl_val()Ilpo Järvinen
Both initialize_mem_bw_resctrl() and initialize_llc_occu_resctrl() that are called from resctrl_val() need to determine domain ID to construct resctrl fs related paths. Both functions do it by taking CPU ID which neither needs for any other purpose than determining the domain ID. Consolidate determining the domain ID into resctrl_val() and pass the domain ID instead of CPU ID to initialize_mem_bw_resctrl() and initialize_llc_occu_resctrl(). Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com> Tested-by: Babu Moger <babu.moger@amd.com> Reviewed-by: Reinette Chatre <reinette.chatre@intel.com> Signed-off-by: Shuah Khan <skhan@linuxfoundation.org>
2024-07-11selftests/resctrl: Make "bandwidth" consistent in comments & printsIlpo Järvinen
Resctrl selftests refer to "bandwidth" currently in two other forms in the code ("B/W" and "band width"). Use "bandwidth" consistently everywhere. While at it, fix also one "over flow" -> "overflow" on a line that is touched by the change. Suggested-by: Reinette Chatre <reinette.chatre@intel.com> Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com> Tested-by: Babu Moger <babu.moger@amd.com> Reviewed-by: Reinette Chatre <reinette.chatre@intel.com> Signed-off-by: Shuah Khan <skhan@linuxfoundation.org>
2024-07-11selftests/resctrl: Calculate resctrl FS derived mem bw over sleep(1) onlyIlpo Järvinen
For MBM/MBA tests, measure_vals() calls get_mem_bw_imc() that performs the measurement over a duration of sleep(1) call. The memory bandwidth numbers from IMC are derived over this duration. The resctrl FS derived memory bandwidth, however, is calculated inside measure_vals() and only takes delta between the previous value and the current one which besides the actual test, also samples inter-test noise. Rework the logic in measure_vals() and get_mem_bw_imc() such that the resctrl FS memory bandwidth section covers much shorter duration closely matching that of the IMC perf counters to improve measurement accuracy. For the second read after rewind() to return a fresh value, also newline has to be consumed by the fscanf(). Suggested-by: Reinette Chatre <reinette.chatre@intel.com> Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com> Tested-by: Babu Moger <babu.moger@amd.com> Reviewed-by: Reinette Chatre <reinette.chatre@intel.com> Signed-off-by: Shuah Khan <skhan@linuxfoundation.org>
2024-07-11selftests/resctrl: Fix closing IMC fds on error and open-code R+W instead of ↵Ilpo Järvinen
loops The imc perf fd close() calls are missing from all error paths. In addition, get_mem_bw_imc() handles fds in a for loop but close() is based on two fixed indexes READ and WRITE. Open code inner for loops to READ+WRITE entries for clarity and add a function to close() IMC fds properly in all cases. Fixes: 7f4d257e3a2a ("selftests/resctrl: Add callback to start a benchmark") Suggested-by: Reinette Chatre <reinette.chatre@intel.com> Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com> Tested-by: Babu Moger <babu.moger@amd.com> Reviewed-by: Reinette Chatre <reinette.chatre@intel.com> Signed-off-by: Shuah Khan <skhan@linuxfoundation.org>
2024-07-11selftests/sched: fix code format issuesaigourensheng
There are extra spaces in the middle of #define. It is recommended to delete the spaces to make the code look more comfortable. Signed-off-by: aigourensheng <shechenglong001@gmail.com> Signed-off-by: Shuah Khan <skhan@linuxfoundation.org>
2024-07-11selftests/lib.mk: silence some clang warnings that gcc already ignoresJohn Hubbard
gcc defaults to silence (off) for the following warnings, but clang defaults to the opposite. The warnings are not useful for the kernel itself, which is why they have remained disabled in gcc for the main kernel build. And it is only due to including kernel data structures in the selftests, that we get the warnings from clang. -Waddress-of-packed-member -Wgnu-variable-sized-type-not-at-end In other words, the warnings are not unique to the selftests: there is nothing that the selftests' code does that triggers these warnings, other than the act of including the kernel's data structures. Therefore, silence them for the clang builds as well. This eliminates warnings for the net/ and user_events/ kselftest subsystems, in these files: ./net/af_unix/scm_rights.c ./net/timestamping.c ./net/ipsec.c ./user_events/perf_test.c Cc: Nathan Chancellor <nathan@kernel.org> Signed-off-by: John Hubbard <jhubbard@nvidia.com> Acked-by: Nathan Chancellor <nathan@kernel.org> Signed-off-by: Shuah Khan <skhan@linuxfoundation.org>
2024-07-11i40e: correct i40e_addr_to_hkey() name in kdocSimon Horman
Correct name of i40e_addr_to_hkey() in it's kdoc. kernel-doc -none reports: drivers/net/ethernet/intel/i40e/i40e.h:739: warning: expecting prototype for i40e_mac_to_hkey(). Prototype was for i40e_addr_to_hkey() instead Signed-off-by: Simon Horman <horms@kernel.org> Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2024-07-11net: intel: Remove MODULE_AUTHORsTony Nguyen
We are moving away from the Sourceforge email address. Rather than removing or updating the email for the affected entries, remove the MODULE_AUTHOR altogether as its usage is incorrect [1]. Link: https://lore.kernel.org/netdev/20200626115236.7f36d379@kicinski-fedora-pc1c0hjn.dhcp.thefacebook.com/ [1] Reviewed-by: Jesse Brandeburg <jesse.brandeburg@intel.com> Acked-by: Alexander Lobakin <aleksander.lobakin@intel.com> # libeth, libie Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2024-07-11net: pse-pd: pd692x0: Fix spelling mistake "availables" -> "available"Colin Ian King
There is a spelling mistake in a dev_err message. Fix it. Signed-off-by: Colin Ian King <colin.i.king@gmail.com> Acked-by: Kory Maincent <Kory.maincent@bootlin.com> Link: https://patch.msgid.link/20240709105222.168306-1-colin.i.king@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-07-11ice: Add tracepoint for adding and removing switch rulesMarcin Szycik
Track the number of rules and recipes added to switch. Add a tracepoint to ice_aq_sw_rules(), which shows both rule and recipe count. This information can be helpful when designing a set of rules to program to the hardware, as it shows where the practical limit is. Actual limits are known (64 recipes, 32k rules), but it's hard to translate these values to how many rules the *user* can actually create, because of extra metadata being implicitly added, and recipe/rule chaining. Chaining combines several recipes/rules to create a larger recipe/rule, so one large rule added by the user might actually consume multiple rules from hardware perspective. Rule counter is simply incremented/decremented in ice_aq_sw_rules(), since all rules are added or removed via it. Counting recipes is harder, as recipes can't be removed (only overwritten). Recipes added via ice_aq_add_recipe() could end up being unused, when there is an error in later stages of rule creation. Instead, track the allocation and freeing of recipes, which should reflect the actual usage of recipes (if something fails after recipe(s) were created, caller should free them). Also, a number of recipes are loaded from NVM by default - initialize the recipe counter with the number of these recipes on switch initialization. Example configuration: cd /sys/kernel/tracing echo function > current_tracer echo ice_aq_sw_rules > set_ftrace_filter echo ice_aq_sw_rules > set_event echo 1 > tracing_on cat trace Example output: tc-4097 [069] ...1. 787.595536: ice_aq_sw_rules <-ice_rem_adv_rule tc-4097 [069] ..... 787.595705: ice_aq_sw_rules: rules=9 recipes=15 tc-4098 [057] ...1. 787.652033: ice_aq_sw_rules <-ice_add_adv_rule tc-4098 [057] ..... 787.652201: ice_aq_sw_rules: rules=10 recipes=16 Reviewed-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com> Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com> Signed-off-by: Marcin Szycik <marcin.szycik@linux.intel.com> Reviewed-by: Simon Horman <horms@kernel.org> Tested-by: Sujai Buvaneswaran <sujai.buvaneswaran@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2024-07-11ice: Remove unused members from switch APIMarcin Szycik
Remove several members of struct ice_sw_recipe and struct ice_prot_lkup_ext. Remove struct ice_recp_grp_entry and struct ice_pref_recipe_group, since they are now unused as well. All of the deleted members were only written to and never read, so it's pointless to keep them. Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com> Signed-off-by: Marcin Szycik <marcin.szycik@linux.intel.com> Tested-by: Sujai Buvaneswaran <sujai.buvaneswaran@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2024-07-11ice: Optimize switch recipe creationMarcin Szycik
Currently when creating switch recipes, switch ID is always added as the first word in every recipe. There are only 5 words in a recipe, so one word is always wasted. This is also true for the last recipe, which stores result indexes (in case of chain recipes). Therefore the maximum usable length of a chain recipe is 4 * 4 = 16 words. 4 words in a recipe, 4 recipes that can be chained (using a 5th one for result indexes). Current max size chained recipe: 0: smmmm 1: smmmm 2: smmmm 3: smmmm 4: srrrr Where: s - switch ID m - regular match (e.g. ipv4 src addr, udp dst port, etc.) r - result index Switch ID does not actually need to be present in every recipe, only in one of them (in case of chained recipe). This frees up to 8 extra words: 3 from recipes in the middle (because first recipe still needs to have switch ID), and 5 from one extra recipe (because now the last recipe also does not have switch ID, so it can chain 1 more recipe). Max size chained recipe after changes: 0: smmmm 1: Mmmmm 2: Mmmmm 3: Mmmmm 4: MMMMM 5: Rrrrr Extra usable words available after this change are highlighted with capital letters. Changing how switch ID is added is not straightforward, because it's not a regular lookup. Its FV index and mask can't be determined based on protocol + offset pair read from package and instead need to be added manually. Additionally, change how result indexes are added. Currently they are always inserted in a new recipe at the end. Example for 13 words, (with above optimization, switch ID being one of the words): 0: smmmm 1: mmmmm 2: mmmxx 3: rrrxx Where: x - unused word In this and some other cases, the result indexes can be moved just after last matches because there are unused words, saving one recipe. Example for 13 words after both optimizations: 0: smmmm 1: mmmmm 2: mmmrr Note how one less result index is needed in this case, because the last recipe does not need to "link" to itself. There are cases when adding an additional recipe for result indexes cannot be avoided. In that cases result indexes are all put in the last recipe. Example for 14 words after both optimizations: 0: smmmm 1: mmmmm 2: mmmmx 3: rrrxx With these two changes, recipes/rules are more space efficient, allowing more to be created in total. Co-developed-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com> Signed-off-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com> Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com> Signed-off-by: Marcin Szycik <marcin.szycik@linux.intel.com> Tested-by: Sujai Buvaneswaran <sujai.buvaneswaran@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2024-07-11Merge tag 'net-6.10-rc8' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Pull networking fixes from Paolo Abeni: "Including fixes from bpf and netfilter. Current release - regressions: - core: fix rc7's __skb_datagram_iter() regression Current release - new code bugs: - eth: bnxt: fix crashes when reducing ring count with active RSS contexts Previous releases - regressions: - sched: fix UAF when resolving a clash - skmsg: skip zero length skb in sk_msg_recvmsg2 - sunrpc: fix kernel free on connection failure in xs_tcp_setup_socket - tcp: avoid too many retransmit packets - tcp: fix incorrect undo caused by DSACK of TLP retransmit - udp: Set SOCK_RCU_FREE earlier in udp_lib_get_port(). - eth: ks8851: fix deadlock with the SPI chip variant - eth: i40e: fix XDP program unloading while removing the driver Previous releases - always broken: - bpf: - fix too early release of tcx_entry - fail bpf_timer_cancel when callback is being cancelled - bpf: fix order of args in call to bpf_map_kvcalloc - netfilter: nf_tables: prefer nft_chain_validate - ppp: reject claimed-as-LCP but actually malformed packets - wireguard: avoid unaligned 64-bit memory accesses" * tag 'net-6.10-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (33 commits) net, sunrpc: Remap EPERM in case of connection failure in xs_tcp_setup_socket net/sched: Fix UAF when resolving a clash net: ks8851: Fix potential TX stall after interface reopen udp: Set SOCK_RCU_FREE earlier in udp_lib_get_port(). netfilter: nf_tables: prefer nft_chain_validate netfilter: nfnetlink_queue: drop bogus WARN_ON ethtool: netlink: do not return SQI value if link is down ppp: reject claimed-as-LCP but actually malformed packets selftests/bpf: Add timer lockup selftest net: ethernet: mtk-star-emac: set mac_managed_pm when probing e1000e: fix force smbus during suspend flow tcp: avoid too many retransmit packets bpf: Defer work in bpf_timer_cancel_and_free bpf: Fail bpf_timer_cancel when callback is being cancelled bpf: fix order of args in call to bpf_map_kvcalloc net: ethernet: lantiq_etop: fix double free in detach i40e: Fix XDP program unloading while removing the driver net: fix rc7's __skb_datagram_iter() net: ks8851: Fix deadlock with the SPI chip variant octeontx2-af: Fix incorrect value output on error path in rvu_check_rsrc_availability() ...
2024-07-11Merge tag 'vfs-6.10-rc8.fixes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull vfs fixes from Christian Brauner: "cachefiles: - Export an existing and add a new cachefile helper to be used in filesystems to fix reference count bugs - Use the newly added fscache_ty_get_volume() helper to get a reference count on an fscache_volume to handle volumes that are about to be removed cleanly - After withdrawing a fscache_cache via FSCACHE_CACHE_IS_WITHDRAWN wait for all ongoing cookie lookups to complete and for the object count to reach zero - Propagate errors from vfs_getxattr() to avoid an infinite loop in cachefiles_check_volume_xattr() because it keeps seeing ESTALE - Don't send new requests when an object is dropped by raising CACHEFILES_ONDEMAND_OJBSTATE_DROPPING - Cancel all requests for an object that is about to be dropped - Wait for the ondemand_boject_worker to finish before dropping a cachefiles object to prevent use-after-free - Use cyclic allocation for message ids to better handle id recycling - Add missing lock protection when iterating through the xarray when polling netfs: - Use standard logging helpers for debug logging VFS: - Fix potential use-after-free in file locks during trace_posix_lock_inode(). The tracepoint could fire while another task raced it and freed the lock that was requested to be traced - Only increment the nr_dentry_negative counter for dentries that are present on the superblock LRU. Currently, DCACHE_LRU_LIST list is used to detect this case. However, the flag is also raised in combination with DCACHE_SHRINK_LIST to indicate that dentry->d_lru is used. So checking only DCACHE_LRU_LIST will lead to wrong nr_dentry_negative count. Fix the check to not count dentries that are on a shrink related list Misc: - hfsplus: fix an uninitialized value issue in copy_name - minix: fix minixfs_rename with HIGHMEM. It still uses kunmap() even though we switched it to kmap_local_page() a while ago" * tag 'vfs-6.10-rc8.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: minixfs: Fix minixfs_rename with HIGHMEM hfsplus: fix uninit-value in copy_name vfs: don't mod negative dentry count when on shrinker list filelock: fix potential use-after-free in posix_lock_inode cachefiles: add missing lock protection when polling cachefiles: cyclic allocation of msg_id to avoid reuse cachefiles: wait for ondemand_object_worker to finish when dropping object cachefiles: cancel all requests for the object that is being dropped cachefiles: stop sending new request when dropping object cachefiles: propagate errors from vfs_getxattr() to avoid infinite loop cachefiles: fix slab-use-after-free in cachefiles_withdraw_cookie() cachefiles: fix slab-use-after-free in fscache_withdraw_volume() netfs, fscache: export fscache_put_volume() and add fscache_try_get_volume() netfs: Switch debug logging to pr_debug()
2024-07-11tick/broadcast: Make takeover of broadcast hrtimer reliableYu Liao
Running the LTP hotplug stress test on a aarch64 machine results in rcu_sched stall warnings when the broadcast hrtimer was owned by the un-plugged CPU. The issue is the following: CPU1 (owns the broadcast hrtimer) CPU2 tick_broadcast_enter() // shutdown local timer device broadcast_shutdown_local() ... tick_broadcast_exit() clockevents_switch_state(dev, CLOCK_EVT_STATE_ONESHOT) // timer device is not programmed cpumask_set_cpu(cpu, tick_broadcast_force_mask) initiates offlining of CPU1 take_cpu_down() /* * CPU1 shuts down and does not * send broadcast IPI anymore */ takedown_cpu() hotplug_cpu__broadcast_tick_pull() // move broadcast hrtimer to this CPU clockevents_program_event() bc_set_next() hrtimer_start() /* * timer device is not programmed * because only the first expiring * timer will trigger clockevent * device reprogramming */ What happens is that CPU2 exits broadcast mode with force bit set, then the local timer device is not reprogrammed and CPU2 expects to receive the expired event by the broadcast IPI. But this does not happen because CPU1 is offlined by CPU2. CPU switches the clockevent device to ONESHOT state, but does not reprogram the device. The subsequent reprogramming of the hrtimer broadcast device does not program the clockevent device of CPU2 either because the pending expiry time is already in the past and the CPU expects the event to be delivered. As a consequence all CPUs which wait for a broadcast event to be delivered are stuck forever. Fix this issue by reprogramming the local timer device if the broadcast force bit of the CPU is set so that the broadcast hrtimer is delivered. [ tglx: Massage comment and change log. Add Fixes tag ] Fixes: 989dcb645ca7 ("tick: Handle broadcast wakeup of multiple cpus") Signed-off-by: Yu Liao <liaoyu15@huawei.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20240711124843.64167-1-liaoyu15@huawei.com
2024-07-11dt-bindings: mmc: sdhci-sprd: convert to YAMLStanislav Jakubek
Covert the Spreadtrum SDHCI controller bindings to DT schema. Rename the file to match compatible. Drop assigned-* properties as these should not be needed. Reviewed-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org> Signed-off-by: Stanislav Jakubek <stano.jakubek@gmail.com> Link: https://lore.kernel.org/r/ZozY+tOkzK9yfjbo@standask-GA-A55M-S2HP Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org>
2024-07-11mmc: davinci_mmc: report all possible bus widthsBastien Curutchet
A dev_info() at probe's end() report the supported bus width. It never reports 8-bits width while the driver can handle it. Update the info message at then end of the probe to report the use of 8-bits data when needed. Signed-off-by: Bastien Curutchet <bastien.curutchet@bootlin.com> Link: https://lore.kernel.org/r/20240711081838.47256-3-bastien.curutchet@bootlin.com Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org>
2024-07-11mmc: Merge branch fixes into nextUlf Hansson
Merge the mmc fixes for v6.10-rc[n] into the next branch, to allow them to get tested together with the new mmc changes that are targeted for v6.11. Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org>
2024-07-11ice: remove unused recipe bookkeeping dataMichal Swiatkowski
Remove root_buf from recipe struct. Its only usage was in ice_find_recp(), where if recipe had an inverse action, it was skipped, but actually the driver never adds inverse actions, so effectively it was pointless. Without root_buf, the recipe data element in ice_add_sw_recipe() does not need to be persistent and can also be automatically deallocated with __free, which nicely simplifies unroll. Signed-off-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com> Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com> Signed-off-by: Marcin Szycik <marcin.szycik@linux.intel.com> Tested-by: Sujai Buvaneswaran <sujai.buvaneswaran@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2024-07-11bpf, arm64: Fix trampoline for BPF_TRAMP_F_CALL_ORIGPuranjay Mohan
When BPF_TRAMP_F_CALL_ORIG is set, the trampoline calls __bpf_tramp_enter() and __bpf_tramp_exit() functions, passing them the struct bpf_tramp_image *im pointer as an argument in R0. The trampoline generation code uses emit_addr_mov_i64() to emit instructions for moving the bpf_tramp_image address into R0, but emit_addr_mov_i64() assumes the address to be in the vmalloc() space and uses only 48 bits. Because bpf_tramp_image is allocated using kzalloc(), its address can use more than 48-bits, in this case the trampoline will pass an invalid address to __bpf_tramp_enter/exit() causing a kernel crash. Fix this by using emit_a64_mov_i64() in place of emit_addr_mov_i64() as it can work with addresses that are greater than 48-bits. Fixes: efc9909fdce0 ("bpf, arm64: Add bpf trampoline for arm64") Signed-off-by: Puranjay Mohan <puranjay@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Closes: https://lore.kernel.org/all/SJ0PR15MB461564D3F7E7A763498CA6A8CBDB2@SJ0PR15MB4615.namprd15.prod.outlook.com/ Link: https://lore.kernel.org/bpf/20240711151838.43469-1-puranjay@kernel.org
2024-07-11ice: Simplify bitmap setting in adding recipeMichal Swiatkowski
Remove unnecessary size checks when copying bitmaps in ice_add_sw_recipe() and replace them with compile time assert. Check if the bitmaps are equal size, as they are copied both ways. Signed-off-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com> Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com> Co-developed-by: Marcin Szycik <marcin.szycik@linux.intel.com> Signed-off-by: Marcin Szycik <marcin.szycik@linux.intel.com> Tested-by: Sujai Buvaneswaran <sujai.buvaneswaran@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2024-07-11ice: Remove reading all recipes before adding a new oneMichal Swiatkowski
The content of the first read recipe is used as a template when adding a recipe. It isn't needed - only prune index is directly set from there. Set it in the code instead. Also, now there's no need to set rid and lookup indexes to 0, as the whole recipe buffer is initialized to 0. Signed-off-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com> Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com> Signed-off-by: Marcin Szycik <marcin.szycik@linux.intel.com> Tested-by: Sujai Buvaneswaran <sujai.buvaneswaran@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2024-07-11mmc: davinci_mmc: Prevent transmitted data size from exceeding sgm's lengthBastien Curutchet
No check is done on the size of the data to be transmiited. This causes a kernel panic when this size exceeds the sg_miter's length. Limit the number of transmitted bytes to sgm->length. Cc: stable@vger.kernel.org Fixes: ed01d210fd91 ("mmc: davinci_mmc: Use sg_miter for PIO") Signed-off-by: Bastien Curutchet <bastien.curutchet@bootlin.com> Link: https://lore.kernel.org/r/20240711081838.47256-2-bastien.curutchet@bootlin.com Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org>
2024-07-11mmc: sdhci: Fix max_seg_size for 64KiB PAGE_SIZEAdrian Hunter
blk_queue_max_segment_size() ensured: if (max_size < PAGE_SIZE) max_size = PAGE_SIZE; whereas: blk_validate_limits() makes it an error: if (WARN_ON_ONCE(lim->max_segment_size < PAGE_SIZE)) return -EINVAL; The change from one to the other, exposed sdhci which was setting maximum segment size too low in some circumstances. Fix the maximum segment size when it is too low. Fixes: 616f87661792 ("mmc: pass queue_limits to blk_mq_alloc_disk") Cc: stable@vger.kernel.org Signed-off-by: Adrian Hunter <adrian.hunter@intel.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Acked-by: Jon Hunter <jonathanh@nvidia.com> Tested-by: Jon Hunter <jonathanh@nvidia.com> Link: https://lore.kernel.org/r/20240710180737.142504-1-adrian.hunter@intel.com Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org>
2024-07-11ice: Remove unused struct ice_prot_lkup_ext membersMarcin Szycik
Remove field_off as it's never used. Remove done bitmap, as its value is only checked and never assigned. Reusing sub-recipes while creating new root recipes is currently not supported in the driver. Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com> Signed-off-by: Marcin Szycik <marcin.szycik@linux.intel.com> Tested-by: Sujai Buvaneswaran <sujai.buvaneswaran@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2024-07-11fixmap: Remove unused set_fixmap_offset_io()Steven Price
The macro set_fixmap_offset_io() was added in commit f774b7d10e21 ("arm64: fixmap: fix missing sub-page offset for earlyprintk") but then commit 8ef0ed95ee04 ("arm64: remove arch specific earlyprintk") removed the file causing the only user to be removed when the two commits were merged. Since this has never been used again since the v3.15 release remove it. Signed-off-by: Steven Price <steven.price@arm.com> Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2024-07-11drm/xe/display/xe_hdcp_gsc: Free arbiter on driver removalNirmoy Das
Free arbiter allocated in intel_hdcp_gsc_init(). Fixes: 152f2df954d8 ("drm/xe/hdcp: Enable HDCP for XE") Cc: Suraj Kandpal <suraj.kandpal@intel.com> Cc: Arun R Murthy <arun.r.murthy@intel.com> Cc: Lucas De Marchi <lucas.demarchi@intel.com> Cc: Rodrigo Vivi <rodrigo.vivi@intel.com> Reviewed-by: Rodrigo Vivi <rodrigo.vivi@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20240708125918.23573-1-nirmoy.das@intel.com Signed-off-by: Nirmoy Das <nirmoy.das@intel.com> (cherry picked from commit 33891539f9d6f245e93a76e3fb5791338180374f) Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>
2024-07-11drm/xe: Use write-back caching mode for system memory on DGFXThomas Hellström
The caching mode for buffer objects with VRAM as a possible placement was forced to write-combined, regardless of placement. However, write-combined system memory is expensive to allocate and even though it is pooled, the pool is expensive to shrink, since it involves global CPU TLB flushes. Moreover write-combined system memory from TTM is only reliably available on x86 and DGFX doesn't have an x86 restriction. So regardless of the cpu caching mode selected for a bo, internally use write-back caching mode for system memory on DGFX. Coherency is maintained, but user-space clients may perceive a difference in cpu access speeds. v2: - Update RB- and Ack tags. - Rephrase wording in xe_drm.h (Matt Roper) v3: - Really rephrase wording. Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com> Fixes: 622f709ca629 ("drm/xe/uapi: Add support for CPU caching mode") Cc: Pallavi Mishra <pallavi.mishra@intel.com> Cc: Matthew Auld <matthew.auld@intel.com> Cc: dri-devel@lists.freedesktop.org Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com> Cc: Effie Yu <effie.yu@intel.com> Cc: Matthew Brost <matthew.brost@intel.com> Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com> Cc: Jose Souza <jose.souza@intel.com> Cc: Michal Mrozek <michal.mrozek@intel.com> Cc: <stable@vger.kernel.org> # v6.8+ Acked-by: Matthew Auld <matthew.auld@intel.com> Acked-by: José Roberto de Souza <jose.souza@intel.com> Reviewed-by: Rodrigo Vivi <rodrigo.vivi@intel.com> Fixes: 622f709ca629 ("drm/xe/uapi: Add support for CPU caching mode") Acked-by: Michal Mrozek <michal.mrozek@intel.com> Acked-by: Effie Yu <effie.yu@intel.com> #On chat Link: https://patchwork.freedesktop.org/patch/msgid/20240705132828.27714-1-thomas.hellstrom@linux.intel.com (cherry picked from commit 01e0cfc994be484ddcb9e121e353e51d8bb837c0) Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>
2024-07-11Merge tag 'asoc-fix-v6.10-rc7' of ↵Takashi Iwai
https://git.kernel.org/pub/scm/linux/kernel/git/broonie/sound into for-linus ASoC: Fixes for v6.10 A few fairly small fixes for ASoC, there's a relatively large set of hardening changes for the cs_dsp firmware file parsing and a couple of other small device specific fixes.
2024-07-11btrfs: avoid races when tracking progress for extent map shrinkingFilipe Manana
We store the progress (root and inode numbers) of the extent map shrinker in fs_info without any synchronization but we can have multiple tasks calling into the shrinker during memory allocations when there's enough memory pressure for example. This can result in a task A reading fs_info->extent_map_shrinker_last_ino after another task B updates it, and task A reading fs_info->extent_map_shrinker_last_root before task B updates it, making task A see an odd state that isn't necessarily harmful but may make it skip certain inode ranges or do more work than necessary by going over the same inodes again. These unprotected accesses would also trigger warnings from tools like KCSAN. So add a lock to protect access to these progress fields. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-07-11btrfs: stop extent map shrinker if reschedule is neededFilipe Manana
The extent map shrinker can be called in a variety of contexts where we are under memory pressure, and of them is when a task is trying to allocate memory. For this reason the shrinker is typically called with a value of struct shrink_control::nr_to_scan that is much smaller than what we return in the nr_cached_objects callback of struct super_operations (fs/btrfs/super.c:btrfs_nr_cached_objects()), so that the shrinker does not take a long time and cause high latencies. However we can still take a lot of time in the shrinker even for a limited amount of nr_to_scan: 1) When traversing the red black tree that tracks open inodes in a root, as for example with millions of open inodes we get a deep tree which takes time searching for an inode; 2) Iterating over the extent map tree, which is a red black tree, of an inode when doing the rb_next() calls and when removing an extent map from the tree, since often that requires rebalancing the red black tree; 3) When trying to write lock an inode's extent map tree we may wait for a significant amount of time, because there's either another task about to do IO and searching for an extent map in the tree or inserting an extent map in the tree, and we can have thousands or even millions of extent maps for an inode. Furthermore, there can be concurrent calls to the shrinker so the lock might be busy simply because there is already another task shrinking extent maps for the same inode; 4) We often reschedule if we need to, which further increases latency. So improve on this by stopping the extent map shrinking code whenever we need to reschedule and make it skip an inode if we can't immediately lock its extent map tree. Reported-by: Mikhail Gavrilov <mikhail.v.gavrilov@gmail.com> Reported-by: Andrea Gelmini <andrea.gelmini@gmail.com> Link: https://lore.kernel.org/linux-btrfs/CABXGCsMmmb36ym8hVNGTiU8yfUS_cGvoUmGCcBrGWq9OxTrs+A@mail.gmail.com/ Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-07-11btrfs: use delayed iput during extent map shrinkingFilipe Manana
When putting an inode during extent map shrinking we're doing a standard iput() but that may take a long time in case the inode is dirty and we are doing the final iput that triggers eviction - the VFS will have to wait for writeback before calling the btrfs evict callback (see fs/inode.c:evict()). This slows down the task running the shrinker which may have been triggered while updating some tree for example, meaning locks are held as well as an open transaction handle. Also if the iput() ends up triggering eviction and the inode has no links anymore, then we trigger item truncation which requires flushing delayed items, space reservation to start a transaction and that may trigger the space reclaim task and wait for it, resulting in deadlocks in case the reclaim task needs for example to commit a transaction and the shrinker is being triggered from a path holding a transaction handle. Syzbot reported such a case with the following stack traces: ====================================================== WARNING: possible circular locking dependency detected 6.10.0-rc2-syzkaller-00010-g2ab795141095 #0 Not tainted ------------------------------------------------------ kswapd0/111 is trying to acquire lock: ffff88801eae4610 (sb_internal#3){.+.+}-{0:0}, at: btrfs_commit_inode_delayed_inode+0x110/0x330 fs/btrfs/delayed-inode.c:1275 but task is already holding lock: ffffffff8dd3a9a0 (fs_reclaim){+.+.}-{0:0}, at: balance_pgdat+0xa88/0x1970 mm/vmscan.c:6924 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #3 (fs_reclaim){+.+.}-{0:0}: __fs_reclaim_acquire mm/page_alloc.c:3783 [inline] fs_reclaim_acquire+0x102/0x160 mm/page_alloc.c:3797 might_alloc include/linux/sched/mm.h:334 [inline] slab_pre_alloc_hook mm/slub.c:3890 [inline] slab_alloc_node mm/slub.c:3980 [inline] kmem_cache_alloc_lru_noprof+0x58/0x2f0 mm/slub.c:4019 btrfs_alloc_inode+0x118/0xb20 fs/btrfs/inode.c:8411 alloc_inode+0x5d/0x230 fs/inode.c:261 iget5_locked fs/inode.c:1235 [inline] iget5_locked+0x1c9/0x2c0 fs/inode.c:1228 btrfs_iget_locked fs/btrfs/inode.c:5590 [inline] btrfs_iget_path fs/btrfs/inode.c:5607 [inline] btrfs_iget+0xfb/0x230 fs/btrfs/inode.c:5636 create_reloc_inode+0x403/0x820 fs/btrfs/relocation.c:3911 btrfs_relocate_block_group+0x471/0xe60 fs/btrfs/relocation.c:4114 btrfs_relocate_chunk+0x143/0x450 fs/btrfs/volumes.c:3373 __btrfs_balance fs/btrfs/volumes.c:4157 [inline] btrfs_balance+0x211a/0x3f00 fs/btrfs/volumes.c:4534 btrfs_ioctl_balance fs/btrfs/ioctl.c:3675 [inline] btrfs_ioctl+0x12ed/0x8290 fs/btrfs/ioctl.c:4742 __do_compat_sys_ioctl+0x2c3/0x330 fs/ioctl.c:1007 do_syscall_32_irqs_on arch/x86/entry/common.c:165 [inline] __do_fast_syscall_32+0x73/0x120 arch/x86/entry/common.c:386 do_fast_syscall_32+0x32/0x80 arch/x86/entry/common.c:411 entry_SYSENTER_compat_after_hwframe+0x84/0x8e -> #2 (btrfs_trans_num_extwriters){++++}-{0:0}: join_transaction+0x164/0xf40 fs/btrfs/transaction.c:315 start_transaction+0x427/0x1a70 fs/btrfs/transaction.c:700 btrfs_rebuild_free_space_tree+0xaa/0x480 fs/btrfs/free-space-tree.c:1323 btrfs_start_pre_rw_mount+0x218/0xf60 fs/btrfs/disk-io.c:2999 open_ctree+0x41ab/0x52e0 fs/btrfs/disk-io.c:3554 btrfs_fill_super fs/btrfs/super.c:946 [inline] btrfs_get_tree_super fs/btrfs/super.c:1863 [inline] btrfs_get_tree+0x11e9/0x1b90 fs/btrfs/super.c:2089 vfs_get_tree+0x8f/0x380 fs/super.c:1780 fc_mount+0x16/0xc0 fs/namespace.c:1125 btrfs_get_tree_subvol fs/btrfs/super.c:2052 [inline] btrfs_get_tree+0xa53/0x1b90 fs/btrfs/super.c:2090 vfs_get_tree+0x8f/0x380 fs/super.c:1780 do_new_mount fs/namespace.c:3352 [inline] path_mount+0x6e1/0x1f10 fs/namespace.c:3679 do_mount fs/namespace.c:3692 [inline] __do_sys_mount fs/namespace.c:3898 [inline] __se_sys_mount fs/namespace.c:3875 [inline] __ia32_sys_mount+0x295/0x320 fs/namespace.c:3875 do_syscall_32_irqs_on arch/x86/entry/common.c:165 [inline] __do_fast_syscall_32+0x73/0x120 arch/x86/entry/common.c:386 do_fast_syscall_32+0x32/0x80 arch/x86/entry/common.c:411 entry_SYSENTER_compat_after_hwframe+0x84/0x8e -> #1 (btrfs_trans_num_writers){++++}-{0:0}: join_transaction+0x148/0xf40 fs/btrfs/transaction.c:314 start_transaction+0x427/0x1a70 fs/btrfs/transaction.c:700 btrfs_rebuild_free_space_tree+0xaa/0x480 fs/btrfs/free-space-tree.c:1323 btrfs_start_pre_rw_mount+0x218/0xf60 fs/btrfs/disk-io.c:2999 open_ctree+0x41ab/0x52e0 fs/btrfs/disk-io.c:3554 btrfs_fill_super fs/btrfs/super.c:946 [inline] btrfs_get_tree_super fs/btrfs/super.c:1863 [inline] btrfs_get_tree+0x11e9/0x1b90 fs/btrfs/super.c:2089 vfs_get_tree+0x8f/0x380 fs/super.c:1780 fc_mount+0x16/0xc0 fs/namespace.c:1125 btrfs_get_tree_subvol fs/btrfs/super.c:2052 [inline] btrfs_get_tree+0xa53/0x1b90 fs/btrfs/super.c:2090 vfs_get_tree+0x8f/0x380 fs/super.c:1780 do_new_mount fs/namespace.c:3352 [inline] path_mount+0x6e1/0x1f10 fs/namespace.c:3679 do_mount fs/namespace.c:3692 [inline] __do_sys_mount fs/namespace.c:3898 [inline] __se_sys_mount fs/namespace.c:3875 [inline] __ia32_sys_mount+0x295/0x320 fs/namespace.c:3875 do_syscall_32_irqs_on arch/x86/entry/common.c:165 [inline] __do_fast_syscall_32+0x73/0x120 arch/x86/entry/common.c:386 do_fast_syscall_32+0x32/0x80 arch/x86/entry/common.c:411 entry_SYSENTER_compat_after_hwframe+0x84/0x8e -> #0 (sb_internal#3){.+.+}-{0:0}: check_prev_add kernel/locking/lockdep.c:3134 [inline] check_prevs_add kernel/locking/lockdep.c:3253 [inline] validate_chain kernel/locking/lockdep.c:3869 [inline] __lock_acquire+0x2478/0x3b30 kernel/locking/lockdep.c:5137 lock_acquire kernel/locking/lockdep.c:5754 [inline] lock_acquire+0x1b1/0x560 kernel/locking/lockdep.c:5719 percpu_down_read include/linux/percpu-rwsem.h:51 [inline] __sb_start_write include/linux/fs.h:1655 [inline] sb_start_intwrite include/linux/fs.h:1838 [inline] start_transaction+0xbc1/0x1a70 fs/btrfs/transaction.c:694 btrfs_commit_inode_delayed_inode+0x110/0x330 fs/btrfs/delayed-inode.c:1275 btrfs_evict_inode+0x960/0xe80 fs/btrfs/inode.c:5291 evict+0x2ed/0x6c0 fs/inode.c:667 iput_final fs/inode.c:1741 [inline] iput.part.0+0x5a8/0x7f0 fs/inode.c:1767 iput+0x5c/0x80 fs/inode.c:1757 btrfs_scan_root fs/btrfs/extent_map.c:1118 [inline] btrfs_free_extent_maps+0xbd3/0x1320 fs/btrfs/extent_map.c:1189 super_cache_scan+0x409/0x550 fs/super.c:227 do_shrink_slab+0x44f/0x11c0 mm/shrinker.c:435 shrink_slab+0x18a/0x1310 mm/shrinker.c:662 shrink_one+0x493/0x7c0 mm/vmscan.c:4790 shrink_many mm/vmscan.c:4851 [inline] lru_gen_shrink_node+0x89f/0x1750 mm/vmscan.c:4951 shrink_node mm/vmscan.c:5910 [inline] kswapd_shrink_node mm/vmscan.c:6720 [inline] balance_pgdat+0x1105/0x1970 mm/vmscan.c:6911 kswapd+0x5ea/0xbf0 mm/vmscan.c:7180 kthread+0x2c1/0x3a0 kernel/kthread.c:389 ret_from_fork+0x45/0x80 arch/x86/kernel/process.c:147 ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:244 other info that might help us debug this: Chain exists of: sb_internal#3 --> btrfs_trans_num_extwriters --> fs_reclaim Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(fs_reclaim); lock(btrfs_trans_num_extwriters); lock(fs_reclaim); rlock(sb_internal#3); *** DEADLOCK *** 2 locks held by kswapd0/111: #0: ffffffff8dd3a9a0 (fs_reclaim){+.+.}-{0:0}, at: balance_pgdat+0xa88/0x1970 mm/vmscan.c:6924 #1: ffff88801eae40e0 (&type->s_umount_key#62){++++}-{3:3}, at: super_trylock_shared fs/super.c:562 [inline] #1: ffff88801eae40e0 (&type->s_umount_key#62){++++}-{3:3}, at: super_cache_scan+0x96/0x550 fs/super.c:196 stack backtrace: CPU: 0 PID: 111 Comm: kswapd0 Not tainted 6.10.0-rc2-syzkaller-00010-g2ab795141095 #0 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.2-debian-1.16.2-1 04/01/2014 Call Trace: <TASK> __dump_stack lib/dump_stack.c:88 [inline] dump_stack_lvl+0x116/0x1f0 lib/dump_stack.c:114 check_noncircular+0x31a/0x400 kernel/locking/lockdep.c:2187 check_prev_add kernel/locking/lockdep.c:3134 [inline] check_prevs_add kernel/locking/lockdep.c:3253 [inline] validate_chain kernel/locking/lockdep.c:3869 [inline] __lock_acquire+0x2478/0x3b30 kernel/locking/lockdep.c:5137 lock_acquire kernel/locking/lockdep.c:5754 [inline] lock_acquire+0x1b1/0x560 kernel/locking/lockdep.c:5719 percpu_down_read include/linux/percpu-rwsem.h:51 [inline] __sb_start_write include/linux/fs.h:1655 [inline] sb_start_intwrite include/linux/fs.h:1838 [inline] start_transaction+0xbc1/0x1a70 fs/btrfs/transaction.c:694 btrfs_commit_inode_delayed_inode+0x110/0x330 fs/btrfs/delayed-inode.c:1275 btrfs_evict_inode+0x960/0xe80 fs/btrfs/inode.c:5291 evict+0x2ed/0x6c0 fs/inode.c:667 iput_final fs/inode.c:1741 [inline] iput.part.0+0x5a8/0x7f0 fs/inode.c:1767 iput+0x5c/0x80 fs/inode.c:1757 btrfs_scan_root fs/btrfs/extent_map.c:1118 [inline] btrfs_free_extent_maps+0xbd3/0x1320 fs/btrfs/extent_map.c:1189 super_cache_scan+0x409/0x550 fs/super.c:227 do_shrink_slab+0x44f/0x11c0 mm/shrinker.c:435 shrink_slab+0x18a/0x1310 mm/shrinker.c:662 shrink_one+0x493/0x7c0 mm/vmscan.c:4790 shrink_many mm/vmscan.c:4851 [inline] lru_gen_shrink_node+0x89f/0x1750 mm/vmscan.c:4951 shrink_node mm/vmscan.c:5910 [inline] kswapd_shrink_node mm/vmscan.c:6720 [inline] balance_pgdat+0x1105/0x1970 mm/vmscan.c:6911 kswapd+0x5ea/0xbf0 mm/vmscan.c:7180 kthread+0x2c1/0x3a0 kernel/kthread.c:389 ret_from_fork+0x45/0x80 arch/x86/kernel/process.c:147 ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:244 </TASK> So fix this by using btrfs_add_delayed_iput() so that the final iput is delegated to the cleaner kthread. Link: https://lore.kernel.org/linux-btrfs/000000000000892280061a344581@google.com/ Reported-by: syzbot+3dad89b3993a4b275e72@syzkaller.appspotmail.com Fixes: 956a17d9d050 ("btrfs: add a shrinker for extent maps") Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>