git.armlinux.org.uk/linux.git - Linus' kernel tree

Age	Commit message (Collapse)	Author
2023-01-31	riscv: Fix build with CONFIG_CC_OPTIMIZE_FOR_SIZE=y	Samuel Holland
	commit 8eb060e10185 ("arch/riscv: add Zihintpause support") broke building with CONFIG_CC_OPTIMIZE_FOR_SIZE enabled (gcc 11.1.0): CC arch/riscv/kernel/vdso/vgettimeofday.o In file included from <command-line>: ./arch/riscv/include/asm/jump_label.h: In function 'cpu_relax': ././include/linux/compiler_types.h:285:33: warning: 'asm' operand 0 probably does not match constraints 285 \| #define asm_volatile_goto(x...) asm goto(x) \| ^~~ ./arch/riscv/include/asm/jump_label.h:41:9: note: in expansion of macro 'asm_volatile_goto' 41 \| asm_volatile_goto( \| ^~~~~~~~~~~~~~~~~ ././include/linux/compiler_types.h:285:33: error: impossible constraint in 'asm' 285 \| #define asm_volatile_goto(x...) asm goto(x) \| ^~~ ./arch/riscv/include/asm/jump_label.h:41:9: note: in expansion of macro 'asm_volatile_goto' 41 \| asm_volatile_goto( \| ^~~~~~~~~~~~~~~~~ make[1]: * [scripts/Makefile.build:249: arch/riscv/kernel/vdso/vgettimeofday.o] Error 1 make: * [arch/riscv/Makefile:128: vdso_prepare] Error 2 Having a static branch in cpu_relax() is problematic because that function is widely inlined, including in some quite complex functions like in the VDSO. A quick measurement shows this static branch is responsible by itself for around 40% of the jump table. Drop the static branch, which ends up being the same number of instructions anyway. If Zihintpause is supported, we trade the nop from the static branch for a div. If Zihintpause is unsupported, we trade the jump from the static branch for (what gets interpreted as) a nop. Fixes: 8eb060e10185 ("arch/riscv: add Zihintpause support") Signed-off-by: Samuel Holland <samuel@sholland.org> Reviewed-by: Conor Dooley <conor.dooley@microchip.com> Cc: stable@vger.kernel.org Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
2023-01-31	Merge branch 'net-ipa-remaining-ipa-v5-0-support'	Jakub Kicinski
	Alex Elder says: ==================== net: ipa: remaining IPA v5.0 support This series includes almost all remaining IPA code changes required to support IPA v5.0. IPA register definitions and configuration data for IPA v5.0 will be sent later (soon). Note that the GSI register definitions still require work. GSI for IPA v5.0 supports up to 256 (rather than 32) channels, and this changes the way GSI register offsets are calculated. A few GSI register fields also change. The first patch in this series increases the number of IPA endpoints supported by the driver, from 32 to 36. The next updates the width of the destination field for the IP_PACKET_INIT immediate command so it can represent up to 256 endpoints rather than just 32. The next adds a few definitions of some IPA registers and fields that are first available in IPA v5.0. The next two patches update the code that handles router and filter table caches. Previously these were referred to as "hashed" tables, and the IPv4 and IPv6 tables are now combined into one "unified" table. The sixth and seventh patches add support for a new pulse generator, which allows time periods to be specified with a wider range of clock resolution. And the last patch just defines two new memory regions that were not previously used. ==================== Link: https://lore.kernel.org/r/20230130210158.4126129-1-elder@linaro.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-01-31	net: ipa: define two new memory regions	Alex Elder
	IPA v5.0 uses two memory regions not previously used. Define them and treat them as valid only for IPA v5.0. Signed-off-by: Alex Elder <elder@linaro.org> Reviewed-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-01-31	net: ipa: support a third pulse register	Alex Elder
	The AP has third pulse generator available starting with IPA v5.0. Redefine ipa_qtime_val() to support that possibility. Pass the IPA pointer as an argument so the version can be determined. And stop using the sign of the returned tick count to indicate which of two pulse generators to use. Instead, have the caller provide the address of a variable that will hold the selected pulse generator for the Qtime value. And for version 5.0, check whether the third pulse generator best represents the time period. Add code in ipa_qtime_config() to configure the fourth pulse generator for IPA v5.0+; in that case configure both the third and fourth pulse generators to use 10 msec granularity. Consistently use "ticks" for local variables that represent a tick count. Signed-off-by: Alex Elder <elder@linaro.org> Reviewed-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-01-31	net: ipa: greater timer granularity options	Alex Elder
	Starting with IPA v5.0, the head-of-line blocking timer has more than two pulse generators available to define timer granularity. To prepare for that, change the way the field value is encoded to use ipa_reg_encode() rather than ipa_reg_bit(). The aggregation granularity selection could (in principle) also use an additional pulse generator starting with IPA v5.0. Encode the AGGR_GRAN_SEL field differently to allow that as well. Signed-off-by: Alex Elder <elder@linaro.org> Reviewed-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-01-31	net: ipa: support zeroing new cache tables	Alex Elder
	IPA v5.0+ separates the configuration of entries in the cached (previously "hashed") routing and filtering tables into distinct registers. Previously a single "filter and router" register updated entries in both tables at once; now the routing and filter table caches have separate registers that define their content. This patch updates the code that zeroes entries in the cached filter and router tables to support IPA versions including v5.0+. Signed-off-by: Alex Elder <elder@linaro.org> Reviewed-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-01-31	net: ipa: update table cache flushing	Alex Elder
	Update the code that causes filter and router table caches to be flushed so that it supports IPA versions 5.0+. It adds a comment in ipa_hardware_config_hashing() that explains that cacheing does not need to be enabled, just as before, because it's enabled by default. (For the record, the FILT_ROUT_CACHE_CFG register would have been used if we wanted to explicitly enable these.) Signed-off-by: Alex Elder <elder@linaro.org> Reviewed-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-01-31	net: ipa: define IPA v5.0+ registers	Alex Elder
	Define some new registers that appear starting with IPA v5.0, along with enumerated types identifying their fields. Code that uses these will be added by upcoming patches. Most of the new registers are related to filter and routing tables, and in particular, their "hashed" variant. These tables are better described as "cached", where a hash value determines which entries are cached. From now on, naming related to this functionality will use "cache" instead of "hash", and that is reflected in these new register names. Some registers for managing these caches and their contents have changed as well. A few other new field definitions for registers (unrelated to table caches) are also defined. Signed-off-by: Alex Elder <elder@linaro.org> Reviewed-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-01-31	net: ipa: extend endpoints in packet init command	Alex Elder
	The IP_PACKET_INIT immediate command defines the destination endpoint to which a packet should be sent. Prior to IPA v5.0, a 5 bit field in that command represents the endpoint, but starting with IPA v5.0, the field is extended to 8 bits to support more than 32 endpoints. Signed-off-by: Alex Elder <elder@linaro.org> Reviewed-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-01-31	net: ipa: support more endpoints	Alex Elder
	Increase the number of endpoints supported by the driver to 36, which IPA v5.0 supports. This makes it impossible to check at build time whether the supported number is too big to fit within the (5-bit) PACKET_INIT destination endpoint field. Instead, convert the build time check to compare against what fits in 8 bits. Add a check in ipa_endpoint_config() to also ensure the hardware reports an endpoint count that's in the expected range. Just open-code 32 as the limit (the PACKET_INIT field mask is not available where we'd want to use it). Signed-off-by: Alex Elder <elder@linaro.org> Reviewed-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-01-31	Merge tag 'mlx5-updates-2023-01-30' of ↵	Jakub Kicinski
	git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux Saeed Mahameed says: ==================== mlx5-updates-2023-01-30 Add fast update encryption key Jianbo Liu Says: ================ Data encryption keys (DEKs) are the keys used for data encryption and decryption operations. Starting from version 22.33.0783, firmware is optimized to accelerate the update of user keys into DEK object in hardware. The support for bulk allocation and destruction of DEK objects is added, and the bulk allocated DEKs are uninitialized, as the bulk creation requires no input key. When offload encryption/decryption, user gets one object from a bulk, and updates key by a new "modify DEK" command. This command is the same as create DEK object, but requires no heavy context memory allocation in firmware, which consumes most cpu cycles of the create DEK command. DEKs are cached internally by the NIC, so invalidating internal NIC caches is required before reusing DEKs. The SYNC_CRYPTO command is added to support it. DEK object can be reused, the keys in it can be updated after this command is executed. This patchset enhances the key creation and destruction flow, to get use of this new feature. Any user, for example, ktls, ipsec and macsec, can use it to offload keys. But, only ktls uses it, as others don't need many keys, and caching two many DEKs in pool is wasteful. There are two new data struts added: a. DEK pool. One pool is created for each key type. The bulks by the type, are placed in the pool's different bulk lists, according to the number of available and in_used DEKs in the bulk. b. DEK bulk. All DEKs in one bulk allocation are store here. There are two bitmaps to indicate the state of each DEK. New APIs are then added. When user need a DEK object, a. Fetch one bulk with avail DEKs, from the partial_list or avail_list, otherwise create new one. b. Pick one DEK, and set its need_sync and in_used bits to 1. Move the bulk to full_list if no more available keys, or put it to partial_list if the bulk is newly created. c. Update DEK object's key with user key, by the "modify DEK" command. d. Return DEK struct to user, then it gets the object id and fills it into the offload commands. When user free a DEK, a. Set in_use bit to 0. If all need_sync bits are 1 and all in_use bits of this bulk are 0, move it to sync_list. b. If the number of DEKs, which are freed by users, is over the threshold (128), schedule a workqueue to do the sync process. For the sync process, the SYNC_CRYPTO command is executed first. Then, for each bulks in partial_list, full_list and sync_list, reset need_sync bits of the freed DEK objects. If all need_sync bits in one bulk are zero, move it to avail_list. We already supported TIS pool to recycle the TISes. With this series and TIS pool, TLS CPS performance is improved greatly. And we tested https on the system: CPU: dual AMD EPYC 7763 64-Core processors RAM: 512G DEV: ConnectX-6 DX, with FW ver 22.33.0838 and TLS_OPTIMISE=true TLS CPS performance numbers are: Before: 11k connections/sec After: 101 connections/sec ================ * tag 'mlx5-updates-2023-01-30' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux: net/mlx5e: kTLS, Improve connection rate by using fast update encryption key net/mlx5: Keep only one bulk of full available DEKs net/mlx5: Add async garbage collector for DEK bulk net/mlx5: Reuse DEKs after executing SYNC_CRYPTO command net/mlx5: Use bulk allocation for fast update encryption key net/mlx5: Add bulk allocation and modify_dek operation net/mlx5: Add support SYNC_CRYPTO command net/mlx5: Add new APIs for fast update encryption key net/mlx5: Refactor the encryption key creation net/mlx5: Add const to the key pointer of encryption key creation net/mlx5: Prepare for fast crypto key update if hardware supports it net/mlx5: Change key type to key purpose net/mlx5: Add IFC bits and enums for crypto key net/mlx5: Add IFC bits for general obj create param net/mlx5: Header file for crypto ==================== Link: https://lore.kernel.org/r/20230131031201.35336-1-saeed@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-01-31	Merge git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf	Jakub Kicinski
	Pablo Neira Ayuso says: ==================== Netfilter fixes for net 1) Release bridge info once packet escapes the br_netfilter path, from Florian Westphal. 2) Revert incorrect fix for the SCTP connection tracking chunk iterator, also from Florian. First path fixes a long standing issue, the second path addresses a mistake in the previous pull request for net. * git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf: Revert "netfilter: conntrack: fix bug in for_each_sctp_chunk" netfilter: br_netfilter: disable sabotage_in hook after first suppression ==================== Link: https://lore.kernel.org/r/20230131133158.4052-1-pablo@netfilter.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-01-31	net: phy: meson-gxl: Add generic dummy stubs for MMD register access	Chris Healy
	The Meson G12A Internal PHY does not support standard IEEE MMD extended register access, therefore add generic dummy stubs to fail the read and write MMD calls. This is necessary to prevent the core PHY code from erroneously believing that EEE is supported by this PHY even though this PHY does not support EEE, as MMD register access returns all FFFFs. Fixes: 5c3407abb338 ("net: phy: meson-gxl: add g12a support") Reviewed-by: Heiner Kallweit <hkallweit1@gmail.com> Signed-off-by: Chris Healy <healych@amazon.com> Reviewed-by: Jerome Brunet <jbrunet@baylibre.com> Link: https://lore.kernel.org/r/20230130231402.471493-1-cphealy@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-01-31	net: fix NULL pointer in skb_segment_list	Yan Zhai
	Commit 3a1296a38d0c ("net: Support GRO/GSO fraglist chaining.") introduced UDP listifyed GRO. The segmentation relies on frag_list being untouched when passing through the network stack. This assumption can be broken sometimes, where frag_list itself gets pulled into linear area, leaving frag_list being NULL. When this happens it can trigger following NULL pointer dereference, and panic the kernel. Reverse the test condition should fix it. [19185.577801][ C1] BUG: kernel NULL pointer dereference, address: ... [19185.663775][ C1] RIP: 0010:skb_segment_list+0x1cc/0x390 ... [19185.834644][ C1] Call Trace: [19185.841730][ C1] <TASK> [19185.848563][ C1] __udp_gso_segment+0x33e/0x510 [19185.857370][ C1] inet_gso_segment+0x15b/0x3e0 [19185.866059][ C1] skb_mac_gso_segment+0x97/0x110 [19185.874939][ C1] __skb_gso_segment+0xb2/0x160 [19185.883646][ C1] udp_queue_rcv_skb+0xc3/0x1d0 [19185.892319][ C1] udp_unicast_rcv_skb+0x75/0x90 [19185.900979][ C1] ip_protocol_deliver_rcu+0xd2/0x200 [19185.910003][ C1] ip_local_deliver_finish+0x44/0x60 [19185.918757][ C1] __netif_receive_skb_one_core+0x8b/0xa0 [19185.927834][ C1] process_backlog+0x88/0x130 [19185.935840][ C1] __napi_poll+0x27/0x150 [19185.943447][ C1] net_rx_action+0x27e/0x5f0 [19185.951331][ C1] ? mlx5_cq_tasklet_cb+0x70/0x160 [mlx5_core] [19185.960848][ C1] __do_softirq+0xbc/0x25d [19185.968607][ C1] irq_exit_rcu+0x83/0xb0 [19185.976247][ C1] common_interrupt+0x43/0xa0 [19185.984235][ C1] asm_common_interrupt+0x22/0x40 ... [19186.094106][ C1] </TASK> Fixes: 3a1296a38d0c ("net: Support GRO/GSO fraglist chaining.") Suggested-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Willem de Bruijn <willemb@google.com> Signed-off-by: Yan Zhai <yan@cloudflare.com> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/r/Y9gt5EUizK1UImEP@debian Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-01-31	net: fman: memac: free mdio device if lynx_pcs_create() fails	Vladimir Oltean
	When memory allocation fails in lynx_pcs_create() and it returns NULL, there remains a dangling reference to the mdiodev returned by of_mdio_find_device() which is leaked as soon as memac_pcs_create() returns empty-handed. Fixes: a7c2a32e7f22 ("net: fman: memac: Use lynx pcs driver") Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Sean Anderson <sean.anderson@seco.com> Acked-by: Madalin Bucur <madalin.bucur@oss.nxp.com> Link: https://lore.kernel.org/r/20230130193051.563315-1-vladimir.oltean@nxp.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-01-31	Merge branch '10GbE' of ↵	Jakub Kicinski
	git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue Tony Nguyen says: ==================== Intel Wired LAN: Remove redundant Device Control Error Reporting Enable Bjorn Helgaas says: Since f26e58bf6f54 ("PCI/AER: Enable error reporting when AER is native"), the PCI core sets the Device Control bits that enable error reporting for PCIe devices. This series removes redundant calls to pci_enable_pcie_error_reporting() that do the same thing from several NIC drivers. There are several more drivers where this should be removed; I started with just the Intel drivers here. * '10GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue: ixgbe: Remove redundant pci_enable_pcie_error_reporting() igc: Remove redundant pci_enable_pcie_error_reporting() igb: Remove redundant pci_enable_pcie_error_reporting() ice: Remove redundant pci_enable_pcie_error_reporting() iavf: Remove redundant pci_enable_pcie_error_reporting() i40e: Remove redundant pci_enable_pcie_error_reporting() fm10k: Remove redundant pci_enable_pcie_error_reporting() e1000e: Remove redundant pci_enable_pcie_error_reporting() ==================== Link: https://lore.kernel.org/r/20230130192519.686446-1-anthony.l.nguyen@intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-01-31	Merge branch 'selftests-mlxsw-convert-to-iproute2-dcb'	Jakub Kicinski
	Petr Machata says: ==================== selftests: mlxsw: Convert to iproute2 dcb There is a dedicated tool for configuration of DCB in iproute2. Use it in the selftests instead of lldpad. Patches #1-#3 convert three tests. Patch #4 drops the now-unnecessary lldpad helpers. ==================== Link: https://lore.kernel.org/r/cover.1675096231.git.petrm@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-01-31	selftests: net: forwarding: lib: Drop lldpad_app_wait_set(), _del()	Petr Machata
	The existing users of these helpers have been converted to iproute2 dcb. Drop the helpers. Signed-off-by: Petr Machata <petrm@nvidia.com> Reviewed-by: Danielle Ratson <danieller@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-01-31	selftests: mlxsw: qos_defprio: Convert from lldptool to dcb	Petr Machata
	Set up default port priority through the iproute2 dcb tool, which is easier to understand and manage. Signed-off-by: Petr Machata <petrm@nvidia.com> Reviewed-by: Danielle Ratson <danieller@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-01-31	selftests: mlxsw: qos_dscp_router: Convert from lldptool to dcb	Petr Machata
	Set up DSCP prioritization through the iproute2 dcb tool, which is easier to understand and manage. Signed-off-by: Petr Machata <petrm@nvidia.com> Reviewed-by: Danielle Ratson <danieller@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-01-31	selftests: mlxsw: qos_dscp_bridge: Convert from lldptool to dcb	Petr Machata
	Set up DSCP prioritization through the iproute2 dcb tool, which is easier to understand and manage. Signed-off-by: Petr Machata <petrm@nvidia.com> Reviewed-by: Danielle Ratson <danieller@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-01-31	sctp: do not check hb_timer.expires when resetting hb_timer	Xin Long
	It tries to avoid the frequently hb_timer refresh in commit ba6f5e33bdbb ("sctp: avoid refreshing heartbeat timer too often"), and it only allows mod_timer when the new expires is after hb_timer.expires. It means even a much shorter interval for hb timer gets applied, it will have to wait until the current hb timer to time out. In sctp_do_8_2_transport_strike(), when a transport enters PF state, it expects to update the hb timer to resend a heartbeat every rto after calling sctp_transport_reset_hb_timer(), which will not work as the change mentioned above. The frequently hb_timer refresh was caused by sctp_transport_reset_timers() called in sctp_outq_flush() and it was already removed in the commit above. So we don't have to check hb_timer.expires when resetting hb_timer as it is now not called very often. Fixes: ba6f5e33bdbb ("sctp: avoid refreshing heartbeat timer too often") Signed-off-by: Xin Long <lucien.xin@gmail.com> Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Link: https://lore.kernel.org/r/d958c06985713ec84049a2d5664879802710179a.1675095933.git.lucien.xin@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-01-31	Merge branch 'net-mdio-add-amlogic-gxl-mdio-mux-support'	Jakub Kicinski
	Jerome Brunet says: ==================== net: mdio: add amlogic gxl mdio mux support Add support for the MDIO multiplexer found in the Amlogic GXL SoC family. This multiplexer allows to choose between the external (SoC pins) MDIO bus, or the internal one leading to the integrated 10/100M PHY. This multiplexer has been handled with the mdio-mux-mmioreg generic driver so far. When it was added, it was thought the logic was handled by a single register. It turns out more than a single register need to be properly set. As long as the device is using the Amlogic vendor bootloader, or upstream u-boot with net support, it is working fine since the kernel is inheriting the bootloader settings. Without net support in the bootloader, this glue comes unset in the kernel and only the external path may operate properly. With this driver (and the associated change in arch/arm64/boot/dts/amlogic/meson-gxl.dtsi), the kernel no longer relies on the bootloader to set things up, fixing the problem. ==================== Link: https://lore.kernel.org/r/20230130151616.375168-1-jbrunet@baylibre.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-01-31	net: mdio: add amlogic gxl mdio mux support	Jerome Brunet
	Add support for the mdio mux and internal phy glue of the GXL SoC family Reported-by: Da Xue <da@lessconfused.com> Signed-off-by: Jerome Brunet <jbrunet@baylibre.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-01-31	dt-bindings: net: add amlogic gxl mdio multiplexer	Jerome Brunet
	Add documentation for the MDIO bus multiplexer found on the Amlogic GXL SoC family Reviewed-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org> Signed-off-by: Jerome Brunet <jbrunet@baylibre.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-01-31	Merge branch 'tools-ynl-more-docs-and-basic-ethtool-support'	Jakub Kicinski
	Jakub Kicinski says: ==================== tools: ynl: more docs and basic ethtool support I got discouraged from supporting ethtool in specs, because generating the user space C code seems a little tricky. The messages are ID'ed in a "directional" way (to and from kernel are separate ID "spaces"). There is value, however, in having the spec and being able to for example use it in Python. After paying off some technical debt - add a partial ethtool spec. Partial because the header for ethtool is almost a 1000 LoC, so converting in one sitting is tough. But adding new commands should be trivial now. Last but not least I add more docs, I realized that I've been sending a similar "instructions" email to people working on new families. It's now intro-specs.rst. ==================== Link: https://lore.kernel.org/r/20230131023354.1732677-1-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-01-31	tools: net: use python3 explicitly	Jakub Kicinski
	The scripts require Python 3 and some distros are dropping Python 2 support. Reported-by: Stanislav Fomichev <sdf@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-01-31	docs: netlink: add a starting guide for working with specs	Jakub Kicinski
	We have a bit of documentation about the internals of Netlink and the specs, but really the goal is for most people to not worry about those. Add a practical guide for beginners who want to poke at the specs. Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-01-31	netlink: specs: add partial specification for ethtool	Jakub Kicinski
	Ethtool is one of the most actively developed families. With the changes to the CLI it should be possible to use the YNL based code for easy prototyping and development. Add a partial family definition. I've tested the string set and rings. I don't have any MAC Merge implementation to test with, but I added the definition for it, anyway, because it's last. New commands can simply be added at the end without having to worry about manually providing IDs / values. Set (with notification support - None is the response, the data is from the notification): $ sudo ./tools/net/ynl/cli.py \ --spec Documentation/netlink/specs/ethtool.yaml \ --do rings-set \ --json '{"header":{"dev-name":"enp0s31f6"}, "rx":129}' \ --subscribe monitor None [{'msg': {'header': {'dev-index': 2, 'dev-name': 'enp0s31f6'}, 'rx': 136, 'rx-max': 4096, 'tx': 256, 'tx-max': 4096, 'tx-push': 0}, 'name': 'rings-ntf'}] Do / dump (yes, the kernel requires that even for dump and even if empty - the "header" nest must be there): $ ./tools/net/ynl/cli.py \ --spec Documentation/netlink/specs/ethtool.yaml \ --do rings-get \ --json '{"header":{"dev-index": 2}}' {'header': {'dev-index': 2, 'dev-name': 'enp0s31f6'}, 'rx': 136, 'rx-max': 4096, 'tx': 256, 'tx-max': 4096, 'tx-push': 0} $ ./tools/net/ynl/cli.py \ --spec Documentation/netlink/specs/ethtool.yaml \ --dump rings-get \ --json '{"header":{}}' [{'header': {'dev-index': 2, 'dev-name': 'enp0s31f6'}, 'rx': 136, 'rx-max': 4096, 'tx': 256, 'tx-max': 4096, 'tx-push': 0}, {'header': {'dev-index': 3, 'dev-name': 'wlp0s20f3'}, 'tx-push': 0}, {'header': {'dev-index': 19, 'dev-name': 'enp58s0u1u1'}, 'rx': 100, 'rx-max': 4096, 'tx-push': 0}] And error reporting: $ ./tools/net/ynl/cli.py \ --spec Documentation/netlink/specs/ethtool.yaml \ --dump rings-get \ --json '{"header":{"flags":5}}' Netlink error: Invalid argument nl_len = 68 (52) nl_flags = 0x300 nl_type = 2 error: -22 extack: {'msg': 'reserved bit set', 'bad-attr-offs': 24, 'bad-attr': '.header.flags'} None Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-01-31	netlink: specs: finish up operation enum-models	Jakub Kicinski
	I had a (bright?) idea of introducing the concept of enum-models to account for all the weird ways families enumerate their messages. I've never finished it because generating C code for each of them is pretty daunting. But for languages which can use ID values directly the support is simple enough, so clean this up a bit. "unified" model is what I recommend going forward. "directional" model is what ethtool uses. "notify-split" is used by the proposed DPLL code, but we can just make them use "unified", it hasn't been merged :) Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-01-31	tools: ynl: load jsonschema on demand	Jakub Kicinski
	The CLI script tries to validate jsonschema by default. It's seems better to validate too many times than too few. However, when copying the scripts to random servers having to install jsonschema is tedious. Load jsonschema via importlib, and let the user opt out. Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-01-31	tools: ynl: use operation names from spec on the CLI	Jakub Kicinski
	When I wrote the first version of the Python code I was quite excited that we can generate class methods directly from the spec. Unfortunately we need to use valid identifiers for method names (specifically no dashes are allowed). Don't reuse those names on the CLI, it's much more natural to use the operation names exactly as listed in the spec. Instead of: ./cli --do rings_get use: ./cli --do rings-get Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-01-31	tools: ynl: support pretty printing bad attribute names	Jakub Kicinski
	One of my favorite features of the Netlink specs is that they make decoding structured extack a ton easier. Implement pretty printing bad attribute names in YNL. For example it will now say: 'bad-attr': '.header.flags' rather than the useless: 'bad-attr-offs': 32 Proof: $ ./cli.py --spec ethtool.yaml --do rings_get \ --json '{"header":{"dev-index":1, "flags":4}}' Netlink error: Invalid argument nl_len = 68 (52) nl_flags = 0x300 nl_type = 2 error: -22 extack: {'msg': 'reserved bit set', 'bad-attr': '.header.flags'} Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-01-31	tools: ynl: support multi-attr	Jakub Kicinski
	Ethtool uses mutli-attr, add the support to YNL. Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-01-31	tools: ynl: support directional enum-model in CLI	Jakub Kicinski
	Support families which use different IDs for messages to and from the kernel. Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-01-31	tools: ynl: add support for types needed by ethtool	Jakub Kicinski
	Ethtool needs support for handful of extra types. It doesn't have the definitions section yet. Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-01-31	tools: ynl: use the common YAML loading and validation code	Jakub Kicinski
	Adapt the common object hierarchy in code gen and CLI. Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-01-31	tools: ynl: add an object hierarchy to represent parsed spec	Jakub Kicinski
	There's a lot of copy and pasting going on between the "cli" and code gen when it comes to representing the parsed spec. Create a library which both can use. Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-01-31	tools: ynl: move the cli and netlink code around	Jakub Kicinski
	Move the CLI code out of samples/ and the library part of it into tools/net/ynl/lib/. This way we can start sharing some code with the code gen. Initially I thought that code gen is too C-specific to share anything but basic stuff like calculating values for enums can easily be shared. Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-01-31	tools: ynl-gen: prevent do / dump reordering	Jakub Kicinski
	An earlier fix tried to address generated code jumping around one code-gen run to another. Turns out dict()s are already ordered since Python 3.7, the problem is that we iterate over operation modes using a set(). Sets are unordered in Python. Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-02-01	powerpc/kexec_file: Count hot-pluggable memory in FDT estimate	Sourabh Jain
	On Systems where online memory is lesser compared to max memory, the kexec_file_load system call may fail to load the kdump kernel with the below errors: "Failed to update fdt with linux,drconf-usable-memory property" "Error setting up usable-memory property for kdump kernel" This happens because the size estimation for usable memory properties for the kdump kernel's FDT is based on the online memory whereas the usable memory properties include max memory. In short, the hot-pluggable memory is not accounted for while estimating the size of the usable memory properties. The issue is addressed by calculating usable memory property size using max hotplug address instead of the last online memory address. Fixes: 2377c92e37fe ("powerpc/kexec_file: fix FDT size estimation for kdump kernel") Signed-off-by: Sourabh Jain <sourabhjain@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20230131030615.729894-1-sourabhjain@linux.ibm.com
2023-01-31	mm: memcg: fix NULL pointer in mem_cgroup_track_foreign_dirty_slowpath()	Kefeng Wang
	As commit 18365225f044 ("hwpoison, memcg: forcibly uncharge LRU pages"), hwpoison will forcibly uncharg a LRU hwpoisoned page, the folio_memcg could be NULl, then, mem_cgroup_track_foreign_dirty_slowpath() could occurs a NULL pointer dereference, let's do not record the foreign writebacks for folio memcg is null in mem_cgroup_track_foreign_dirty() to fix it. Link: https://lkml.kernel.org/r/20230129040945.180629-1-wangkefeng.wang@huawei.com Fixes: 97b27821b485 ("writeback, memcg: Implement foreign dirty flushing") Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Reported-by: Ma Wupeng <mawupeng1@huawei.com> Tested-by: Miko Larsson <mikoxyzzz@gmail.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Jan Kara <jack@suse.cz> Cc: Jens Axboe <axboe@kernel.dk> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Ma Wupeng <mawupeng1@huawei.com> Cc: Naoya Horiguchi <naoya.horiguchi@nec.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Tejun Heo <tj@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-01-31	Kconfig.debug: fix the help description in SCHED_DEBUG	ye xingchen
	The correct file path for SCHED_DEBUG is /sys/kernel/debug/sched. Link: https://lkml.kernel.org/r/202301291013573466558@zte.com.cn Signed-off-by: ye xingchen <ye.xingchen@zte.com.cn> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Geert Uytterhoeven <geert+renesas@glider.be> Cc: Josh Poimboeuf <jpoimboe@kernel.org> Cc: Kees Cook <keescook@chromium.org> Cc: Miguel Ojeda <ojeda@kernel.org> Cc: Nathan Chancellor <nathan@kernel.org> Cc: Nick Desaulniers <ndesaulniers@google.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Zhaoyang Huang <zhaoyang.huang@unisoc.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-01-31	mm/swapfile: add cond_resched() in get_swap_pages()	Longlong Xia
	The softlockup still occurs in get_swap_pages() under memory pressure. 64 CPU cores, 64GB memory, and 28 zram devices, the disksize of each zram device is 50MB with same priority as si. Use the stress-ng tool to increase memory pressure, causing the system to oom frequently. The plist_for_each_entry_safe() loops in get_swap_pages() could reach tens of thousands of times to find available space (extreme case: cond_resched() is not called in scan_swap_map_slots()). Let's add cond_resched() into get_swap_pages() when failed to find available space to avoid softlockup. Link: https://lkml.kernel.org/r/20230128094757.1060525-1-xialonglong1@huawei.com Signed-off-by: Longlong Xia <xialonglong1@huawei.com> Reviewed-by: "Huang, Ying" <ying.huang@intel.com> Cc: Chen Wandun <chenwandun@huawei.com> Cc: Huang Ying <ying.huang@intel.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Nanyong Sun <sunnanyong@huawei.com> Cc: Hugh Dickins <hughd@google.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-01-31	mm: use stack_depot_early_init for kmemleak	Zhaoyang Huang
	Mirsad report the below error which is caused by stack_depot_init() failure in kvcalloc. Solve this by having stackdepot use stack_depot_early_init(). On 1/4/23 17:08, Mirsad Goran Todorovac wrote: I hate to bring bad news again, but there seems to be a problem with the output of /sys/kernel/debug/kmemleak: [root@pc-mtodorov ~]# cat /sys/kernel/debug/kmemleak unreferenced object 0xffff951c118568b0 (size 16): comm "kworker/u12:2", pid 56, jiffies 4294893952 (age 4356.548s) hex dump (first 16 bytes): 6d 65 6d 73 74 69 63 6b 30 00 00 00 00 00 00 00 memstick0....... backtrace: [root@pc-mtodorov ~]# Apparently, backtrace of called functions on the stack is no longer printed with the list of memory leaks. This appeared on Lenovo desktop 10TX000VCR, with AlmaLinux 8.7 and BIOS version M22KT49A (11/10/2022) and 6.2-rc1 and 6.2-rc2 builds. This worked on 6.1 with the same CONFIG_KMEMLEAK=y and MGLRU enabled on a vanilla mainstream kernel from Mr. Torvalds' tree. I don't know if this is deliberate feature for some reason or a bug. Please find attached the config, lshw and kmemleak output. [vbabka@suse.cz: remove stack_depot_init() call] Link: https://lore.kernel.org/all/5272a819-ef74-65ff-be61-4d2d567337de@alu.unizg.hr/ Link: https://lkml.kernel.org/r/1674091345-14799-2-git-send-email-zhaoyang.huang@unisoc.com Fixes: 56a61617dd22 ("mm: use stack_depot for recording kmemleak's backtrace") Reported-by: Mirsad Todorovac <mirsad.todorovac@alu.unizg.hr> Suggested-by: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Zhaoyang Huang <zhaoyang.huang@unisoc.com> Acked-by: Mike Rapoport (IBM) <rppt@kernel.org> Acked-by: Catalin Marinas <catalin.marinas@arm.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Tested-by: Borislav Petkov (AMD) <bp@alien8.de> Cc: ke.wang <ke.wang@unisoc.com> Cc: Nathan Chancellor <nathan@kernel.org> Cc: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-01-31	Squashfs: fix handling and sanity checking of xattr_ids count	Phillip Lougher
	A Sysbot [1] corrupted filesystem exposes two flaws in the handling and sanity checking of the xattr_ids count in the filesystem. Both of these flaws cause computation overflow due to incorrect typing. In the corrupted filesystem the xattr_ids value is 4294967071, which stored in a signed variable becomes the negative number -225. Flaw 1 (64-bit systems only): The signed integer xattr_ids variable causes sign extension. This causes variable overflow in the SQUASHFS_XATTR_(A) macros. The variable is first multiplied by sizeof(struct squashfs_xattr_id) where the type of the sizeof operator is "unsigned long". On a 64-bit system this is 64-bits in size, and causes the negative number to be sign extended and widened to 64-bits and then become unsigned. This produces the very large number 18446744073709548016 or 2^64 - 3600. This number when rounded up by SQUASHFS_METADATA_SIZE - 1 (8191 bytes) and divided by SQUASHFS_METADATA_SIZE overflows and produces a length of 0 (stored in len). Flaw 2 (32-bit systems only): On a 32-bit system the integer variable is not widened by the unsigned long type of the sizeof operator (32-bits), and the signedness of the variable has no effect due it always being treated as unsigned. The above corrupted xattr_ids value of 4294967071, when multiplied overflows and produces the number 4294963696 or 2^32 - 3400. This number when rounded up by SQUASHFS_METADATA_SIZE - 1 (8191 bytes) and divided by SQUASHFS_METADATA_SIZE overflows again and produces a length of 0. The effect of the 0 length computation: In conjunction with the corrupted xattr_ids field, the filesystem also has a corrupted xattr_table_start value, where it matches the end of filesystem value of 850. This causes the following sanity check code to fail because the incorrectly computed len of 0 matches the incorrect size of the table reported by the superblock (0 bytes). len = SQUASHFS_XATTR_BLOCK_BYTES(xattr_ids); indexes = SQUASHFS_XATTR_BLOCKS(xattr_ids); / * The computed size of the index table (len bytes) should exactly * match the table start and end points / start = table_start + sizeof(id_table); end = msblk->bytes_used; if (len != (end - start)) return ERR_PTR(-EINVAL); Changing the xattr_ids variable to be "usigned int" fixes the flaw on a 64-bit system. This relies on the fact the computation is widened by the unsigned long type of the sizeof operator. Casting the variable to u64 in the above macro fixes this flaw on a 32-bit system. It also means 64-bit systems do not implicitly rely on the type of the sizeof operator to widen the computation. [1] https://lore.kernel.org/lkml/000000000000cd44f005f1a0f17f@google.com/ Link: https://lkml.kernel.org/r/20230127061842.10965-1-phillip@squashfs.org.uk Fixes: 506220d2ba21 ("squashfs: add more sanity checks in xattr id lookup") Signed-off-by: Phillip Lougher <phillip@squashfs.org.uk> Reported-by: <syzbot+082fa4af80a5bb1a9843@syzkaller.appspotmail.com> Cc: Alexey Khoroshilov <khoroshilov@ispras.ru> Cc: Fedor Pchelkin <pchelkin@ispras.ru> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-01-31	sh: define RUNTIME_DISCARD_EXIT	Tom Saeger
	sh vmlinux fails to link with GNU ld < 2.40 (likely < 2.36) since commit 99cb0d917ffa ("arch: fix broken BuildID for arm64 and riscv"). This is similar to fixes for powerpc and s390: commit 4b9880dbf3bd ("powerpc/vmlinux.lds: Define RUNTIME_DISCARD_EXIT"). commit a494398bde27 ("s390: define RUNTIME_DISCARD_EXIT to fix link error with GNU ld < 2.36"). $ sh4-linux-gnu-ld --version \| head -n1 GNU ld (GNU Binutils for Debian) 2.35.2 $ make ARCH=sh CROSS_COMPILE=sh4-linux-gnu- microdev_defconfig $ make ARCH=sh CROSS_COMPILE=sh4-linux-gnu- `.exit.text' referenced in section `__bug_table' of crypto/algboss.o: defined in discarded section `.exit.text' of crypto/algboss.o `.exit.text' referenced in section `__bug_table' of drivers/char/hw_random/core.o: defined in discarded section `.exit.text' of drivers/char/hw_random/core.o make[2]: * [scripts/Makefile.vmlinux:34: vmlinux] Error 1 make[1]: * [Makefile:1252: vmlinux] Error 2 arch/sh/kernel/vmlinux.lds.S keeps EXIT_TEXT: /* * .exit.text is discarded at runtime, not link time, to deal with * references from __bug_table */ .exit.text : AT(ADDR(.exit.text)) { EXIT_TEXT } However, EXIT_TEXT is thrown away by DISCARD(include/asm-generic/vmlinux.lds.h) because sh does not define RUNTIME_DISCARD_EXIT. GNU ld 2.40 does not have this issue and builds fine. This corresponds with Masahiro's comments in a494398bde27: "Nathan [Chancellor] also found that binutils commit 21401fc7bf67 ("Duplicate output sections in scripts") cured this issue, so we cannot reproduce it with binutils 2.36+, but it is better to not rely on it." Link: https://lkml.kernel.org/r/9166a8abdc0f979e50377e61780a4bba1dfa2f52.1674518464.git.tom.saeger@oracle.com Fixes: 99cb0d917ffa ("arch: fix broken BuildID for arm64 and riscv") Link: https://lore.kernel.org/all/Y7Jal56f6UBh1abE@dev-arch.thelio-3990X/ Link: https://lore.kernel.org/all/20230123194218.47ssfzhrpnv3xfez@oracle.com/ Signed-off-by: Tom Saeger <tom.saeger@oracle.com> Tested-by: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de> Cc: Ard Biesheuvel <ardb@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Christoph Hellwig <hch@lst.de> Cc: Dennis Gilmore <dennis@ausil.us> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Masahiro Yamada <masahiroy@kernel.org> Cc: Naresh Kamboju <naresh.kamboju@linaro.org> Cc: Nathan Chancellor <nathan@kernel.org> Cc: Palmer Dabbelt <palmer@rivosinc.com> Cc: Rich Felker <dalias@libc.org> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-01-31	highmem: round down the address passed to kunmap_flush_on_unmap()	Matthew Wilcox (Oracle)
	We already round down the address in kunmap_local_indexed() which is the other implementation of __kunmap_local(). The only implementation of kunmap_flush_on_unmap() is PA-RISC which is expecting a page-aligned address. This may be causing PA-RISC to be flushing the wrong addresses currently. Link: https://lkml.kernel.org/r/20230126200727.1680362-1-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Fixes: 298fa1ad5571 ("highmem: Provide generic variant of kmap_atomic*") Reviewed-by: Ira Weiny <ira.weiny@intel.com> Cc: "Fabio M. De Francesco" <fmdefrancesco@gmail.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Helge Deller <deller@gmx.de> Cc: Alexander Potapenko <glider@google.com> Cc: Andrey Konovalov <andreyknvl@gmail.com> Cc: Bagas Sanjaya <bagasdotme@gmail.com> Cc: David Sterba <dsterba@suse.com> Cc: Kees Cook <keescook@chromium.org> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Tony Luck <tony.luck@intel.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-01-31	migrate: hugetlb: check for hugetlb shared PMD in node migration	Mike Kravetz
	migrate_pages/mempolicy semantics state that CAP_SYS_NICE is required to move pages shared with another process to a different node. page_mapcount > 1 is being used to determine if a hugetlb page is shared. However, a hugetlb page will have a mapcount of 1 if mapped by multiple processes via a shared PMD. As a result, hugetlb pages shared by multiple processes and mapped with a shared PMD can be moved by a process without CAP_SYS_NICE. To fix, check for a shared PMD if mapcount is 1. If a shared PMD is found consider the page shared. Link: https://lkml.kernel.org/r/20230126222721.222195-3-mike.kravetz@oracle.com Fixes: e2d8cf405525 ("migrate: add hugepage migration code to migrate_pages()") Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Acked-by: Peter Xu <peterx@redhat.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: James Houghton <jthoughton@google.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev> Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com> Cc: Yang Shi <shy828301@gmail.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-01-31	mm: hugetlb: proc: check for hugetlb shared PMD in /proc/PID/smaps	Mike Kravetz
	Patch series "Fixes for hugetlb mapcount at most 1 for shared PMDs". This issue of mapcount in hugetlb pages referenced by shared PMDs was discussed in [1]. The following two patches address user visible behavior caused by this issue. [1] https://lore.kernel.org/linux-mm/Y9BF+OCdWnCSilEu@monkey/ This patch (of 2): A hugetlb page will have a mapcount of 1 if mapped by multiple processes via a shared PMD. This is because only the first process increases the map count, and subsequent processes just add the shared PMD page to their page table. page_mapcount is being used to decide if a hugetlb page is shared or private in /proc/PID/smaps. Pages referenced via a shared PMD were incorrectly being counted as private. To fix, check for a shared PMD if mapcount is 1. If a shared PMD is found count the hugetlb page as shared. A new helper to check for a shared PMD is added. [akpm@linux-foundation.org: simplification, per David] [akpm@linux-foundation.org: hugetlb.h: include page_ref.h for page_count()] Link: https://lkml.kernel.org/r/20230126222721.222195-2-mike.kravetz@oracle.com Fixes: 25ee01a2fca0 ("mm: hugetlb: proc: add hugetlb-related fields to /proc/PID/smaps") Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Acked-by: Peter Xu <peterx@redhat.com> Cc: David Hildenbrand <david@redhat.com> Cc: James Houghton <jthoughton@google.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev> Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com> Cc: Yang Shi <shy828301@gmail.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>