summaryrefslogtreecommitdiff
path: root/drivers/base
AgeCommit message (Collapse)Author
2021-04-05driver core: Fix locking bug in deferred_probe_timeout_work_func()Saravana Kannan
list_for_each_entry_safe() is only useful if we are deleting nodes in a linked list within the loop. It doesn't protect against other threads adding/deleting nodes to the list in parallel. We need to grab deferred_probe_mutex when traversing the deferred_probe_pending_list. Cc: stable@vger.kernel.org Fixes: 25b4e70dcce9 ("driver core: allow stopping deferred probe after init") Signed-off-by: Saravana Kannan <saravanak@google.com> Link: https://lore.kernel.org/r/20210402040342.2944858-2-saravanak@google.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-04-05Merge 5.12-rc6 into driver-core-nextGreg Kroah-Hartman
We need the driver core fixes in here as well. Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-04-03Merge tag 'driver-core-5.12-rc6' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core Pull driver core fix from Greg KH: "Here is a single driver core fix for a reported problem with differed probing. It has been in linux-next for a while with no reported problems" * tag 'driver-core-5.12-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: driver core: clear deferred probe reason on probe retry
2021-04-02driver core: platform: Make platform_get_irq_optional() optionalAndy Shevchenko
Currently the platform_get_irq_optional() returns an error code even if IRQ resource sumply has not been found. It prevents caller to be error code agnostic in their error handling. Now: ret = platform_get_irq_optional(...); if (ret != -ENXIO) return ret; // respect deferred probe if (ret > 0) ...we get an IRQ... After proposed change: ret = platform_get_irq_optional(...); if (ret < 0) return ret; if (ret > 0) ...we get an IRQ... Reported-by: Matthias Schiffer <matthias.schiffer@ew.tq-group.com> Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Link: https://lore.kernel.org/r/20210331144526.19439-1-andriy.shevchenko@linux.intel.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-04-02driver core: Replace printf() specifier and drop unneeded castingAndy Shevchenko
The size_t type has very well established specifier, i.e. "%zu", use it directly instead of casting to unsigned long with "%lu". Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Link: https://lore.kernel.org/r/20210401171042.60612-1-andriy.shevchenko@linux.intel.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-04-02driver core: Cast to (void *) with __force for __percpu pointerAndy Shevchenko
Sparse is not happy: drivers/base/devres.c:1230:9: warning: cast removes address space '__percpu' of expression Use __force attribute to make it happy. Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Link: https://lore.kernel.org/r/20210401171030.60527-1-andriy.shevchenko@linux.intel.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-04-02driver core: platform: Make clear error code used for missed IRQAndy Shevchenko
We have few code paths where same error code is assigned and returned for missed IRQ. Unify that under single error path. Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Link: https://lore.kernel.org/r/20210331145937.35980-1-andriy.shevchenko@linux.intel.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-04-02devcoredump: fix kernel-doc warningPierre-Louis Bossart
remove make W=1 warnings drivers/base/devcoredump.c:208: warning: Function parameter or member 'data' not described in 'devcd_free_sgtable' drivers/base/devcoredump.c:208: warning: Excess function parameter 'table' description in 'devcd_free_sgtable' drivers/base/devcoredump.c:225: warning: expecting prototype for devcd_read_from_table(). Prototype was for devcd_read_from_sgtable() instead Signed-off-by: Pierre-Louis Bossart <pierre-louis.bossart@linux.intel.com> Link: https://lore.kernel.org/r/20210331232614.304591-8-pierre-louis.bossart@linux.intel.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-04-02platform-msi: fix kernel-doc warningsPierre-Louis Bossart
remove make W=1 warnings drivers/base/platform-msi.c:336: warning: Function parameter or member 'is_tree' not described in '__platform_msi_create_device_domain' drivers/base/platform-msi.c:336: warning: expecting prototype for platform_msi_create_device_domain(). Prototype was for __platform_msi_create_device_domain() instead Signed-off-by: Pierre-Louis Bossart <pierre-louis.bossart@linux.intel.com> Link: https://lore.kernel.org/r/20210331232614.304591-7-pierre-louis.bossart@linux.intel.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-04-02driver core: attribute_container: remove kernel-doc warningsPierre-Louis Bossart
Remove make W=1 warnings drivers/base/attribute_container.c:471: warning: Function parameter or member 'cont' not described in 'attribute_container_add_class_device_adapter' drivers/base/attribute_container.c:471: warning: Function parameter or member 'dev' not described in 'attribute_container_add_class_device_adapter' drivers/base/attribute_container.c:471: warning: Function parameter or member 'classdev' not described in 'attribute_container_add_class_device_adapter' Signed-off-by: Pierre-Louis Bossart <pierre-louis.bossart@linux.intel.com> Link: https://lore.kernel.org/r/20210331232614.304591-3-pierre-louis.bossart@linux.intel.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-04-02driver core: remove kernel-doc warningsPierre-Louis Bossart
remove make W=1 warning: drivers/base/core.c:1670: warning: Function parameter or member 'flags' not described in 'fw_devlink_create_devlink' drivers/base/core.c:1670: warning: Function parameter or member 'con' not described in 'fw_devlink_create_devlink' drivers/base/core.c:1670: warning: Function parameter or member 'sup_handle' not described in 'fw_devlink_create_devlink' drivers/base/core.c:1670: warning: Function parameter or member 'flags' not described in 'fw_devlink_create_devlink' drivers/base/core.c:1763: warning: Function parameter or member 'dev' not described in '__fw_devlink_link_to_consumers' drivers/base/core.c:1844: warning: Function parameter or member 'dev' not described in '__fw_devlink_link_to_suppliers' drivers/base/core.c:1844: warning: Function parameter or member 'fwnode' not described in '__fw_devlink_link_to_suppliers' Signed-off-by: Pierre-Louis Bossart <pierre-louis.bossart@linux.intel.com> Link: https://lore.kernel.org/r/20210331232614.304591-2-pierre-louis.bossart@linux.intel.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-03-31regmap-irq: Add driver callback to configure virtual regsGuru Das Srinagesh
Enable drivers to configure and modify "virtual" registers, which are non-standard registers that further configure irq type on some devices. Since they are non-standard, enable drivers to configure them according to their particular idiosyncrasies by specifying an optional callback function while registering with the framework. Signed-off-by: Guru Das Srinagesh <gurus@codeaurora.org> Link: https://lore.kernel.org/r/07e058cdec2297d15c95c825aa0263064d962d5a.1616613838.git.gurus@codeaurora.org Signed-off-by: Mark Brown <broonie@kernel.org>
2021-03-31regmap-irq: Introduce virtual regs to handle more config regsGuru Das Srinagesh
Add "virtual" registers support to handle any irq configuration registers in addition to the ones the framework currently supports (status, mask, unmask, wake, type and ack). These are non-standard registers that further configure irq type on some devices, so enable the framework to add a variable number of them. Signed-off-by: Guru Das Srinagesh <gurus@codeaurora.org> Link: https://lore.kernel.org/r/a1787067004b0e11cb960319082764397469215a.1616613838.git.gurus@codeaurora.org Signed-off-by: Mark Brown <broonie@kernel.org>
2021-03-29PM: runtime: Fix race getting/putting suppliers at probeAdrian Hunter
pm_runtime_put_suppliers() must not decrement rpm_active unless the consumer is suspended. That is because, otherwise, it could suspend suppliers for an active consumer. That can happen as follows: static int driver_probe_device(struct device_driver *drv, struct device *dev) { int ret = 0; if (!device_is_registered(dev)) return -ENODEV; dev->can_match = true; pr_debug("bus: '%s': %s: matched device %s with driver %s\n", drv->bus->name, __func__, dev_name(dev), drv->name); pm_runtime_get_suppliers(dev); if (dev->parent) pm_runtime_get_sync(dev->parent); At this point, dev can runtime suspend so rpm_put_suppliers() can run, rpm_active becomes 1 (the lowest value). pm_runtime_barrier(dev); if (initcall_debug) ret = really_probe_debug(dev, drv); else ret = really_probe(dev, drv); Probe callback can have runtime resumed dev, and then runtime put so dev is awaiting autosuspend, but rpm_active is 2. pm_request_idle(dev); if (dev->parent) pm_runtime_put(dev->parent); pm_runtime_put_suppliers(dev); Now pm_runtime_put_suppliers() will put the supplier i.e. rpm_active 2 -> 1, but consumer can still be active. return ret; } Fix by checking the runtime status. For any status other than RPM_SUSPENDED, rpm_active can be considered to be "owned" by rpm_[get/put]_suppliers() and pm_runtime_put_suppliers() need do nothing. Reported-by: Asutosh Das <asutoshd@codeaurora.org> Fixes: 4c06c4e6cf63 ("driver core: Fix possible supplier PM-usage counter imbalance") Signed-off-by: Adrian Hunter <adrian.hunter@intel.com> Cc: 5.1+ <stable@vger.kernel.org> # 5.1+ Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2021-03-29PM: runtime: Fix ordering in pm_runtime_get_suppliers()Adrian Hunter
rpm_active indicates how many times the supplier usage_count has been incremented. Consequently it must be updated after pm_runtime_get_sync() of the supplier, not before. Fixes: 4c06c4e6cf63 ("driver core: Fix possible supplier PM-usage counter imbalance") Signed-off-by: Adrian Hunter <adrian.hunter@intel.com> Cc: 5.1+ <stable@vger.kernel.org> # 5.1+ Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2021-03-28base: dd: fix error return code of driver_sysfs_add()Jia-Ju Bai
When device_create_file() fails and returns a non-zero value, no error return code of driver_sysfs_add() is assigned. To fix this bug, ret is assigned with the return value of device_create_file(), and then ret is checked. Reported-by: TOTE Robot <oslab@tsinghua.edu.cn> Signed-off-by: Jia-Ju Bai <baijiaju1990@gmail.com> Link: https://lore.kernel.org/r/20210324023405.12465-1-baijiaju1990@gmail.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-03-28driver core: Use unbound workqueue for deferred probesYogesh Lal
Deferred probe usually runs only on pinned kworkers, which might take longer time if a device contains multiple sub-devices. One such case is of sound card on mobile devices, where we have good number of mixers and controls per mixer. We observed boot up improvement - deferred probes take ~600ms when bound to little core kworker and ~200ms when deferred probe is queued on unbound wq. This is due to scheduler moving the worker running deferred probe work to big CPUs. Without this change, we see the worker is running on LITTLE CPU due to affinity. Since kworker runs deferred probe of several devices, the locality may not be important. Also, init thread executing driver initcalls, can potentially migrate as it has cpu affinity set to all cpus.In addition to this, async probes use unbounded workqueue. So, using unbounded wq for deferred probes looks to be similar to these w.r.t. scheduling behavior. Signed-off-by: Yogesh Lal <ylal@codeaurora.org> Link: https://lore.kernel.org/r/1616583698-6398-1-git-send-email-ylal@codeaurora.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-03-23driver core: clear deferred probe reason on probe retryAhmad Fatoum
When retrying a deferred probe, any old defer reason string should be discarded. Otherwise, if the probe is deferred again at a different spot, but without setting a message, the now incorrect probe reason will remain. This was observed with the i.MX I2C driver, which ultimately failed to probe due to lack of the GPIO driver. The probe defer for GPIO doesn't record a message, but a previous probe defer to clock_get did. This had the effect that /sys/kernel/debug/devices_deferred listed a misleading probe deferral reason. Cc: stable <stable@vger.kernel.org> Fixes: d090b70ede02 ("driver core: add deferring probe reason to devices_deferred property") Reviewed-by: Andy Shevchenko <andy.shevchenko@gmail.com> Reviewed-by: Andrzej Hajda <a.hajda@samsung.com> Signed-off-by: Ahmad Fatoum <a.fatoum@pengutronix.de> Link: https://lore.kernel.org/r/20210319110459.19966-1-a.fatoum@pengutronix.de Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-03-23devcoredump: avoid -Wempty-body warningsArnd Bergmann
Cleaning out the last -Wempty-body warnings found some interesting cases with empty macros, along with harmless warnings like this one: drivers/base/devcoredump.c: In function 'dev_coredumpm': drivers/base/devcoredump.c:297:56: error: suggest braces around empty body in an 'if' statement [-Werror=empty-body] 297 | /* nothing - symlink will be missing */; | ^ drivers/base/devcoredump.c:301:56: error: suggest braces around empty body in an 'if' statement [-Werror=empty-body] 301 | /* nothing - symlink will be missing */; | ^ Randy tried addressing this one before, and there were multiple other ideas in that thread. Add a runtime warning and code comment here. Link: https://lore.kernel.org/lkml/20200418184111.13401-8-rdunlap@infradead.org/ Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Matthew Wilcox <willy@infradead.org> Signed-off-by: Arnd Bergmann <arnd@arndb.de> Link: https://lore.kernel.org/r/20210322114258.3420937-1-arnd@kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-03-23device property: Add test cases for fwnode_property_count_*() APIsAndy Shevchenko
Add test cases for fwnode_property_count_*() APIs. While at it, modify the arrays of integers to be size of non-power-of-2 for better test coverage and decreasing stack usage. Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Link: https://lore.kernel.org/r/20210212162539.86850-1-andriy.shevchenko@linux.intel.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-03-23device property: Sync descriptions of swnode array and group APIsAndy Shevchenko
After a few updates against swnode APIs the kernel documentation, i.e. for swnode group registration and unregistration deviates from the one for swnode array. In general, the same rules are applied to both. Hence, synchronize descriptions of swnode array and group APIs Reviewed-by: Heikki Krogerus <heikki.krogerus@linux.intel.com> Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Link: https://lore.kernel.org/r/20210308103644.81960-1-andriy.shevchenko@linux.intel.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-03-23Revert "Revert "driver core: Set fw_devlink=on by default""Saravana Kannan
This reverts commit 3e4c982f1ce75faf5314477b8da296d2d00919df. Since all reported issues due to fw_devlink=on should be addressed by this series, revert the revert. fw_devlink=on Take II. Signed-off-by: Saravana Kannan <saravanak@google.com> Link: https://lore.kernel.org/r/20210302211133.2244281-4-saravanak@google.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-03-23driver core: Update device link status properly for device_bind_driver()Saravana Kannan
Device link status was not getting updated correctly when device_bind_driver() is called on a device. This causes a warning[1]. Fix this by updating device links that can be updated and dropping device links that can't be updated to a sensible state. [1] - https://lore.kernel.org/lkml/56f7d032-ba5a-a8c7-23de-2969d98c527e@nvidia.com/ Tested-by: Jon Hunter <jonathanh@nvidia.com> Signed-off-by: Saravana Kannan <saravanak@google.com> Link: https://lore.kernel.org/r/20210302211133.2244281-3-saravanak@google.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-03-23driver core: Avoid pointless deferred probe attemptsSaravana Kannan
There's no point in adding a device to the deferred probe list if we know for sure that it doesn't have a matching driver. So, check if a device can match with a driver before adding it to the deferred probe list. Signed-off-by: Saravana Kannan <saravanak@google.com> Link: https://lore.kernel.org/r/20210302211133.2244281-2-saravanak@google.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-03-23devtmpfs: actually reclaim some init memoryRasmus Villemoes
Currently gcc seems to inline devtmpfs_setup() into devtmpfsd(), so its memory footprint isn't reclaimed as intended. Mark it noinline to make sure it gets put in .init.text. While here, setup_done can also be put in .init.data: After complete() releases the internal spinlock, the completion object is never touched again by that thread, and the waiting thread doesn't proceed until it observes ->done while holding that spinlock. This is now the same pattern as for kthreadd_done in init/main.c: complete() is done in a __ref function, while the corresponding wait_for_completion() is in an __init function. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk> Link: https://lore.kernel.org/r/20210312103027.2701413-2-linux@rasmusvillemoes.dk Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-03-23devtmpfs: fix placement of complete() callRasmus Villemoes
Calling complete() from within the __init function is wrong - theoretically, the init process could proceed all the way to freeing the init mem before the devtmpfsd thread gets to execute the return instruction in devtmpfs_setup(). In practice, it seems to be harmless as gcc inlines devtmpfs_setup() into devtmpfsd(). So the calls of the __init functions init_chdir() etc. actually happen from devtmpfs_setup(), but the __ref on that one silences modpost (it's all right, because those calls happen before the complete()). But it does make the __init annotation of the setup function moot, which we'll fix in a subsequent patch. Fixes: bcbacc4909f1 ("devtmpfs: refactor devtmpfsd()") Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk> Link: https://lore.kernel.org/r/20210312103027.2701413-1-linux@rasmusvillemoes.dk Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-03-23drivers/base/cpu: remove redundant assignment of variable retvalColin Ian King
The variable retval is being initialized with a value that is never read and it is being updated later with a new value. Clean this up by initializing retval to -ENOMEM and remove the assignment to retval on the !dev failure path. Kudos to Rafael for the improved fix suggestion. Reviewed-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Signed-off-by: Colin Ian King <colin.king@canonical.com> Link: https://lore.kernel.org/r/20210218202837.516231-1-colin.king@canonical.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-03-23driver core: dd: remove deferred_devices variableGreg Kroah-Hartman
No need to save the debugfs dentry for the "devices_deferred" debugfs file (gotta love the juxtaposition), if we need to remove it we can look it up from debugfs itself. Cc: "Rafael J. Wysocki" <rafael@kernel.org> Cc: linux-kernel@vger.kernel.org Reviewed-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Link: https://lore.kernel.org/r/20210216142400.3759099-2-gregkh@linuxfoundation.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-03-23driver core: component: remove dentry pointer in "struct master"Greg Kroah-Hartman
There is no need to keep around a pointer to a dentry when all it is used for is to remove the debugfs file when tearing things down. As the name is simple, have debugfs look up the dentry when removing things, keeping the logic much simpler. Cc: "Rafael J. Wysocki" <rafael@kernel.org> Cc: linux-kernel@vger.kernel.org Reviewed-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Link: https://lore.kernel.org/r/20210216142400.3759099-1-gregkh@linuxfoundation.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-03-23driver core: auxiliary bus: Remove unneeded module bitsDave Jiang
Remove module bits in the auxiliary bus code since the auxiliary bus cannot be built as a module and the relevant code is not needed. Cc: Dave Ertman <david.m.ertman@intel.com> Suggested-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Dave Jiang <dave.jiang@intel.com> Link: https://lore.kernel.org/r/161307488980.1896017.15627190714413338196.stgit@djiang5-desk3.ch.intel.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-03-22PM: runtime: Defer suspending suppliersRafael J. Wysocki
Because the PM-runtime status of the device is not updated in __rpm_callback(), attempts to suspend the suppliers of the given device triggered by the rpm_put_suppliers() call in there may cause a supplier to be suspended completely before the status of the consumer is updated to RPM_SUSPENDED, which is confusing. To avoid that (1) modify __rpm_callback() to only decrease the PM-runtime usage counter of each supplier and (2) make rpm_suspend() try to suspend the suppliers after changing the consumer's status to RPM_SUSPENDED, in analogy with the device's parent. Link: https://lore.kernel.org/linux-pm/CAPDyKFqm06KDw_p8WXsM4dijDbho4bb6T4k50UqqvR1_COsp8g@mail.gmail.com/ Fixes: 21d5c57b3726 ("PM / runtime: Use device links") Reported-by: elaine.zhang <zhangqing@rock-chips.com> Diagnosed-by: Ulf Hansson <ulf.hansson@linaro.org> Reviewed-by: Ulf Hansson <ulf.hansson@linaro.org> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2021-03-19Revert "PM: runtime: Update device status before letting suppliers suspend"Rafael J. Wysocki
Revert commit 44cc89f76464 ("PM: runtime: Update device status before letting suppliers suspend") that introduced a race condition into __rpm_callback() which allowed a concurrent rpm_resume() to run and resume the device prematurely after its status had been changed to RPM_SUSPENDED by __rpm_callback(). Fixes: 44cc89f76464 ("PM: runtime: Update device status before letting suppliers suspend") Link: https://lore.kernel.org/linux-pm/24dfb6fc-5d54-6ee2-9195-26428b7ecf8a@intel.com/ Reported-by: Adrian Hunter <adrian.hunter@intel.com> Cc: 4.10+ <stable@vger.kernel.org> # 4.10+ Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Reviewed-by: Ulf Hansson <ulf.hansson@linaro.org>
2021-03-18PM: domains: Don't runtime resume devices at genpd_prepare()Ulf Hansson
Runtime resuming a device upfront in the genpd_prepare() callback, to check if there is a wakeup pending for it, seems like an unnecessary thing to do. The PM core already manages these kind of things in a common way in __device_suspend(), via calling pm_runtime_barrier() and pm_wakeup_pending(). Therefore, let's simply drop this behaviour from genpd_prepare(). Note that, this change is applicable only for devices that are attached to a genpd that has the GENPD_FLAG_ACTIVE_WAKEUP set (Renesas, Mediatek, and Rockchip platforms). Moreover, a driver that needs to restore power for its device to re-configure it for a system wakeup, may still call pm_runtime_get_sync(), for example, to do this. Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2021-03-18regmap-irq: Extend sub-irq to support non-fixed reg stridesGuru Das Srinagesh
Qualcomm's MFD chips have a top level interrupt status register and sub-irqs (peripherals). When a bit in the main status register goes high, it means that the peripheral corresponding to that bit has an unserviced interrupt. If the bit is not set, this means that the corresponding peripheral does not. Commit a2d21848d9211d ("regmap: regmap-irq: Add main status register support") introduced the sub-irq logic that is currently applied only when reading status registers, but not for any other functions like acking or masking. Extend the use of sub-irq to all other functions, with two caveats regarding the specification of offsets: - Each member of the sub_reg_offsets array should be of length 1 - The specified offsets should be the unequal strides for each sub-irq device. In QCOM's case, all the *_base registers are to be configured to the base addresses of the first sub-irq group, with offsets of each subsequent group calculated as a difference from these addresses. Continuing from the example mentioned in the cover letter: /* * Address of MISC_INT_MASK = 0x1011 * Address of TEMP_ALARM_INT_MASK = 0x2011 * Address of GPIO01_INT_MASK = 0x3011 * * Calculate offsets as: * offset_0 = 0x1011 - 0x1011 = 0 (to access MISC's * registers) * offset_1 = 0x2011 - 0x1011 = 0x1000 * offset_2 = 0x3011 - 0x1011 = 0x2000 */ static unsigned int sub_unit0_offsets[] = {0}; static unsigned int sub_unit1_offsets[] = {0x1000}; static unsigned int sub_unit2_offsets[] = {0x2000}; static struct regmap_irq_sub_irq_map chip_sub_irq_offsets[] = { REGMAP_IRQ_MAIN_REG_OFFSET(sub_unit0_offsets), REGMAP_IRQ_MAIN_REG_OFFSET(sub_unit0_offsets), REGMAP_IRQ_MAIN_REG_OFFSET(sub_unit0_offsets), }; static struct regmap_irq_chip chip_irq_chip = { --------8<-------- .not_fixed_stride = true, .mask_base = MISC_INT_MASK, .type_base = MISC_INT_TYPE, .ack_base = MISC_INT_ACK, .sub_reg_offsets = chip_sub_irq_offsets, --------8<-------- }; Signed-off-by: Guru Das Srinagesh <gurus@codeaurora.org> Link: https://lore.kernel.org/r/526562423eaa58b4075362083f561841f1d6956c.1615423027.git.gurus@codeaurora.org Signed-off-by: Mark Brown <broonie@kernel.org>
2021-03-12arch_topology: Export arch_freq_scale and helpersViresh Kumar
It is possible now for other parts of the kernel to provide their own implementation of sched_freq_tick() and they can very well be modules themselves (like CPPC cpufreq driver, which is going to use these in a later commit). Export arch_freq_scale and topology_{set|clear}_scale_freq_source(). Reviewed-by: Ionela Voinescu <ionela.voinescu@arm.com> Tested-by: Ionela Voinescu <ionela.voinescu@arm.com> Tested-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
2021-03-10software node: Fix device_add_software_node()Heikki Krogerus
The function device_add_software_node() was meant to register the node supplied to it, but only if that node wasn't already registered. Right now the function attempts to always register the node. That will cause a failure with nodes that are already registered. Fixing that by incrementing the reference count of the nodes that have already been registered, and only registering the new nodes. Also, clarifying the behaviour in the function documentation. Fixes: e68d0119e328 ("software node: Introduce device_add_software_node()") Signed-off-by: Heikki Krogerus <heikki.krogerus@linux.intel.com> Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Tested-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2021-03-10software node: Fix node registrationHeikki Krogerus
Software node can not be registered before its parent. Fixes: 80488a6b1d3c ("software node: Add support for static node descriptors") Cc: 5.10+ <stable@vger.kernel.org> # 5.10+ Signed-off-by: Heikki Krogerus <heikki.krogerus@linux.intel.com> Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Tested-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2021-03-10regmap: set debugfs_name to NULL after it is freedMeng Li
There is a upstream commit cffa4b2122f5("regmap:debugfs: Fix a memory leak when calling regmap_attach_dev") that adds a if condition when create name for debugfs_name. With below function invoking logical, debugfs_name is freed in regmap_debugfs_exit(), but it is not created again because of the if condition introduced by above commit. regmap_reinit_cache() regmap_debugfs_exit() ... regmap_debugfs_init() So, set debugfs_name to NULL after it is freed. Fixes: cffa4b2122f5 ("regmap: debugfs: Fix a memory leak when calling regmap_attach_dev") Signed-off-by: Meng Li <Meng.Li@windriver.com> Link: https://lore.kernel.org/r/20210226021737.7690-1-Meng.Li@windriver.com Signed-off-by: Mark Brown <broonie@kernel.org>
2021-03-10arch_topology: Allow multiple entities to provide sched_freq_tick() callbackViresh Kumar
This patch attempts to make it generic enough so other parts of the kernel can also provide their own implementation of scale_freq_tick() callback, which is called by the scheduler periodically to update the per-cpu arch_freq_scale variable. The implementations now need to provide 'struct scale_freq_data' for the CPUs for which they have hardware counters available, and a callback gets registered for each possible CPU in a per-cpu variable. The arch specific (or ARM AMU) counters are updated to adapt to this and they take the highest priority if they are available, i.e. they will be used instead of CPPC based counters for example. The special code to rebuild the sched domains, in case invariance status change for the system, is moved out of arm64 specific code and is added to arch_topology.c. Note that this also defines SCALE_FREQ_SOURCE_CPUFREQ but doesn't use it and it is added to show that cpufreq is also acts as source of information for FIE and will be used by default if no other counters are supported for a platform. Reviewed-by: Ionela Voinescu <ionela.voinescu@arm.com> Tested-by: Ionela Voinescu <ionela.voinescu@arm.com> Acked-by: Will Deacon <will@kernel.org> # for arm64 Tested-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
2021-03-10arch_topology: Rename freq_scale as arch_freq_scaleViresh Kumar
Rename freq_scale to a less generic name, as it will get exported soon for modules. Since x86 already names its own implementation of this as arch_freq_scale, lets stick to that. Suggested-by: Will Deacon <will@kernel.org> Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
2021-03-01PM: runtime: Update device status before letting suppliers suspendRafael J. Wysocki
Because the PM-runtime status of the device is not updated in __rpm_callback(), attempts to suspend the suppliers of the given device triggered by rpm_put_suppliers() called by it may fail. Fix this by making __rpm_callback() update the device's status to RPM_SUSPENDED before calling rpm_put_suppliers() if the current status of the device is RPM_SUSPENDING and the callback just invoked by it has returned 0 (success). While at it, modify the code in __rpm_callback() to always check the device's PM-runtime status under its PM lock. Link: https://lore.kernel.org/linux-pm/CAPDyKFqm06KDw_p8WXsM4dijDbho4bb6T4k50UqqvR1_COsp8g@mail.gmail.com/ Fixes: 21d5c57b3726 ("PM / runtime: Use device links") Reported-by: Elaine Zhang <zhangqing@rock-chips.com> Diagnosed-by: Ulf Hansson <ulf.hansson@linaro.org> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Tested-by: Elaine Zhang <zhangiqng@rock-chips.com> Reviewed-by: Ulf Hansson <ulf.hansson@linaro.org> Cc: 4.10+ <stable@vger.kernel.org> # 4.10+ Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2021-02-26Merge tag 'riscv-for-linus-5.12-mw0' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux Pull RISC-V updates from Palmer Dabbelt: "A handful of new RISC-V related patches for this merge window: - A check to ensure drivers are properly using uaccess. This isn't manifesting with any of the drivers I'm currently using, but may catch errors in new drivers. - Some preliminary support for the FU740, along with the HiFive Unleashed it will appear on. - NUMA support for RISC-V, which involves making the arm64 code generic. - Support for kasan on the vmalloc region. - A handful of new drivers for the Kendryte K210, along with the DT plumbing required to boot on a handful of K210-based boards. - Support for allocating ASIDs. - Preliminary support for kernels larger than 128MiB. - Various other improvements to our KASAN support, including the utilization of huge pages when allocating the KASAN regions. We may have already found a bug with the KASAN_VMALLOC code, but it's passing my tests. There's a fix in the works, but that will probably miss the merge window. * tag 'riscv-for-linus-5.12-mw0' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux: (75 commits) riscv: Improve kasan population by using hugepages when possible riscv: Improve kasan population function riscv: Use KASAN_SHADOW_INIT define for kasan memory initialization riscv: Improve kasan definitions riscv: Get rid of MAX_EARLY_MAPPING_SIZE soc: canaan: Sort the Makefile alphabetically riscv: Disable KSAN_SANITIZE for vDSO riscv: Remove unnecessary declaration riscv: Add Canaan Kendryte K210 SD card defconfig riscv: Update Canaan Kendryte K210 defconfig riscv: Add Kendryte KD233 board device tree riscv: Add SiPeed MAIXDUINO board device tree riscv: Add SiPeed MAIX GO board device tree riscv: Add SiPeed MAIX DOCK board device tree riscv: Add SiPeed MAIX BiT board device tree riscv: Update Canaan Kendryte K210 device tree dt-bindings: add resets property to dw-apb-timer dt-bindings: fix sifive gpio properties dt-bindings: update sifive uart compatible string dt-bindings: update sifive clint compatible string ...
2021-02-26drivers/base/memory: don't store phys_device in memory blocksDavid Hildenbrand
No need to store the value for each and every memory block, as we can easily query the value at runtime. Reshuffle the members to optimize the memory layout. Also, let's clarify what the interface once was used for and why it's legacy nowadays. "phys_device" was used on s390x in older versions of lsmem[2]/chmem[3], back when they were still part of s390x-tools. They were later replaced by the variants in linux-utils. For example, RHEL6 and RHEL7 contain lsmem/chmem from s390-utils. RHEL8 switched to versions from util-linux on s390x [4]. "phys_device" was added with sysfs support for memory hotplug in commit 3947be1969a9 ("[PATCH] memory hotplug: sysfs and add/remove functions") in 2005. It always returned 0. s390x started returning something != 0 on some setups (if sclp.rzm is set by HW) in 2010 via commit 57b552ba0b2f ("memory hotplug/s390: set phys_device"). For s390x, it allowed for identifying which memory block devices belong to the same storage increment (RZM). Only if all memory block devices comprising a single storage increment were offline, the memory could actually be removed in the hypervisor. Since commit e5d709bb5fb7 ("s390/memory hotplug: provide memory_block_size_bytes() function") in 2013 a memory block device spans at least one storage increment - which is why the interface isn't really helpful/used anymore (except by old lsmem/chmem tools). There were once RFC patches to make use of "phys_device" in ACPI context; however, the underlying problem could be solved using different interfaces [1]. [1] https://patchwork.kernel.org/patch/2163871/ [2] https://github.com/ibm-s390-tools/s390-tools/blob/v2.1.0/zconf/lsmem [3] https://github.com/ibm-s390-tools/s390-tools/blob/v2.1.0/zconf/chmem [4] https://bugzilla.redhat.com/show_bug.cgi?id=1504134 Link: https://lkml.kernel.org/r/20210201181347.13262-2-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Acked-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: "Rafael J. Wysocki" <rafael@kernel.org> Cc: Mauro Carvalho Chehab <mchehab+huawei@kernel.org> Cc: Ilya Dryomov <idryomov@gmail.com> Cc: Vaibhav Jain <vaibhav@linux.ibm.com> Cc: Tom Rix <trix@redhat.com> Cc: Geert Uytterhoeven <geert+renesas@glider.be> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-02-26mm/memory_hotplug: rename all existing 'memhp' into 'mhp'Anshuman Khandual
This renames all 'memhp' instances to 'mhp' except for memhp_default_state for being a kernel command line option. This is just a clean up and should not cause a functional change. Let's make it consistent rater than mixing the two prefixes. In preparation for more users of the 'mhp' terminology. Link: https://lkml.kernel.org/r/1611554093-27316-1-git-send-email-anshuman.khandual@arm.com Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com> Suggested-by: David Hildenbrand <david@redhat.com> Reviewed-by: David Hildenbrand <david@redhat.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: "Rafael J. Wysocki" <rafael@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-02-24Merge branch 'akpm' (patches from Andrew)Linus Torvalds
Merge misc updates from Andrew Morton: "A few small subsystems and some of MM. 172 patches. Subsystems affected by this patch series: hexagon, scripts, ntfs, ocfs2, vfs, and mm (slab-generic, slab, slub, debug, pagecache, swap, memcg, pagemap, mprotect, mremap, page-reporting, vmalloc, kasan, pagealloc, memory-failure, hugetlb, vmscan, z3fold, compaction, mempolicy, oom-kill, hugetlbfs, and migration)" * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (172 commits) mm/migrate: remove unneeded semicolons hugetlbfs: remove unneeded return value of hugetlb_vmtruncate() hugetlbfs: fix some comment typos hugetlbfs: correct some obsolete comments about inode i_mutex hugetlbfs: make hugepage size conversion more readable hugetlbfs: remove meaningless variable avoid_reserve hugetlbfs: correct obsolete function name in hugetlbfs_read_iter() hugetlbfs: use helper macro default_hstate in init_hugetlbfs_fs hugetlbfs: remove useless BUG_ON(!inode) in hugetlbfs_setattr() hugetlbfs: remove special hugetlbfs_set_page_dirty() mm/hugetlb: change hugetlb_reserve_pages() to type bool mm, oom: fix a comment in dump_task() mm/mempolicy: use helper range_in_vma() in queue_pages_test_walk() numa balancing: migrate on fault among multiple bound nodes mm, compaction: make fast_isolate_freepages() stay within zone mm/compaction: fix misbehaviors of fast_find_migrateblock() mm/compaction: correct deferral logic for proactive compaction mm/compaction: remove duplicated VM_BUG_ON_PAGE !PageLocked mm/compaction: remove rcu_read_lock during page compaction z3fold: simplify the zhdr initialization code in init_z3fold_page() ...
2021-02-24mm: memcg: add swapcache stat for memcg v2Shakeel Butt
This patch adds swapcache stat for the cgroup v2. The swapcache represents the memory that is accounted against both the memory and the swap limit of the cgroup. The main motivation behind exposing the swapcache stat is for enabling users to gracefully migrate from cgroup v1's memsw counter to cgroup v2's memory and swap counters. Cgroup v1's memsw limit allows users to limit the memory+swap usage of a workload but without control on the exact proportion of memory and swap. Cgroup v2 provides separate limits for memory and swap which enables more control on the exact usage of memory and swap individually for the workload. With some little subtleties, the v1's memsw limit can be switched with the sum of the v2's memory and swap limits. However the alternative for memsw usage is not yet available in cgroup v2. Exposing per-cgroup swapcache stat enables that alternative. Adding the memory usage and swap usage and subtracting the swapcache will approximate the memsw usage. This will help in the transparent migration of the workloads depending on memsw usage and limit to v2' memory and swap counters. The reasons these applications are still interested in this approximate memsw usage are: (1) these applications are not really interested in two separate memory and swap usage metrics. A single usage metric is more simple to use and reason about for them. (2) The memsw usage metric hides the underlying system's swap setup from the applications. Applications with multiple instances running in a datacenter with heterogeneous systems (some have swap and some don't) will keep seeing a consistent view of their usage. [akpm@linux-foundation.org: fix CONFIG_SWAP=n build] Link: https://lkml.kernel.org/r/20210108155813.2914586-3-shakeelb@google.com Signed-off-by: Shakeel Butt <shakeelb@google.com> Acked-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Roman Gushchin <guro@fb.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Yang Shi <shy828301@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-02-24mm: memcontrol: convert NR_FILE_PMDMAPPED account to pagesMuchun Song
Currently we use struct per_cpu_nodestat to cache the vmstat counters, which leads to inaccurate statistics especially THP vmstat counters. In the systems with hundreds of processors it can be GBs of memory. For example, for a 96 CPUs system, the threshold is the maximum number of 125. And the per cpu counters can cache 23.4375 GB in total. The THP page is already a form of batched addition (it will add 512 worth of memory in one go) so skipping the batching seems like sensible. Although every THP stats update overflows the per-cpu counter, resorting to atomic global updates. But it can make the statistics more accuracy for the THP vmstat counters. So we convert the NR_FILE_PMDMAPPED account to pages. This patch is consistent with 8f182270dfec ("mm/swap.c: flush lru pvecs on compound page arrival"). Doing this also can make the unit of vmstat counters more unified. Finally, the unit of the vmstat counters are pages, kB and bytes. The B/KB suffix can tell us that the unit is bytes or kB. The rest which is without suffix are pages. Link: https://lkml.kernel.org/r/20201228164110.2838-7-songmuchun@bytedance.com Signed-off-by: Muchun Song <songmuchun@bytedance.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Feng Tang <feng.tang@intel.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Michal Hocko <mhocko@suse.com> Cc: NeilBrown <neilb@suse.de> Cc: Pankaj Gupta <pankaj.gupta@cloud.ionos.com> Cc: Rafael. J. Wysocki <rafael@kernel.org> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Roman Gushchin <guro@fb.com> Cc: Sami Tolvanen <samitolvanen@google.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-02-24mm: memcontrol: convert NR_SHMEM_PMDMAPPED account to pagesMuchun Song
Currently we use struct per_cpu_nodestat to cache the vmstat counters, which leads to inaccurate statistics especially THP vmstat counters. In the systems with hundreds of processors it can be GBs of memory. For example, for a 96 CPUs system, the threshold is the maximum number of 125. And the per cpu counters can cache 23.4375 GB in total. The THP page is already a form of batched addition (it will add 512 worth of memory in one go) so skipping the batching seems like sensible. Although every THP stats update overflows the per-cpu counter, resorting to atomic global updates. But it can make the statistics more accuracy for the THP vmstat counters. So we convert the NR_SHMEM_PMDMAPPED account to pages. This patch is consistent with 8f182270dfec ("mm/swap.c: flush lru pvecs on compound page arrival"). Doing this also can make the unit of vmstat counters more unified. Finally, the unit of the vmstat counters are pages, kB and bytes. The B/KB suffix can tell us that the unit is bytes or kB. The rest which is without suffix are pages. Link: https://lkml.kernel.org/r/20201228164110.2838-6-songmuchun@bytedance.com Signed-off-by: Muchun Song <songmuchun@bytedance.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Feng Tang <feng.tang@intel.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Michal Hocko <mhocko@suse.com> Cc: NeilBrown <neilb@suse.de> Cc: Pankaj Gupta <pankaj.gupta@cloud.ionos.com> Cc: Rafael. J. Wysocki <rafael@kernel.org> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Roman Gushchin <guro@fb.com> Cc: Sami Tolvanen <samitolvanen@google.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-02-24mm: memcontrol: convert NR_SHMEM_THPS account to pagesMuchun Song
Currently we use struct per_cpu_nodestat to cache the vmstat counters, which leads to inaccurate statistics especially THP vmstat counters. In the systems with hundreds of processors it can be GBs of memory. For example, for a 96 CPUs system, the threshold is the maximum number of 125. And the per cpu counters can cache 23.4375 GB in total. The THP page is already a form of batched addition (it will add 512 worth of memory in one go) so skipping the batching seems like sensible. Although every THP stats update overflows the per-cpu counter, resorting to atomic global updates. But it can make the statistics more accuracy for the THP vmstat counters. So we convert the NR_SHMEM_THPS account to pages. This patch is consistent with 8f182270dfec ("mm/swap.c: flush lru pvecs on compound page arrival"). Doing this also can make the unit of vmstat counters more unified. Finally, the unit of the vmstat counters are pages, kB and bytes. The B/KB suffix can tell us that the unit is bytes or kB. The rest which is without suffix are pages. Link: https://lkml.kernel.org/r/20201228164110.2838-5-songmuchun@bytedance.com Signed-off-by: Muchun Song <songmuchun@bytedance.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Feng Tang <feng.tang@intel.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Michal Hocko <mhocko@suse.com> Cc: NeilBrown <neilb@suse.de> Cc: Pankaj Gupta <pankaj.gupta@cloud.ionos.com> Cc: Rafael. J. Wysocki <rafael@kernel.org> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Roman Gushchin <guro@fb.com> Cc: Sami Tolvanen <samitolvanen@google.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-02-24mm: memcontrol: convert NR_FILE_THPS account to pagesMuchun Song
Currently we use struct per_cpu_nodestat to cache the vmstat counters, which leads to inaccurate statistics especially THP vmstat counters. In the systems with if hundreds of processors it can be GBs of memory. For example, for a 96 CPUs system, the threshold is the maximum number of 125. And the per cpu counters can cache 23.4375 GB in total. The THP page is already a form of batched addition (it will add 512 worth of memory in one go) so skipping the batching seems like sensible. Although every THP stats update overflows the per-cpu counter, resorting to atomic global updates. But it can make the statistics more accuracy for the THP vmstat counters. So we convert the NR_FILE_THPS account to pages. This patch is consistent with 8f182270dfec ("mm/swap.c: flush lru pvecs on compound page arrival"). Doing this also can make the unit of vmstat counters more unified. Finally, the unit of the vmstat counters are pages, kB and bytes. The B/KB suffix can tell us that the unit is bytes or kB. The rest which is without suffix are pages. Link: https://lkml.kernel.org/r/20201228164110.2838-4-songmuchun@bytedance.com Signed-off-by: Muchun Song <songmuchun@bytedance.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Feng Tang <feng.tang@intel.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Michal Hocko <mhocko@suse.com> Cc: NeilBrown <neilb@suse.de> Cc: Pankaj Gupta <pankaj.gupta@cloud.ionos.com> Cc: Rafael. J. Wysocki <rafael@kernel.org> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Roman Gushchin <guro@fb.com> Cc: Sami Tolvanen <samitolvanen@google.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>