From eddb1c228f7951d399240a0cc57455dccc7f8777 Mon Sep 17 00:00:00 2001 From: John Hubbard Date: Thu, 30 Jan 2020 22:12:54 -0800 Subject: mm/gup: introduce pin_user_pages*() and FOLL_PIN MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Introduce pin_user_pages*() variations of get_user_pages*() calls, and also pin_longterm_pages*() variations. For now, these are placeholder calls, until the various call sites are converted to use the correct get_user_pages*() or pin_user_pages*() API. These variants will eventually all set FOLL_PIN, which is also introduced, and thoroughly documented. pin_user_pages() pin_user_pages_remote() pin_user_pages_fast() All pages that are pinned via the above calls, must be unpinned via put_user_page(). The underlying rules are: * FOLL_PIN is a gup-internal flag, so the call sites should not directly set it. That behavior is enforced with assertions. * Call sites that want to indicate that they are going to do DirectIO ("DIO") or something with similar characteristics, should call a get_user_pages()-like wrapper call that sets FOLL_PIN. These wrappers will: * Start with "pin_user_pages" instead of "get_user_pages". That makes it easy to find and audit the call sites. * Set FOLL_PIN * For pages that are received via FOLL_PIN, those pages must be returned via put_user_page(). Thanks to Jan Kara and Vlastimil Babka for explaining the 4 cases in this documentation. (I've reworded it and expanded upon it.) Link: http://lkml.kernel.org/r/20200107224558.2362728-12-jhubbard@nvidia.com Signed-off-by: John Hubbard Reviewed-by: Jan Kara Reviewed-by: Mike Rapoport [Documentation] Reviewed-by: Jérôme Glisse Cc: Jonathan Corbet Cc: Ira Weiny Cc: Alex Williamson Cc: Aneesh Kumar K.V Cc: Björn Töpel Cc: Christoph Hellwig Cc: Daniel Vetter Cc: Dan Williams Cc: Hans Verkuil Cc: Jason Gunthorpe Cc: Jason Gunthorpe Cc: Jens Axboe Cc: Kirill A. Shutemov Cc: Leon Romanovsky Cc: Mauro Carvalho Chehab Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- Documentation/core-api/index.rst | 1 + Documentation/core-api/pin_user_pages.rst | 232 ++++++++++++++++++++++++++++++ 2 files changed, 233 insertions(+) create mode 100644 Documentation/core-api/pin_user_pages.rst (limited to 'Documentation') diff --git a/Documentation/core-api/index.rst b/Documentation/core-api/index.rst index bc0c727d7fd8..a501dc1c90d0 100644 --- a/Documentation/core-api/index.rst +++ b/Documentation/core-api/index.rst @@ -31,6 +31,7 @@ Core utilities generic-radix-tree memory-allocation mm-api + pin_user_pages gfp_mask-from-fs-io timekeeping boot-time-mm diff --git a/Documentation/core-api/pin_user_pages.rst b/Documentation/core-api/pin_user_pages.rst new file mode 100644 index 000000000000..71849830cd48 --- /dev/null +++ b/Documentation/core-api/pin_user_pages.rst @@ -0,0 +1,232 @@ +.. SPDX-License-Identifier: GPL-2.0 + +==================================================== +pin_user_pages() and related calls +==================================================== + +.. contents:: :local: + +Overview +======== + +This document describes the following functions:: + + pin_user_pages() + pin_user_pages_fast() + pin_user_pages_remote() + +Basic description of FOLL_PIN +============================= + +FOLL_PIN and FOLL_LONGTERM are flags that can be passed to the get_user_pages*() +("gup") family of functions. FOLL_PIN has significant interactions and +interdependencies with FOLL_LONGTERM, so both are covered here. + +FOLL_PIN is internal to gup, meaning that it should not appear at the gup call +sites. This allows the associated wrapper functions (pin_user_pages*() and +others) to set the correct combination of these flags, and to check for problems +as well. + +FOLL_LONGTERM, on the other hand, *is* allowed to be set at the gup call sites. +This is in order to avoid creating a large number of wrapper functions to cover +all combinations of get*(), pin*(), FOLL_LONGTERM, and more. Also, the +pin_user_pages*() APIs are clearly distinct from the get_user_pages*() APIs, so +that's a natural dividing line, and a good point to make separate wrapper calls. +In other words, use pin_user_pages*() for DMA-pinned pages, and +get_user_pages*() for other cases. There are four cases described later on in +this document, to further clarify that concept. + +FOLL_PIN and FOLL_GET are mutually exclusive for a given gup call. However, +multiple threads and call sites are free to pin the same struct pages, via both +FOLL_PIN and FOLL_GET. It's just the call site that needs to choose one or the +other, not the struct page(s). + +The FOLL_PIN implementation is nearly the same as FOLL_GET, except that FOLL_PIN +uses a different reference counting technique. + +FOLL_PIN is a prerequisite to FOLL_LONGTERM. Another way of saying that is, +FOLL_LONGTERM is a specific case, more restrictive case of FOLL_PIN. + +Which flags are set by each wrapper +=================================== + +For these pin_user_pages*() functions, FOLL_PIN is OR'd in with whatever gup +flags the caller provides. The caller is required to pass in a non-null struct +pages* array, and the function then pin pages by incrementing each by a special +value. For now, that value is +1, just like get_user_pages*().:: + + Function + -------- + pin_user_pages FOLL_PIN is always set internally by this function. + pin_user_pages_fast FOLL_PIN is always set internally by this function. + pin_user_pages_remote FOLL_PIN is always set internally by this function. + +For these get_user_pages*() functions, FOLL_GET might not even be specified. +Behavior is a little more complex than above. If FOLL_GET was *not* specified, +but the caller passed in a non-null struct pages* array, then the function +sets FOLL_GET for you, and proceeds to pin pages by incrementing the refcount +of each page by +1.:: + + Function + -------- + get_user_pages FOLL_GET is sometimes set internally by this function. + get_user_pages_fast FOLL_GET is sometimes set internally by this function. + get_user_pages_remote FOLL_GET is sometimes set internally by this function. + +Tracking dma-pinned pages +========================= + +Some of the key design constraints, and solutions, for tracking dma-pinned +pages: + +* An actual reference count, per struct page, is required. This is because + multiple processes may pin and unpin a page. + +* False positives (reporting that a page is dma-pinned, when in fact it is not) + are acceptable, but false negatives are not. + +* struct page may not be increased in size for this, and all fields are already + used. + +* Given the above, we can overload the page->_refcount field by using, sort of, + the upper bits in that field for a dma-pinned count. "Sort of", means that, + rather than dividing page->_refcount into bit fields, we simple add a medium- + large value (GUP_PIN_COUNTING_BIAS, initially chosen to be 1024: 10 bits) to + page->_refcount. This provides fuzzy behavior: if a page has get_page() called + on it 1024 times, then it will appear to have a single dma-pinned count. + And again, that's acceptable. + +This also leads to limitations: there are only 31-10==21 bits available for a +counter that increments 10 bits at a time. + +TODO: for 1GB and larger huge pages, this is cutting it close. That's because +when pin_user_pages() follows such pages, it increments the head page by "1" +(where "1" used to mean "+1" for get_user_pages(), but now means "+1024" for +pin_user_pages()) for each tail page. So if you have a 1GB huge page: + +* There are 256K (18 bits) worth of 4 KB tail pages. +* There are 21 bits available to count up via GUP_PIN_COUNTING_BIAS (that is, + 10 bits at a time) +* There are 21 - 18 == 3 bits available to count. Except that there aren't, + because you need to allow for a few normal get_page() calls on the head page, + as well. Fortunately, the approach of using addition, rather than "hard" + bitfields, within page->_refcount, allows for sharing these bits gracefully. + But we're still looking at about 8 references. + +This, however, is a missing feature more than anything else, because it's easily +solved by addressing an obvious inefficiency in the original get_user_pages() +approach of retrieving pages: stop treating all the pages as if they were +PAGE_SIZE. Retrieve huge pages as huge pages. The callers need to be aware of +this, so some work is required. Once that's in place, this limitation mostly +disappears from view, because there will be ample refcounting range available. + +* Callers must specifically request "dma-pinned tracking of pages". In other + words, just calling get_user_pages() will not suffice; a new set of functions, + pin_user_page() and related, must be used. + +FOLL_PIN, FOLL_GET, FOLL_LONGTERM: when to use which flags +========================================================== + +Thanks to Jan Kara, Vlastimil Babka and several other -mm people, for describing +these categories: + +CASE 1: Direct IO (DIO) +----------------------- +There are GUP references to pages that are serving +as DIO buffers. These buffers are needed for a relatively short time (so they +are not "long term"). No special synchronization with page_mkclean() or +munmap() is provided. Therefore, flags to set at the call site are: :: + + FOLL_PIN + +...but rather than setting FOLL_PIN directly, call sites should use one of +the pin_user_pages*() routines that set FOLL_PIN. + +CASE 2: RDMA +------------ +There are GUP references to pages that are serving as DMA +buffers. These buffers are needed for a long time ("long term"). No special +synchronization with page_mkclean() or munmap() is provided. Therefore, flags +to set at the call site are: :: + + FOLL_PIN | FOLL_LONGTERM + +NOTE: Some pages, such as DAX pages, cannot be pinned with longterm pins. That's +because DAX pages do not have a separate page cache, and so "pinning" implies +locking down file system blocks, which is not (yet) supported in that way. + +CASE 3: Hardware with page faulting support +------------------------------------------- +Here, a well-written driver doesn't normally need to pin pages at all. However, +if the driver does choose to do so, it can register MMU notifiers for the range, +and will be called back upon invalidation. Either way (avoiding page pinning, or +using MMU notifiers to unpin upon request), there is proper synchronization with +both filesystem and mm (page_mkclean(), munmap(), etc). + +Therefore, neither flag needs to be set. + +In this case, ideally, neither get_user_pages() nor pin_user_pages() should be +called. Instead, the software should be written so that it does not pin pages. +This allows mm and filesystems to operate more efficiently and reliably. + +CASE 4: Pinning for struct page manipulation only +------------------------------------------------- +Here, normal GUP calls are sufficient, so neither flag needs to be set. + +page_dma_pinned(): the whole point of pinning +============================================= + +The whole point of marking pages as "DMA-pinned" or "gup-pinned" is to be able +to query, "is this page DMA-pinned?" That allows code such as page_mkclean() +(and file system writeback code in general) to make informed decisions about +what to do when a page cannot be unmapped due to such pins. + +What to do in those cases is the subject of a years-long series of discussions +and debates (see the References at the end of this document). It's a TODO item +here: fill in the details once that's worked out. Meanwhile, it's safe to say +that having this available: :: + + static inline bool page_dma_pinned(struct page *page) + +...is a prerequisite to solving the long-running gup+DMA problem. + +Another way of thinking about FOLL_GET, FOLL_PIN, and FOLL_LONGTERM +=================================================================== + +Another way of thinking about these flags is as a progression of restrictions: +FOLL_GET is for struct page manipulation, without affecting the data that the +struct page refers to. FOLL_PIN is a *replacement* for FOLL_GET, and is for +short term pins on pages whose data *will* get accessed. As such, FOLL_PIN is +a "more severe" form of pinning. And finally, FOLL_LONGTERM is an even more +restrictive case that has FOLL_PIN as a prerequisite: this is for pages that +will be pinned longterm, and whose data will be accessed. + +Unit testing +============ +This file:: + + tools/testing/selftests/vm/gup_benchmark.c + +has the following new calls to exercise the new pin*() wrapper functions: + +* PIN_FAST_BENCHMARK (./gup_benchmark -a) +* PIN_BENCHMARK (./gup_benchmark -b) + +You can monitor how many total dma-pinned pages have been acquired and released +since the system was booted, via two new /proc/vmstat entries: :: + + /proc/vmstat/nr_foll_pin_requested + /proc/vmstat/nr_foll_pin_requested + +Those are both going to show zero, unless CONFIG_DEBUG_VM is set. This is +because there is a noticeable performance drop in put_user_page(), when they +are activated. + +References +========== + +* `Some slow progress on get_user_pages() (Apr 2, 2019) `_ +* `DMA and get_user_pages() (LPC: Dec 12, 2018) `_ +* `The trouble with get_user_pages() (Apr 30, 2018) `_ + +John Hubbard, October, 2019 -- cgit From f1f6a7dd9b53aafd81b696b9017036e7b08e57ea Mon Sep 17 00:00:00 2001 From: John Hubbard Date: Thu, 30 Jan 2020 22:13:35 -0800 Subject: mm, tree-wide: rename put_user_page*() to unpin_user_page*() MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit In order to provide a clearer, more symmetric API for pinning and unpinning DMA pages. This way, pin_user_pages*() calls match up with unpin_user_pages*() calls, and the API is a lot closer to being self-explanatory. Link: http://lkml.kernel.org/r/20200107224558.2362728-23-jhubbard@nvidia.com Signed-off-by: John Hubbard Reviewed-by: Jan Kara Cc: Alex Williamson Cc: Aneesh Kumar K.V Cc: Björn Töpel Cc: Christoph Hellwig Cc: Daniel Vetter Cc: Dan Williams Cc: Hans Verkuil Cc: Ira Weiny Cc: Jason Gunthorpe Cc: Jason Gunthorpe Cc: Jens Axboe Cc: Jerome Glisse Cc: Jonathan Corbet Cc: Kirill A. Shutemov Cc: Leon Romanovsky Cc: Mauro Carvalho Chehab Cc: Mike Rapoport Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- Documentation/core-api/pin_user_pages.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'Documentation') diff --git a/Documentation/core-api/pin_user_pages.rst b/Documentation/core-api/pin_user_pages.rst index 71849830cd48..1d490155ecd7 100644 --- a/Documentation/core-api/pin_user_pages.rst +++ b/Documentation/core-api/pin_user_pages.rst @@ -219,7 +219,7 @@ since the system was booted, via two new /proc/vmstat entries: :: /proc/vmstat/nr_foll_pin_requested Those are both going to show zero, unless CONFIG_DEBUG_VM is set. This is -because there is a noticeable performance drop in put_user_page(), when they +because there is a noticeable performance drop in unpin_user_page(), when they are activated. References -- cgit From 45190f01dd402112d3d22c0ddc4152994f9e1e55 Mon Sep 17 00:00:00 2001 From: Vitaly Wool Date: Thu, 30 Jan 2020 22:15:04 -0800 Subject: mm/zswap.c: add allocation hysteresis if pool limit is hit zswap will always try to shrink pool when zswap is full. If there is a high pressure on zswap it will result in flipping pages in and out zswap pool without any real benefit, and the overall system performance will drop. The previous discussion on this subject [1] ended up with a suggestion to implement a sort of hysteresis to refuse taking pages into zswap pool until it has sufficient space if the limit has been hit. This is my take on this. Hysteresis is controlled with a sysfs-configurable parameter (namely, /sys/kernel/debug/zswap/accept_threhsold_percent). It specifies the threshold at which zswap would start accepting pages again after it became full. Setting this parameter to 100 disables the hysteresis and sets the zswap behavior to pre-hysteresis state. [1] https://lkml.org/lkml/2019/11/8/949 Link: http://lkml.kernel.org/r/20200108200118.15563-1-vitaly.wool@konsulko.com Signed-off-by: Vitaly Wool Cc: Dan Streetman Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- Documentation/vm/zswap.rst | 13 +++++++++++++ 1 file changed, 13 insertions(+) (limited to 'Documentation') diff --git a/Documentation/vm/zswap.rst b/Documentation/vm/zswap.rst index 1444ecd40911..61f6185188cd 100644 --- a/Documentation/vm/zswap.rst +++ b/Documentation/vm/zswap.rst @@ -130,6 +130,19 @@ checking for the same-value filled pages during store operation. However, the existing pages which are marked as same-value filled pages remain stored unchanged in zswap until they are either loaded or invalidated. +To prevent zswap from shrinking pool when zswap is full and there's a high +pressure on swap (this will result in flipping pages in and out zswap pool +without any real benefit but with a performance drop for the system), a +special parameter has been introduced to implement a sort of hysteresis to +refuse taking pages into zswap pool until it has sufficient space if the limit +has been hit. To set the threshold at which zswap would start accepting pages +again after it became full, use the sysfs ``accept_threhsold_percent`` +attribute, e. g.:: + + echo 80 > /sys/module/zswap/parameters/accept_threhsold_percent + +Setting this parameter to 100 will disable the hysteresis. + A debugfs interface is provided for various statistic about pool size, number of pages stored, same-value filled pages and various counters for the reasons pages are rejected. -- cgit From c65e6815db1c2e28d5554bd99d3a6e522ab599d1 Mon Sep 17 00:00:00 2001 From: Mikhail Zaslonko Date: Thu, 30 Jan 2020 22:16:27 -0800 Subject: s390/boot: add dfltcc= kernel command line parameter Add the new kernel command line parameter 'dfltcc=' to configure s390 zlib hardware support. Format: { on | off | def_only | inf_only | always } on: s390 zlib hardware support for compression on level 1 and decompression (default) off: No s390 zlib hardware support def_only: s390 zlib hardware support for deflate only (compression on level 1) inf_only: s390 zlib hardware support for inflate only (decompression) always: Same as 'on' but ignores the selected compression level always using hardware support (used for debugging) Link: http://lkml.kernel.org/r/20200103223334.20669-5-zaslonko@linux.ibm.com Signed-off-by: Mikhail Zaslonko Cc: Chris Mason Cc: Christian Borntraeger Cc: David Sterba Cc: Eduard Shishkin Cc: Heiko Carstens Cc: Ilya Leoshkevich Cc: Josef Bacik Cc: Richard Purdie Cc: Vasily Gorbik Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- Documentation/admin-guide/kernel-parameters.txt | 12 ++++++++++++ 1 file changed, 12 insertions(+) (limited to 'Documentation') diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt index ec92120a7952..ddc5ccdd4cd1 100644 --- a/Documentation/admin-guide/kernel-parameters.txt +++ b/Documentation/admin-guide/kernel-parameters.txt @@ -834,6 +834,18 @@ dump out devices still on the deferred probe list after retrying. + dfltcc= [HW,S390] + Format: { on | off | def_only | inf_only | always } + on: s390 zlib hardware support for compression on + level 1 and decompression (default) + off: No s390 zlib hardware support + def_only: s390 zlib hardware support for deflate + only (compression on level 1) + inf_only: s390 zlib hardware support for inflate + only (decompression) + always: Same as 'on' but ignores the selected compression + level always using hardware support (used for debugging) + dhash_entries= [KNL] Set number of hash buckets for dentry cache. -- cgit