summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2019-06-14cfg80211: use BIT_ULL in cfg80211_parse_mbssid_data()Luca Coelho
The seen_indices variable is u64 and in other parts of the code we assume mbssid_index_ie[2] can be up to 45, so we should use the 64-bit versions of BIT, namely, BIT_ULL(). Reported-by: Dan Carpented <dan.carpenter@oracle.com> Signed-off-by: Luca Coelho <luciano.coelho@intel.com> Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2019-06-14mac80211: only warn once on chanctx_conf being NULLYibo Zhao
In multiple SSID cases, it takes time to prepare every AP interface to be ready in initializing phase. If a sta already knows everything it needs to join one of the APs and sends authentication to the AP which is not fully prepared at this point of time, AP's channel context could be NULL. As a result, warning message occurs. Even worse, if the AP is under attack via tools such as MDK3 and massive authentication requests are received in a very short time, console will be hung due to kernel warning messages. WARN_ON_ONCE() could be a better way for indicating warning messages without duplicate messages to flood the console. Johannes: We still need to address the underlying problem, but we don't really have a good handle on it yet. Suppress the worst side-effects for now. Signed-off-by: Zhi Chen <zhichen@codeaurora.org> Signed-off-by: Yibo Zhao <yiboz@codeaurora.org> [johannes: add note, change subject] Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2019-06-14mac80211: drop robust management frames from unknown TAJohannes Berg
When receiving a robust management frame, drop it if we don't have rx->sta since then we don't have a security association and thus couldn't possibly validate the frame. Cc: stable@vger.kernel.org Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2019-06-14ima: Use designated initializers for struct ima_event_dataThiago Jung Bauermann
Designated initializers allow specifying only the members of the struct that need initialization. Non-mentioned members are initialized to zero. This makes the code a bit clearer (particularly in ima_add_boot_aggregate) and also allows adding a new member to the struct without having to update all struct initializations. Signed-off-by: Thiago Jung Bauermann <bauerman@linux.ibm.com> Signed-off-by: Mimi Zohar <zohar@linux.ibm.com>
2019-06-14ima: use the lsm policy update notifierJanne Karhunen
Don't do lazy policy updates while running the rule matching, run the updates as they happen. Depends on commit f242064c5df3 ("LSM: switch to blocking policy update notifiers") Signed-off-by: Janne Karhunen <janne.karhunen@gmail.com> Signed-off-by: Mimi Zohar <zohar@linux.ibm.com>
2019-06-14LSM: switch to blocking policy update notifiersJanne Karhunen
Atomic policy updaters are not very useful as they cannot usually perform the policy updates on their own. Since it seems that there is no strict need for the atomicity, switch to the blocking variant. While doing so, rename the functions accordingly. Signed-off-by: Janne Karhunen <janne.karhunen@gmail.com> Acked-by: Paul Moore <paul@paul-moore.com> Acked-by: James Morris <jamorris@linux.microsoft.com> Signed-off-by: Mimi Zohar <zohar@linux.ibm.com>
2019-06-14x86/ima: fix the Kconfig dependency for IMA_ARCH_POLICYNayna Jain
If enabled, ima arch specific policies always adds the measurements rules, this makes it dependent on CONFIG_IMA. CONFIG_IMA_APPRAISE implicitly takes care of this, however it is needed explicitly for CONFIG_KEXEC_VERIFY_SIG. This patch adds the CONFIG_IMA dependency in combination with CONFIG_KEXEC_VERIFY_SIG for CONFIG_IMA_ARCH_POLICY Fixes: d958083a8f640 (x86/ima: define arch_get_ima_policy() for x86) Signed-off-by: Nayna Jain <nayna@linux.ibm.com> Cc: Eric Biederman <ebiederm@xmission.com> Cc: Dave Young <dyoung@redhat.com> Signed-off-by: Mimi Zohar <zohar@linux.ibm.com>
2019-06-14ima: Make arch_policy_entry staticYueHaibing
Fix sparse warning: security/integrity/ima/ima_policy.c:202:23: warning: symbol 'arch_policy_entry' was not declared. Should it be static? Fixes: 6191706246de ("ima: add support for arch specific policies") Reported-by: Hulk Robot <hulkci@huawei.com> Signed-off-by: YueHaibing <yuehaibing@huawei.com> Cc: stable@vger.kernel.org (linux-5.0) Signed-off-by: Mimi Zohar <zohar@linux.ibm.com>
2019-06-14debugobjects: Move printk out of db->lock critical sectionsWaiman Long
The db->lock is a raw spinlock and so the lock hold time is supposed to be short. This will not be the case when printk() is being involved in some of the critical sections. In order to avoid the long hold time, in case some messages need to be printed, the debug_object_is_on_stack() and debug_print_object() calls are now moved out of those critical sections. Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Yang Shi <yang.shi@linux.alibaba.com> Cc: "Joel Fernandes (Google)" <joel@joelfernandes.org> Cc: Qian Cai <cai@gmx.us> Cc: Zhong Jiang <zhongjiang@huawei.com> Link: https://lkml.kernel.org/r/20190520141450.7575-6-longman@redhat.com
2019-06-14debugobjects: Less aggressive freeing of excess debug objectsWaiman Long
After a system bootup and 3 parallel kernel builds, a partial output of the debug objects stats file was: pool_free :5101 pool_pcp_free :4181 pool_min_free :220 pool_used :104172 pool_max_used :171920 on_free_list :0 objs_allocated:39268280 objs_freed :39160031 More than 39 millions debug objects had since been allocated and then freed. The pool_max_used, however, was only about 172k. So this is a lot of extra overhead in freeing and allocating objects from slabs. It may also causes the slabs to be more fragmented and harder to reclaim. Make the freeing of excess debug objects less aggressive by freeing them at a maximum frequency of 10Hz and about 1k objects at each round of freeing. With that change applied, the partial output of the debug objects stats file after similar actions became: pool_free :5901 pool_pcp_free :3742 pool_min_free :1022 pool_used :104805 pool_max_used :168081 on_free_list :0 objs_allocated:5796864 objs_freed :5687182 Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Yang Shi <yang.shi@linux.alibaba.com> Cc: "Joel Fernandes (Google)" <joel@joelfernandes.org> Cc: Qian Cai <cai@gmx.us> Cc: Zhong Jiang <zhongjiang@huawei.com> Link: https://lkml.kernel.org/r/20190520141450.7575-5-longman@redhat.com
2019-06-14debugobjects: Reduce number of pool_lock acquisitions in fill_pool()Waiman Long
In fill_pool(), the pool_lock is acquired and then released once per debug object. If many objects are to be filled, the constant lock and unlock operations are extra overhead. To reduce the overhead, batch them up and do an allocation of 4 objects per lock/unlock sequence. Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Yang Shi <yang.shi@linux.alibaba.com> Cc: "Joel Fernandes (Google)" <joel@joelfernandes.org> Cc: Qian Cai <cai@gmx.us> Cc: Zhong Jiang <zhongjiang@huawei.com> Link: https://lkml.kernel.org/r/20190520141450.7575-4-longman@redhat.com
2019-06-14debugobjects: Percpu pool lookahead freeing/allocationWaiman Long
Most workloads will allocate a bunch of memory objects, work on them and then freeing all or most of them. So just having a percpu free pool may not reduce the pool_lock contention significantly if large number of objects are being used. To help those situations, we are now doing lookahead allocation and freeing of the debug objects into and out of the percpu free pool. This will hopefully reduce the number of times the pool_lock needs to be taken and hence its contention level. Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Yang Shi <yang.shi@linux.alibaba.com> Cc: "Joel Fernandes (Google)" <joel@joelfernandes.org> Cc: Qian Cai <cai@gmx.us> Cc: Zhong Jiang <zhongjiang@huawei.com> Link: https://lkml.kernel.org/r/20190520141450.7575-3-longman@redhat.com
2019-06-14debugobjects: Add percpu free poolsWaiman Long
When a multi-threaded workload does a lot of small memory object allocations and deallocations, it may cause the allocation and freeing of many debug objects. This will make the global pool_lock a bottleneck in the performance of the workload. Since interrupts are disabled when acquiring the pool_lock, it may even cause hard lockups to happen. To reduce contention of the global pool_lock, add a percpu debug object free pool that can be used to buffer some of the debug object allocation and freeing requests without acquiring the pool_lock. Each CPU will now have a percpu free pool that can hold up to a maximum of 64 debug objects. Allocation and freeing requests will go to the percpu free pool first. If that fails, the pool_lock will be taken and the global free pool will be used. The presence or absence of obj_cache is used as a marker to see if the percpu cache should be used. Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Yang Shi <yang.shi@linux.alibaba.com> Cc: "Joel Fernandes (Google)" <joel@joelfernandes.org> Cc: Qian Cai <cai@gmx.us> Cc: Zhong Jiang <zhongjiang@huawei.com> Link: https://lkml.kernel.org/r/20190520141450.7575-2-longman@redhat.com
2019-06-14debugobjects: No need to check return value of debugfs_create()Greg Kroah-Hartman
When calling debugfs functions, there is no need to ever check the return value. The function can work or not, but the code logic should never do something different based on this. Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Qian Cai <cai@gmx.us> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Waiman Long <longman@redhat.com> Cc: "Joel Fernandes (Google)" <joel@joelfernandes.org> Cc: Zhong Jiang <zhongjiang@huawei.com> Link: https://lkml.kernel.org/r/20190612153513.GA21082@kroah.com
2019-06-14gpu: ipu-v3: image-convert: Fix image downsize coefficientsSteve Longerbeam
The output of the IC downsizer unit in both dimensions must be <= 1024 before being passed to the IC resizer unit. This was causing corrupted images when: input_dim > 1024, and input_dim / 2 < output_dim < input_dim Some broken examples were 1920x1080 -> 1024x768 and 1920x1080 -> 1280x1080. Fixes: 70b9b6b3bcb21 ("gpu: ipu-v3: image-convert: calculate per-tile resize coefficients") Signed-off-by: Steve Longerbeam <slongerbeam@gmail.com> Signed-off-by: Philipp Zabel <p.zabel@pengutronix.de>
2019-06-14gpu: ipu-v3: image-convert: Fix input bytesperline for packed formatsSteve Longerbeam
The input bytesperline calculation for packed pixel formats was incorrect. The min/max clamping values must be multiplied by the packed bits-per-pixel. This was causing corrupted converted images when the input format was RGB4 (probably also other input packed formats). Fixes: d966e23d61a2c ("gpu: ipu-v3: image-convert: fix bytesperline adjustment") Reported-by: Harsha Manjula Mallikarjun <Harsha.ManjulaMallikarjun@in.bosch.com> Suggested-by: Harsha Manjula Mallikarjun <Harsha.ManjulaMallikarjun@in.bosch.com> Signed-off-by: Steve Longerbeam <slongerbeam@gmail.com> Signed-off-by: Philipp Zabel <p.zabel@pengutronix.de>
2019-06-14gpu: ipu-v3: image-convert: Fix input bytesperline width/height alignSteve Longerbeam
The output width and height alignment values were being used in the input bytesperline calculation. Fix by separating local vars w_align and h_align into w_align_in, h_align_in, w_align_out, and h_align_out. Fixes: d966e23d61a2c ("gpu: ipu-v3: image-convert: fix bytesperline adjustment") Signed-off-by: Steve Longerbeam <slongerbeam@gmail.com> Signed-off-by: Philipp Zabel <p.zabel@pengutronix.de>
2019-06-14thunderbolt: Implement CIO reset correctly for Titan RidgeMika Westerberg
When starting ICM firmware on Apple systems we need to perform CIO reset as part of the flow. However, it turns out that the reset register has changed to another location in Titan Ridge. Fix this by introducing ->cio_reset() callback with corresponding implementations for Alpine and Titan Ridge. Fixes: c4630d6ae6e3 ("thunderbolt: Start firmware on Titan Ridge Apple systems") Reported-by: Peter Bowen <pzb@amazon.com> Signed-off-by: Mika Westerberg <mika.westerberg@linux.intel.com>
2019-06-14arm64: remove redundant 'default n' from KconfigBartlomiej Zolnierkiewicz
'default n' is the default value for any bool or tristate Kconfig setting so there is no need to write it explicitly. Also since commit f467c5640c29 ("kconfig: only write '# CONFIG_FOO is not set' for visible symbols") the Kconfig behavior is the same regardless of 'default n' being present or not: ... One side effect of (and the main motivation for) this change is making the following two definitions behave exactly the same: config FOO bool config FOO bool default n With this change, neither of these will generate a '# CONFIG_FOO is not set' line (assuming FOO isn't selected/implied). That might make it clearer to people that a bare 'default n' is redundant. ... Signed-off-by: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com> Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2019-06-14ARM: davinci: da8xx: specify dma_coherent_mask for lcdcBartosz Golaszewski
The lcdc device is missing the dma_coherent_mask definition causing the following warning on da850-evm: da8xx_lcdc da8xx_lcdc.0: found Sharp_LK043T1DG01 panel ------------[ cut here ]------------ WARNING: CPU: 0 PID: 1 at kernel/dma/mapping.c:247 dma_alloc_attrs+0xc8/0x110 Modules linked in: CPU: 0 PID: 1 Comm: swapper Not tainted 5.2.0-rc3-00077-g16d72dd4891f #18 Hardware name: DaVinci DA850/OMAP-L138/AM18x EVM [<c000fce8>] (unwind_backtrace) from [<c000d900>] (show_stack+0x10/0x14) [<c000d900>] (show_stack) from [<c001a4f8>] (__warn+0xec/0x114) [<c001a4f8>] (__warn) from [<c001a634>] (warn_slowpath_null+0x3c/0x48) [<c001a634>] (warn_slowpath_null) from [<c0065860>] (dma_alloc_attrs+0xc8/0x110) [<c0065860>] (dma_alloc_attrs) from [<c02820f8>] (fb_probe+0x228/0x5a8) [<c02820f8>] (fb_probe) from [<c02d3e9c>] (platform_drv_probe+0x48/0x9c) [<c02d3e9c>] (platform_drv_probe) from [<c02d221c>] (really_probe+0x1d8/0x2d4) [<c02d221c>] (really_probe) from [<c02d2474>] (driver_probe_device+0x5c/0x168) [<c02d2474>] (driver_probe_device) from [<c02d2728>] (device_driver_attach+0x58/0x60) [<c02d2728>] (device_driver_attach) from [<c02d27b0>] (__driver_attach+0x80/0xbc) [<c02d27b0>] (__driver_attach) from [<c02d047c>] (bus_for_each_dev+0x64/0xb4) [<c02d047c>] (bus_for_each_dev) from [<c02d1590>] (bus_add_driver+0xe4/0x1d8) [<c02d1590>] (bus_add_driver) from [<c02d301c>] (driver_register+0x78/0x10c) [<c02d301c>] (driver_register) from [<c000a5c0>] (do_one_initcall+0x48/0x1bc) [<c000a5c0>] (do_one_initcall) from [<c05cae6c>] (kernel_init_freeable+0x10c/0x1d8) [<c05cae6c>] (kernel_init_freeable) from [<c048a000>] (kernel_init+0x8/0xf4) [<c048a000>] (kernel_init) from [<c00090e0>] (ret_from_fork+0x14/0x34) Exception stack(0xc6837fb0 to 0xc6837ff8) 7fa0: 00000000 00000000 00000000 00000000 7fc0: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 7fe0: 00000000 00000000 00000000 00000000 00000013 00000000 ---[ end trace 8a8073511be81dd2 ]--- Add a 32-bit mask to the platform device's definition. Signed-off-by: Bartosz Golaszewski <bgolaszewski@baylibre.com> Signed-off-by: Sekhar Nori <nsekhar@ti.com>
2019-06-14ARM: davinci: da850-evm: call regulator_has_full_constraints()Bartosz Golaszewski
The BB expander at 0x21 i2c bus 1 fails to probe on da850-evm because the board doesn't set has_full_constraints to true in the regulator API. Call regulator_has_full_constraints() at the end of board registration just like we do in da850-lcdk and da830-evm. Reviewed-by: Linus Walleij <linus.walleij@linaro.org> Signed-off-by: Bartosz Golaszewski <bgolaszewski@baylibre.com> Signed-off-by: Sekhar Nori <nsekhar@ti.com>
2019-06-14x86/cpufeatures: Add FDP_EXCPTN_ONLY and ZERO_FCS_FDSAaron Lewis
Add the CPUID enumeration for Intel's de-feature bits to accommodate passing these de-features through to kvm guests. These de-features are (from SDM vol 1, section 8.1.8): - X86_FEATURE_FDP_EXCPTN_ONLY: If CPUID.(EAX=07H,ECX=0H):EBX[bit 6] = 1, the data pointer (FDP) is updated only for the x87 non-control instructions that incur unmasked x87 exceptions. - X86_FEATURE_ZERO_FCS_FDS: If CPUID.(EAX=07H,ECX=0H):EBX[bit 13] = 1, the processor deprecates FCS and FDS; it saves each as 0000H. Signed-off-by: Aaron Lewis <aaronlewis@google.com> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Jim Mattson <jmattson@google.com> Cc: Fenghua Yu <fenghua.yu@intel.com> Cc: Frederic Weisbecker <frederic@kernel.org> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: marcorr@google.com Cc: Peter Feiner <pfeiner@google.com> Cc: pshier@google.com Cc: Robert Hoo <robert.hu@linux.intel.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Thomas Lendacky <Thomas.Lendacky@amd.com> Cc: x86-ml <x86@kernel.org> Link: https://lkml.kernel.org/r/20190605220252.103406-1-aaronlewis@google.com
2019-06-14timekeeping: Repair ktime_get_coarse*() granularityThomas Gleixner
Jason reported that the coarse ktime based time getters advance only once per second and not once per tick as advertised. The code reads only the monotonic base time, which advances once per second. The nanoseconds are accumulated on every tick in xtime_nsec up to a second and the regular time getters take this nanoseconds offset into account, but the ktime_get_coarse*() implementation fails to do so. Add the accumulated xtime_nsec value to the monotonic base time to get the proper per tick advancing coarse tinme. Fixes: b9ff604cff11 ("timekeeping: Add ktime_get_coarse_with_offset") Reported-by: Jason A. Donenfeld <Jason@zx2c4.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Jason A. Donenfeld <Jason@zx2c4.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Clemens Ladisch <clemens@ladisch.de> Cc: Sultan Alsawaf <sultan@kerneltoast.com> Cc: Waiman Long <longman@redhat.com> Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/r/alpine.DEB.2.21.1906132136280.1791@nanos.tec.linutronix.de
2019-06-14Merge tag 'drm-misc-fixes-2019-06-13' of ↵Daniel Vetter
git://anongit.freedesktop.org/drm/drm-misc into drm-fixes Sean writes: meson: A few G12A fixes across the driver (Neil) quirks: A couple quirks for GPD devices (Hans) gem_shmem: Use writecombine when vmapping non-dmabuf BOs (Boris) panfrost: A couple tweaks to requiring devfreq (Neil & Ezequiel) edid: Ensure we return the override mode when ddc probe fails (Jani) Cc: Hans de Goede <hdegoede@redhat.com> Cc: Neil Armstrong <narmstrong@baylibre.com> Cc: Boris Brezillon <boris.brezillon@collabora.com> Cc: Ezequiel Garcia <ezequiel@collabora.com> Cc: Jani Nikula <jani.nikula@intel.com> Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch> From: Sean Paul <sean@poorly.run> Link: https://patchwork.freedesktop.org/patch/msgid/20190613143946.GA24233@art_vandelay
2019-06-14perf/x86: Add Intel Ice Lake NNPI uncore supportRajneesh Bhardwaj
Intel Ice Lake uncore support already included IMC PCI ID but ICL-NNPI CPUID is missing so add it to fix the probe function. Fixes: e39875d15ad6 ("perf/x86: add Intel Icelake uncore support") Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@linux.intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Peter Zijlstra <peterz@infradead.org> Cc: alexander.shishkin@linux.intel.com Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Kan Liang <kan.liang@linux.intel.com> Cc: Qiuxu Zhuo <qiuxu.zhuo@intel.com> Cc: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Cc: Len Brown <lenb@kernel.org> Cc: Linux PM <linux-pm@vger.kernel.org> Link: https://lkml.kernel.org/r/20190614081701.13828-1-rajneesh.bhardwaj@linux.intel.com
2019-06-14Revert "ALSA: hda/realtek - Improve the headset mic for Acer Aspire laptops"Hui Wang
This reverts commit 9cb40eb184c4220d244a532bd940c6345ad9dbd9. This patch introduces noise and headphone playback issue after rebooting or suspending/resuming. Let us revert it. BugLink: https://bugzilla.kernel.org/show_bug.cgi?id=203831 Fixes: 9cb40eb184c4 ("ALSA: hda/realtek - Improve the headset mic for Acer Aspire laptops") Cc: <stable@vger.kernel.org> Signed-off-by: Hui Wang <hui.wang@canonical.com> Signed-off-by: Takashi Iwai <tiwai@suse.de>
2019-06-13mm/devm_memremap_pages: fix final page put raceDan Williams
Logan noticed that devm_memremap_pages_release() kills the percpu_ref drops all the page references that were acquired at init and then immediately proceeds to unplug, arch_remove_memory(), the backing pages for the pagemap. If for some reason device shutdown actually collides with a busy / elevated-ref-count page then arch_remove_memory() should be deferred until after that reference is dropped. As it stands the "wait for last page ref drop" happens *after* devm_memremap_pages_release() returns, which is obviously too late and can lead to crashes. Fix this situation by assigning the responsibility to wait for the percpu_ref to go idle to devm_memremap_pages() with a new ->cleanup() callback. Implement the new cleanup callback for all devm_memremap_pages() users: pmem, devdax, hmm, and p2pdma. Link: http://lkml.kernel.org/r/155727339156.292046.5432007428235387859.stgit@dwillia2-desk3.amr.corp.intel.com Fixes: 41e94a851304 ("add devm_memremap_pages") Signed-off-by: Dan Williams <dan.j.williams@intel.com> Reported-by: Logan Gunthorpe <logang@deltatee.com> Reviewed-by: Ira Weiny <ira.weiny@intel.com> Reviewed-by: Logan Gunthorpe <logang@deltatee.com> Cc: Bjorn Helgaas <bhelgaas@google.com> Cc: "Jérôme Glisse" <jglisse@redhat.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: "Rafael J. Wysocki" <rafael@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-06-13PCI/P2PDMA: track pgmap references per resource, not globallyDan Williams
In preparation for fixing a race between devm_memremap_pages_release() and the final put of a page from the device-page-map, allocate a percpu-ref per p2pdma resource mapping. Link: http://lkml.kernel.org/r/155727338646.292046.9922678317501435597.stgit@dwillia2-desk3.amr.corp.intel.com Signed-off-by: Dan Williams <dan.j.williams@intel.com> Reviewed-by: Ira Weiny <ira.weiny@intel.com> Reviewed-by: Logan Gunthorpe <logang@deltatee.com> Cc: Bjorn Helgaas <bhelgaas@google.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: "Jérôme Glisse" <jglisse@redhat.com> Cc: "Rafael J. Wysocki" <rafael@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-06-13lib/genalloc: introduce chunk ownersDan Williams
The p2pdma facility enables a provider to publish a pool of dma addresses for a consumer to allocate. A genpool is used internally by p2pdma to collect dma resources, 'chunks', to be handed out to consumers. Whenever a consumer allocates a resource it needs to pin the 'struct dev_pagemap' instance that backs the chunk selected by pci_alloc_p2pmem(). Currently that reference is taken globally on the entire provider device. That sets up a lifetime mismatch whereby the p2pdma core needs to maintain hacks to make sure the percpu_ref is not released twice. This lifetime mismatch also stands in the way of a fix to devm_memremap_pages() whereby devm_memremap_pages_release() must wait for the percpu_ref ->release() callback to complete before it can proceed to teardown pages. So, towards fixing this situation, introduce the ability to store a 'chunk owner' at gen_pool_add() time, and a facility to retrieve the owner at gen_pool_{alloc,free}() time. For p2pdma this will be used to store and recall individual dev_pagemap reference counter instances per-chunk. Link: http://lkml.kernel.org/r/155727338118.292046.13407378933221579644.stgit@dwillia2-desk3.amr.corp.intel.com Signed-off-by: Dan Williams <dan.j.williams@intel.com> Reviewed-by: Ira Weiny <ira.weiny@intel.com> Reviewed-by: Logan Gunthorpe <logang@deltatee.com> Cc: Bjorn Helgaas <bhelgaas@google.com> Cc: "Jérôme Glisse" <jglisse@redhat.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: "Rafael J. Wysocki" <rafael@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-06-13PCI/P2PDMA: fix the gen_pool_add_virt() failure pathDan Williams
The pci_p2pdma_add_resource() implementation immediately frees the pgmap if gen_pool_add_virt() fails. However, that means that when @dev triggers a devres release devm_memremap_pages_release() will crash trying to access the freed @pgmap. Use the new devm_memunmap_pages() to manually free the mapping in the error path. Link: http://lkml.kernel.org/r/155727337603.292046.13101332703665246702.stgit@dwillia2-desk3.amr.corp.intel.com Signed-off-by: Dan Williams <dan.j.williams@intel.com> Fixes: 52916982af48 ("PCI/P2PDMA: Support peer-to-peer memory") Reviewed-by: Ira Weiny <ira.weiny@intel.com> Acked-by: Bjorn Helgaas <bhelgaas@google.com> Reviewed-by: Logan Gunthorpe <logang@deltatee.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: "Jérôme Glisse" <jglisse@redhat.com> Cc: "Rafael J. Wysocki" <rafael@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-06-13mm/devm_memremap_pages: introduce devm_memunmap_pagesDan Williams
Use the new devm_release_action() facility to allow devm_memremap_pages_release() to be manually triggered. Link: http://lkml.kernel.org/r/155727337088.292046.5774214552136776763.stgit@dwillia2-desk3.amr.corp.intel.com Signed-off-by: Dan Williams <dan.j.williams@intel.com> Reviewed-by: Ira Weiny <ira.weiny@intel.com> Reviewed-by: Logan Gunthorpe <logang@deltatee.com> Cc: Bjorn Helgaas <bhelgaas@google.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: "Jérôme Glisse" <jglisse@redhat.com> Cc: "Rafael J. Wysocki" <rafael@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-06-13drivers/base/devres: introduce devm_release_action()Dan Williams
Patch series "mm/devm_memremap_pages: Fix page release race", v2. Logan audited the devm_memremap_pages() shutdown path and noticed that it was possible to proceed to arch_remove_memory() before all potential page references have been reaped. Introduce a new ->cleanup() callback to do the work of waiting for any straggling page references and then perform the percpu_ref_exit() in devm_memremap_pages_release() context. For p2pdma this involves some deeper reworks to reference count resources on a per-instance basis rather than a per pci-device basis. A modified genalloc api is introduced to convey a driver-private pointer through gen_pool_{alloc,free}() interfaces. Also, a devm_memunmap_pages() api is introduced since p2pdma does not auto-release resources on a setup failure. The dax and pmem changes pass the nvdimm unit tests, and the p2pdma changes should now pass testing with the pci_p2pdma_release() fix. Jrme, how does this look for HMM? This patch (of 6): The devm_add_action() facility allows a resource allocation routine to add custom devm semantics. One such user is devm_memremap_pages(). There is now a need to manually trigger devm_memremap_pages_release(). Introduce devm_release_action() so the release action can be triggered via a new devm_memunmap_pages() api in a follow-on change. Link: http://lkml.kernel.org/r/155727336530.292046.2926860263201336366.stgit@dwillia2-desk3.amr.corp.intel.com Signed-off-by: Dan Williams <dan.j.williams@intel.com> Reviewed-by: Ira Weiny <ira.weiny@intel.com> Reviewed-by: Logan Gunthorpe <logang@deltatee.com> Cc: Bjorn Helgaas <bhelgaas@google.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: "Rafael J. Wysocki" <rafael@kernel.org> Cc: "Jérôme Glisse" <jglisse@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-06-13mm/vmscan.c: fix trying to reclaim unevictable LRU pageMinchan Kim
There was the below bug report from Wu Fangsuo. On the CMA allocation path, isolate_migratepages_range() could isolate unevictable LRU pages and reclaim_clean_page_from_list() can try to reclaim them if they are clean file-backed pages. page:ffffffbf02f33b40 count:86 mapcount:84 mapping:ffffffc08fa7a810 index:0x24 flags: 0x19040c(referenced|uptodate|arch_1|mappedtodisk|unevictable|mlocked) raw: 000000000019040c ffffffc08fa7a810 0000000000000024 0000005600000053 raw: ffffffc009b05b20 ffffffc009b05b20 0000000000000000 ffffffc09bf3ee80 page dumped because: VM_BUG_ON_PAGE(PageLRU(page) || PageUnevictable(page)) page->mem_cgroup:ffffffc09bf3ee80 ------------[ cut here ]------------ kernel BUG at /home/build/farmland/adroid9.0/kernel/linux/mm/vmscan.c:1350! Internal error: Oops - BUG: 0 [#1] PREEMPT SMP Modules linked in: CPU: 0 PID: 7125 Comm: syz-executor Tainted: G S 4.14.81 #3 Hardware name: ASR AQUILAC EVB (DT) task: ffffffc00a54cd00 task.stack: ffffffc009b00000 PC is at shrink_page_list+0x1998/0x3240 LR is at shrink_page_list+0x1998/0x3240 pc : [<ffffff90083a2158>] lr : [<ffffff90083a2158>] pstate: 60400045 sp : ffffffc009b05940 .. shrink_page_list+0x1998/0x3240 reclaim_clean_pages_from_list+0x3c0/0x4f0 alloc_contig_range+0x3bc/0x650 cma_alloc+0x214/0x668 ion_cma_allocate+0x98/0x1d8 ion_alloc+0x200/0x7e0 ion_ioctl+0x18c/0x378 do_vfs_ioctl+0x17c/0x1780 SyS_ioctl+0xac/0xc0 Wu found it's due to commit ad6b67041a45 ("mm: remove SWAP_MLOCK in ttu"). Before that, unevictable pages go to cull_mlocked so that we can't reach the VM_BUG_ON_PAGE line. To fix the issue, this patch filters out unevictable LRU pages from the reclaim_clean_pages_from_list in CMA. Link: http://lkml.kernel.org/r/20190524071114.74202-1-minchan@kernel.org Fixes: ad6b67041a45 ("mm: remove SWAP_MLOCK in ttu") Signed-off-by: Minchan Kim <minchan@kernel.org> Reported-by: Wu Fangsuo <fangsuowu@asrmicro.com> Debugged-by: Wu Fangsuo <fangsuowu@asrmicro.com> Tested-by: Wu Fangsuo <fangsuowu@asrmicro.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Pankaj Suryawanshi <pankaj.suryawanshi@einfochips.com> Cc: <stable@vger.kernel.org> [4.12+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-06-13coredump: fix race condition between collapse_huge_page() and core dumpingAndrea Arcangeli
When fixing the race conditions between the coredump and the mmap_sem holders outside the context of the process, we focused on mmget_not_zero()/get_task_mm() callers in 04f5866e41fb70 ("coredump: fix race condition between mmget_not_zero()/get_task_mm() and core dumping"), but those aren't the only cases where the mmap_sem can be taken outside of the context of the process as Michal Hocko noticed while backporting that commit to older -stable kernels. If mmgrab() is called in the context of the process, but then the mm_count reference is transferred outside the context of the process, that can also be a problem if the mmap_sem has to be taken for writing through that mm_count reference. khugepaged registration calls mmgrab() in the context of the process, but the mmap_sem for writing is taken later in the context of the khugepaged kernel thread. collapse_huge_page() after taking the mmap_sem for writing doesn't modify any vma, so it's not obvious that it could cause a problem to the coredump, but it happens to modify the pmd in a way that breaks an invariant that pmd_trans_huge_lock() relies upon. collapse_huge_page() needs the mmap_sem for writing just to block concurrent page faults that call pmd_trans_huge_lock(). Specifically the invariant that "!pmd_trans_huge()" cannot become a "pmd_trans_huge()" doesn't hold while collapse_huge_page() runs. The coredump will call __get_user_pages() without mmap_sem for reading, which eventually can invoke a lockless page fault which will need a functional pmd_trans_huge_lock(). So collapse_huge_page() needs to use mmget_still_valid() to check it's not running concurrently with the coredump... as long as the coredump can invoke page faults without holding the mmap_sem for reading. This has "Fixes: khugepaged" to facilitate backporting, but in my view it's more a bug in the coredump code that will eventually have to be rewritten to stop invoking page faults without the mmap_sem for reading. So the long term plan is still to drop all mmget_still_valid(). Link: http://lkml.kernel.org/r/20190607161558.32104-1-aarcange@redhat.com Fixes: ba76149f47d8 ("thp: khugepaged") Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Reported-by: Michal Hocko <mhocko@suse.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Jann Horn <jannh@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Peter Xu <peterx@redhat.com> Cc: Jason Gunthorpe <jgg@mellanox.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-06-13mm/mlock.c: change count_mm_mlocked_page_nr return typeswkhack
On a 64-bit machine the value of "vma->vm_end - vma->vm_start" may be negative when using 32 bit ints and the "count >> PAGE_SHIFT"'s result will be wrong. So change the local variable and return value to unsigned long to fix the problem. Link: http://lkml.kernel.org/r/20190513023701.83056-1-swkhack@gmail.com Fixes: 0cf2f6f6dc60 ("mm: mlock: check against vma for actual mlock() size") Signed-off-by: swkhack <swkhack@gmail.com> Acked-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-06-13mm: mmu_gather: remove __tlb_reset_range() for force flushYang Shi
A few new fields were added to mmu_gather to make TLB flush smarter for huge page by telling what level of page table is changed. __tlb_reset_range() is used to reset all these page table state to unchanged, which is called by TLB flush for parallel mapping changes for the same range under non-exclusive lock (i.e. read mmap_sem). Before commit dd2283f2605e ("mm: mmap: zap pages with read mmap_sem in munmap"), the syscalls (e.g. MADV_DONTNEED, MADV_FREE) which may update PTEs in parallel don't remove page tables. But, the forementioned commit may do munmap() under read mmap_sem and free page tables. This may result in program hang on aarch64 reported by Jan Stancek. The problem could be reproduced by his test program with slightly modified below. ---8<--- static int map_size = 4096; static int num_iter = 500; static long threads_total; static void *distant_area; void *map_write_unmap(void *ptr) { int *fd = ptr; unsigned char *map_address; int i, j = 0; for (i = 0; i < num_iter; i++) { map_address = mmap(distant_area, (size_t) map_size, PROT_WRITE | PROT_READ, MAP_SHARED | MAP_ANONYMOUS, -1, 0); if (map_address == MAP_FAILED) { perror("mmap"); exit(1); } for (j = 0; j < map_size; j++) map_address[j] = 'b'; if (munmap(map_address, map_size) == -1) { perror("munmap"); exit(1); } } return NULL; } void *dummy(void *ptr) { return NULL; } int main(void) { pthread_t thid[2]; /* hint for mmap in map_write_unmap() */ distant_area = mmap(0, DISTANT_MMAP_SIZE, PROT_WRITE | PROT_READ, MAP_ANONYMOUS | MAP_PRIVATE, -1, 0); munmap(distant_area, (size_t)DISTANT_MMAP_SIZE); distant_area += DISTANT_MMAP_SIZE / 2; while (1) { pthread_create(&thid[0], NULL, map_write_unmap, NULL); pthread_create(&thid[1], NULL, dummy, NULL); pthread_join(thid[0], NULL); pthread_join(thid[1], NULL); } } ---8<--- The program may bring in parallel execution like below: t1 t2 munmap(map_address) downgrade_write(&mm->mmap_sem); unmap_region() tlb_gather_mmu() inc_tlb_flush_pending(tlb->mm); free_pgtables() tlb->freed_tables = 1 tlb->cleared_pmds = 1 pthread_exit() madvise(thread_stack, 8M, MADV_DONTNEED) zap_page_range() tlb_gather_mmu() inc_tlb_flush_pending(tlb->mm); tlb_finish_mmu() if (mm_tlb_flush_nested(tlb->mm)) __tlb_reset_range() __tlb_reset_range() would reset freed_tables and cleared_* bits, but this may cause inconsistency for munmap() which do free page tables. Then it may result in some architectures, e.g. aarch64, may not flush TLB completely as expected to have stale TLB entries remained. Use fullmm flush since it yields much better performance on aarch64 and non-fullmm doesn't yields significant difference on x86. The original proposed fix came from Jan Stancek who mainly debugged this issue, I just wrapped up everything together. Jan's testing results: v5.2-rc2-24-gbec7550cca10 -------------------------- mean stddev real 37.382 2.780 user 1.420 0.078 sys 54.658 1.855 v5.2-rc2-24-gbec7550cca10 + "mm: mmu_gather: remove __tlb_reset_range() for force flush" ---------------------------------------------------------------------------------------_ mean stddev real 37.119 2.105 user 1.548 0.087 sys 55.698 1.357 [akpm@linux-foundation.org: coding-style fixes] Link: http://lkml.kernel.org/r/1558322252-113575-1-git-send-email-yang.shi@linux.alibaba.com Fixes: dd2283f2605e ("mm: mmap: zap pages with read mmap_sem in munmap") Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com> Signed-off-by: Jan Stancek <jstancek@redhat.com> Reported-by: Jan Stancek <jstancek@redhat.com> Tested-by: Jan Stancek <jstancek@redhat.com> Suggested-by: Will Deacon <will.deacon@arm.com> Tested-by: Will Deacon <will.deacon@arm.com> Acked-by: Will Deacon <will.deacon@arm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Nick Piggin <npiggin@gmail.com> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> Cc: Nadav Amit <namit@vmware.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Mel Gorman <mgorman@suse.de> Cc: <stable@vger.kernel.org> [4.20+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-06-13fs/ocfs2: fix race in ocfs2_dentry_attach_lock()Wengang Wang
ocfs2_dentry_attach_lock() can be executed in parallel threads against the same dentry. Make that race safe. The race is like this: thread A thread B (A1) enter ocfs2_dentry_attach_lock, seeing dentry->d_fsdata is NULL, and no alias found by ocfs2_find_local_alias, so kmalloc a new ocfs2_dentry_lock structure to local variable "dl", dl1 ..... (B1) enter ocfs2_dentry_attach_lock, seeing dentry->d_fsdata is NULL, and no alias found by ocfs2_find_local_alias so kmalloc a new ocfs2_dentry_lock structure to local variable "dl", dl2. ...... (A2) set dentry->d_fsdata with dl1, call ocfs2_dentry_lock() and increase dl1->dl_lockres.l_ro_holders to 1 on success. ...... (B2) set dentry->d_fsdata with dl2 call ocfs2_dentry_lock() and increase dl2->dl_lockres.l_ro_holders to 1 on success. ...... (A3) call ocfs2_dentry_unlock() and decrease dl2->dl_lockres.l_ro_holders to 0 on success. .... (B3) call ocfs2_dentry_unlock(), decreasing dl2->dl_lockres.l_ro_holders, but see it's zero now, panic Link: http://lkml.kernel.org/r/20190529174636.22364-1-wen.gang.wang@oracle.com Signed-off-by: Wengang Wang <wen.gang.wang@oracle.com> Reported-by: Daniel Sobe <daniel.sobe@nxp.com> Tested-by: Daniel Sobe <daniel.sobe@nxp.com> Reviewed-by: Changwei Ge <gechangwei@live.cn> Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com> Cc: Mark Fasheh <mark@fasheh.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Gang He <ghe@suse.com> Cc: Jun Piao <piaojun@huawei.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-06-13mm/vmscan.c: fix recent_rotated historyKirill Tkhai
Johannes pointed out that after commit 886cf1901db9 ("mm: move recent_rotated pages calculation to shrink_inactive_list()") we lost all zone_reclaim_stat::recent_rotated history. This fixes it. Link: http://lkml.kernel.org/r/155905972210.26456.11178359431724024112.stgit@localhost.localdomain Fixes: 886cf1901db9 ("mm: move recent_rotated pages calculation to shrink_inactive_list()") Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com> Reported-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Daniel Jordan <daniel.m.jordan@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-06-13mm/mlock.c: mlockall error for flag MCL_ONFAULTPotyra, Stefan
If mlockall() is called with only MCL_ONFAULT as flag, it removes any previously applied lockings and does nothing else. This behavior is counter-intuitive and doesn't match the Linux man page. For mlockall(): EINVAL Unknown flags were specified or MCL_ONFAULT was specified without either MCL_FUTURE or MCL_CURRENT. Consequently, return the error EINVAL, if only MCL_ONFAULT is passed. That way, applications will at least detect that they are calling mlockall() incorrectly. Link: http://lkml.kernel.org/r/20190527075333.GA6339@er01809n.ebgroup.elektrobit.com Fixes: b0f205c2a308 ("mm: mlock: add mlock flags to enable VM_LOCKONFAULT usage") Signed-off-by: Stefan Potyra <Stefan.Potyra@elektrobit.com> Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-06-13scripts/decode_stacktrace.sh: prefix addr2line with $CROSS_COMPILEManuel Traut
At least for ARM64 kernels compiled with the crosstoolchain from Debian/stretch or with the toolchain from kernel.org the line number is not decoded correctly by 'decode_stacktrace.sh': $ echo "[ 136.513051] f1+0x0/0xc [kcrash]" | \ CROSS_COMPILE=/opt/gcc-8.1.0-nolibc/aarch64-linux/bin/aarch64-linux- \ ./scripts/decode_stacktrace.sh /scratch/linux-arm64/vmlinux \ /scratch/linux-arm64 \ /nfs/debian/lib/modules/4.20.0-devel [ 136.513051] f1 (/linux/drivers/staging/kcrash/kcrash.c:68) kcrash If addr2line from the toolchain is used the decoded line number is correct: [ 136.513051] f1 (/linux/drivers/staging/kcrash/kcrash.c:57) kcrash Link: http://lkml.kernel.org/r/20190527083425.3763-1-manut@linutronix.de Signed-off-by: Manuel Traut <manut@linutronix.de> Acked-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-06-13mm/list_lru.c: fix memory leak in __memcg_init_list_lru_nodeShakeel Butt
Syzbot reported following memory leak: ffffffffda RBX: 0000000000000003 RCX: 0000000000441f79 BUG: memory leak unreferenced object 0xffff888114f26040 (size 32): comm "syz-executor626", pid 7056, jiffies 4294948701 (age 39.410s) hex dump (first 32 bytes): 40 60 f2 14 81 88 ff ff 40 60 f2 14 81 88 ff ff @`......@`...... 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ backtrace: slab_post_alloc_hook mm/slab.h:439 [inline] slab_alloc mm/slab.c:3326 [inline] kmem_cache_alloc_trace+0x13d/0x280 mm/slab.c:3553 kmalloc include/linux/slab.h:547 [inline] __memcg_init_list_lru_node+0x58/0xf0 mm/list_lru.c:352 memcg_init_list_lru_node mm/list_lru.c:375 [inline] memcg_init_list_lru mm/list_lru.c:459 [inline] __list_lru_init+0x193/0x2a0 mm/list_lru.c:626 alloc_super+0x2e0/0x310 fs/super.c:269 sget_userns+0x94/0x2a0 fs/super.c:609 sget+0x8d/0xb0 fs/super.c:660 mount_nodev+0x31/0xb0 fs/super.c:1387 fuse_mount+0x2d/0x40 fs/fuse/inode.c:1236 legacy_get_tree+0x27/0x80 fs/fs_context.c:661 vfs_get_tree+0x2e/0x120 fs/super.c:1476 do_new_mount fs/namespace.c:2790 [inline] do_mount+0x932/0xc50 fs/namespace.c:3110 ksys_mount+0xab/0x120 fs/namespace.c:3319 __do_sys_mount fs/namespace.c:3333 [inline] __se_sys_mount fs/namespace.c:3330 [inline] __x64_sys_mount+0x26/0x30 fs/namespace.c:3330 do_syscall_64+0x76/0x1a0 arch/x86/entry/common.c:301 entry_SYSCALL_64_after_hwframe+0x44/0xa9 This is a simple off by one bug on the error path. Link: http://lkml.kernel.org/r/20190528043202.99980-1-shakeelb@google.com Fixes: 60d3fd32a7a9 ("list_lru: introduce per-memcg lists") Reported-by: syzbot+f90a420dfe2b1b03cb2c@syzkaller.appspotmail.com Signed-off-by: Shakeel Butt <shakeelb@google.com> Acked-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Kirill Tkhai <ktkhai@virtuozzo.com> Cc: <stable@vger.kernel.org> [4.0+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-06-13mm: memcontrol: don't batch updates of local VM stats and eventsJohannes Weiner
The kernel test robot noticed a 26% will-it-scale pagefault regression from commit 42a300353577 ("mm: memcontrol: fix recursive statistics correctness & scalabilty"). This appears to be caused by bouncing the additional cachelines from the new hierarchical statistics counters. We can fix this by getting rid of the batched local counters instead. Originally, there were *only* group-local counters, and they were fully maintained per cpu. A reader of a stats file high up in the cgroup tree would have to walk the entire subtree and collect each level's per-cpu counters to get the recursive view. This was prohibitively expensive, and so we switched to per-cpu batched updates of the local counters during a983b5ebee57 ("mm: memcontrol: fix excessive complexity in memory.stat reporting"), reducing the complexity from nr_subgroups * nr_cpus to nr_subgroups. With growing machines and cgroup trees, the tree walk itself became too expensive for monitoring top-level groups, and this is when the culprit patch added hierarchy counters on each cgroup level. When the per-cpu batch size would be reached, both the local and the hierarchy counters would get batch-updated from the per-cpu delta simultaneously. This makes local and hierarchical counter reads blazingly fast, but it unfortunately makes the write-side too cache line intense. Since local counter reads were never a problem - we only centralized them to accelerate the hierarchy walk - and use of the local counters are becoming rarer due to replacement with hierarchical views (ongoing rework in the page reclaim and workingset code), we can make those local counters unbatched per-cpu counters again. The scheme will then be as such: when a memcg statistic changes, the writer will: - update the local counter (per-cpu) - update the batch counter (per-cpu). If the batch is full: - spill the batch into the group's atomic_t - spill the batch into all ancestors' atomic_ts - empty out the batch counter (per-cpu) when a local memcg counter is read, the reader will: - collect the local counter from all cpus when a hiearchy memcg counter is read, the reader will: - read the atomic_t We might be able to simplify this further and make the recursive counters unbatched per-cpu counters as well (batch upward propagation, but leave per-cpu collection to the readers), but that will require a more in-depth analysis and testing of all the callsites. Deal with the immediate regression for now. Link: http://lkml.kernel.org/r/20190521151647.GB2870@cmpxchg.org Fixes: 42a300353577 ("mm: memcontrol: fix recursive statistics correctness & scalabilty") Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reported-by: kernel test robot <rong.a.chen@intel.com> Tested-by: kernel test robot <rong.a.chen@intel.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Shakeel Butt <shakeelb@google.com> Cc: Roman Gushchin <guro@fb.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-06-13rcu: Don't return a value from rcu_assign_pointer()Andrea Parri
Quoting Paul [1]: "Given that a quick (and perhaps error-prone) search of the uses of rcu_assign_pointer() in v5.1 didn't find a single use of the return value, let's please instead change the documentation and implementation to eliminate the return value." [1] https://lkml.kernel.org/r/20190523135013.GL28207@linux.ibm.com Signed-off-by: Andrea Parri <andrea.parri@amarulasolutions.com> Cc: "Paul E. McKenney" <paulmck@linux.ibm.com> Cc: Josh Triplett <josh@joshtriplett.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Lai Jiangshan <jiangshanlai@gmail.com> Cc: Joel Fernandes <joel@joelfernandes.org> Cc: rcu@vger.kernel.org Cc: Peter Zijlstra <peterz@infradead.org> Cc: Will Deacon <will.deacon@arm.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Sasha Levin <sashal@kernel.org> Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-06-13rcu: Force inlining of rcu_read_lock()Waiman Long
When debugging options are turned on, the rcu_read_lock() function might not be inlined. This results in lockdep's print_lock() function printing "rcu_read_lock+0x0/0x70" instead of rcu_read_lock()'s caller. For example: [ 10.579995] ============================= [ 10.584033] WARNING: suspicious RCU usage [ 10.588074] 4.18.0.memcg_v2+ #1 Not tainted [ 10.593162] ----------------------------- [ 10.597203] include/linux/rcupdate.h:281 Illegal context switch in RCU read-side critical section! [ 10.606220] [ 10.606220] other info that might help us debug this: [ 10.606220] [ 10.614280] [ 10.614280] rcu_scheduler_active = 2, debug_locks = 1 [ 10.620853] 3 locks held by systemd/1: [ 10.624632] #0: (____ptrval____) (&type->i_mutex_dir_key#5){.+.+}, at: lookup_slow+0x42/0x70 [ 10.633232] #1: (____ptrval____) (rcu_read_lock){....}, at: rcu_read_lock+0x0/0x70 [ 10.640954] #2: (____ptrval____) (rcu_read_lock){....}, at: rcu_read_lock+0x0/0x70 These "rcu_read_lock+0x0/0x70" strings are not providing any useful information. This commit therefore forces inlining of the rcu_read_lock() function so that rcu_read_lock()'s caller is instead shown. Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-06-13rcu: Fix irritating whitespace error in rcu_assign_pointer()Paul E. McKenney
Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-06-13rcu: Upgrade sync_exp_work_done() to smp_mb()Paul E. McKenney
The sync_exp_work_done() function uses smp_mb__before_atomic(), but there is no obvious atomic in the ensuing code. The ordering is absolutely required for grace periods to work correctly, so this commit upgrades the smp_mb__before_atomic() to smp_mb(). Fixes: 6fba2b3767ea ("rcu: Remove deprecated RCU debugfs tracing code") Reported-by: Andrea Parri <andrea.parri@amarulasolutions.com> Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-06-14PCI: PM: Skip devices in D0 for suspend-to-idleRafael J. Wysocki
Commit d491f2b75237 ("PCI: PM: Avoid possible suspend-to-idle issue") attempted to avoid a problem with devices whose drivers want them to stay in D0 over suspend-to-idle and resume, but it did not go as far as it should with that. Namely, first of all, the power state of a PCI bridge with a downstream device in D0 must be D0 (based on the PCI PM spec r1.2, sec 6, table 6-1, if the bridge is not in D0, there can be no PCI transactions on its secondary bus), but that is not actively enforced during system-wide PM transitions, so use the skip_bus_pm flag introduced by commit d491f2b75237 for that. Second, the configuration of devices left in D0 (whatever the reason) during suspend-to-idle need not be changed and attempting to put them into D0 again by force is pointless, so explicitly avoid doing that. Fixes: d491f2b75237 ("PCI: PM: Avoid possible suspend-to-idle issue") Reported-by: Kai-Heng Feng <kai.heng.feng@canonical.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Reviewed-by: Mika Westerberg <mika.westerberg@linux.intel.com> Tested-by: Kai-Heng Feng <kai.heng.feng@canonical.com>
2019-06-13Merge branch 'bpf-ppc-div-fix'Daniel Borkmann
Naveen N. Rao says: ==================== The first patch updates DIV64 overflow tests to properly detect error conditions. The second patch fixes powerpc64 JIT to generate the proper unsigned division instruction for BPF_ALU64. ==================== Acked-by: Sandipan Das <sandipan@linux.ibm.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-06-13powerpc/bpf: use unsigned division instruction for 64-bit operationsNaveen N. Rao
BPF_ALU64 div/mod operations are currently using signed division, unlike BPF_ALU32 operations. Fix the same. DIV64 and MOD64 overflow tests pass with this fix. Fixes: 156d0e290e969c ("powerpc/ebpf/jit: Implement JIT compiler for extended BPF") Cc: stable@vger.kernel.org # v4.8+ Signed-off-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-06-13bpf: fix div64 overflow tests to properly detect errorsNaveen N. Rao
If the result of the division is LLONG_MIN, current tests do not detect the error since the return value is truncated to a 32-bit value and ends up being 0. Signed-off-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>