summaryrefslogtreecommitdiff
path: root/lib
AgeCommit message (Collapse)Author
2017-09-04Merge branch 'x86-asm-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 asm updates from Ingo Molnar: - Introduce the ORC unwinder, which can be enabled via CONFIG_ORC_UNWINDER=y. The ORC unwinder is a lightweight, Linux kernel specific debuginfo implementation, which aims to be DWARF done right for unwinding. Objtool is used to generate the ORC unwinder tables during build, so the data format is flexible and kernel internal: there's no dependency on debuginfo created by an external toolchain. The ORC unwinder is almost two orders of magnitude faster than the (out of tree) DWARF unwinder - which is important for perf call graph profiling. It is also significantly simpler and is coded defensively: there has not been a single ORC related kernel crash so far, even with early versions. (knock on wood!) But the main advantage is that enabling the ORC unwinder allows CONFIG_FRAME_POINTERS to be turned off - which speeds up the kernel measurably: With frame pointers disabled, GCC does not have to add frame pointer instrumentation code to every function in the kernel. The kernel's .text size decreases by about 3.2%, resulting in better cache utilization and fewer instructions executed, resulting in a broad kernel-wide speedup. Average speedup of system calls should be roughly in the 1-3% range - measurements by Mel Gorman [1] have shown a speedup of 5-10% for some function execution intense workloads. The main cost of the unwinder is that the unwinder data has to be stored in RAM: the memory cost is 2-4MB of RAM, depending on kernel config - which is a modest cost on modern x86 systems. Given how young the ORC unwinder code is it's not enabled by default - but given the performance advantages the plan is to eventually make it the default unwinder on x86. See Documentation/x86/orc-unwinder.txt for more details. - Remove lguest support: its intended role was that of a temporary proof of concept for virtualization, plus its removal will enable the reduction (removal) of the paravirt API as well, so Rusty agreed to its removal. (Juergen Gross) - Clean up and fix FSGS related functionality (Andy Lutomirski) - Clean up IO access APIs (Andy Shevchenko) - Enhance the symbol namespace (Jiri Slaby) * 'x86-asm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (47 commits) objtool: Handle GCC stack pointer adjustment bug x86/entry/64: Use ENTRY() instead of ALIGN+GLOBAL for stub32_clone() x86/fpu/math-emu: Add ENDPROC to functions x86/boot/64: Extract efi_pe_entry() from startup_64() x86/boot/32: Extract efi_pe_entry() from startup_32() x86/lguest: Remove lguest support x86/paravirt/xen: Remove xen_patch() objtool: Fix objtool fallthrough detection with function padding x86/xen/64: Fix the reported SS and CS in SYSCALL objtool: Track DRAP separately from callee-saved registers objtool: Fix validate_branch() return codes x86: Clarify/fix no-op barriers for text_poke_bp() x86/switch_to/64: Rewrite FS/GS switching yet again to fix AMD CPUs selftests/x86/fsgsbase: Test selectors 1, 2, and 3 x86/fsgsbase/64: Report FSBASE and GSBASE correctly in core dumps x86/fsgsbase/64: Fully initialize FS and GS state in start_thread_common x86/asm: Fix UNWIND_HINT_REGS macro for older binutils x86/asm/32: Fix regs_get_register() on segment registers x86/xen/64: Rearrange the SYSCALL entries x86/asm/32: Remove a bunch of '& 0xffff' from pt_regs segment reads ...
2017-09-04Merge branch 'core-debugobjects-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull debugobjects fix from Ingo Molnar: "A single commit making debugobjects interact better with kmemleak" * 'core-debugobjects-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: debugobjects: Make kmemleak ignore debug objects
2017-09-03Merge branch 'docs-next' of git://git.lwn.net/linuxLinus Torvalds
Pull documentation updates from Jonathan Corbet: "After a fair amount of churn in the last couple of cycles, docs are taking it easier this time around. Lots of fixes and some new documentation, but nothing all that radical. Perhaps the most interesting change for many is the scripts/sphinx-pre-install tool from Mauro; it will tell you exactly which packages you need to install to get a working docs toolchain on your system. There are two little patches reaching outside of Documentation/; both just tweak kerneldoc comments to eliminate warnings and fix some dangling doc pointers" * 'docs-next' of git://git.lwn.net/linux: (52 commits) Documentation/sphinx: fix kernel-doc decode for non-utf-8 locale genalloc: Fix an incorrect kerneldoc comment doc: Add documentation for the genalloc subsystem assoc_array: fix path to assoc_array documentation kernel-doc parser mishandles declarations split into lines docs: ReSTify table of contents in core.rst docs: process: drop git snapshots from applying-patches.rst Documentation:input: fix typo swap: Remove obsolete sentence sphinx.rst: Allow Sphinx version 1.6 at the docs docs-rst: fix verbatim font size on tables Documentation: stable-kernel-rules: fix broken git urls rtmutex: update rt-mutex rtmutex: update rt-mutex-design docs: fix minimal sphinx version in conf.py docs: fix nested numbering in the TOC NVMEM documentation fix: A minor typo docs-rst: pdf: use same vertical margin on all Sphinx versions doc: Makefile: if sphinx is not found, run a check script docs: Fix paths in security/keys ...
2017-09-03Merge tag 'drm-for-v4.14' of git://people.freedesktop.org/~airlied/linuxLinus Torvalds
Pull drm updates from Dave Airlie: "This is the main drm pull request for 4.14 merge window. I'm sending this early, as my continuing journey into fatherhood is occurring really soon now, I'm going to be mostly useless for the next couple of weeks, though I may be able to read email, I doubt I'll be doing much patch applications or git sending. If anything urgent pops up I've asked Daniel/Jani/Alex/Sean to try and direct stuff towards you. Outside drm changes: Some rcar-du updates that touch the V4L tree, all acks should be in place. It adds one export to the radix tree code for new i915 use case. There are some minor AGP cleanups (don't see that too often). Changes to the vbox driver in staging to avoid breaking compilation. Summary: core: - Atomic helper fixes - Atomic UAPI fixes - Add YCBCR 4:2:0 support - Drop set_busid hook - Refactor fb_helper locking - Remove a bunch of internal APIs - Add a bunch of better default handlers - Format modifier/blob plane property added - More internal header refactoring - Make more internal API names consistent - Enhanced syncobj APIs (wait/signal/reset/create signalled) bridge: - Add Synopsys Designware MIPI DSI host bridge driver tiny: - Add Pervasive Displays RePaper displays - Add support for LEGO MINDSTORMS EV3 LCD i915: - Lots of GEN10/CNL support patches - drm syncobj support - Skylake+ watermark refactoring - GVT vGPU 48-bit ppgtt support - GVT performance improvements - NOA change ioctl - CCS (color compression) scanout support - GPU reset improvements amdgpu: - Initial hugepage support - BO migration logic rework - Vega10 improvements - Powerplay fixes - Stop reprogramming the MC - Fixes for ACP audio on stoney - SR-IOV fixes/improvements - Command submission overhead improvements amdkfd: - Non-dGPU upstreaming patches - Scratch VA ioctl - Image tiling modes - Update PM4 headers for new firmware - Drop all BUG_ONs. nouveau: - GP108 modesetting support. - Disable MSI on big endian. vmwgfx: - Add fence fd support. msm: - Runtime PM improvements exynos: - NV12MT support - Refactor KMS drivers imx-drm: - Lock scanout channel to improve memory bw - Cleanups etnaviv: - GEM object population fixes tegra: - Prep work for Tegra186 support - PRIME mmap support sunxi: - HDMI support improvements - HDMI CEC support omapdrm: - HDMI hotplug IRQ support - Big driver cleanup - OMAP5 DSI support rcar-du: - vblank fixes - VSP1 updates arcgpu: - Minor fixes stm: - Add STM32 DSI controller driver dw_hdmi: - Add support for Rockchip RK3399 - HDMI CEC support atmel-hlcdc: - Add 8-bit color support vc4: - Atomic fixes - New ioctl to attach a label to a buffer object - HDMI CEC support - Allow userspace to dictate rendering order on submit ioctl" * tag 'drm-for-v4.14' of git://people.freedesktop.org/~airlied/linux: (1074 commits) drm/syncobj: Add a signal ioctl (v3) drm/syncobj: Add a reset ioctl (v3) drm/syncobj: Add a syncobj_array_find helper drm/syncobj: Allow wait for submit and signal behavior (v5) drm/syncobj: Add a CREATE_SIGNALED flag drm/syncobj: Add a callback mechanism for replace_fence (v3) drm/syncobj: add sync obj wait interface. (v8) i915: Use drm_syncobj_fence_get drm/syncobj: Add a race-free drm_syncobj_fence_get helper (v2) drm/syncobj: Rename fence_get to find_fence drm: kirin: Add mode_valid logic to avoid mode clocks we can't generate drm/vmwgfx: Bump the version for fence FD support drm/vmwgfx: Add export fence to file descriptor support drm/vmwgfx: Add support for imported Fence File Descriptor drm/vmwgfx: Prepare to support fence fd drm/vmwgfx: Fix incorrect command header offset at restart drm/vmwgfx: Support the NOP_ERROR command drm/vmwgfx: Restart command buffers after errors drm/vmwgfx: Move irq bottom half processing to threads drm/vmwgfx: Don't use drm_irq_[un]install ...
2017-09-01Merge branch 'linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6 Pull crypto fixes from Herbert Xu: "This fixes the following issues: - Regression in chacha20 handling of chunked input - Crash in algif_skcipher when used with async io - Potential bogus pointer dereference in lib/mpi" * 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6: crypto: algif_skcipher - only call put_page on referenced and used pages crypto: testmgr - add chunked test cases for chacha20 crypto: chacha20 - fix handling of chunked input lib/mpi: kunmap after finishing accessing buffer
2017-08-30assoc_array: fix path to assoc_array documentationAlexander Kuleshov
Signed-off-by: Alexander Kuleshov <kuleshovmail@gmail.com> Signed-off-by: Jonathan Corbet <corbet@lwn.net>
2017-08-22lib/mpi: kunmap after finishing accessing bufferStephan Mueller
Using sg_miter_start and sg_miter_next, the buffer of an SG is kmap'ed to *buff. The current code calls sg_miter_stop (and thus kunmap) on the SG entry before the last access of *buff. The patch moves the sg_miter_stop call after the last access to *buff to ensure that the memory pointed to by *buff is still mapped. Fixes: 4816c9406430 ("lib/mpi: Fix SG miter leak") Cc: <stable@vger.kernel.org> Signed-off-by: Stephan Mueller <smueller@chronox.de> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2017-08-22Merge tag 'drm-intel-next-2017-08-18' of ↵Dave Airlie
git://anongit.freedesktop.org/git/drm-intel into drm-next Final pile of features for 4.14 - New ioctl to change NOA configurations, plus prep (Lionel) - CCS (color compression) scanout support, based on the fancy new modifier additions (Ville&Ben) - Document i915 register macro style (Jani) - Many more gen10/cnl patches (Rodrigo, Pualo, ...) - More gpu reset vs. modeset duct-tape to restore the old way. - prep work for cnl: hpd_pin reorg (Rodrigo), support for more power wells (Imre), i2c pin reorg (Anusha) - drm_syncobj support (Jason Ekstrand) - forcewake vs gpu reset fix (Chris) - execbuf speedup for the no-relocs fastpath, anv/vk low-overhead ftw (Chris) - switch to idr/radixtree instead of the resizing ht for execbuf id->vma lookups (Chris) gvt: - MMIO save/restore optimization (Changbin) - Split workload scan vs. dispatch for more parallel exec (Ping) - vGPU full 48bit ppgtt support (Joonas, Tina) - vGPU hw id expose for perf (Zhenyu) Bunch of work all over to make the igt CI runs more complete/stable. Watch https://intel-gfx-ci.01.org/tree/drm-tip/shards-all.html for progress in getting this ready. Next week we're going into production mode (i.e. will send results to intel-gfx) on hsw, more platforms to come. Also, a new maintainer tram, I'm stepping out. Huge thanks to Jani for being an awesome co-maintainer the past few years, and all the best for Jani, Joonas&Rodrigo as the new maintainers! * tag 'drm-intel-next-2017-08-18' of git://anongit.freedesktop.org/git/drm-intel: (179 commits) drm/i915: Update DRIVER_DATE to 20170818 drm/i915/bxt: use NULL for GPIO connection ID drm/i915: Mark the GT as busy before idling the previous request drm/i915: Trivial grammar fix s/opt of/opt out of/ in comment drm/i915: Replace execbuf vma ht with an idr drm/i915: Simplify eb_lookup_vmas() drm/i915: Convert execbuf to use struct-of-array packing for critical fields drm/i915: Check context status before looking up our obj/vma drm/i915: Don't use MI_STORE_DWORD_IMM on Sandybridge/vcs drm/i915: Stop touching forcewake following a gen6+ engine reset MAINTAINERS: drm/i915 has a new maintainer team drm/i915: Split pin mapping into per platform functions drm/i915/opregion: let user specify override VBT via firmware load drm/i915/cnl: Reuse skl_wm_get_hw_state on Cannonlake. drm/i915/gen10: implement gen 10 watermarks calculations drm/i915/cnl: Fix LSPCON support. drm/i915/vbt: ignore extraneous child devices for a port drm/i915/cnl: Setup PAT Index. drm/i915/edp: Allow alternate fixed mode for eDP if available. drm/i915: Add support for drm syncobjs ...
2017-08-18drm/i915: Replace execbuf vma ht with an idrChris Wilson
This was the competing idea long ago, but it was only with the rewrite of the idr as an radixtree and using the radixtree directly ourselves, along with the realisation that we can store the vma directly in the radixtree and only need a list for the reverse mapping, that made the patch performant enough to displace using a hashtable. Though the vma ht is fast and doesn't require any extra allocation (as we can embed the node inside the vma), it does require a thread for resizing and serialization and will have the occasional slow lookup. That is hairy enough to investigate alternatives and favour them if equivalent in peak performance. One advantage of allocating an indirection entry is that we can support a single shared bo between many clients, something that was done on a first-come first-serve basis for shared GGTT vma previously. To offset the extra allocations, we create yet another kmem_cache for them. Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20170816085210.4199-5-chris@chris-wilson.co.uk
2017-08-18kernel/watchdog: Prevent false positives with turbo modesThomas Gleixner
The hardlockup detector on x86 uses a performance counter based on unhalted CPU cycles and a periodic hrtimer. The hrtimer period is about 2/5 of the performance counter period, so the hrtimer should fire 2-3 times before the performance counter NMI fires. The NMI code checks whether the hrtimer fired since the last invocation. If not, it assumess a hard lockup. The calculation of those periods is based on the nominal CPU frequency. Turbo modes increase the CPU clock frequency and therefore shorten the period of the perf/NMI watchdog. With extreme Turbo-modes (3x nominal frequency) the perf/NMI period is shorter than the hrtimer period which leads to false positives. A simple fix would be to shorten the hrtimer period, but that comes with the side effect of more frequent hrtimer and softlockup thread wakeups, which is not desired. Implement a low pass filter, which checks the perf/NMI period against kernel time. If the perf/NMI fires before 4/5 of the watchdog period has elapsed then the event is ignored and postponed to the next perf/NMI. That solves the problem and avoids the overhead of shorter hrtimer periods and more frequent softlockup thread wakeups. Fixes: 58687acba592 ("lockup_detector: Combine nmi_watchdog and softlockup detector") Reported-and-tested-by: Kan Liang <Kan.liang@intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: dzickus@redhat.com Cc: prarit@redhat.com Cc: ak@linux.intel.com Cc: babu.moger@oracle.com Cc: peterz@infradead.org Cc: eranian@google.com Cc: acme@redhat.com Cc: stable@vger.kernel.org Cc: atomlin@redhat.com Cc: akpm@linux-foundation.org Cc: torvalds@linux-foundation.org Link: http://lkml.kernel.org/r/alpine.DEB.2.20.1708150931310.1886@nanos
2017-08-14debugobjects: Make kmemleak ignore debug objectsWaiman Long
The allocated debug objects are either on the free list or in the hashed bucket lists. So they won't get lost. However if both debug objects and kmemleak are enabled and kmemleak scanning is done while some of the debug objects are transitioning from one list to the others, false negative reporting of memory leaks may happen for those objects. For example, [38687.275678] kmemleak: 12 new suspected memory leaks (see /sys/kernel/debug/kmemleak) unreferenced object 0xffff92e98aabeb68 (size 40): comm "ksmtuned", pid 4344, jiffies 4298403600 (age 906.430s) hex dump (first 32 bytes): 00 00 00 00 00 00 00 00 d0 bc db 92 e9 92 ff ff ................ 01 00 00 00 00 00 00 00 38 36 8a 61 e9 92 ff ff ........86.a.... backtrace: [<ffffffff8fa5378a>] kmemleak_alloc+0x4a/0xa0 [<ffffffff8f47c019>] kmem_cache_alloc+0xe9/0x320 [<ffffffff8f62ed96>] __debug_object_init+0x3e6/0x400 [<ffffffff8f62ef01>] debug_object_activate+0x131/0x210 [<ffffffff8f330d9f>] __call_rcu+0x3f/0x400 [<ffffffff8f33117d>] call_rcu_sched+0x1d/0x20 [<ffffffff8f4a183c>] put_object+0x2c/0x40 [<ffffffff8f4a188c>] __delete_object+0x3c/0x50 [<ffffffff8f4a18bd>] delete_object_full+0x1d/0x20 [<ffffffff8fa535c2>] kmemleak_free+0x32/0x80 [<ffffffff8f47af07>] kmem_cache_free+0x77/0x350 [<ffffffff8f453912>] unlink_anon_vmas+0x82/0x1e0 [<ffffffff8f440341>] free_pgtables+0xa1/0x110 [<ffffffff8f44af91>] exit_mmap+0xc1/0x170 [<ffffffff8f29db60>] mmput+0x80/0x150 [<ffffffff8f2a7609>] do_exit+0x2a9/0xd20 The references in the debug objects may also hide a real memory leak. As there is no point in having kmemleak to track debug object allocations, kmemleak checking is now disabled for debug objects. Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Andrew Morton <akpm@linux-foundation.org> Link: http://lkml.kernel.org/r/1502718733-8527-1-git-send-email-longman@redhat.com
2017-08-10fault-inject: fix wrong should_fail() decision in task contextAkinobu Mita
Commit 1203c8e6fb0a ("fault-inject: simplify access check for fail-nth") unintentionally broke a conditional statement in should_fail(). Any faults are not injected in the task context by the change when the systematic fault injection is not used. This change restores to the previous correct behaviour. Link: http://lkml.kernel.org/r/1501633700-3488-1-git-send-email-akinobu.mita@gmail.com Fixes: 1203c8e6fb0a ("fault-inject: simplify access check for fail-nth") Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com> Reported-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com> Tested-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com> Cc: Dmitry Vyukov <dvyukov@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-08-10test_kmod: fix small memory leak on filesystem testsDan Carpenter
The break was in the wrong place so file system tests don't work as intended, leaking memory at each test switch. [mcgrof@kernel.org: massaged commit subject, noted memory leak issue without the fix] Link: http://lkml.kernel.org/r/20170802211450.27928-6-mcgrof@kernel.org Fixes: 39258f448d71 ("kmod: add test driver to stress test the module loader") Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Luis R. Rodriguez <mcgrof@kernel.org> Reported-by: David Binderman <dcb314@hotmail.com> Cc: Colin Ian King <colin.king@canonical.com> Cc: Dmitry Torokhov <dmitry.torokhov@gmail.com> Cc: Eric W. Biederman <ebiederm@xmission.com> Cc: Jessica Yu <jeyu@redhat.com> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Kees Cook <keescook@chromium.org> Cc: Michal Marek <mmarek@suse.com> Cc: Miroslav Benes <mbenes@suse.cz> Cc: Petr Mladek <pmladek@suse.com> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-08-10test_kmod: fix the lock in register_test_dev_kmod()Dan Carpenter
We accidentally just drop the lock twice instead of taking it and then releasing it. This isn't a big issue unless you are adding more than one device to test on, and the kmod.sh doesn't do that yet, however this obviously is the correct thing to do. [mcgrof@kernel.org: massaged subject, explain what happens] Link: http://lkml.kernel.org/r/20170802211450.27928-5-mcgrof@kernel.org Fixes: 39258f448d71 ("kmod: add test driver to stress test the module loader") Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Luis R. Rodriguez <mcgrof@kernel.org> Cc: Colin Ian King <colin.king@canonical.com> Cc: David Binderman <dcb314@hotmail.com> Cc: Dmitry Torokhov <dmitry.torokhov@gmail.com> Cc: Eric W. Biederman <ebiederm@xmission.com> Cc: Jessica Yu <jeyu@redhat.com> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Kees Cook <keescook@chromium.org> Cc: Michal Marek <mmarek@suse.com> Cc: Miroslav Benes <mbenes@suse.cz> Cc: Petr Mladek <pmladek@suse.com> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-08-10test_kmod: fix bug which allows negative values on two config optionsLuis R. Rodriguez
Parsing with kstrtol() enables values to be negative, and we failed to check for negative values when parsing with test_dev_config_update_uint_sync() or test_dev_config_update_uint_range(). test_dev_config_update_uint_range() has a minimum check though so an issue is not present there. test_dev_config_update_uint_sync() is only used for the number of threads to use (config_num_threads_store()), and indeed this would fail with an attempt for a large allocation. Although the issue is only present in practice with the first fix both by using kstrtoul() instead of kstrtol(). Link: http://lkml.kernel.org/r/20170802211450.27928-4-mcgrof@kernel.org Fixes: 39258f448d71 ("kmod: add test driver to stress test the module loader") Signed-off-by: Luis R. Rodriguez <mcgrof@kernel.org> Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Cc: Colin Ian King <colin.king@canonical.com> Cc: David Binderman <dcb314@hotmail.com> Cc: Dmitry Torokhov <dmitry.torokhov@gmail.com> Cc: Eric W. Biederman <ebiederm@xmission.com> Cc: Jessica Yu <jeyu@redhat.com> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Kees Cook <keescook@chromium.org> Cc: Michal Marek <mmarek@suse.com> Cc: Miroslav Benes <mbenes@suse.cz> Cc: Petr Mladek <pmladek@suse.com> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-08-10test_kmod: fix spelling mistake: "EMTPY" -> "EMPTY"Colin Ian King
Trivial fix to spelling mistake in snprintf text [mcgrof@kernel.org: massaged commit message] Link: http://lkml.kernel.org/r/20170802211450.27928-3-mcgrof@kernel.org Fixes: 39258f448d71 ("kmod: add test driver to stress test the module loader") Signed-off-by: Colin Ian King <colin.king@canonical.com> Signed-off-by: Luis R. Rodriguez <mcgrof@kernel.org> Cc: Dmitry Torokhov <dmitry.torokhov@gmail.com> Cc: Kees Cook <keescook@chromium.org> Cc: Jessica Yu <jeyu@redhat.com> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: Michal Marek <mmarek@suse.com> Cc: Petr Mladek <pmladek@suse.com> Cc: Miroslav Benes <mbenes@suse.cz> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Eric W. Biederman <ebiederm@xmission.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Dan Carpenter <dan.carpenter@oracle.com> Cc: David Binderman <dcb314@hotmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-08-10Merge branch 'x86/urgent' into x86/asm, to pick up fixesIngo Molnar
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-07-31Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netLinus Torvalds
Pull networking fixes from David Miller: 1) Handle notifier registry failures properly in tun/tap driver, from Tonghao Zhang. 2) Fix bpf verifier handling of subtraction bounds and add a testcase for this, from Edward Cree. 3) Increase reset timeout in ftgmac100 driver, from Ben Herrenschmidt. 4) Fix use after free in prd_retire_rx_blk_timer_exired() in AF_PACKET, from Cong Wang. 5) Fix SElinux regression due to recent UDP optimizations, from Paolo Abeni. 6) We accidently increment IPSTATS_MIB_FRAGFAILS in the ipv6 code paths, fix from Stefano Brivio. 7) Fix some mem leaks in dccp, from Xin Long. 8) Adjust MDIO_BUS kconfig deps to avoid build errors, from Arnd Bergmann. 9) Mac address length check and buffer size fixes from Cong Wang. 10) Don't leak sockets in ipv6 udp early demux, from Paolo Abeni. 11) Fix return value when copy_from_user() fails in bpf_prog_get_info_by_fd(), from Daniel Borkmann. 12) Handle PHY_HALTED properly in phy library state machine, from Florian Fainelli. 13) Fix OOPS in fib_sync_down_dev(), from Ido Schimmel. 14) Fix truesize calculation in virtio_net which led to performance regressions, from Michael S Tsirkin. * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (76 commits) samples/bpf: fix bpf tunnel cleanup udp6: fix jumbogram reception ppp: Fix a scheduling-while-atomic bug in del_chan Revert "net: bcmgenet: Remove init parameter from bcmgenet_mii_config" virtio_net: fix truesize for mergeable buffers mv643xx_eth: fix of_irq_to_resource() error check MAINTAINERS: Add more files to the PHY LIBRARY section ipv4: fib: Fix NULL pointer deref during fib_sync_down_dev() net: phy: Correctly process PHY_HALTED in phy_stop_machine() sunhme: fix up GREG_STAT and GREG_IMASK register offsets bpf: fix bpf_prog_get_info_by_fd to dump correct xlated_prog_len tcp: avoid bogus gcc-7 array-bounds warning net: tc35815: fix spelling mistake: "Intterrupt" -> "Interrupt" bpf: don't indicate success when copy_from_user fails udp6: fix socket leak on early demux net: thunderx: Fix BGX transmit stall due to underflow Revert "vhost: cache used event for better performance" team: use a larger struct for mac address net: check dev->addr_len for dev_set_mac_address() phy: bcm-ns-usb3: fix MDIO_BUS dependency ...
2017-07-26x86/kconfig: Make it easier to switch to the new ORC unwinderJosh Poimboeuf
A couple of Kconfig changes which make it much easier to switch to the new CONFIG_ORC_UNWINDER: 1) Remove x86 dependencies on CONFIG_FRAME_POINTER for lockdep, latencytop, and fault injection. x86 has a 'guess' unwinder which just scans the stack for kernel text addresses. It's not 100% accurate but in many cases it's good enough. This allows those users who don't want the text overhead of the frame pointer or ORC unwinders to still use these features. More importantly, this also makes it much more straightforward to disable frame pointers. 2) Make CONFIG_ORC_UNWINDER depend on !CONFIG_FRAME_POINTER. While it would be possible to have both enabled, it doesn't really make sense to do so. So enforce a sane configuration to prevent the user from making a dumb mistake. With these changes, when you disable CONFIG_FRAME_POINTER, "make oldconfig" will ask if you want to enable CONFIG_ORC_UNWINDER. Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Jiri Slaby <jslaby@suse.cz> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: live-patching@vger.kernel.org Link: http://lkml.kernel.org/r/9985fb91ce5005fe33ea5cc2a20f14bd33c61d03.1500938583.git.jpoimboe@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-07-26x86/unwind: Add the ORC unwinderJosh Poimboeuf
Add the new ORC unwinder which is enabled by CONFIG_ORC_UNWINDER=y. It plugs into the existing x86 unwinder framework. It relies on objtool to generate the needed .orc_unwind and .orc_unwind_ip sections. For more details on why ORC is used instead of DWARF, see Documentation/x86/orc-unwinder.txt - but the short version is that it's a simplified, fundamentally more robust debugninfo data structure, which also allows up to two orders of magnitude faster lookups than the DWARF unwinder - which matters to profiling workloads like perf. Thanks to Andy Lutomirski for the performance improvement ideas: splitting the ORC unwind table into two parallel arrays and creating a fast lookup table to search a subset of the unwind table. Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Jiri Slaby <jslaby@suse.cz> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: live-patching@vger.kernel.org Link: http://lkml.kernel.org/r/0a6cbfb40f8da99b7a45a1a8302dc6aef16ec812.1500938583.git.jpoimboe@redhat.com [ Extended the changelog. ] Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-07-25lib: test_rhashtable: Fix KASAN warningPhil Sutter
I forgot one spot when introducing struct test_obj_val. Fixes: e859afe1ee0c5 ("lib: test_rhashtable: fix for large entry counts") Reported by: kernel test robot <fengguang.wu@intel.com> Signed-off-by: Phil Sutter <phil@nwl.cc> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-07-24lib: test_rhashtable: fix for large entry countsPhil Sutter
During concurrent access testing, threadfunc() concatenated thread ID and object index to create a unique key like so: | tdata->objs[i].value = (tdata->id << 16) | i; This breaks if a user passes an entries parameter of 64k or higher, since 'i' might use more than 16 bits then. Effectively, this will lead to duplicate keys in the table. Fix the problem by introducing a struct holding object and thread ID and using that as key instead of a single integer type field. Fixes: f4a3e90ba5739 ("rhashtable-test: extend to test concurrency") Reported by: Manuel Messner <mm@skelett.io> Signed-off-by: Phil Sutter <phil@nwl.cc> Acked-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-07-21uuid: fix incorrect uuid_equal conversion in test_uuid_testChristoph Hellwig
Fixes: df33767d ("uuid: hoist helpers uuid_equal() and uuid_copy() from xfs") Reported-by: kernel test robot <xiaolong.ye@intel.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Amir Goldstein <amir73il@gmail.com> Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
2017-07-15Merge tag 'random_for_linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tytso/random Pull random updates from Ted Ts'o: "Add wait_for_random_bytes() and get_random_*_wait() functions so that callers can more safely get random bytes if they can block until the CRNG is initialized. Also print a warning if get_random_*() is called before the CRNG is initialized. By default, only one single-line warning will be printed per boot. If CONFIG_WARN_ALL_UNSEEDED_RANDOM is defined, then a warning will be printed for each function which tries to get random bytes before the CRNG is initialized. This can get spammy for certain architecture types, so it is not enabled by default" * tag 'random_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/random: random: reorder READ_ONCE() in get_random_uXX random: suppress spammy warnings about unseeded randomness random: warn when kernel uses unseeded randomness net/route: use get_random_int for random counter net/neighbor: use get_random_u32 for 32-bit hash random rhashtable: use get_random_u32 for hash_rnd ceph: ensure RNG is seeded before using iscsi: ensure RNG is seeded before use cifs: use get_random_u32 for 32-bit lock random random: add get_random_{bytes,u32,u64,int,long,once}_wait family random: add wait_for_random_bytes() API
2017-07-15random: suppress spammy warnings about unseeded randomnessTheodore Ts'o
Unfortunately, on some models of some architectures getting a fully seeded CRNG is extremely difficult, and so this can result in dmesg getting spammed for a surprisingly long time. This is really bad from a security perspective, and so architecture maintainers really need to do what they can to get the CRNG seeded sooner after the system is booted. However, users can't do anything actionble to address this, and spamming the kernel messages log will only just annoy people. For developers who want to work on improving this situation, CONFIG_WARN_UNSEEDED_RANDOM has been renamed to CONFIG_WARN_ALL_UNSEEDED_RANDOM. By default the kernel will always print the first use of unseeded randomness. This way, hopefully the security obsessed will be happy that there is _some_ indication when the kernel boots there may be a potential issue with that architecture or subarchitecture. To see all uses of unseeded randomness, developers can enable CONFIG_WARN_ALL_UNSEEDED_RANDOM. Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2017-07-14kmod: add test driver to stress test the module loaderLuis R. Rodriguez
This adds a new stress test driver for kmod: the kernel module loader. The new stress test driver, test_kmod, is only enabled as a module right now. It should be possible to load this as built-in and load tests early (refer to the force_init_test module parameter), however since a lot of test can get a system out of memory fast we leave this disabled for now. Using a system with 1024 MiB of RAM can *easily* get your kernel OOM fast with this test driver. The test_kmod driver exposes API knobs for us to fine tune simple request_module() and get_fs_type() calls. Since these API calls only allow each one parameter a test driver for these is rather simple. Other factors that can help out test driver though are the number of calls we issue and knowing current limitations of each. This exposes configuration as much as possible through userspace to be able to build tests directly from userspace. Since it allows multiple misc devices its will eventually (once we add a knob to let us create new devices at will) also be possible to perform more tests in parallel, provided you have enough memory. We only enable tests we know work as of right now. Demo screenshots: # tools/testing/selftests/kmod/kmod.sh kmod_test_0001_driver: OK! - loading kmod test kmod_test_0001_driver: OK! - Return value: 256 (MODULE_NOT_FOUND), expected MODULE_NOT_FOUND kmod_test_0001_fs: OK! - loading kmod test kmod_test_0001_fs: OK! - Return value: -22 (-EINVAL), expected -EINVAL kmod_test_0002_driver: OK! - loading kmod test kmod_test_0002_driver: OK! - Return value: 256 (MODULE_NOT_FOUND), expected MODULE_NOT_FOUND kmod_test_0002_fs: OK! - loading kmod test kmod_test_0002_fs: OK! - Return value: -22 (-EINVAL), expected -EINVAL kmod_test_0003: OK! - loading kmod test kmod_test_0003: OK! - Return value: 0 (SUCCESS), expected SUCCESS kmod_test_0004: OK! - loading kmod test kmod_test_0004: OK! - Return value: 0 (SUCCESS), expected SUCCESS kmod_test_0005: OK! - loading kmod test kmod_test_0005: OK! - Return value: 0 (SUCCESS), expected SUCCESS kmod_test_0006: OK! - loading kmod test kmod_test_0006: OK! - Return value: 0 (SUCCESS), expected SUCCESS kmod_test_0005: OK! - loading kmod test kmod_test_0005: OK! - Return value: 0 (SUCCESS), expected SUCCESS kmod_test_0006: OK! - loading kmod test kmod_test_0006: OK! - Return value: 0 (SUCCESS), expected SUCCESS XXX: add test restult for 0007 Test completed You can also request for specific tests: # tools/testing/selftests/kmod/kmod.sh -t 0001 kmod_test_0001_driver: OK! - loading kmod test kmod_test_0001_driver: OK! - Return value: 256 (MODULE_NOT_FOUND), expected MODULE_NOT_FOUND kmod_test_0001_fs: OK! - loading kmod test kmod_test_0001_fs: OK! - Return value: -22 (-EINVAL), expected -EINVAL Test completed Lastly, the current available number of tests: # tools/testing/selftests/kmod/kmod.sh --help Usage: tools/testing/selftests/kmod/kmod.sh [ -t <4-number-digit> ] Valid tests: 0001-0009 0001 - Simple test - 1 thread for empty string 0002 - Simple test - 1 thread for modules/filesystems that do not exist 0003 - Simple test - 1 thread for get_fs_type() only 0004 - Simple test - 2 threads for get_fs_type() only 0005 - multithreaded tests with default setup - request_module() only 0006 - multithreaded tests with default setup - get_fs_type() only 0007 - multithreaded tests with default setup test request_module() and get_fs_type() 0008 - multithreaded - push kmod_concurrent over max_modprobes for request_module() 0009 - multithreaded - push kmod_concurrent over max_modprobes for get_fs_type() The following test cases currently fail, as such they are not currently enabled by default: # tools/testing/selftests/kmod/kmod.sh -t 0008 # tools/testing/selftests/kmod/kmod.sh -t 0009 To be sure to run them as intended please unload both of the modules: o test_module o xfs And ensure they are not loaded on your system prior to testing them. If you use these paritions for your rootfs you can change the default test driver used for get_fs_type() by exporting it into your environment. For example of other test defaults you can override refer to kmod.sh allow_user_defaults(). Behind the scenes this is how we fine tune at a test case prior to hitting a trigger to run it: cat /sys/devices/virtual/misc/test_kmod0/config echo -n "2" > /sys/devices/virtual/misc/test_kmod0/config_test_case echo -n "ext4" > /sys/devices/virtual/misc/test_kmod0/config_test_fs echo -n "80" > /sys/devices/virtual/misc/test_kmod0/config_num_threads cat /sys/devices/virtual/misc/test_kmod0/config echo -n "1" > /sys/devices/virtual/misc/test_kmod0/config_num_threads Finally to trigger: echo -n "1" > /sys/devices/virtual/misc/test_kmod0/trigger_config The kmod.sh script uses the above constructs to build different test cases. A bit of interpretation of the current failures follows, first two premises: a) When request_module() is used userspace figures out an optimized version of module order for us. Once it finds the modules it needs, as per depmod symbol dep map, it will finit_module() the respective modules which are needed for the original request_module() request. b) We have an optimization in place whereby if a kernel uses request_module() on a module already loaded we never bother userspace as the module already is loaded. This is all handled by kernel/kmod.c. A few things to consider to help identify root causes of issues: 0) kmod 19 has a broken heuristic for modules being assumed to be built-in to your kernel and will return 0 even though request_module() failed. Upgrade to a newer version of kmod. 1) A get_fs_type() call for "xfs" will request_module() for "fs-xfs", not for "xfs". The optimization in kernel described in b) fails to catch if we have a lot of consecutive get_fs_type() calls. The reason is the optimization in place does not look for aliases. This means two consecutive get_fs_type() calls will bump kmod_concurrent, whereas request_module() will not. This one explanation why test case 0009 fails at least once for get_fs_type(). 2) If a module fails to load --- for whatever reason (kmod_concurrent limit reached, file not yet present due to rootfs switch, out of memory) we have a period of time during which module request for the same name either with request_module() or get_fs_type() will *also* fail to load even if the file for the module is ready. This explains why *multiple* NULLs are possible on test 0009. 3) finit_module() consumes quite a bit of memory. 4) Filesystems typically also have more dependent modules than other modules, its important to note though that even though a get_fs_type() call does not incur additional kmod_concurrent bumps, since userspace loads dependencies it finds it needs via finit_module_fd(), it *will* take much more memory to load a module with a lot of dependencies. Because of 3) and 4) we will easily run into out of memory failures with certain tests. For instance test 0006 fails on qemu with 1024 MiB of RAM. It panics a box after reaping all userspace processes and still not having enough memory to reap. [arnd@arndb.de: add dependencies for test module] Link: http://lkml.kernel.org/r/20170630154834.3689272-1-arnd@arndb.de Link: http://lkml.kernel.org/r/20170628223155.26472-3-mcgrof@kernel.org Signed-off-by: Luis R. Rodriguez <mcgrof@kernel.org> Cc: Jessica Yu <jeyu@redhat.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: Michal Marek <mmarek@suse.com> Cc: Petr Mladek <pmladek@suse.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-14fault-inject: simplify access check for fail-nthAkinobu Mita
The fail-nth file is created with 0666 and the access is permitted if and only if the task is current. This file is owned by the currnet user. So we can create it with 0644 and allow the owner to write it. This enables to watch the status of task->fail_nth from another processes. [akinobu.mita@gmail.com: don't convert unsigned type value as signed int] Link: http://lkml.kernel.org/r/1492444483-9239-1-git-send-email-akinobu.mita@gmail.com [akinobu.mita@gmail.com: avoid unwanted data race to task->fail_nth] Link: http://lkml.kernel.org/r/1499962492-8931-1-git-send-email-akinobu.mita@gmail.com Link: http://lkml.kernel.org/r/1491490561-10485-5-git-send-email-akinobu.mita@gmail.com Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com> Acked-by: Dmitry Vyukov <dvyukov@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-14lib/atomic64_test.c: add a test that atomic64_inc_not_zero() returns an intMichael Ellerman
atomic64_inc_not_zero() returns a "truth value" which in C is traditionally an int. That means callers are likely to expect the result will fit in an int. If an implementation returns a "true" value which does not fit in an int, then there's a possibility that callers will truncate it when they store it in an int. In fact this happened in practice, see commit 966d2b04e070 ("percpu-refcount: fix reference leak during percpu-atomic transition"). So add a test that the result fits in an int, even when the input doesn't. This catches the case where an implementation just passes the non-zero input value out as the result. Link: http://lkml.kernel.org/r/1499775133-1231-1-git-send-email-mpe@ellerman.id.au Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Cc: Douglas Miller <dougmill@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-12writeback: rework wb_[dec|inc]_stat family of functionsNikolay Borisov
Currently the writeback statistics code uses a percpu counters to hold various statistics. Furthermore we have 2 families of functions - those which disable local irq and those which doesn't and whose names begin with double underscore. However, they both end up calling __add_wb_stats which in turn calls percpu_counter_add_batch which is already irq-safe. Exploiting this fact allows to eliminated the __wb_* functions since they don't add any further protection than we already have. Furthermore, refactor the wb_* function to call __add_wb_stat directly without the irq-disabling dance. This will likely result in better runtime of code which deals with modifying the stat counters. While at it also document why percpu_counter_add_batch is in fact preempt and irq-safe since at least 3 people got confused. Link: http://lkml.kernel.org/r/1498029937-27293-1-git-send-email-nborisov@suse.com Signed-off-by: Nikolay Borisov <nborisov@suse.com> Acked-by: Tejun Heo <tj@kernel.org> Reviewed-by: Jan Kara <jack@suse.cz> Cc: Josef Bacik <jbacik@fb.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Jeff Layton <jlayton@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-12include/linux/string.h: add the option of fortified string.h functionsDaniel Micay
This adds support for compiling with a rough equivalent to the glibc _FORTIFY_SOURCE=1 feature, providing compile-time and runtime buffer overflow checks for string.h functions when the compiler determines the size of the source or destination buffer at compile-time. Unlike glibc, it covers buffer reads in addition to writes. GNU C __builtin_*_chk intrinsics are avoided because they would force a much more complex implementation. They aren't designed to detect read overflows and offer no real benefit when using an implementation based on inline checks. Inline checks don't add up to much code size and allow full use of the regular string intrinsics while avoiding the need for a bunch of _chk functions and per-arch assembly to avoid wrapper overhead. This detects various overflows at compile-time in various drivers and some non-x86 core kernel code. There will likely be issues caught in regular use at runtime too. Future improvements left out of initial implementation for simplicity, as it's all quite optional and can be done incrementally: * Some of the fortified string functions (strncpy, strcat), don't yet place a limit on reads from the source based on __builtin_object_size of the source buffer. * Extending coverage to more string functions like strlcat. * It should be possible to optionally use __builtin_object_size(x, 1) for some functions (C strings) to detect intra-object overflows (like glibc's _FORTIFY_SOURCE=2), but for now this takes the conservative approach to avoid likely compatibility issues. * The compile-time checks should be made available via a separate config option which can be enabled by default (or always enabled) once enough time has passed to get the issues it catches fixed. Kees said: "This is great to have. While it was out-of-tree code, it would have blocked at least CVE-2016-3858 from being exploitable (improper size argument to strlcpy()). I've sent a number of fixes for out-of-bounds-reads that this detected upstream already" [arnd@arndb.de: x86: fix fortified memcpy] Link: http://lkml.kernel.org/r/20170627150047.660360-1-arnd@arndb.de [keescook@chromium.org: avoid panic() in favor of BUG()] Link: http://lkml.kernel.org/r/20170626235122.GA25261@beast [keescook@chromium.org: move from -mm, add ARCH_HAS_FORTIFY_SOURCE, tweak Kconfig help] Link: http://lkml.kernel.org/r/20170526095404.20439-1-danielmicay@gmail.com Link: http://lkml.kernel.org/r/1497903987-21002-8-git-send-email-keescook@chromium.org Signed-off-by: Daniel Micay <danielmicay@gmail.com> Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: Arnd Bergmann <arnd@arndb.de> Acked-by: Kees Cook <keescook@chromium.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Daniel Axtens <dja@axtens.net> Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk> Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Cc: Chris Metcalf <cmetcalf@ezchip.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-12kernel/watchdog: split up config optionsNicholas Piggin
Split SOFTLOCKUP_DETECTOR from LOCKUP_DETECTOR, and split HARDLOCKUP_DETECTOR_PERF from HARDLOCKUP_DETECTOR. LOCKUP_DETECTOR implies the general boot, sysctl, and programming interfaces for the lockup detectors. An architecture that wants to use a hard lockup detector must define HAVE_HARDLOCKUP_DETECTOR_PERF or HAVE_HARDLOCKUP_DETECTOR_ARCH. Alternatively an arch can define HAVE_NMI_WATCHDOG, which provides the minimum arch_touch_nmi_watchdog, and it otherwise does its own thing and does not implement the LOCKUP_DETECTOR interfaces. sparc is unusual in that it has started to implement some of the interfaces, but not fully yet. It should probably be converted to a full HAVE_HARDLOCKUP_DETECTOR_ARCH. [npiggin@gmail.com: fix] Link: http://lkml.kernel.org/r/20170617223522.66c0ad88@roar.ozlabs.ibm.com Link: http://lkml.kernel.org/r/20170616065715.18390-4-npiggin@gmail.com Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Reviewed-by: Don Zickus <dzickus@redhat.com> Reviewed-by: Babu Moger <babu.moger@oracle.com> Tested-by: Babu Moger <babu.moger@oracle.com> [sparc] Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-12fault-inject: support systematic fault injectionDmitry Vyukov
Add /proc/self/task/<current-tid>/fail-nth file that allows failing 0-th, 1-st, 2-nd and so on calls systematically. Excerpt from the added documentation: "Write to this file of integer N makes N-th call in the current task fail (N is 0-based). Read from this file returns a single char 'Y' or 'N' that says if the fault setup with a previous write to this file was injected or not, and disables the fault if it wasn't yet injected. Note that this file enables all types of faults (slab, futex, etc). This setting takes precedence over all other generic settings like probability, interval, times, etc. But per-capability settings (e.g. fail_futex/ignore-private) take precedence over it. This feature is intended for systematic testing of faults in a single system call. See an example below" Why add a new setting: 1. Existing settings are global rather than per-task. So parallel testing is not possible. 2. attr->interval is close but it depends on attr->count which is non reset to 0, so interval does not work as expected. 3. Trying to model this with existing settings requires manipulations of all of probability, interval, times, space, task-filter and unexposed count and per-task make-it-fail files. 4. Existing settings are per-failure-type, and the set of failure types is potentially expanding. 5. make-it-fail can't be changed by unprivileged user and aggressive stress testing better be done from an unprivileged user. Similarly, this would require opening the debugfs files to the unprivileged user, as he would need to reopen at least times file (not possible to pre-open before dropping privs). The proposed interface solves all of the above (see the example). We want to integrate this into syzkaller fuzzer. A prototype has found 10 bugs in kernel in first day of usage: https://groups.google.com/forum/#!searchin/syzkaller/%22FAULT_INJECTION%22%7Csort:relevance I've made the current interface work with all types of our sandboxes. For setuid the secret sauce was prctl(PR_SET_DUMPABLE, 1, 0, 0, 0) to make /proc entries non-root owned. So I am fine with the current version of the code. [akpm@linux-foundation.org: fix build] Link: http://lkml.kernel.org/r/20170328130128.101773-1-dvyukov@google.com Signed-off-by: Dmitry Vyukov <dvyukov@google.com> Cc: Akinobu Mita <akinobu.mita@gmail.com> Cc: Michal Hocko <mhocko@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-12test_sysctl: test against int proc_dointvec() array supportLuis R. Rodriguez
Add a few initial respective tests for an array: o Echoing values separated by spaces works o Echoing only first elements will set first elements o Confirm PAGE_SIZE limit still applies even if an array is used Link: http://lkml.kernel.org/r/20170630224431.17374-7-mcgrof@kernel.org Signed-off-by: Luis R. Rodriguez <mcgrof@kernel.org> Cc: Kees Cook <keescook@chromium.org> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-12test_sysctl: add simple proc_douintvec() caseLuis R. Rodriguez
Test against a simple proc_douintvec() case. While at it, add a test against UINT_MAX. Make sure UINT_MAX works, and UINT_MAX+1 will fail and that negative values are not accepted. Link: http://lkml.kernel.org/r/20170630224431.17374-6-mcgrof@kernel.org Signed-off-by: Luis R. Rodriguez <mcgrof@kernel.org> Cc: Kees Cook <keescook@chromium.org> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-12test_sysctl: add simple proc_dointvec() caseLuis R. Rodriguez
Test against a simple proc_dointvec() case. While at it, add a test against INT_MAX. Make sure INT_MAX works, and INT_MAX+1 will fail. Also test negative values work. Link: http://lkml.kernel.org/r/20170630224431.17374-5-mcgrof@kernel.org Signed-off-by: Luis R. Rodriguez <mcgrof@kernel.org> Cc: Kees Cook <keescook@chromium.org> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-12test_sysctl: add dedicated proc sysctl test driverLuis R. Rodriguez
The existing tools/testing/selftests/sysctl/ tests include two test cases, but these use existing production kernel sysctl interfaces. We want to expand test coverage but we can't just be looking for random safe production values to poke at, that's just insane! Instead just dedicate a test driver for debugging purposes and port the existing scripts to use it. This will make it easier for further tests to be added. Subsequent patches will extend our test coverage for sysctl. The stress test driver uses a new license (GPL on Linux, copyleft-next outside of Linux). Linus was fine with this [0] and later due to Ted's and Alans's request ironed out an "or" language clause to use [1] which is already present upstream. [0] https://lkml.kernel.org/r/CA+55aFyhxcvD+q7tp+-yrSFDKfR0mOHgyEAe=f_94aKLsOu0Og@mail.gmail.com [1] https://lkml.kernel.org/r/1495234558.7848.122.camel@linux.intel.com Link: http://lkml.kernel.org/r/20170630224431.17374-2-mcgrof@kernel.org Signed-off-by: Luis R. Rodriguez <mcgrof@kernel.org> Acked-by: Kees Cook <keescook@chromium.org> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-10lib/bsearch.c: micro-optimize pivot position calculationSergey Senozhatsky
There is a slightly faster way (in terms of the number of instructions being used) to calculate the position of a middle element, preserving integer overflow safeness. ./scripts/bloat-o-meter lib/bsearch.o.old lib/bsearch.o.new add/remove: 0/0 grow/shrink: 0/1 up/down: 0/-24 (-24) function old new delta bsearch 122 98 -24 TEST INT array of size 100001, elements [0..100000]. gcc 7.1, Os, x86_64. a) bsearch() of existing key "100001 - 2": BASE ==== $ perf stat ./a.out Performance counter stats for './a.out': 619.445196 task-clock:u (msec) # 0.999 CPUs utilized 0 context-switches:u # 0.000 K/sec 0 cpu-migrations:u # 0.000 K/sec 133 page-faults:u # 0.215 K/sec 1,949,517,279 cycles:u # 3.147 GHz (83.06%) 181,017,938 stalled-cycles-frontend:u # 9.29% frontend cycles idle (83.05%) 82,959,265 stalled-cycles-backend:u # 4.26% backend cycles idle (67.02%) 4,355,706,383 instructions:u # 2.23 insn per cycle # 0.04 stalled cycles per insn (83.54%) 1,051,539,242 branches:u # 1697.550 M/sec (83.54%) 15,263,381 branch-misses:u # 1.45% of all branches (83.43%) 0.620082548 seconds time elapsed PATCHED ======= $ perf stat ./a.out Performance counter stats for './a.out': 475.097316 task-clock:u (msec) # 0.999 CPUs utilized 0 context-switches:u # 0.000 K/sec 0 cpu-migrations:u # 0.000 K/sec 135 page-faults:u # 0.284 K/sec 1,487,467,717 cycles:u # 3.131 GHz (82.95%) 186,537,162 stalled-cycles-frontend:u # 12.54% frontend cycles idle (82.93%) 28,797,869 stalled-cycles-backend:u # 1.94% backend cycles idle (67.10%) 3,807,564,203 instructions:u # 2.56 insn per cycle # 0.05 stalled cycles per insn (83.57%) 1,049,344,291 branches:u # 2208.693 M/sec (83.60%) 5,485 branch-misses:u # 0.00% of all branches (83.58%) 0.475760235 seconds time elapsed b) bsearch() of un-existing key "100001 + 2": BASE ==== $ perf stat ./a.out Performance counter stats for './a.out': 499.244480 task-clock:u (msec) # 0.999 CPUs utilized 0 context-switches:u # 0.000 K/sec 0 cpu-migrations:u # 0.000 K/sec 132 page-faults:u # 0.264 K/sec 1,571,194,855 cycles:u # 3.147 GHz (83.18%) 13,450,980 stalled-cycles-frontend:u # 0.86% frontend cycles idle (83.18%) 21,256,072 stalled-cycles-backend:u # 1.35% backend cycles idle (66.78%) 4,171,197,909 instructions:u # 2.65 insn per cycle # 0.01 stalled cycles per insn (83.68%) 1,009,175,281 branches:u # 2021.405 M/sec (83.79%) 3,122 branch-misses:u # 0.00% of all branches (83.37%) 0.499871144 seconds time elapsed PATCHED ======= $ perf stat ./a.out Performance counter stats for './a.out': 399.023499 task-clock:u (msec) # 0.998 CPUs utilized 0 context-switches:u # 0.000 K/sec 0 cpu-migrations:u # 0.000 K/sec 134 page-faults:u # 0.336 K/sec 1,245,793,991 cycles:u # 3.122 GHz (83.39%) 11,529,273 stalled-cycles-frontend:u # 0.93% frontend cycles idle (83.46%) 12,116,311 stalled-cycles-backend:u # 0.97% backend cycles idle (66.92%) 3,679,710,005 instructions:u # 2.95 insn per cycle # 0.00 stalled cycles per insn (83.47%) 1,009,792,625 branches:u # 2530.660 M/sec (83.46%) 2,590 branch-misses:u # 0.00% of all branches (83.12%) 0.399733539 seconds time elapsed Link: http://lkml.kernel.org/r/20170607150457.5905-1-sergey.senozhatsky@gmail.com Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-10lib/extable.c: use bsearch() library function in search_extable()Thomas Meyer
[thomas@m3y3r.de: v3: fix arch specific implementations] Link: http://lkml.kernel.org/r/1497890858.12931.7.camel@m3y3r.de Signed-off-by: Thomas Meyer <thomas@m3y3r.de> Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-10lib/rhashtable.c: use kvzalloc() in bucket_table_alloc() when possibleMichal Hocko
bucket_table_alloc() can be currently called with GFP_KERNEL or GFP_ATOMIC. For the former we basically have an open coded kvzalloc() while the later only uses kzalloc(). Let's simplify the code a bit by the dropping the open coded path and replace it with kvzalloc(). Link: http://lkml.kernel.org/r/20170531155145.17111-3-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Cc: Thomas Graf <tgraf@suug.ch> Cc: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-10lib/interval_tree_test.c: allow full tree searchDavidlohr Bueso
... such that a user can specify visiting all the nodes in the tree (intersects with the world). This is a nice opposite from the very basic default query which is a single point. Link: http://lkml.kernel.org/r/20170518174936.20265-5-dave@stgolabs.net Signed-off-by: Davidlohr Bueso <dbueso@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-10lib/interval_tree_test.c: allow users to limit scope of endpointDavidlohr Bueso
Add a 'max_endpoint' parameter such that users may easily limit the size of the intervals that are randomly generated. Link: http://lkml.kernel.org/r/20170518174936.20265-4-dave@stgolabs.net Signed-off-by: Davidlohr Bueso <dbueso@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-10lib/interval_tree_test.c: make test options module parametersDavidlohr Bueso
Allows for more flexible debugging. Link: http://lkml.kernel.org/r/20170518174936.20265-3-dave@stgolabs.net Signed-off-by: Davidlohr Bueso <dbueso@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-10lib/interval_tree_test.c: allow the module to be compiled-inDavidlohr Bueso
Patch series "lib/interval_tree_test: some debugging improvements". Here are some patches that update the interval_tree_test module allowing users to pass finer grained options to run the actual test. This patch (of 4): It is a tristate after all, and also serves well for quick debugging. Link: http://lkml.kernel.org/r/20170518174936.20265-2-dave@stgolabs.net Signed-off-by: Davidlohr Bueso <dbueso@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-10lib/kstrtox.c: use "unsigned int" moreAlexey Dobriyan
gcc does generates stupid code sign extending data back and forth. Help by using "unsigned int". add/remove: 0/0 grow/shrink: 0/3 up/down: 0/-61 (-61) function old new delta _parse_integer 128 123 -5 It _still_ does generate useless MOVSX but I don't know how to delete it: 0000000000000070 <_parse_integer>: ... a0: 89 c2 mov edx,eax a2: 83 e8 30 sub eax,0x30 a5: 83 f8 09 cmp eax,0x9 a8: 76 11 jbe bb <_parse_integer+0x4b> aa: 83 ca 20 or edx,0x20 ad: 0f be c2 ===> movsx eax,dl <=== useless b0: 8d 50 9f lea edx,[rax-0x61] b3: 83 fa 05 cmp edx,0x5 Patch also helps on embedded archs which generally only like "int". On arm "and 0xff" is generated which is waste because all values used in comparisons are positive. Link: http://lkml.kernel.org/r/20170514194720.GB32563@avx2 Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-10lib/kstrtox.c: delete end-of-string testAlexey Dobriyan
Standard "while (*s)" test is unnecessary because NUL won't pass valid-digit test anyway. Save one branch per parsed character. Link: http://lkml.kernel.org/r/20170514193756.GA32563@avx2 Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-10bitmap: optimise bitmap_set and bitmap_clear of a single bitMatthew Wilcox
We have eight users calling bitmap_clear for a single bit and seventeen calling bitmap_set for a single bit. Rather than fix all of them to call __clear_bit or __set_bit, turn bitmap_clear and bitmap_set into inline functions and make this special case efficient. Link: http://lkml.kernel.org/r/20170628153221.11322-3-willy@infradead.org Signed-off-by: Matthew Wilcox <mawilcox@microsoft.com> Acked-by: Rasmus Villemoes <linux@rasmusvillemoes.dk> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Matthew Wilcox <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-10lib/test_bitmap.c: add optimisation testsMatthew Wilcox
Patch series "Bitmap optimisations", v2. These three bitmap patches use more efficient specialisations when the compiler can figure out that it's safe to do so. Thanks to Rasmus's eagle eyes, a nasty bug in v1 was avoided, and I've added a test case which would have caught it. This patch (of 4): This version of the test is actually a no-op; the next patch will enable it. Link: http://lkml.kernel.org/r/20170628153221.11322-2-willy@infradead.org Signed-off-by: Matthew Wilcox <mawilcox@microsoft.com> Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Matthew Wilcox <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-08Merge tag 'dmaengine-4.13-rc1' of git://git.infradead.org/users/vkoul/slave-dmaLinus Torvalds
Pull dmaengine updates from Vinod Koul: - removal of AVR32 support in dw driver as AVR32 is gone - new driver for Broadcom stream buffer accelerator (SBA) RAID driver - add support for Faraday Technology FTDMAC020 in amba-pl08x driver - IOMMU support in pl330 driver - updates to bunch of drivers * tag 'dmaengine-4.13-rc1' of git://git.infradead.org/users/vkoul/slave-dma: (36 commits) dmaengine: qcom_hidma: correct API violation for submit dmaengine: zynqmp_dma: Remove max len check in zynqmp_dma_prep_memcpy dmaengine: tegra-apb: Really fix runtime-pm usage dmaengine: fsl_raid: make of_device_ids const. dmaengine: qcom_hidma: allow ACPI/DT parameters to be overridden dmaengine: fsldma: set BWC, DAHTS and SAHTS values correctly dmaengine: Kconfig: Simplify the help text for MXS_DMA dmaengine: pl330: Delete unused functions dmaengine: Replace WARN_TAINT_ONCE() with pr_warn_once() dmaengine: Kconfig: Extend the dependency for MXS_DMA dmaengine: mxs: Use %zu for printing a size_t variable dmaengine: ste_dma40: Cleanup scatterlist layering violations dmaengine: imx-dma: cleanup scatterlist layering violations dmaengine: use proper name for the R-Car SoC dmaengine: imx-sdma: Fix compilation warning. dmaengine: imx-sdma: Handle return value of clk_prepare_enable dmaengine: pl330: Add IOMMU support to slave tranfers dmaengine: DW DMAC: Handle return value of clk_prepare_enable dmaengine: pl08x: use GENMASK() to create bitmasks dmaengine: pl08x: Add support for Faraday Technology FTDMAC020 ...
2017-07-07Merge branch 'uaccess-work.iov_iter' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull iov_iter hardening from Al Viro: "This is the iov_iter/uaccess/hardening pile. For one thing, it trims the inline part of copy_to_user/copy_from_user to the minimum that *does* need to be inlined - object size checks, basically. For another, it sanitizes the checks for iov_iter primitives. There are 4 groups of checks: access_ok(), might_fault(), object size and KASAN. - access_ok() had been verified by whoever had set the iov_iter up. However, that has happened in a function far away, so proving that there's no path to actual copying bypassing those checks is hard and proving that iov_iter has not been buggered in the meanwhile is also not pleasant. So we want those redone in actual copyin/copyout. - might_fault() is better off consolidated - we know whether it needs to be checked as soon as we enter iov_iter primitive and observe the iov_iter flavour. No need to wait until the copyin/copyout. The call chains are short enough to make sure we won't miss anything - in fact, it's more robust that way, since there are cases where we do e.g. forced fault-in before getting to copyin/copyout. It's not quite what we need to check (in particular, combination of iovec-backed and set_fs(KERNEL_DS) is almost certainly a bug, not a cause to skip checks), but that's for later series. For now let's keep might_fault(). - KASAN checks belong in copyin/copyout - at the same level where other iov_iter flavours would've hit them in memcpy(). - object size checks should apply to *all* iov_iter flavours, not just iovec-backed ones. There are two groups of primitives - one gets the kernel object described as pointer + size (copy_to_iter(), etc.) while another gets it as page + offset + size (copy_page_to_iter(), etc.) For the first group the checks are best done where we actually have a chance to find the object size. In other words, those belong in inline wrappers in uio.h, before calling into iov_iter.c. Same kind as we have for inlined part of copy_to_user(). For the second group there is no object to look at - offset in page is just a number, it bears no type information. So we do them in the common helper called by iov_iter.c primitives of that kind. All it currently does is checking that we are not trying to access outside of the compound page; eventually we might want to add some sanity checks on the page involved. So the things we need in copyin/copyout part of iov_iter.c do not quite match anything in uaccess.h (we want no zeroing, we *do* want access_ok() and KASAN and we want no might_fault() or object size checks done on that level). OTOH, these needs are simple enough to provide a couple of helpers (static in iov_iter.c) doing just what we need..." * 'uaccess-work.iov_iter' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: iov_iter: saner checks on copyin/copyout iov_iter: sanity checks for copy to/from page primitives iov_iter/hardening: move object size checks to inlined part copy_{to,from}_user(): consolidate object size checks copy_{from,to}_user(): move kasan checks and might_fault() out-of-line
2017-07-07Merge tag 'for-linus-v4.13-2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux Pull Writeback error handling updates from Jeff Layton: "This pile represents the bulk of the writeback error handling fixes that I have for this cycle. Some of the earlier patches in this pile may look trivial but they are prerequisites for later patches in the series. The aim of this set is to improve how we track and report writeback errors to userland. Most applications that care about data integrity will periodically call fsync/fdatasync/msync to ensure that their writes have made it to the backing store. For a very long time, we have tracked writeback errors using two flags in the address_space: AS_EIO and AS_ENOSPC. Those flags are set when a writeback error occurs (via mapping_set_error) and are cleared as a side-effect of filemap_check_errors (as you noted yesterday). This model really sucks for userland. Only the first task to call fsync (or msync or fdatasync) will see the error. Any subsequent task calling fsync on a file will get back 0 (unless another writeback error occurs in the interim). If I have several tasks writing to a file and calling fsync to ensure that their writes got stored, then I need to have them coordinate with one another. That's difficult enough, but in a world of containerized setups that coordination may even not be possible. But wait...it gets worse! The calls to filemap_check_errors can be buried pretty far down in the call stack, and there are internal callers of filemap_write_and_wait and the like that also end up clearing those errors. Many of those callers ignore the error return from that function or return it to userland at nonsensical times (e.g. truncate() or stat()). If I get back -EIO on a truncate, there is no reason to think that it was because some previous writeback failed, and a subsequent fsync() will (incorrectly) return 0. This pile aims to do three things: 1) ensure that when a writeback error occurs that that error will be reported to userland on a subsequent fsync/fdatasync/msync call, regardless of what internal callers are doing 2) report writeback errors on all file descriptions that were open at the time that the error occurred. This is a user-visible change, but I think most applications are written to assume this behavior anyway. Those that aren't are unlikely to be hurt by it. 3) document what filesystems should do when there is a writeback error. Today, there is very little consistency between them, and a lot of cargo-cult copying. We need to make it very clear what filesystems should do in this situation. To achieve this, the set adds a new data type (errseq_t) and then builds new writeback error tracking infrastructure around that. Once all of that is in place, we change the filesystems to use the new infrastructure for reporting wb errors to userland. Note that this is just the initial foray into cleaning up this mess. There is a lot of work remaining here: 1) convert the rest of the filesystems in a similar fashion. Once the initial set is in, then I think most other fs' will be fairly simple to convert. Hopefully most of those can in via individual filesystem trees. 2) convert internal waiters on writeback to use errseq_t for detecting errors instead of relying on the AS_* flags. I have some draft patches for this for ext4, but they are not quite ready for prime time yet. This was a discussion topic this year at LSF/MM too. If you're interested in the gory details, LWN has some good articles about this: https://lwn.net/Articles/718734/ https://lwn.net/Articles/724307/" * tag 'for-linus-v4.13-2' of git://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux: btrfs: minimal conversion to errseq_t writeback error reporting on fsync xfs: minimal conversion to errseq_t writeback error reporting ext4: use errseq_t based error handling for reporting data writeback errors fs: convert __generic_file_fsync to use errseq_t based reporting block: convert to errseq_t based writeback error tracking dax: set errors in mapping when writeback fails Documentation: flesh out the section in vfs.txt on storing and reporting writeback errors mm: set both AS_EIO/AS_ENOSPC and errseq_t in mapping_set_error fs: new infrastructure for writeback error handling and reporting lib: add errseq_t type and infrastructure for handling it mm: don't TestClearPageError in __filemap_fdatawait_range mm: clear AS_EIO/AS_ENOSPC when writeback initiation fails jbd2: don't clear and reset errors after waiting on writeback buffer: set errors in mapping at the time that the error occurs fs: check for writeback errors after syncing out buffers in generic_file_fsync buffer: use mapping_set_error instead of setting the flag mm: fix mapping_set_error call in me_pagecache_dirty