linux.git - Linus' kernel tree

Age	Commit message (Collapse)	Author
2017-11-16	drm/i915/gvt: Let the caller choose if a shadow page should be put into hash ↵	Zhi Wang
	table As we want to re-use intel_vgpu_shadow_page in buidling scrach page table and we don't want to put scrach page table page into hash table, a new param is introduced to give the caller a choice to decide if a shadow page should be put into hash table. Signed-off-by: Zhi Wang <zhi.a.wang@intel.com>
2017-11-16	drm/i915/gvt: Use I915_GTT_PAGE_SIZE	Zhi Wang
	As there is already an I915_GTT_PAGE_SIZE marco in i915, let GVT-g use it as well. Also this patch re-names some GTT marcos with additional prefix. Signed-off-by: Zhi Wang <zhi.a.wang@intel.com>
2017-11-16	drm/i915/gvt: Export intel_gvt_render_mmio_to_ring_id()	Zhi Wang
	Since many emulation logic needs to convert the offset of ring registers into ring id, we export it for other caller which might need it. Signed-off-by: Zhi Wang <zhi.a.wang@intel.com>
2017-11-16	drm/i915/gvt: Factor intel_vgpu_page_track	Zhi Wang
	As the data structure of "intel_vgpu_guest_page" will become much heavier in future, it's better to factor out the guest memory page track mechnisim as early as possible. Signed-off-by: Zhi Wang <zhi.a.wang@intel.com>
2017-11-16	drm/i915/gvt: Refine shadow batch buffer	Zhi Wang
	1) Use standard i915 GEM object sequence to access the shadow batch buffer. 2) Manage i915 vma life cycle to solve one FIXME. v2: - Refine code structure. - Refine the usage of GEM APIs. - Add the missing lock/unlock in release_shadow_batch_buffer. Test on my SKL NuC. Signed-off-by: Zhi Wang <zhi.a.wang@intel.com>
2017-11-16	drm/i915/gvt: Refine find_bb_size()	Zhi Wang
	Returns the error code if something is wrong and the size of batch buffer is passed through the pointer. Signed-off-by: Zhi Wang <zhi.a.wang@intel.com>
2017-11-16	drm/i915/gvt: Use BIT() to make klockwork happy	Zhi Wang
	Replace the plain bit usage with BIT() to make klockwork happy. Cc: Deng Hongyi <hongyi.deng@intel.com> Signed-off-by: Zhi Wang <zhi.a.wang@intel.com>
2017-11-16	drm/i915/gvt: Add basic debugfs infrastructure	Changbin Du
	We need debugfs entry to expose some debug information of gvt and vGPUs. The first tool will be added is mmio-diff, which help to find the difference values of host and vGPU mmio. It's useful for platform enabling. This patch just add a basic debugfs infrastructure, each vGPU has its own sub-folder. Two simple attributes are created as a template. . ├── num_tracked_mmio ├── vgpu1 \| └── active └── vgpu2 └── active Signed-off-by: Changbin Du <changbin.du@intel.com> Reviewed-by: Zhi Wang <zhi.a.wang@intel.com> Signed-off-by: Zhi Wang <zhi.a.wang@intel.com>
2017-11-16	drm/i915/gvt: Refactor vGPU type code in kvmgt part	fred gao
	all the vGPU type related code in kvmgt will be moved into gvt.c/gvt.h files while the common vGPU type related interfaces will be called. v2: - intel_gvt_{init,cleanup}_vgpu_type_groups are initialized in gvt part. (Wang, Zhi) Signed-off-by: fred gao <fred.gao@intel.com> Reviewed-by: Zhi Wang <zhi.a.wang@intel.com> Signed-off-by: Zhi Wang <zhi.a.wang@intel.com>
2017-11-16	drm/i915/gvt: Move vGPU type related code into gvt file	fred gao
	In this patch, all the vGPU type related code will be merged into same gvt file and the common interface will be exposed to both XenGT and KvmGT. v2: - remove the useless mdev_* gvt_ops. add get_gvt_attr ops for MPT module. intel_gvt_{init,cleanup}_vgpu_type_groups are initialized in gvt part. (Wang, Zhi) - set gvt_vgpu_type_groups[i] to NULL. (Zhang,Xiong) Signed-off-by: fred gao <fred.gao@intel.com> Reviewed-by: Zhi Wang <zhi.a.wang@intel.com> Signed-off-by: Zhi Wang <zhi.a.wang@intel.com>
2017-11-16	drm/i915/gvt: Move clean_workloads() into scheduler.c	Zhi Wang
	Move clean_workloads() into scheduler.c since it's not specific to execlist. v2: - Remove clean_workloads in intel_vgpu_select_submission_ops. (Zhenyu) Signed-off-by: Zhi Wang <zhi.a.wang@intel.com>
2017-11-16	drm/i915/gvt: Introduce intel_vgpu_reset_submission	Zhi Wang
	Introduce an generic API to reset vGPU virtual submission interface. Signed-off-by: Zhi Wang <zhi.a.wang@intel.com>
2017-11-16	drm/i915/gvt: Introduce vGPU submission ops	Zhi Wang
	Introduce vGPU submission ops to support easy switching submission mode of one vGPU between different OSes. Signed-off-by: Zhi Wang <zhi.a.wang@intel.com>
2017-11-16	drm/i915/gvt: Remove one extra declaration in scheduler.h	Zhi Wang
	Now the function has been moved into scheduler.c. The extra declaration is not necessary. Signed-off-by: Zhi Wang <zhi.a.wang@intel.com>
2017-11-16	drm/i915/gvt: Move common vGPU workload creation into scheduler.c	Zhi Wang
	Move common vGPU workload creation functions into scheduler.c since they are not specific to execlist emulation. Signed-off-by: Zhi Wang <zhi.a.wang@intel.com>
2017-11-16	drm/i915/gvt: Move common workload preparation into prepare_workload()	Zhi Wang
	Move common workload preparation into prepare_workload() in scheduler.c, as they are not specific to execlist emulation. Signed-off-by: Zhi Wang <zhi.a.wang@intel.com>
2017-11-16	drm/i915/gvt: Factor out prepare_workload()	Zhi Wang
	Factor out prepare_workload() for the following re-factor. Signed-off-by: Zhi Wang <zhi.a.wang@intel.com>
2017-11-16	drm/i915/gvt: Factor out vGPU workload creation/destroy	Zhi Wang
	Factor out vGPU workload creation/destroy functions since they are not specific to execlist emulation. Signed-off-by: Zhi Wang <zhi.a.wang@intel.com>
2017-11-16	drm/i915/gvt: Use dyndbg for gvt debug info	Shuo Liu
	It's better enable/disable and classify gvt debug info dynamically. This patch change it to dyndbg so can be dynamically enable/disable each item. All gvt log can be enabled by, $ echo 'file gvt +p' > /sys/kernel/debug/dynamic_debug/control Signed-off-by: Shuo Liu <shuo.a.liu@intel.com> Signed-off-by: Zhenyu Wang <zhenyuw@linux.intel.com>
2017-11-16	drm/i915/gvt: ensure -ve return value is handled correctly	Colin Ian King
	An earlier fix changed the return type from find_bb_size however the integer return is being assigned to a unsigned int so the -ve error check will never be detected. Make bb_size an int to fix this. Detected by CoverityScan CID#1456886 ("Unsigned compared against 0") Fixes: 1e3197d6ad73 ("drm/i915/gvt: Refine error handling for perform_bb_shadow") Signed-off-by: Colin Ian King <colin.king@canonical.com> Signed-off-by: Zhenyu Wang <zhenyuw@linux.intel.com>
2017-11-16	drm/i915/gvt: Add VM healthy check for submit_context	fred gao
	When a scan error occurs in submit_context, this patch is to decrease the mm ref count and free the workload struct before the workload is abandoned. v2: - submit_context related code should be combined together. (Zhenyu) v3: - free all the unsubmitted workloads. (Zhenyu) v4: - refine the clean path. (Zhenyu) v5: - polish the title. (Zhenyu) Signed-off-by: fred gao <fred.gao@intel.com> Signed-off-by: Zhenyu Wang <zhenyuw@linux.intel.com>
2017-11-16	drm/i915/gvt: Add VM healthy check for workload_thread	fred gao
	When a scan error occurs in dispatch_workload, this patch is to check the healthy state and free all the queued workloads before the failsafe mode is entered. Signed-off-by: fred gao <fred.gao@intel.com> Signed-off-by: Zhenyu Wang <zhenyuw@linux.intel.com>
2017-11-16	drm/i915/gvt: Change the return type during command scan	fred gao
	Generally, there are 3 types of errors during command scan: a) some commands might be unknown with EBADRQC; b) some cmd access invalid address with EFAULT; c) some unexpected force nonpriv cmd with EPERM. later the healthy state can be judged through the return error. v2: - remove some internal i915 errors rating. (Zhenyu) v3: - the healthy state is judged through the internal defined return error. (Zhenyu) - force non priv cmd error can be ignored. (Kevin) v4: - reuse standard defined errno instead of recreate, e.g EBADRQC for unknown cmd, EFAULT for invalid address, EPERM for nonpriv. (Zhenyu) v5: - remove some irrelevant code for the patch. - fix typo of vgpu_is_vm_unhealthy. (Zhenyu) v6: - move the healthy check and failsafe code into another patch. (Zhenyu) v7: - polish title and commit message. (Zhenyu) Signed-off-by: fred gao <fred.gao@intel.com> Signed-off-by: Zhenyu Wang <zhenyuw@linux.intel.com>
2017-11-16	drm/i915/gvt: Do not allocate initial ring scan buffer	Zhi Wang
	Theoretically, the largest bulk of commands in the ring buffer of an engine might be the first submission, which usually contains a lot of commands to initialize the HW. After removing the initial allocation of the ring scan buffer and let krealloc() do everything we need, we still have a big chance to get the buffer of suitable size in the first submission. Tested on my SKL NUC. Signed-off-by: Zhi Wang <zhi.a.wang@intel.com>
2017-11-16	drm/i915/gvt: Move ring scan buffers into intel_vgpu_submission	Zhi Wang
	Move ring scan buffers into intel_vgpu_submission since they belongs to a part of vGPU submission stuffs. Signed-off-by: Zhi Wang <zhi.a.wang@intel.com>
2017-11-16	drm/i915/gvt: Rename reserved ring buffer	Zhi Wang
	"reserved" means reserve something from somewhere. Actually they are buffers used by command scanner. Rename it to ring_scan_buffer. v2: - Remove the usage of an extra variable. (Zhenyu) Fixes: 0a53bc07f044 ("drm/i915/gvt: Separate cmd scan from request allocation") Signed-off-by: Zhi Wang <zhi.a.wang@intel.com>
2017-11-16	drm/i915/gvt: Fix a memory leak in cmd_parser.c	Zhi Wang
	The pointer points to the original memory can never take the return value of krealloc(). Signed-off-by: Zhi Wang <zhi.a.wang@intel.com>
2017-11-16	drm/i915/gvt: Move tlb_handle_pending into intel_vgpu_submission	Zhi Wang
	Move tlb_handle_pending into intel_vgpu_submssion since it belongs to a part of vGPU submission stuffs Signed-off-by: Zhi Wang <zhi.a.wang@intel.com>
2017-11-16	drm/i915/gvt: Introduce intel_vgpu_submission	Zhi Wang
	Introduce intel_vgpu_submission to hold all members related to submission in struct intel_vgpu before. Signed-off-by: Zhi Wang <zhi.a.wang@intel.com>
2017-11-16	drm/i915/gvt: Move workload cache init/clean into intel_vgpu_{setup, ↵	Zhi Wang
	clean}_submission() Move vGPU workload cache initialization/de-initialization into intel_vgpu_{setup, clean}_submission() since they are not specific to execlist stuffs. Signed-off-by: Zhi Wang <zhi.a.wang@intel.com>
2017-11-16	drm/i915/gvt: Rename intel_vgpu_{init, clean}_gvt_context()	Zhi Wang
	To move workload related functions into scheduler.c, an expected way is to collect all the init/clean functions related to vGPU workload submission into fewer functions. Rename intel_vgpu_{init, clean}_gvt_context() for above usage in future. Signed-off-by: Zhi Wang <zhi.a.wang@intel.com>
2017-11-16	drm/i915/gvt: Make elsp_dwords in the right order	Zhi Wang
	The context descriptors in elsp_dwords are stored in a reversed order and the definition of context descriptor is also reversed. The revesred stuff is hard to be used and might cause misunderstanding. Make them in the right oder for following code re-factoring. Tested on my SKL NUC. Signed-off-by: Zhi Wang <zhi.a.wang@intel.com>
2017-11-16	drm/i915/gvt: Add support for opregion virtualization	Xiaolin Zhang
	opregion emulated with a copy from host which leads to some display bugs such as guest resolution adjustment failure due to host opregion fail to claim port D support. with a fake opregion table provided to fully emulate opregion to meet guest port requirement. v1 - initial patch v2 - reforamt opregion arrary with 0x02x output v3 - opregion array removed with opregion generation on host initizaiton v4 - rebased v3 patch from stable branch to staging branch which also has different struct child_device_config and addressed v3 review comments. Signed-off-by: Xiaolin Zhang <xiaolin.zhang@intel.com> Signed-off-by: Zhenyu Wang <zhenyuw@linux.intel.com>
2017-11-15	Merge branch 'akpm' (patches from Andrew)	Linus Torvalds
	Merge updates from Andrew Morton: - a few misc bits - ocfs2 updates - almost all of MM * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (131 commits) memory hotplug: fix comments when adding section mm: make alloc_node_mem_map a void call if we don't have CONFIG_FLAT_NODE_MEM_MAP mm: simplify nodemask printing mm,oom_reaper: remove pointless kthread_run() error check mm/page_ext.c: check if page_ext is not prepared writeback: remove unused function parameter mm: do not rely on preempt_count in print_vma_addr mm, sparse: do not swamp log with huge vmemmap allocation failures mm/hmm: remove redundant variable align_end mm/list_lru.c: mark expected switch fall-through mm/shmem.c: mark expected switch fall-through mm/page_alloc.c: broken deferred calculation mm: don't warn about allocations which stall for too long fs: fuse: account fuse_inode slab memory as reclaimable mm, page_alloc: fix potential false positive in __zone_watermark_ok mm: mlock: remove lru_add_drain_all() mm, sysctl: make NUMA stats configurable shmem: convert shmem_init_inodecache() to void Unify migrate_pages and move_pages access checks mm, pagevec: rename pagevec drained field ...
2017-11-16	Merge branch 'drm-next-4.15-dc' of git://people.freedesktop.org/~agd5f/linux ↵	Dave Airlie
	into drm-next Various fixes for DC for 4.15. * 'drm-next-4.15-dc' of git://people.freedesktop.org/~agd5f/linux: drm/amd/display: fix MST link training fail division by 0 drm/amd/display: Fix formatting for null pointer dereference fix drm/amd/display: Remove dangling planes on dc commit state drm/amd/display: add flip_immediate to commit update for stream drm/amd/display: Miss register MST encoder cbs drm/amd/display: Fix warnings on S3 resume drm/amd/display: use num_timing_generator instead of pipe_count drm/amd/display: use configurable FBC option in dm drm/amd/display: fix AZ clock not enabled before program AZ endpoint amdgpu/dm: Don't use DRM_ERROR in amdgpu_dm_atomic_check
2017-11-15	memory hotplug: fix comments when adding section	Fan Du
	Here, pfn_to_node should be page_to_nid. Link: http://lkml.kernel.org/r/1510735205-22540-1-git-send-email-fan.du@intel.com Signed-off-by: Fan Du <fan.du@intel.com> Acked-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-11-15	mm: make alloc_node_mem_map a void call if we don't have ↵	Oscar Salvador
	CONFIG_FLAT_NODE_MEM_MAP free_area_init_node() calls alloc_node_mem_map(), but this function does nothing unless we have CONFIG_FLAT_NODE_MEM_MAP. As a cleanup, we can move the "#ifdef CONFIG_FLAT_NODE_MEM_MAP" within alloc_node_mem_map() out of the function, and define a alloc_node_mem_map() { } when CONFIG_FLAT_NODE_MEM_MAP is not present. This also moves the printk that lays within the "#ifdef CONFIG_FLAT_NODE_MEM_MAP" block from free_area_init_node() to alloc_node_mem_map(), getting rid of the "#ifdef CONFIG_FLAT_NODE_MEM_MAP" in free_area_init_node(). [akpm@linux-foundation.org: clean up the printk while we're there] Link: http://lkml.kernel.org/r/20171114111935.GA11758@techadventures.net Signed-off-by: Oscar Salvador <osalvador@techadventures.net> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-11-15	mm: simplify nodemask printing	Michal Hocko
	alloc_warn() and dump_header() have to explicitly handle NULL nodemask which forces both paths to use pr_cont. We can do better. printk already handles NULL pointers properly so all we need is to teach nodemask_pr_args to handle NULL nodemask carefully. This allows simplification of both alloc_warn() and dump_header() and gets rid of pr_cont altogether. This patch has been motivated by patch from Joe Perches http://lkml.kernel.org/r/b31236dfe3fc924054fd7842bde678e71d193638.1509991345.git.joe@perches.com [akpm@linux-foundation.org: fix tile warning, per Arnd] Link: http://lkml.kernel.org/r/20171109100531.3cn2hcqnuj7mjaju@dhcp22.suse.cz Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: Joe Perches <joe@perches.com> Cc: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-11-15	mm,oom_reaper: remove pointless kthread_run() error check	Tetsuo Handa
	Since oom_init() is called before userspace processes start, memory allocation failure for creating the OOM reaper kernel thread will let the OOM killer call panic() rather than wake up the OOM reaper. Link: http://lkml.kernel.org/r/1510137800-4602-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jp Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Acked-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-11-15	mm/page_ext.c: check if page_ext is not prepared	Jaewon Kim
	online_page_ext() and page_ext_init() allocate page_ext for each section, but they do not allocate if the first PFN is !pfn_present(pfn) or !pfn_valid(pfn). Then section->page_ext remains as NULL. lookup_page_ext checks NULL only if CONFIG_DEBUG_VM is enabled. For a valid PFN, __set_page_owner will try to get page_ext through lookup_page_ext. Without CONFIG_DEBUG_VM lookup_page_ext will misuse NULL pointer as value 0. This incurrs invalid address access. This is the panic example when PFN 0x100000 is not valid but PFN 0x13FC00 is being used for page_ext. section->page_ext is NULL, get_entry returned invalid page_ext address as 0x1DFA000 for a PFN 0x13FC00. To avoid this panic, CONFIG_DEBUG_VM should be removed so that page_ext will be checked at all times. Unable to handle kernel paging request at virtual address 01dfa014 ------------[ cut here ]------------ Kernel BUG at ffffff80082371e0 [verbose debug info unavailable] Internal error: Oops: 96000045 [#1] PREEMPT SMP Modules linked in: PC is at __set_page_owner+0x48/0x78 LR is at __set_page_owner+0x44/0x78 __set_page_owner+0x48/0x78 get_page_from_freelist+0x880/0x8e8 __alloc_pages_nodemask+0x14c/0xc48 __do_page_cache_readahead+0xdc/0x264 filemap_fault+0x2ac/0x550 ext4_filemap_fault+0x3c/0x58 __do_fault+0x80/0x120 handle_mm_fault+0x704/0xbb0 do_page_fault+0x2e8/0x394 do_mem_abort+0x88/0x124 Pre-4.7 kernels also need commit f86e4271978b ("mm: check the return value of lookup_page_ext for all call sites"). Link: http://lkml.kernel.org/r/20171107094131.14621-1-jaewon31.kim@samsung.com Fixes: eefa864b701d ("mm/page_ext: resurrect struct page extending code for debugging") Signed-off-by: Jaewon Kim <jaewon31.kim@samsung.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Minchan Kim <minchan@kernel.org> Cc: Joonsoo Kim <js1304@gmail.com> Cc: <stable@vger.kernel.org> [depends on f86e427197, see above] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-11-15	writeback: remove unused function parameter	Wang Long
	The parameter `struct bdi_writeback *wb` is not been used in the function body. Remove it. Link: http://lkml.kernel.org/r/1509685485-15278-1-git-send-email-wanglong19@meituan.com Signed-off-by: Wang Long <wanglong19@meituan.com> Reviewed-by: Jan Kara <jack@suse.cz> Acked-by: Tejun Heo <tj@kernel.org> Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-11-15	mm: do not rely on preempt_count in print_vma_addr	Michal Hocko
	The preempt count check on print_vma_addr has been added by commit e8bff74afbdb ("x86: fix "BUG: sleeping function called from invalid context" in print_vma_addr()") and it relied on the elevated preempt count from preempt_conditional_sti because preempt_count check doesn't work on non preemptive kernels by default. The code has evolved though and commit d99e1bd175f4 ("x86/entry/traps: Refactor preemption and interrupt flag handling") has replaced preempt_conditional_sti by an explicit preempt_disable which is noop on !PREEMPT so the check in print_vma_addr is broken. Fix the issue by using trylock on mmap_sem rather than chacking the preempt count. The allocation we are relying on has to be GFP_NOWAIT as well. There is a chance that we won't dump the vma state if the lock is contended or the memory short but this is acceptable outcome and much less fragile than the not working preemption check or tricks around it. Link: http://lkml.kernel.org/r/20171106134031.g6dbelg55mrbyc6i@dhcp22.suse.cz Fixes: d99e1bd175f4 ("x86/entry/traps: Refactor preemption and interrupt flag handling") Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Yang Shi <yang.s@alibaba-inc.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-11-15	mm, sparse: do not swamp log with huge vmemmap allocation failures	Michal Hocko
	While doing memory hotplug tests under heavy memory pressure we have noticed too many page allocation failures when allocating vmemmap memmap backed by huge page kworker/u3072:1: page allocation failure: order:9, mode:0x24084c0(GFP_KERNEL\|__GFP_REPEAT\|__GFP_ZERO) [...] Call Trace: dump_trace+0x59/0x310 show_stack_log_lvl+0xea/0x170 show_stack+0x21/0x40 dump_stack+0x5c/0x7c warn_alloc_failed+0xe2/0x150 __alloc_pages_nodemask+0x3ed/0xb20 alloc_pages_current+0x7f/0x100 vmemmap_alloc_block+0x79/0xb6 __vmemmap_alloc_block_buf+0x136/0x145 vmemmap_populate+0xd2/0x2b9 sparse_mem_map_populate+0x23/0x30 sparse_add_one_section+0x68/0x18e __add_pages+0x10a/0x1d0 arch_add_memory+0x4a/0xc0 add_memory_resource+0x89/0x160 add_memory+0x6d/0xd0 acpi_memory_device_add+0x181/0x251 acpi_bus_attach+0xfd/0x19b acpi_bus_scan+0x59/0x69 acpi_device_hotplug+0xd2/0x41f acpi_hotplug_work_fn+0x1a/0x23 process_one_work+0x14e/0x410 worker_thread+0x116/0x490 kthread+0xbd/0xe0 ret_from_fork+0x3f/0x70 and we do see many of those because essentially every allocation fails for each memory section. This is an excessive way to tell the user that there is nothing to really worry about because we do have a fallback mechanism to use base pages. The only downside might be a performance degradation due to TLB pressure. This patch changes vmemmap_alloc_block() to use __GFP_NOWARN and warn explicitly once on the first allocation failure. This will reduce the noise in the kernel log considerably, while we still have an indication that a performance might be impacted. [mhocko@kernel.org: forgot to git add the follow up fix] Link: http://lkml.kernel.org/r/20171107090635.c27thtse2lchjgvb@dhcp22.suse.cz Link: http://lkml.kernel.org/r/20171106092228.31098-1-mhocko@kernel.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Michal Hocko <mhocko@suse.com> Cc: Joe Perches <joe@perches.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Khalid Aziz <khalid.aziz@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-11-15	mm/hmm: remove redundant variable align_end	Colin Ian King
	Variable align_end is assigned a value but it is never read, so the variable is redundant and can be removed. Cleans up the clang warning: Value stored to 'align_end' is never read Link: http://lkml.kernel.org/r/20171017143837.23207-1-colin.king@canonical.com Signed-off-by: Colin Ian King <colin.king@canonical.com> Reviewed-by: Jérôme Glisse <jglisse@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-11-15	mm/list_lru.c: mark expected switch fall-through	Gustavo A. R. Silva
	In preparation for enabling -Wimplicit-fallthrough, mark switch cases where we are expecting to fall through. Link: http://lkml.kernel.org/r/20171020190754.GA24332@embeddedor.com Signed-off-by: Gustavo A. R. Silva <garsilva@embeddedor.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-11-15	mm/shmem.c: mark expected switch fall-through	Gustavo A. R. Silva
	In preparation to enabling -Wimplicit-fallthrough, mark switch cases where we are expecting to fall through. Link: http://lkml.kernel.org/r/20171020191058.GA24427@embeddedor.com Signed-off-by: Gustavo A. R. Silva <garsilva@embeddedor.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-11-15	mm/page_alloc.c: broken deferred calculation	Pavel Tatashin
	In reset_deferred_meminit() we determine number of pages that must not be deferred. We initialize pages for at least 2G of memory, but also pages for reserved memory in this node. The reserved memory is determined in this function: memblock_reserved_memory_within(), which operates over physical addresses, and returns size in bytes. However, reset_deferred_meminit() assumes that that this function operates with pfns, and returns page count. The result is that in the best case machine boots slower than expected due to initializing more pages than needed in single thread, and in the worst case panics because fewer than needed pages are initialized early. Link: http://lkml.kernel.org/r/20171021011707.15191-1-pasha.tatashin@oracle.com Fixes: 864b9a393dcb ("mm: consider memblock reservations for deferred memory initialization sizing") Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-11-15	mm: don't warn about allocations which stall for too long	Tetsuo Handa
	Commit 63f53dea0c98 ("mm: warn about allocations which stall for too long") was a great step for reducing possibility of silent hang up problem caused by memory allocation stalls. But this commit reverts it, for it is possible to trigger OOM lockup and/or soft lockups when many threads concurrently called warn_alloc() (in order to warn about memory allocation stalls) due to current implementation of printk(), and it is difficult to obtain useful information due to limitation of synchronous warning approach. Current printk() implementation flushes all pending logs using the context of a thread which called console_unlock(). printk() should be able to flush all pending logs eventually unless somebody continues appending to printk() buffer. Since warn_alloc() started appending to printk() buffer while waiting for oom_kill_process() to make forward progress when oom_kill_process() is processing pending logs, it became possible for warn_alloc() to force oom_kill_process() loop inside printk(). As a result, warn_alloc() significantly increased possibility of preventing oom_kill_process() from making forward progress. ---------- Pseudo code start ---------- Before warn_alloc() was introduced: retry: if (mutex_trylock(&oom_lock)) { while (atomic_read(&printk_pending_logs) > 0) { atomic_dec(&printk_pending_logs); print_one_log(); } // Send SIGKILL here. mutex_unlock(&oom_lock) } goto retry; After warn_alloc() was introduced: retry: if (mutex_trylock(&oom_lock)) { while (atomic_read(&printk_pending_logs) > 0) { atomic_dec(&printk_pending_logs); print_one_log(); } // Send SIGKILL here. mutex_unlock(&oom_lock) } else if (waited_for_10seconds()) { atomic_inc(&printk_pending_logs); } goto retry; ---------- Pseudo code end ---------- Although waited_for_10seconds() becomes true once per 10 seconds, unbounded number of threads can call waited_for_10seconds() at the same time. Also, since threads doing waited_for_10seconds() keep doing almost busy loop, the thread doing print_one_log() can use little CPU resource. Therefore, this situation can be simplified like ---------- Pseudo code start ---------- retry: if (mutex_trylock(&oom_lock)) { while (atomic_read(&printk_pending_logs) > 0) { atomic_dec(&printk_pending_logs); print_one_log(); } // Send SIGKILL here. mutex_unlock(&oom_lock) } else { atomic_inc(&printk_pending_logs); } goto retry; ---------- Pseudo code end ---------- when printk() is called faster than print_one_log() can process a log. One of possible mitigation would be to introduce a new lock in order to make sure that no other series of printk() (either oom_kill_process() or warn_alloc()) can append to printk() buffer when one series of printk() (either oom_kill_process() or warn_alloc()) is already in progress. Such serialization will also help obtaining kernel messages in readable form. ---------- Pseudo code start ---------- retry: if (mutex_trylock(&oom_lock)) { mutex_lock(&oom_printk_lock); while (atomic_read(&printk_pending_logs) > 0) { atomic_dec(&printk_pending_logs); print_one_log(); } // Send SIGKILL here. mutex_unlock(&oom_printk_lock); mutex_unlock(&oom_lock) } else { if (mutex_trylock(&oom_printk_lock)) { atomic_inc(&printk_pending_logs); mutex_unlock(&oom_printk_lock); } } goto retry; ---------- Pseudo code end ---------- But this commit does not go that direction, for we don't want to introduce a new lock dependency, and we unlikely be able to obtain useful information even if we serialized oom_kill_process() and warn_alloc(). Synchronous approach is prone to unexpected results (e.g. too late [1], too frequent [2], overlooked [3]). As far as I know, warn_alloc() never helped with providing information other than "something is going wrong". I want to consider asynchronous approach which can obtain information during stalls with possibly relevant threads (e.g. the owner of oom_lock and kswapd-like threads) and serve as a trigger for actions (e.g. turn on/off tracepoints, ask libvirt daemon to take a memory dump of stalling KVM guest for diagnostic purpose). This commit temporarily loses ability to report e.g. OOM lockup due to unable to invoke the OOM killer due to !__GFP_FS allocation request. But asynchronous approach will be able to detect such situation and emit warning. Thus, let's remove warn_alloc(). [1] https://bugzilla.kernel.org/show_bug.cgi?id=192981 [2] http://lkml.kernel.org/r/CAM_iQpWuPVGc2ky8M-9yukECtS+zKjiDasNymX7rMcBjBFyM_A@mail.gmail.com [3] commit db73ee0d46379922 ("mm, vmscan: do not loop on too_many_isolated for ever")) Link: http://lkml.kernel.org/r/1509017339-4802-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jp Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Reported-by: Cong Wang <xiyou.wangcong@gmail.com> Reported-by: yuwang.yuwang <yuwang.yuwang@alibaba-inc.com> Reported-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Mel Gorman <mgorman@suse.de> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Cc: Petr Mladek <pmladek@suse.com> Cc: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-11-15	fs: fuse: account fuse_inode slab memory as reclaimable	Johannes Weiner
	Fuse inodes are currently included in the unreclaimable slab counts - SUnreclaim in /proc/meminfo, slab_unreclaimable in /proc/vmstat and the per-cgroup memory.stat. But they are reclaimable just like other filesystems' inodes, and /proc/sys/vm/drop_caches frees them easily. Mark the slab cache reclaimable. Link: http://lkml.kernel.org/r/20171102202727.12539-1-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Miklos Szeredi <mszeredi@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-11-15	mm, page_alloc: fix potential false positive in __zone_watermark_ok	Vlastimil Babka
	Since commit 97a16fc82a7c ("mm, page_alloc: only enforce watermarks for order-0 allocations"), __zone_watermark_ok() check for high-order allocations will shortcut per-migratetype free list checks for ALLOC_HARDER allocations, and return true as long as there's free page of any migratetype. The intention is that ALLOC_HARDER can allocate from MIGRATE_HIGHATOMIC free lists, while normal allocations can't. However, as a side effect, the watermark check will then also return true when there are pages only on the MIGRATE_ISOLATE list, or (prior to CMA conversion to ZONE_MOVABLE) on the MIGRATE_CMA list. Since the allocation cannot actually obtain isolated pages, and might not be able to obtain CMA pages, this can result in a false positive. The condition should be rare and perhaps the outcome is not a fatal one. Still, it's better if the watermark check is correct. There also shouldn't be a performance tradeoff here. Link: http://lkml.kernel.org/r/20171102125001.23708-1-vbabka@suse.cz Fixes: 97a16fc82a7c ("mm, page_alloc: only enforce watermarks for order-0 allocations") Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Mel Gorman <mgorman@techsingularity.net> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Rik van Riel <riel@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>