summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2018-04-09afs: Split the directory content defs into a headerDavid Howells
Split the directory content definitions into a header file so that they can be used by multiple .c files. Signed-off-by: David Howells <dhowells@redhat.com>
2018-04-09afs: Fix directory handlingDavid Howells
AFS directories are structured blobs that are downloaded just like files and then parsed by the lookup and readdir code and, as such, are currently handled in the pagecache like any other file, with the entire directory content being thrown away each time the directory changes. However, since the blob is a known structure and since the data version counter on a directory increases by exactly one for each change committed to that directory, we can actually edit the directory locally rather than fetching it from the server after each locally-induced change. What we can't do, though, is mix data from the server and data from the client since the server is technically at liberty to rearrange or compress a directory if it sees fit, provided it updates the data version number when it does so and breaks the callback (ie. sends a notification). Further, lookup with lookup-ahead, readdir and, when it arrives, local editing are likely want to scan the whole of a directory. So directory handling needs to be improved to maintain the coherency of the directory blob prior to permitting local directory editing. To this end: (1) If any directory page gets discarded, invalidate and reread the entire directory. (2) If readpage notes that if when it fetches a single page that the version number has changed, the entire directory is flagged for invalidation. (3) Read as much of the directory in one go as we can. Note that this removes local caching of directories in fscache for the moment as we can't pass the pages to fscache_read_or_alloc_pages() since page->lru is in use by the LRU. Signed-off-by: David Howells <dhowells@redhat.com>
2018-04-09afs: Split the dynroot stuff out and give it its own ops tablesDavid Howells
Split the AFS dynamic root stuff out of the main directory handling file and into its own file as they share little in common. The dynamic root code also gets its own dentry and inode ops tables. Signed-off-by: David Howells <dhowells@redhat.com>
2018-04-09afs: Keep track of invalid-before version for dentry coherencyDavid Howells
Each afs dentry is tagged with the version that the parent directory was at last time it was validated and, currently, if this differs, the directory is scanned and the dentry is refreshed. However, this leads to an excessive amount of revalidation on directories that get modified on the client without conflict with another client. We know there's no conflict because the parent directory's data version number got incremented by exactly 1 on any create, mkdir, unlink, etc., therefore we can trust the current state of the unaffected dentries when we perform a local directory modification. Optimise by keeping track of the last version of the parent directory that was changed outside of the client in the parent directory's vnode and using that to validate the dentries rather than the current version. Signed-off-by: David Howells <dhowells@redhat.com>
2018-04-09afs: Rearrange status mappingDavid Howells
Rearrange the AFSFetchStatus to inode attribute mapping code in a number of ways: (1) Use an XDR structure rather than a series of incremented pointer accesses when decoding an AFSFetchStatus object. This allows out-of-order decode. (2) Don't store the if_version value but rather just check it and abort if it's not something we can handle. (3) Store the owner and group in the status record as raw values rather than converting them to kuid/kgid. Do that when they're mapped into i_uid/i_gid. (4) Validate the type and abort code up front and abort if they're wrong. (5) Split the inode attribute setting out into its own function from the XDR decode of an AFSFetchStatus object. This allows it to be called from elsewhere too. (6) Differentiate changes to data from changes to metadata. (7) Use the split-out attribute mapping function from afs_iget(). Signed-off-by: David Howells <dhowells@redhat.com>
2018-04-09afs: Make it possible to get the data version in readpageDavid Howells
Store the data version number indicated by an FS.FetchData op into the read request structure so that it's accessible by the page reader. Signed-off-by: David Howells <dhowells@redhat.com>
2018-04-09afs: Init inode before accessing cacheDavid Howells
We no longer parse symlinks when we get the inode to determine if this symlink is actually a mountpoint as we detect that by examining the mode instead (symlinks are always 0777 and mountpoints 0644). Access the cache after mapping the status so that we don't have to manually set the inode size now. Note that this may need adjusting if the disconnected operation is implemented as the file metadata may have to be obtained from the cache. Signed-off-by: David Howells <dhowells@redhat.com>
2018-04-09afs: Introduce a statistics proc fileDavid Howells
Introduce a proc file that displays a bunch of statistics for the AFS filesystem in the current network namespace. Signed-off-by: David Howells <dhowells@redhat.com>
2018-04-09afs: Dump bad status recordDavid Howells
Dump an AFS FileStatus record that is detected as invalid. Signed-off-by: David Howells <dhowells@redhat.com>
2018-04-09scsi: qla2xxx: Fix small memory leak in qla2x00_probe_one on probe failureBill Kuzeja
The code that fixes the crashes in the following commit introduced a small memory leak: commit 6a2cf8d3663e ("scsi: qla2xxx: Fix crashes in qla2x00_probe_one on probe failure") Fixing this requires a bit of reworking, which I've explained. Also provide some code cleanup. There is a small window in qla2x00_probe_one where if qla2x00_alloc_queues fails, we end up never freeing req and rsp and leak 0xc0 and 0xc8 bytes respectively (the sizes of req and rsp). I originally put in checks to test for this condition which were based on the incorrect assumption that if ha->rsp_q_map and ha->req_q_map were allocated, then rsp and req were allocated as well. This is incorrect. There is a window between these allocations: ret = qla2x00_mem_alloc(ha, req_length, rsp_length, &req, &rsp); goto probe_hw_failed; [if successful, both rsp and req allocated] base_vha = qla2x00_create_host(sht, ha); goto probe_hw_failed; ret = qla2x00_request_irqs(ha, rsp); goto probe_failed; if (qla2x00_alloc_queues(ha, req, rsp)) { goto probe_failed; [if successful, now ha->rsp_q_map and ha->req_q_map allocated] To simplify this, we should just set req and rsp to NULL after we free them. Sounds simple enough? The problem is that req and rsp are pointers defined in the qla2x00_probe_one and they are not always passed by reference to the routines that free them. Here are paths which can free req and rsp: PATH 1: qla2x00_probe_one ret = qla2x00_mem_alloc(ha, req_length, rsp_length, &req, &rsp); [req and rsp are passed by reference, but if this fails, we currently do not NULL out req and rsp. Easily fixed] PATH 2: qla2x00_probe_one failing in qla2x00_request_irqs or qla2x00_alloc_queues probe_failed: qla2x00_free_device(base_vha); qla2x00_free_req_que(ha, req) qla2x00_free_rsp_que(ha, rsp) PATH 3: qla2x00_probe_one: failing in qla2x00_mem_alloc or qla2x00_create_host probe_hw_failed: qla2x00_free_req_que(ha, req) qla2x00_free_rsp_que(ha, rsp) PATH 1: This should currently work, but it doesn't because rsp and rsp are not set to NULL in qla2x00_mem_alloc. Easily remedied. PATH 2: req and rsp aren't passed in at all to qla2x00_free_device but are derived from ha->req_q_map[0] and ha->rsp_q_map[0]. These are only set up if qla2x00_alloc_queues succeeds. In qla2x00_free_queues, we are protected from crashing if these don't exist because req_qid_map and rsp_qid_map are only set on their allocation. We are guarded in this way: for (cnt = 0; cnt < ha->max_req_queues; cnt++) { if (!test_bit(cnt, ha->req_qid_map)) continue; PATH 3: This works. We haven't freed req or rsp yet (or they were never allocated if qla2x00_mem_alloc failed), so we'll attempt to free them here. To summarize, there are a few small changes to make this work correctly and (and for some cleanup): 1) (For PATH 1) Set *rsp and *req to NULL in case of failure in qla2x00_mem_alloc so these are correctly set to NULL back in qla2x00_probe_one 2) After jumping to probe_failed: and calling qla2x00_free_device, explicitly set rsp and req to NULL so further calls with these pointers do not crash, i.e. the free queue calls in the probe_hw_failed section we fall through to. 3) Fix return code check in the call to qla2x00_alloc_queues. We currently drop the return code on the floor. The probe fails but the caller of the probe doesn't have an error code, so it attaches to pci. This can result in a crash on module shutdown. 4) Remove unnecessary NULL checks in qla2x00_free_req_que, qla2x00_free_rsp_que, and the egregious NULL checks before kfrees and vfrees in qla2x00_mem_free. I tested this out running a scenario where the card breaks at various times during initialization. I made sure I forced every error exit path in qla2x00_probe_one. Cc: <stable@vger.kernel.org> # v4.16 Fixes: 6a2cf8d3663e ("scsi: qla2xxx: Fix crashes in qla2x00_probe_one on probe failure") Signed-off-by: Bill Kuzeja <william.kuzeja@stratus.com> Acked-by: Himanshu Madhani <himanshu.madhani@cavium.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
2018-04-09scsi: scsi_dh: Don't look for NULL devices handlers by nameJohannes Thumshirn
Currently scsi_dh_lookup() doesn't check for NULL as a device name. This combined with nvme over dm-mpath results in the following messages emitted by device-mapper: device-mapper: multipath: Could not failover device 259:67: Handler scsi_dh_(null) error 14. Let scsi_dh_lookup() fail fast on NULL names. [mkp: typo fix] Cc: <stable@vger.kernel.org> # v4.16 Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Hannes Reinecke <hare@suse.com> Reviewed-by: Bart Van Assche <bart.vanassche@wdc.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
2018-04-09scsi: core: remove redundant assignment to shost->use_blk_mqColin Ian King
The first assignment to shost->use_blk_mq is redundant as it is overwritten by the following statement. Remove this redundant code. Detected by CoverityScan, CID#1466993 ("Unused value") Signed-off-by: Colin Ian King <colin.king@canonical.com> Reviewed-by: Bart Van Assche <bart.vanassche@wdc.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
2018-04-09afs: Implement @cell substitution handlingDavid Howells
Implement @cell substitution handling such that if @cell is seen as a name in a dynamic root mount, then the name of the root cell for that network namespace will be substituted for @cell during lookup. The substitution of @cell for the current net namespace is set by writing the cell name to /proc/fs/afs/rootcell. The value can be obtained by reading the file. For example: # mount -t afs none /kafs -o dyn # echo grand.central.org >/proc/fs/afs/rootcell # ls /kafs/@cell archive/ cvs/ doc/ local/ project/ service/ software/ user/ www/ # cat /proc/fs/afs/rootcell grand.central.org Signed-off-by: David Howells <dhowells@redhat.com>
2018-04-09afs: Implement @sys substitution handlingDavid Howells
Implement the AFS feature by which @sys at the end of a pathname component may be substituted for one of a list of values, typically naming the operating system. Up to 16 alternatives may be specified and these are tried in turn until one works. Each network namespace has[*] a separate independent list. Upon creation of a new network namespace, the list of values is initialised[*] to a single OpenAFS-compatible string representing arch type plus "_linux26". For example, on x86_64, the sysname is "amd64_linux26". [*] Or will, once network namespace support is finalised in kAFS. The list may be set by: # for i in foo bar linux-x86_64; do echo $i; done >/proc/fs/afs/sysname for which separate writes to the same fd are amalgamated and applied on close. The LF character may be used as a separator to specify multiple items in the same write() call. The list may be cleared by: # echo >/proc/fs/afs/sysname and read by: # cat /proc/fs/afs/sysname foo bar linux-x86_64 Signed-off-by: David Howells <dhowells@redhat.com>
2018-04-09afs: Prospectively look up extra files when doing a single lookupDavid Howells
When afs_lookup() is called, prospectively look up the next 50 uncached fids also from that same directory and cache the results, rather than just looking up the one file requested. This allows us to use the FS.InlineBulkStatus RPC op to increase efficiency by fetching up to 50 file statuses at a time. Signed-off-by: David Howells <dhowells@redhat.com>
2018-04-09afs: Don't over-increment the cell usage count when pinning itDavid Howells
AFS cells that are added or set as the workstation cell through /proc are pinned against removal by setting the AFS_CELL_FL_NO_GC flag on them and taking a ref. The ref should be only taken if the flag wasn't already set. Fix this by making it conditional. Without this an assertion failure will occur during module removal indicating that the refcount is too elevated. Signed-off-by: David Howells <dhowells@redhat.com>
2018-04-09afs: Fix checker warningsDavid Howells
Fix warnings raised by checker, including: (*) Warnings raised by unequal comparison for the purposes of sorting, where the endianness doesn't matter: fs/afs/addr_list.c:246:21: warning: restricted __be16 degrades to integer fs/afs/addr_list.c:246:30: warning: restricted __be16 degrades to integer fs/afs/addr_list.c:248:21: warning: restricted __be32 degrades to integer fs/afs/addr_list.c:248:49: warning: restricted __be32 degrades to integer fs/afs/addr_list.c:283:21: warning: restricted __be16 degrades to integer fs/afs/addr_list.c:283:30: warning: restricted __be16 degrades to integer (*) afs_set_cb_interest() is not actually used and can be removed. (*) afs_cell_gc_delay() should be provided with a sysctl. (*) afs_cell_destroy() needs to use rcu_access_pointer() to read cell->vl_addrs. (*) afs_init_fs_cursor() should be static. (*) struct afs_vnode::permit_cache needs to be marked __rcu. (*) afs_server_rcu() needs to use rcu_access_pointer(). (*) afs_destroy_server() should use rcu_access_pointer() on server->addresses as the server object is no longer accessible. (*) afs_find_server() casts __be16/__be32 values to int in order to directly compare them for the purpose of finding a match in a list, but is should also annotate the cast with __force to avoid checker warnings. (*) afs_check_permit() accesses vnode->permit_cache outside of the RCU readlock, though it doesn't then access the value; the extraneous access is deleted. False positives: (*) Conditional locking around the code in xdr_decode_AFSFetchStatus. This can be dealt with in a separate patch. fs/afs/fsclient.c:148:9: warning: context imbalance in 'xdr_decode_AFSFetchStatus' - different lock contexts for basic block (*) Incorrect handling of seq-retry lock context balance: fs/afs/inode.c:455:38: warning: context imbalance in 'afs_getattr' - different lock contexts for basic block fs/afs/server.c:52:17: warning: context imbalance in 'afs_find_server' - different lock contexts for basic block fs/afs/server.c:128:17: warning: context imbalance in 'afs_find_server_by_uuid' - different lock contexts for basic block Errors: (*) afs_lookup_cell_rcu() needs to break out of the seq-retry loop, not go round again if it successfully found the workstation cell. (*) Fix UUID decode in afs_deliver_cb_probe_uuid(). (*) afs_cache_permit() has a missing rcu_read_unlock() before one of the jumps to the someone_else_changed_it label. Move the unlock to after the label. (*) afs_vl_get_addrs_u() is using ntohl() rather than htonl() when encoding to XDR. (*) afs_deliver_yfsvl_get_endpoints() is using htonl() rather than ntohl() when decoding from XDR. Signed-off-by: David Howells <dhowells@redhat.com>
2018-04-09vfs: Remove the const from dir_context::actorDavid Howells
Remove the const marking from the actor function pointer in the dir_context struct. The const prevents the structure from being used as part of a kmalloc'd object as it makes the compiler require that the actor member be set at object initialisation time (or not at all), incuring something like the following error if you try and set it later: fs/afs/dir.c:556:20: error: assignment of read-only member 'actor' Marking the member const like this adds very little in the way of sanity checking as the type checking system is likely to provide sufficient - and if not, the kernel is very likely to oops repeatably in this case. Fixes: ac6614b76478 ("[readdir] constify ->actor") Signed-off-by: David Howells <dhowells@redhat.com> Reviewed-by: Al Viro <viro@zeniv.linux.org.uk>
2018-04-09Merge branch 'work.namei' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull vfs namei updates from Al Viro: - make lookup_one_len() safe with parent locked only shared(incoming afs series wants that) - fix of getname_kernel() regression from 2015 (-stable fodder, that one). * 'work.namei' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: getname_kernel() needs to make sure that ->name != ->iname in long case make lookup_one_len() safe to use with directory locked shared new helper: __lookup_slow() merge common parts of lookup_one_len{,_unlocked} into common helper
2018-04-09Merge tag 'for-linus-4.17-ofs' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/hubcap/linux Pull orangefs updates from Mike Marshall: "Fixes and cleanups: - Documentation cleanups - removal of unused code - make some structs static - implement Orangefs vm_operations fault callout - eliminate two single-use functions and put their cleaned up code in line. - replace a vmalloc/memset instance with vzalloc - fix a race condition bug in wait code" * tag 'for-linus-4.17-ofs' of git://git.kernel.org/pub/scm/linux/kernel/git/hubcap/linux: Orangefs: documentation updates orangefs: document package install and xfstests procedure orangefs: remove unused code orangefs: make several *_operations structs static orangefs: implement vm_ops->fault orangefs: open code short single-use functions orangefs: replace vmalloc and memset with vzalloc orangefs: bug fix for a race condition when getting a slot
2018-04-09Merge tag 'pstore-v4.17-rc1-fix' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux Pull pstore fix from Kees Cook: "Fix another compression Kconfig combination missed in testing (Tobias Regnery)" * tag 'pstore-v4.17-rc1-fix' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux: pstore: fix crypto dependencies without compression
2018-04-09selinux: fix missing dput() before selinuxfs unmountStephen Smalley
Commit 0619f0f5e36f ("selinux: wrap selinuxfs state") triggers a BUG when SELinux is runtime-disabled (i.e. systemd or equivalent disables SELinux before initial policy load via /sys/fs/selinux/disable based on /etc/selinux/config SELINUX=disabled). This does not manifest if SELinux is disabled via kernel command line argument or if SELinux is enabled (permissive or enforcing). Before: SELinux: Disabled at runtime. BUG: Dentry 000000006d77e5c7{i=17,n=null} still in use (1) [unmount of selinuxfs selinuxfs] After: SELinux: Disabled at runtime. Fixes: 0619f0f5e36f ("selinux: wrap selinuxfs state") Reported-by: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> Reported-by: Dmitry Vyukov <dvyukov@google.com> Signed-off-by: Stephen Smalley <sds@tycho.nsa.gov> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-09Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds
Pull kvm updates from Paolo Bonzini: "ARM: - VHE optimizations - EL2 address space randomization - speculative execution mitigations ("variant 3a", aka execution past invalid privilege register access) - bugfixes and cleanups PPC: - improvements for the radix page fault handler for HV KVM on POWER9 s390: - more kvm stat counters - virtio gpu plumbing - documentation - facilities improvements x86: - support for VMware magic I/O port and pseudo-PMCs - AMD pause loop exiting - support for AMD core performance extensions - support for synchronous register access - expose nVMX capabilities to userspace - support for Hyper-V signaling via eventfd - use Enlightened VMCS when running on Hyper-V - allow userspace to disable MWAIT/HLT/PAUSE vmexits - usual roundup of optimizations and nested virtualization bugfixes Generic: - API selftest infrastructure (though the only tests are for x86 as of now)" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (174 commits) kvm: x86: fix a prototype warning kvm: selftests: add sync_regs_test kvm: selftests: add API testing infrastructure kvm: x86: fix a compile warning KVM: X86: Add Force Emulation Prefix for "emulate the next instruction" KVM: X86: Introduce handle_ud() KVM: vmx: unify adjacent #ifdefs x86: kvm: hide the unused 'cpu' variable KVM: VMX: remove bogus WARN_ON in handle_ept_misconfig Revert "KVM: X86: Fix SMRAM accessing even if VM is shutdown" kvm: Add emulation for movups/movupd KVM: VMX: raise internal error for exception during invalid protected mode state KVM: nVMX: Optimization: Dont set KVM_REQ_EVENT when VMExit with nested_run_pending KVM: nVMX: Require immediate-exit when event reinjected to L2 and L1 event pending KVM: x86: Fix misleading comments on handling pending exceptions KVM: x86: Rename interrupt.pending to interrupt.injected KVM: VMX: No need to clear pending NMI/interrupt on inject realmode interrupt x86/kvm: use Enlightened VMCS when running on Hyper-V x86/hyper-v: detect nested features x86/hyper-v: define struct hv_enlightened_vmcs and clean field bits ...
2018-04-09Merge branch 'for-4.17/dax' into libnvdimm-for-nextDan Williams
2018-04-09Merge branch 'for-4.17/libnvdimm' into libnvdimm-for-nextDan Williams
2018-04-09Fix subtle macro variable shadowing in min_not_zero()Linus Torvalds
Commit 3c8ba0d61d04 ("kernel.h: Retain constant expression output for max()/min()") rewrote our min/max macros to be very clever, but in the meantime resurrected a variable name shadow issue that we had had previously fixed in commit 589a9785ee3a ("min/max: remove sparse warnings when they're nested"). That commit talks about the sparse warnings that this shadowing causes, which we ignored as just a minor annoyance. But it turns out that the sparse warning is the least of our problems. We actually have a real bug due to the shadowing through the interaction with "min_not_zero()", which ends up doing min(__x, __y) internally, and then the new declaration of "__x" and "__y" as new variables in __cmp_once() results in a complete mess of an expression, and "min_not_zero()" doesn't work at all. For some odd reason, this only ever caused (reported) problems on s390, even though it is a generic issue and most of the (obviously successful) testing of the problematic commit had happened on other architectures. Quoting Sebastian Ott: "What happened is that the bio build by the partition detection code was attempted to be split by the block layer because the block queue had a max_sector setting of 0. blk_queue_max_hw_sectors uses min_not_zero." So re-introduce the use of __UNIQUE_ID() to make sure that the min/max macros do not have these kinds of clashes. [ That said, __UNIQUE_ID() itself has several issues that make it less than wonderful. In particular, the "uniqueness" has a fallback on the line number, which means that it's not actually unique in more complex cases if you don't build with gcc or clang (which have working unique counters that aren't tied to line numbers). That historical broken fallback also means that we have that pointless "prefix" argument that doesn't actually make much sense _except_ for the known-broken case. Oh well. ] Fixes: 3c8ba0d61d04 ("kernel.h: Retain constant expression output for max()/min()") Reported-and-tested-by: Sebastian Ott <sebott@linux.vnet.ibm.com> Cc: Kees Cook <keescook@chromium.org> Cc: Ingo Molnar <mingo@kernel.org> Cc: Miguel Ojeda <miguel.ojeda.sandonis@gmail.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-09xfs: non-scrub - remove unused function parametersEric Sandeen
Signed-off-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-04-09xfs: remove filestream item xfs_inode referenceChristoph Hellwig
The filestreams allocator stores an xfs_fstrm_item structure in the MRU to cache inode number to agno mappings for a particular length of time. Each xfs_fstrm_item contains the internal MRU structure, an inode pointer and agno value. The inode pointer stored in the xfs_fstrm_item is not referenced, however, which means the inode itself can be removed and reclaimed before the MRU item is freed. If this occurs, xfs_fstrm_free_func() can access freed or unrelated memory through xfs_fstrm_item->ip and crash. The obvious solution is to grab an inode reference for xfs_fstrm_item. The filestream mechanism only actually uses the inode pointer as a means to access the xfs_mount, however. Rather than add unnecessary complexity, simplify the implementation to store an xfs_mount pointer in struct xfs_mru_cache, and pass it to the free callback. This also requires updates to the tracepoint class to provide the associated data via parameters rather than the inode and a minor hack to peek at the MRU key to establish the inode number at free time. Based on debugging work and an earlier patch from Brian Foster, who also wrote most of this changelog. Reported-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-04-09x86/espfix: Document use of _PAGE_GLOBALDave Hansen
The "normal" kernel page table creation mechanisms using PAGE_KERNEL_* page protections will never set _PAGE_GLOBAL with PTI. The few places in the kernel that always want _PAGE_GLOBAL must avoid using PAGE_KERNEL_*. Document that we want it here and its use is not accidental. Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Arjan van de Ven <arjan@linux.intel.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Dan Williams <dan.j.williams@intel.com> Cc: David Woodhouse <dwmw2@infradead.org> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Hugh Dickins <hughd@google.com> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Juergen Gross <jgross@suse.com> Cc: Kees Cook <keescook@google.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Nadav Amit <namit@vmware.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-mm@kvack.org Link: http://lkml.kernel.org/r/20180406205507.BCF4D4F0@viggo.jf.intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-04-09x86/mm: Introduce "default" kernel PTE maskDave Hansen
The __PAGE_KERNEL_* page permissions are "raw". They contain bits that may or may not be supported on the current processor. They need to be filtered by a mask (currently __supported_pte_mask) to turn them into a value that we can actually set in a PTE. These __PAGE_KERNEL_* values all contain _PAGE_GLOBAL. But, with PTI, we want to be able to support _PAGE_GLOBAL (have the bit set in __supported_pte_mask) but not have it appear in any of these masks by default. This patch creates a new mask, __default_kernel_pte_mask, and applies it when creating all of the PAGE_KERNEL_* masks. This makes PAGE_KERNEL_* safe to use anywhere (they only contain supported bits). It also ensures that PAGE_KERNEL_* contains _PAGE_GLOBAL on PTI=n kernels but clears _PAGE_GLOBAL when PTI=y. We also make __default_kernel_pte_mask a non-GPL exported symbol because there are plenty of driver-available interfaces that take PAGE_KERNEL_* permissions. Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Arjan van de Ven <arjan@linux.intel.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Dan Williams <dan.j.williams@intel.com> Cc: David Woodhouse <dwmw2@infradead.org> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Hugh Dickins <hughd@google.com> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Juergen Gross <jgross@suse.com> Cc: Kees Cook <keescook@google.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Nadav Amit <namit@vmware.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-mm@kvack.org Link: http://lkml.kernel.org/r/20180406205506.030DB6B6@viggo.jf.intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-04-09x86/mm: Undo double _PAGE_PSE clearingDave Hansen
When clearing _PAGE_PRESENT on a huge page, we need to be careful to also clear _PAGE_PSE, otherwise it might still get confused for a valid large page table entry. We do that near the spot where we *set* _PAGE_PSE. That's fine, but it's unnecessary. pgprot_large_2_4k() already did it. BTW, I also noticed that pgprot_large_2_4k() and pgprot_4k_2_large() are not symmetric. pgprot_large_2_4k() clears _PAGE_PSE (because it is aliased to _PAGE_PAT) but pgprot_4k_2_large() does not put _PAGE_PSE back. Bummer. Also, add some comments and change "promote" to "move". "Promote" seems an odd word to move when we are logically moving a bit to a lower bit position. Also add an extra line return to make it clear to which line the comment applies. Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Arjan van de Ven <arjan@linux.intel.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Dan Williams <dan.j.williams@intel.com> Cc: David Woodhouse <dwmw2@infradead.org> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Hugh Dickins <hughd@google.com> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Juergen Gross <jgross@suse.com> Cc: Kees Cook <keescook@google.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Nadav Amit <namit@vmware.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-mm@kvack.org Link: http://lkml.kernel.org/r/20180406205504.9B0F44A9@viggo.jf.intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-04-09x86/mm: Factor out pageattr _PAGE_GLOBAL settingDave Hansen
The pageattr code has a pattern repeated where it sets _PAGE_GLOBAL for present PTEs but clears it for non-present PTEs. The intention is to keep _PAGE_GLOBAL from getting confused with _PAGE_PROTNONE since _PAGE_GLOBAL is for present PTEs and _PAGE_PROTNONE is for non-present But, this pattern makes no sense. Effectively, it says, if you use the pageattr code, always set _PAGE_GLOBAL when _PAGE_PRESENT. canon_pgprot() will clear it if unsupported (because it masks the value with __supported_pte_mask) but we *always* set it. Even if canon_pgprot() did not filter _PAGE_GLOBAL, it would be OK. _PAGE_GLOBAL is ignored when CR4.PGE=0 by the hardware. This unconditional setting of _PAGE_GLOBAL is a problem when we have PTI and non-PTI and we want some areas to have _PAGE_GLOBAL and some not. This updated version of the code says: 1. Clear _PAGE_GLOBAL when !_PAGE_PRESENT 2. Never set _PAGE_GLOBAL implicitly 3. Allow _PAGE_GLOBAL to be in cpa.set_mask 4. Allow _PAGE_GLOBAL to be inherited from previous PTE Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Arjan van de Ven <arjan@linux.intel.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Dan Williams <dan.j.williams@intel.com> Cc: David Woodhouse <dwmw2@infradead.org> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Hugh Dickins <hughd@google.com> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Juergen Gross <jgross@suse.com> Cc: Kees Cook <keescook@google.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Nadav Amit <namit@vmware.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-mm@kvack.org Link: http://lkml.kernel.org/r/20180406205502.86E199DA@viggo.jf.intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-04-09Merge branch 'for-linus-sa1100' of git://git.armlinux.org.uk/~rmk/linux-armLinus Torvalds
Pull ARM SA1100 updates from Russell King: "We have support for arbitary MMIO registers providing platform GPIOs, which allows us to abstract some of the SA11x0 CF support. This set of updates makes that change" * 'for-linus-sa1100' of git://git.armlinux.org.uk/~rmk/linux-arm: ARM: sa1100/simpad: switch simpad CF to use gpiod APIs ARM: sa1100/shannon: convert to generic CF sockets ARM: sa1100/nanoengine: convert to generic CF sockets ARM: sa1100/h3xxx: switch h3xxx PCMCIA to use gpiod APIs ARM: sa1100/cerf: convert to generic CF sockets ARM: sa1100/assabet: convert to generic CF sockets ARM: sa1100: provide infrastructure to support generic CF sockets pcmcia: sa1100: provide generic CF support
2018-04-09Merge branch 'linus' into x86/pti to pick up upstream changesIngo Molnar
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-04-09x86/entry/64: Drop idtentry's manual stack switch for user entriesAndy Lutomirski
For non-paranoid entries, idtentry knows how to switch from the kernel stack to the user stack, as does error_entry. This results in pointless duplication and code bloat. Make idtentry stop thinking about stacks for non-paranoid entries. This reduces text size by 5377 bytes. This goes back to the following commit: 7f2590a110b8 ("x86/entry/64: Use a per-CPU trampoline stack for IDT entries") Signed-off-by: Andy Lutomirski <luto@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Dominik Brodowski <linux@dominikbrodowski.net> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/90aab80c1f906e70742eaa4512e3c9b5e62d59d4.1522794757.git.luto@kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-04-09x86/olpc: Fix inconsistent MFD_CS5535 configurationArnd Bergmann
This Kconfig warning appeared after a fix to the Kconfig validation. The GPIO_CS5535 driver depends on the MFD_CS5535 driver, but the former is selected in places where the latter is not: WARNING: unmet direct dependencies detected for GPIO_CS5535 Depends on [m]: GPIOLIB [=y] && (X86 [=y] || MIPS || COMPILE_TEST [=y]) && MFD_CS5535 [=m] Selected by [y]: - OLPC_XO1_SCI [=y] && X86_32 [=y] && OLPC [=y] && OLPC_XO1_PM [=y] && INPUT [=y]=y The warning does seem appropriate, since the GPIO_CS5535 driver won't work unless MFD_CS5535 is also present. However, there is no link time dependency between the two, so this caused no problems during randconfig testing before. This changes the 'select GPIO_CS5535' to 'depends on GPIO_CS5535' to avoid the issue, at the expense of making it harder to configure the driver (one now has to select the dependencies first). The 'select MFD_CORE' part is completely redundant, since we already depend on MFD_CS5535 here, so I'm removing that as well. Ideally, the private symbols exported by that cs5535 gpio driver would just be converted to gpiolib interfaces so we could expletely avoid this dependency. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Acked-by: Thomas Gleixner <tglx@linutronix.de> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Masahiro Yamada <yamada.masahiro@socionext.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: linux-kbuild@vger.kernel.org Fixes: f622f8279581 ("kconfig: warn unmet direct dependency of tristate symbols selected by y") Link: http://lkml.kernel.org/r/20180404124539.3817101-1-arnd@arndb.de Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-04-09swiotlb: Use dma_direct_supported() for swiotlb_opsChristoph Hellwig
swiotlb_alloc() calls dma_direct_alloc(), which can satisfy lower than 32-bit DMA mask requests using GFP_DMA if the architecture supports it. Various x86 drivers rely on that, so we need to support that. At the same time the whole kernel expects a 32-bit DMA mask to just work, so the other magic in swiotlb_dma_supported() isn't actually needed either. Reported-by: Dominik Brodowski <linux@dominikbrodowski.net> Signed-off-by: Christoph Hellwig <hch@lst.de> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: iommu@lists.linux-foundation.org Fixes: 6e4bf5867783 ("x86/dma: Use generic swiotlb_ops") Link: http://lkml.kernel.org/r/20180409091517.6619-2-hch@lst.de Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-04-09Merge branch 'for-linus' of git://git.armlinux.org.uk/~rmk/linux-armLinus Torvalds
Pull ARM updates from Russell King: "A number of core ARM changes: - Refactoring linker script by Nicolas Pitre - Enable source fortification - Add support for Cortex R8" * 'for-linus' of git://git.armlinux.org.uk/~rmk/linux-arm: ARM: decompressor: fix warning introduced in fortify patch ARM: 8751/1: Add support for Cortex-R8 processor ARM: 8749/1: Kconfig: Add ARCH_HAS_FORTIFY_SOURCE ARM: simplify and fix linker script for TCM ARM: linker script: factor out TCM bits ARM: linker script: factor out vectors and stubs ARM: linker script: factor out unwinding table sections ARM: linker script: factor out stuff for the .text section ARM: linker script: factor out stuff for the DISCARD section ARM: linker script: factor out some common definitions between XIP and non-XIP
2018-04-09perf/core: Fix use-after-free in uprobe_perf_close()Prashant Bhole
A use-after-free bug was caught by KASAN while running usdt related code (BCC project. bcc/tests/python/test_usdt2.py): ================================================================== BUG: KASAN: use-after-free in uprobe_perf_close+0x222/0x3b0 Read of size 4 at addr ffff880384f9b4a4 by task test_usdt2.py/870 CPU: 4 PID: 870 Comm: test_usdt2.py Tainted: G W 4.16.0-next-20180409 #215 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014 Call Trace: dump_stack+0xc7/0x15b ? show_regs_print_info+0x5/0x5 ? printk+0x9c/0xc3 ? kmsg_dump_rewind_nolock+0x6e/0x6e ? uprobe_perf_close+0x222/0x3b0 print_address_description+0x83/0x3a0 ? uprobe_perf_close+0x222/0x3b0 kasan_report+0x1dd/0x460 ? uprobe_perf_close+0x222/0x3b0 uprobe_perf_close+0x222/0x3b0 ? probes_open+0x180/0x180 ? free_filters_list+0x290/0x290 trace_uprobe_register+0x1bb/0x500 ? perf_event_attach_bpf_prog+0x310/0x310 ? probe_event_disable+0x4e0/0x4e0 perf_uprobe_destroy+0x63/0xd0 _free_event+0x2bc/0xbd0 ? lockdep_rcu_suspicious+0x100/0x100 ? ring_buffer_attach+0x550/0x550 ? kvm_sched_clock_read+0x1a/0x30 ? perf_event_release_kernel+0x3e4/0xc00 ? __mutex_unlock_slowpath+0x12e/0x540 ? wait_for_completion+0x430/0x430 ? lock_downgrade+0x3c0/0x3c0 ? lock_release+0x980/0x980 ? do_raw_spin_trylock+0x118/0x150 ? do_raw_spin_unlock+0x121/0x210 ? do_raw_spin_trylock+0x150/0x150 perf_event_release_kernel+0x5d4/0xc00 ? put_event+0x30/0x30 ? fsnotify+0xd2d/0xea0 ? sched_clock_cpu+0x18/0x1a0 ? __fsnotify_update_child_dentry_flags.part.0+0x1b0/0x1b0 ? pvclock_clocksource_read+0x152/0x2b0 ? pvclock_read_flags+0x80/0x80 ? kvm_sched_clock_read+0x1a/0x30 ? sched_clock_cpu+0x18/0x1a0 ? pvclock_clocksource_read+0x152/0x2b0 ? locks_remove_file+0xec/0x470 ? pvclock_read_flags+0x80/0x80 ? fcntl_setlk+0x880/0x880 ? ima_file_free+0x8d/0x390 ? lockdep_rcu_suspicious+0x100/0x100 ? ima_file_check+0x110/0x110 ? fsnotify+0xea0/0xea0 ? kvm_sched_clock_read+0x1a/0x30 ? rcu_note_context_switch+0x600/0x600 perf_release+0x21/0x40 __fput+0x264/0x620 ? fput+0xf0/0xf0 ? do_raw_spin_unlock+0x121/0x210 ? do_raw_spin_trylock+0x150/0x150 ? SyS_fchdir+0x100/0x100 ? fsnotify+0xea0/0xea0 task_work_run+0x14b/0x1e0 ? task_work_cancel+0x1c0/0x1c0 ? copy_fd_bitmaps+0x150/0x150 ? vfs_read+0xe5/0x260 exit_to_usermode_loop+0x17b/0x1b0 ? trace_event_raw_event_sys_exit+0x1a0/0x1a0 do_syscall_64+0x3f6/0x490 ? syscall_return_slowpath+0x2c0/0x2c0 ? lockdep_sys_exit+0x1f/0xaa ? syscall_return_slowpath+0x1a3/0x2c0 ? lockdep_sys_exit+0x1f/0xaa ? prepare_exit_to_usermode+0x11c/0x1e0 ? enter_from_user_mode+0x30/0x30 random: crng init done ? __put_user_4+0x1c/0x30 entry_SYSCALL_64_after_hwframe+0x3d/0xa2 RIP: 0033:0x7f41d95f9340 RSP: 002b:00007fffe71e4268 EFLAGS: 00000246 ORIG_RAX: 0000000000000003 RAX: 0000000000000000 RBX: 000000000000000d RCX: 00007f41d95f9340 RDX: 0000000000000000 RSI: 0000000000002401 RDI: 000000000000000d RBP: 0000000000000000 R08: 00007f41ca8ff700 R09: 00007f41d996dd1f R10: 00007fffe71e41e0 R11: 0000000000000246 R12: 00007fffe71e4330 R13: 0000000000000000 R14: fffffffffffffffc R15: 00007fffe71e4290 Allocated by task 870: kasan_kmalloc+0xa0/0xd0 kmem_cache_alloc_node+0x11a/0x430 copy_process.part.19+0x11a0/0x41c0 _do_fork+0x1be/0xa20 do_syscall_64+0x198/0x490 entry_SYSCALL_64_after_hwframe+0x3d/0xa2 Freed by task 0: __kasan_slab_free+0x12e/0x180 kmem_cache_free+0x102/0x4d0 free_task+0xfe/0x160 __put_task_struct+0x189/0x290 delayed_put_task_struct+0x119/0x250 rcu_process_callbacks+0xa6c/0x1b60 __do_softirq+0x238/0x7ae The buggy address belongs to the object at ffff880384f9b480 which belongs to the cache task_struct of size 12928 It occurs because task_struct is freed before perf_event which refers to the task and task flags are checked while teardown of the event. perf_event_alloc() assigns task_struct to hw.target of perf_event, but there is no reference counting for it. As a fix we get_task_struct() in perf_event_alloc() at above mentioned assignment and put_task_struct() in _free_event(). Signed-off-by: Prashant Bhole <bhole_prashant_q7@lab.ntt.co.jp> Reviewed-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: <stable@kernel.org> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Fixes: 63b6da39bb38e8f1a1ef3180d32a39d6 ("perf: Fix perf_event_exit_task() race") Link: http://lkml.kernel.org/r/20180409100346.6416-1-bhole_prashant_q7@lab.ntt.co.jp Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-04-09Merge branch 'for-next' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/gerg/m68knommu Pull m68knommu update from Greg Ungerer: "Only a single fix to set the DMA masks in the ColdFire FEC platform data structure. This stops the warning from dma-mapping.h at boot time" * 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/gerg/m68knommu: m68k: set dma and coherent masks for platform FEC ethernets
2018-04-09Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/mattst88/alpha Pull alpha updates from Matt Turner: "A few small changes for alpha" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mattst88/alpha: alpha: io: reorder barriers to guarantee writeX() and iowriteX() ordering alpha: Implement CPU vulnerabilities sysfs functions. alpha: rtc: stop validating rtc_time in .read_time alpha: rtc: remove unused set_mmss ops
2018-04-09libnvdimm, of_pmem: workaround OF_NUMA=n build errorDan Williams
Stephen reports that an x86 allmodconfig build fails to build the of_pmem driver due to a missing definition of of_node_to_nid(). That helper is currently only exported in the OF_NUMA=y case. In other cases, ppc and sparc, it is a weak symbol, and outside of those platforms it is a static inline. Until an OF_NUMA=n configuration can reliably support usage of of_node_to_nid() in modules across architectures, mark this driver as 'bool' instead of 'tristate'. Cc: Rob Herring <robh@kernel.org> Cc: Oliver O'Halloran <oohall@gmail.com> Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2018-04-09Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux Pull s390 updates from Martin Schwidefsky: - Improvements for the spectre defense: * The spectre related code is consolidated to a single file nospec-branch.c * Automatic enable/disable for the spectre v2 defenses (expoline vs. nobp) * Syslog messages for specve v2 are added * Enable CONFIG_GENERIC_CPU_VULNERABILITIES and define the attribute functions for spectre v1 and v2 - Add helper macros for assembler alternatives and use them to shorten the code in entry.S. - Add support for persistent configuration data via the SCLP Store Data interface. The H/W interface requires a page table that uses 4K pages only, the code to setup such an address space is added as well. - Enable virtio GPU emulation in QEMU. To do this the depends statements for a few common Kconfig options are modified. - Add support for format-3 channel path descriptors and add a binary sysfs interface to export the associated utility strings. - Add a sysfs attribute to control the IFCC handling in case of constant channel errors. - The vfio-ccw changes from Cornelia. - Bug fixes and cleanups. * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux: (40 commits) s390/kvm: improve stack frame constants in entry.S s390/lpp: use assembler alternatives for the LPP instruction s390/entry.S: use assembler alternatives s390: add assembler macros for CPU alternatives s390: add sysfs attributes for spectre s390: report spectre mitigation via syslog s390: add automatic detection of the spectre defense s390: move nobp parameter functions to nospec-branch.c s390/cio: add util_string sysfs attribute s390/chsc: query utility strings via fmt3 channel path descriptor s390/cio: rename struct channel_path_desc s390/cio: fix unbind of io_subchannel_driver s390/qdio: split up CCQ handling for EQBS / SQBS s390/qdio: don't retry EQBS after CCQ 96 s390/qdio: restrict buffer merging to eligible devices s390/qdio: don't merge ERROR output buffers s390/qdio: simplify math in get_*_buffer_frontier() s390/decompressor: trim uncompressed image head during the build s390/crypto: Fix kernel crash on aes_s390 module remove. s390/defkeymap: fix global init to zero ...
2018-04-09ALSA: pcm: Remove WARN_ON() at snd_pcm_hw_params() errorTakashi Iwai
snd_pcm_hw_params() (more exactly snd_pcm_hw_params_choose()) contains a check of the return error from snd_pcm_hw_param_first() and _last() with snd_BUG_ON() -- i.e. it may trigger WARN_ON() depending on the kconfig. This was a valid check in the past, as these functions shouldn't return any error if the parameters have been already refined via snd_pcm_hw_refine() beforehand. However, the recent rewrite introduced a kmalloc() in snd_pcm_hw_refine() for removing VLA, and this brought a possibility to trigger an error. As a result, syzbot caught lots of superfluous kernel WARN_ON() and paniced via fault injection. As the WARN_ON() is no longer valid with the introduction of kmalloc(), let's drop snd_BUG_ON() check, in order to make the world peaceful place again. Reported-by: syzbot+803e0047ac3a3096bb4f@syzkaller.appspotmail.com Fixes: 5730f9f744cf ("ALSA: pcm: Remove VLA usage") Signed-off-by: Takashi Iwai <tiwai@suse.de>
2018-04-09vhost-net: set packet weight of tx polling to 2 * vq sizehaibinzhang(张海斌)
handle_tx will delay rx for tens or even hundreds of milliseconds when tx busy polling udp packets with small length(e.g. 1byte udp payload), because setting VHOST_NET_WEIGHT takes into account only sent-bytes but no single packet length. Ping-Latencies shown below were tested between two Virtual Machines using netperf (UDP_STREAM, len=1), and then another machine pinged the client: vq size=256 Packet-Weight Ping-Latencies(millisecond) min avg max Origin 3.319 18.489 57.303 64 1.643 2.021 2.552 128 1.825 2.600 3.224 256 1.997 2.710 4.295 512 1.860 3.171 4.631 1024 2.002 4.173 9.056 2048 2.257 5.650 9.688 4096 2.093 8.508 15.943 vq size=512 Packet-Weight Ping-Latencies(millisecond) min avg max Origin 6.537 29.177 66.245 64 2.798 3.614 4.403 128 2.861 3.820 4.775 256 3.008 4.018 4.807 512 3.254 4.523 5.824 1024 3.079 5.335 7.747 2048 3.944 8.201 12.762 4096 4.158 11.057 19.985 Seems pretty consistent, a small dip at 2 VQ sizes. Ring size is a hint from device about a burst size it can tolerate. Based on benchmarks, set the weight to 2 * vq size. To evaluate this change, another tests were done using netperf(RR, TX) between two machines with Intel(R) Xeon(R) Gold 6133 CPU @ 2.50GHz, and vq size was tweaked through qemu. Results shown below does not show obvious changes. vq size=256 TCP_RR vq size=512 TCP_RR size/sessions/+thu%/+normalize% size/sessions/+thu%/+normalize% 1/ 1/ -7%/ -2% 1/ 1/ 0%/ -2% 1/ 4/ +1%/ 0% 1/ 4/ +1%/ 0% 1/ 8/ +1%/ -2% 1/ 8/ 0%/ +1% 64/ 1/ -6%/ 0% 64/ 1/ +7%/ +3% 64/ 4/ 0%/ +2% 64/ 4/ -1%/ +1% 64/ 8/ 0%/ 0% 64/ 8/ -1%/ -2% 256/ 1/ -3%/ -4% 256/ 1/ -4%/ -2% 256/ 4/ +3%/ +4% 256/ 4/ +1%/ +2% 256/ 8/ +2%/ 0% 256/ 8/ +1%/ -1% vq size=256 UDP_RR vq size=512 UDP_RR size/sessions/+thu%/+normalize% size/sessions/+thu%/+normalize% 1/ 1/ -5%/ +1% 1/ 1/ -3%/ -2% 1/ 4/ +4%/ +1% 1/ 4/ -2%/ +2% 1/ 8/ -1%/ -1% 1/ 8/ -1%/ 0% 64/ 1/ -2%/ -3% 64/ 1/ +1%/ +1% 64/ 4/ -5%/ -1% 64/ 4/ +2%/ 0% 64/ 8/ 0%/ -1% 64/ 8/ -2%/ +1% 256/ 1/ +7%/ +1% 256/ 1/ -7%/ 0% 256/ 4/ +1%/ +1% 256/ 4/ -3%/ -4% 256/ 8/ +2%/ +2% 256/ 8/ +1%/ +1% vq size=256 TCP_STREAM vq size=512 TCP_STREAM size/sessions/+thu%/+normalize% size/sessions/+thu%/+normalize% 64/ 1/ 0%/ -3% 64/ 1/ 0%/ 0% 64/ 4/ +3%/ -1% 64/ 4/ -2%/ +4% 64/ 8/ +9%/ -4% 64/ 8/ -1%/ +2% 256/ 1/ +1%/ -4% 256/ 1/ +1%/ +1% 256/ 4/ -1%/ -1% 256/ 4/ -3%/ 0% 256/ 8/ +7%/ +5% 256/ 8/ -3%/ 0% 512/ 1/ +1%/ 0% 512/ 1/ -1%/ -1% 512/ 4/ +1%/ -1% 512/ 4/ 0%/ 0% 512/ 8/ +7%/ -5% 512/ 8/ +6%/ -1% 1024/ 1/ 0%/ -1% 1024/ 1/ 0%/ +1% 1024/ 4/ +3%/ 0% 1024/ 4/ +1%/ 0% 1024/ 8/ +8%/ +5% 1024/ 8/ -1%/ 0% 2048/ 1/ +2%/ +2% 2048/ 1/ -1%/ 0% 2048/ 4/ +1%/ 0% 2048/ 4/ 0%/ -1% 2048/ 8/ -2%/ 0% 2048/ 8/ 5%/ -1% 4096/ 1/ -2%/ 0% 4096/ 1/ -2%/ 0% 4096/ 4/ +2%/ 0% 4096/ 4/ 0%/ 0% 4096/ 8/ +9%/ -2% 4096/ 8/ -5%/ -1% Acked-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Haibin Zhang <haibinzhang@tencent.com> Signed-off-by: Yunfang Tai <yunfangtai@tencent.com> Signed-off-by: Lidong Chen <lidongchen@tencent.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-09net: thunderx: rework mac addresses list to u64 arrayVadim Lomovtsev
It is too expensive to pass u64 values via linked list, instead allocate array for them by overall number of mac addresses from netdev. This eventually removes multiple kmalloc() calls, aviod memory fragmentation and allow to put single null check on kmalloc return value in order to prevent a potential null pointer dereference. Addresses-Coverity-ID: 1467429 ("Dereference null return value") Fixes: 37c3347eb247 ("net: thunderx: add ndo_set_rx_mode callback implementation for VF") Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Vadim Lomovtsev <Vadim.Lomovtsev@cavium.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-09inetpeer: fix uninit-value in inet_getpeerEric Dumazet
syzbot/KMSAN reported that p->dtime was read while it was not yet initialized in : delta = (__u32)jiffies - p->dtime; if (delta < ttl || !refcount_dec_if_one(&p->refcnt)) gc_stack[i] = NULL; This is a false positive, because the inetpeer wont be erased from rb-tree if the refcount_dec_if_one(&p->refcnt) does not succeed. And this wont happen before first inet_putpeer() call for this inetpeer has been done, and ->dtime field is written exactly before the refcount_dec_and_test(&p->refcnt). The KMSAN report was : BUG: KMSAN: uninit-value in inet_peer_gc net/ipv4/inetpeer.c:163 [inline] BUG: KMSAN: uninit-value in inet_getpeer+0x1567/0x1e70 net/ipv4/inetpeer.c:228 CPU: 0 PID: 9494 Comm: syz-executor5 Not tainted 4.16.0+ #82 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Call Trace: __dump_stack lib/dump_stack.c:17 [inline] dump_stack+0x185/0x1d0 lib/dump_stack.c:53 kmsan_report+0x142/0x240 mm/kmsan/kmsan.c:1067 __msan_warning_32+0x6c/0xb0 mm/kmsan/kmsan_instr.c:676 inet_peer_gc net/ipv4/inetpeer.c:163 [inline] inet_getpeer+0x1567/0x1e70 net/ipv4/inetpeer.c:228 inet_getpeer_v4 include/net/inetpeer.h:110 [inline] icmpv4_xrlim_allow net/ipv4/icmp.c:330 [inline] icmp_send+0x2b44/0x3050 net/ipv4/icmp.c:725 ip_options_compile+0x237c/0x29f0 net/ipv4/ip_options.c:472 ip_rcv_options net/ipv4/ip_input.c:284 [inline] ip_rcv_finish+0xda8/0x16d0 net/ipv4/ip_input.c:365 NF_HOOK include/linux/netfilter.h:288 [inline] ip_rcv+0x119d/0x16f0 net/ipv4/ip_input.c:493 __netif_receive_skb_core+0x47cf/0x4a80 net/core/dev.c:4562 __netif_receive_skb net/core/dev.c:4627 [inline] netif_receive_skb_internal+0x49d/0x630 net/core/dev.c:4701 netif_receive_skb+0x230/0x240 net/core/dev.c:4725 tun_rx_batched drivers/net/tun.c:1555 [inline] tun_get_user+0x6d88/0x7580 drivers/net/tun.c:1962 tun_chr_write_iter+0x1d4/0x330 drivers/net/tun.c:1990 do_iter_readv_writev+0x7bb/0x970 include/linux/fs.h:1776 do_iter_write+0x30d/0xd40 fs/read_write.c:932 vfs_writev fs/read_write.c:977 [inline] do_writev+0x3c9/0x830 fs/read_write.c:1012 SYSC_writev+0x9b/0xb0 fs/read_write.c:1085 SyS_writev+0x56/0x80 fs/read_write.c:1082 do_syscall_64+0x309/0x430 arch/x86/entry/common.c:287 entry_SYSCALL_64_after_hwframe+0x3d/0xa2 RIP: 0033:0x455111 RSP: 002b:00007fae0365cba0 EFLAGS: 00000293 ORIG_RAX: 0000000000000014 RAX: ffffffffffffffda RBX: 000000000000002e RCX: 0000000000455111 RDX: 0000000000000001 RSI: 00007fae0365cbf0 RDI: 00000000000000fc RBP: 0000000020000040 R08: 00000000000000fc R09: 0000000000000000 R10: 000000000000002e R11: 0000000000000293 R12: 00000000ffffffff R13: 0000000000000658 R14: 00000000006fc8e0 R15: 0000000000000000 Uninit was created at: kmsan_save_stack_with_flags mm/kmsan/kmsan.c:278 [inline] kmsan_internal_poison_shadow+0xb8/0x1b0 mm/kmsan/kmsan.c:188 kmsan_kmalloc+0x94/0x100 mm/kmsan/kmsan.c:314 kmem_cache_alloc+0xaab/0xb90 mm/slub.c:2756 inet_getpeer+0xed8/0x1e70 net/ipv4/inetpeer.c:210 inet_getpeer_v4 include/net/inetpeer.h:110 [inline] ip4_frag_init+0x4d1/0x740 net/ipv4/ip_fragment.c:153 inet_frag_alloc net/ipv4/inet_fragment.c:369 [inline] inet_frag_create net/ipv4/inet_fragment.c:385 [inline] inet_frag_find+0x7da/0x1610 net/ipv4/inet_fragment.c:418 ip_find net/ipv4/ip_fragment.c:275 [inline] ip_defrag+0x448/0x67a0 net/ipv4/ip_fragment.c:676 ip_check_defrag+0x775/0xda0 net/ipv4/ip_fragment.c:724 packet_rcv_fanout+0x2a8/0x8d0 net/packet/af_packet.c:1447 deliver_skb net/core/dev.c:1897 [inline] deliver_ptype_list_skb net/core/dev.c:1912 [inline] __netif_receive_skb_core+0x314a/0x4a80 net/core/dev.c:4545 __netif_receive_skb net/core/dev.c:4627 [inline] netif_receive_skb_internal+0x49d/0x630 net/core/dev.c:4701 netif_receive_skb+0x230/0x240 net/core/dev.c:4725 tun_rx_batched drivers/net/tun.c:1555 [inline] tun_get_user+0x6d88/0x7580 drivers/net/tun.c:1962 tun_chr_write_iter+0x1d4/0x330 drivers/net/tun.c:1990 do_iter_readv_writev+0x7bb/0x970 include/linux/fs.h:1776 do_iter_write+0x30d/0xd40 fs/read_write.c:932 vfs_writev fs/read_write.c:977 [inline] do_writev+0x3c9/0x830 fs/read_write.c:1012 SYSC_writev+0x9b/0xb0 fs/read_write.c:1085 SyS_writev+0x56/0x80 fs/read_write.c:1082 do_syscall_64+0x309/0x430 arch/x86/entry/common.c:287 entry_SYSCALL_64_after_hwframe+0x3d/0xa2 Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: syzbot <syzkaller@googlegroups.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-09syscalls/x86: Adapt syscall_wrapper.h to the new syscall stub naming conventionDominik Brodowski
Make the code in syscall_wrapper.h more readable by naming the stub macros similar to the stub they provide. While at it, fix a stray newline at the end of the __IA32_COMPAT_SYS_STUBx macro. Signed-off-by: Dominik Brodowski <linux@dominikbrodowski.net> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/20180409105145.5364-5-linux@dominikbrodowski.net Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-04-09syscalls/core, syscalls/x86: Rename struct pt_regs-based sys_*() to ↵Dominik Brodowski
__x64_sys_*() This rename allows us to have a coherent syscall stub naming convention on 64-bit x86 (0xffffffff prefix removed): 810f0af0 t kernel_waitid # common (32/64) kernel helper <inline> __do_sys_waitid # inlined helper doing actual work 810f0be0 t __se_sys_waitid # C func calling inlined helper <inline> __do_compat_sys_waitid # inlined helper doing actual work 810f0d80 t __se_compat_sys_waitid # compat C func calling inlined helper 810f2080 T __x64_sys_waitid # x64 64-bit-ptregs -> C stub 810f20b0 T __ia32_sys_waitid # ia32 32-bit-ptregs -> C stub[*] 810f2470 T __ia32_compat_sys_waitid # ia32 32-bit-ptregs -> compat C stub 810f2490 T __x32_compat_sys_waitid # x32 64-bit-ptregs -> compat C stub [*] This stub is unused, as the syscall table links __ia32_compat_sys_waitid instead of __ia32_sys_waitid as we need a compat variant here. Suggested-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Dominik Brodowski <linux@dominikbrodowski.net> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/20180409105145.5364-4-linux@dominikbrodowski.net Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-04-09syscalls/core, syscalls/x86: Clean up compat syscall stub naming conventionDominik Brodowski
Tidy the naming convention for compat syscall subs. Hints which describe the purpose of the stub go in front and receive a double underscore to denote that they are generated on-the-fly by the COMPAT_SYSCALL_DEFINEx() macro. For the generic case, this means: t kernel_waitid # common C function (see kernel/exit.c) __do_compat_sys_waitid # inlined helper doing the actual work # (takes original parameters as declared) T __se_compat_sys_waitid # sign-extending C function calling inlined # helper (takes parameters of type long, # casts them to unsigned long and then to # the declared type) T compat_sys_waitid # alias to __se_compat_sys_waitid() # (taking parameters as declared), to # be included in syscall table For x86, the naming is as follows: t kernel_waitid # common C function (see kernel/exit.c) __do_compat_sys_waitid # inlined helper doing the actual work # (takes original parameters as declared) t __se_compat_sys_waitid # sign-extending C function calling inlined # helper (takes parameters of type long, # casts them to unsigned long and then to # the declared type) T __ia32_compat_sys_waitid # IA32_EMULATION 32-bit-ptregs -> C stub, # calls __se_compat_sys_waitid(); to be # included in syscall table T __x32_compat_sys_waitid # x32 64-bit-ptregs -> C stub, calls # __se_compat_sys_waitid(); to be included # in syscall table If only one of IA32_EMULATION and x32 is enabled, __se_compat_sys_waitid() may be inlined into the stub __{ia32,x32}_compat_sys_waitid(). Suggested-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Dominik Brodowski <linux@dominikbrodowski.net> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/20180409105145.5364-3-linux@dominikbrodowski.net Signed-off-by: Ingo Molnar <mingo@kernel.org>