Age | Commit message (Collapse) | Author |
|
Add statistics to /proc/fs/afs/stats for data transfer RPC operations. New
lines are added that look like:
file-rd : n=55794 nb=10252282150
file-wr : n=9789 nb=3247763645
where n= indicates the number of ops completed and nb= indicates the number
of bytes successfully transferred. file-rd is the counts for read/fetch
operations and file-wr the counts for write/store operations.
Note that directory and symlink downloading are included in the file-rd
stats at the moment.
Signed-off-by: David Howells <dhowells@redhat.com>
|
|
Trace protocol errors detected in afs.
Signed-off-by: David Howells <dhowells@redhat.com>
|
|
Locally edit the contents of an AFS directory upon a successful inode
operation that modifies that directory (such as mkdir, create and unlink)
so that we can avoid the current practice of re-downloading the directory
after each change.
This is viable provided that the directory version number we get back from
the modifying RPC op is exactly incremented by 1 from what we had
previously. The data in the directory contents is in a defined format that
we have to parse locally to perform lookups and readdir, so modifying isn't
a problem.
If the edit fails, we just clear the VALID flag on the directory and it
will be reloaded next time it is needed.
Signed-off-by: David Howells <dhowells@redhat.com>
|
|
Adjust the AFS directory XDR structures in a number of superficial ways:
(1) Rename them to all begin afs_xdr_.
(2) Use u8 instead of uint8_t.
(3) Mark the structures as __packed so they don't get rearranged by the
compiler.
(4) Rename the hdr member of afs_xdr_dir_block to meta.
(5) Rename the pagehdr member of afs_xdr_dir_block to hdr.
Signed-off-by: David Howells <dhowells@redhat.com>
|
|
Split the directory content definitions into a header file so that they can
be used by multiple .c files.
Signed-off-by: David Howells <dhowells@redhat.com>
|
|
AFS directories are structured blobs that are downloaded just like files
and then parsed by the lookup and readdir code and, as such, are currently
handled in the pagecache like any other file, with the entire directory
content being thrown away each time the directory changes.
However, since the blob is a known structure and since the data version
counter on a directory increases by exactly one for each change committed
to that directory, we can actually edit the directory locally rather than
fetching it from the server after each locally-induced change.
What we can't do, though, is mix data from the server and data from the
client since the server is technically at liberty to rearrange or compress
a directory if it sees fit, provided it updates the data version number
when it does so and breaks the callback (ie. sends a notification).
Further, lookup with lookup-ahead, readdir and, when it arrives, local
editing are likely want to scan the whole of a directory.
So directory handling needs to be improved to maintain the coherency of the
directory blob prior to permitting local directory editing.
To this end:
(1) If any directory page gets discarded, invalidate and reread the entire
directory.
(2) If readpage notes that if when it fetches a single page that the
version number has changed, the entire directory is flagged for
invalidation.
(3) Read as much of the directory in one go as we can.
Note that this removes local caching of directories in fscache for the
moment as we can't pass the pages to fscache_read_or_alloc_pages() since
page->lru is in use by the LRU.
Signed-off-by: David Howells <dhowells@redhat.com>
|
|
Split the AFS dynamic root stuff out of the main directory handling file
and into its own file as they share little in common.
The dynamic root code also gets its own dentry and inode ops tables.
Signed-off-by: David Howells <dhowells@redhat.com>
|
|
Each afs dentry is tagged with the version that the parent directory was at
last time it was validated and, currently, if this differs, the directory
is scanned and the dentry is refreshed.
However, this leads to an excessive amount of revalidation on directories
that get modified on the client without conflict with another client. We
know there's no conflict because the parent directory's data version number
got incremented by exactly 1 on any create, mkdir, unlink, etc., therefore
we can trust the current state of the unaffected dentries when we perform a
local directory modification.
Optimise by keeping track of the last version of the parent directory that
was changed outside of the client in the parent directory's vnode and using
that to validate the dentries rather than the current version.
Signed-off-by: David Howells <dhowells@redhat.com>
|
|
Rearrange the AFSFetchStatus to inode attribute mapping code in a number of
ways:
(1) Use an XDR structure rather than a series of incremented pointer
accesses when decoding an AFSFetchStatus object. This allows
out-of-order decode.
(2) Don't store the if_version value but rather just check it and abort if
it's not something we can handle.
(3) Store the owner and group in the status record as raw values rather
than converting them to kuid/kgid. Do that when they're mapped into
i_uid/i_gid.
(4) Validate the type and abort code up front and abort if they're wrong.
(5) Split the inode attribute setting out into its own function from the
XDR decode of an AFSFetchStatus object. This allows it to be called
from elsewhere too.
(6) Differentiate changes to data from changes to metadata.
(7) Use the split-out attribute mapping function from afs_iget().
Signed-off-by: David Howells <dhowells@redhat.com>
|
|
Store the data version number indicated by an FS.FetchData op into the read
request structure so that it's accessible by the page reader.
Signed-off-by: David Howells <dhowells@redhat.com>
|
|
We no longer parse symlinks when we get the inode to determine if this
symlink is actually a mountpoint as we detect that by examining the mode
instead (symlinks are always 0777 and mountpoints 0644).
Access the cache after mapping the status so that we don't have to manually
set the inode size now.
Note that this may need adjusting if the disconnected operation is
implemented as the file metadata may have to be obtained from the cache.
Signed-off-by: David Howells <dhowells@redhat.com>
|
|
Introduce a proc file that displays a bunch of statistics for the AFS
filesystem in the current network namespace.
Signed-off-by: David Howells <dhowells@redhat.com>
|
|
Dump an AFS FileStatus record that is detected as invalid.
Signed-off-by: David Howells <dhowells@redhat.com>
|
|
The code that fixes the crashes in the following commit introduced a small
memory leak:
commit 6a2cf8d3663e ("scsi: qla2xxx: Fix crashes in qla2x00_probe_one on probe failure")
Fixing this requires a bit of reworking, which I've explained. Also provide
some code cleanup.
There is a small window in qla2x00_probe_one where if qla2x00_alloc_queues
fails, we end up never freeing req and rsp and leak 0xc0 and 0xc8 bytes
respectively (the sizes of req and rsp).
I originally put in checks to test for this condition which were based on
the incorrect assumption that if ha->rsp_q_map and ha->req_q_map were
allocated, then rsp and req were allocated as well. This is incorrect.
There is a window between these allocations:
ret = qla2x00_mem_alloc(ha, req_length, rsp_length, &req, &rsp);
goto probe_hw_failed;
[if successful, both rsp and req allocated]
base_vha = qla2x00_create_host(sht, ha);
goto probe_hw_failed;
ret = qla2x00_request_irqs(ha, rsp);
goto probe_failed;
if (qla2x00_alloc_queues(ha, req, rsp)) {
goto probe_failed;
[if successful, now ha->rsp_q_map and ha->req_q_map allocated]
To simplify this, we should just set req and rsp to NULL after we free
them. Sounds simple enough? The problem is that req and rsp are pointers
defined in the qla2x00_probe_one and they are not always passed by reference
to the routines that free them.
Here are paths which can free req and rsp:
PATH 1:
qla2x00_probe_one
ret = qla2x00_mem_alloc(ha, req_length, rsp_length, &req, &rsp);
[req and rsp are passed by reference, but if this fails, we currently
do not NULL out req and rsp. Easily fixed]
PATH 2:
qla2x00_probe_one
failing in qla2x00_request_irqs or qla2x00_alloc_queues
probe_failed:
qla2x00_free_device(base_vha);
qla2x00_free_req_que(ha, req)
qla2x00_free_rsp_que(ha, rsp)
PATH 3:
qla2x00_probe_one:
failing in qla2x00_mem_alloc or qla2x00_create_host
probe_hw_failed:
qla2x00_free_req_que(ha, req)
qla2x00_free_rsp_que(ha, rsp)
PATH 1: This should currently work, but it doesn't because rsp and rsp are
not set to NULL in qla2x00_mem_alloc. Easily remedied.
PATH 2: req and rsp aren't passed in at all to qla2x00_free_device but are
derived from ha->req_q_map[0] and ha->rsp_q_map[0]. These are only set up if
qla2x00_alloc_queues succeeds.
In qla2x00_free_queues, we are protected from crashing if these don't exist
because req_qid_map and rsp_qid_map are only set on their allocation. We are
guarded in this way:
for (cnt = 0; cnt < ha->max_req_queues; cnt++) {
if (!test_bit(cnt, ha->req_qid_map))
continue;
PATH 3: This works. We haven't freed req or rsp yet (or they were never
allocated if qla2x00_mem_alloc failed), so we'll attempt to free them here.
To summarize, there are a few small changes to make this work correctly and
(and for some cleanup):
1) (For PATH 1) Set *rsp and *req to NULL in case of failure in
qla2x00_mem_alloc so these are correctly set to NULL back in
qla2x00_probe_one
2) After jumping to probe_failed: and calling qla2x00_free_device,
explicitly set rsp and req to NULL so further calls with these pointers do
not crash, i.e. the free queue calls in the probe_hw_failed section we fall
through to.
3) Fix return code check in the call to qla2x00_alloc_queues. We currently
drop the return code on the floor. The probe fails but the caller of the
probe doesn't have an error code, so it attaches to pci. This can result in
a crash on module shutdown.
4) Remove unnecessary NULL checks in qla2x00_free_req_que,
qla2x00_free_rsp_que, and the egregious NULL checks before kfrees and vfrees
in qla2x00_mem_free.
I tested this out running a scenario where the card breaks at various times
during initialization. I made sure I forced every error exit path in
qla2x00_probe_one.
Cc: <stable@vger.kernel.org> # v4.16
Fixes: 6a2cf8d3663e ("scsi: qla2xxx: Fix crashes in qla2x00_probe_one on probe failure")
Signed-off-by: Bill Kuzeja <william.kuzeja@stratus.com>
Acked-by: Himanshu Madhani <himanshu.madhani@cavium.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
|
|
Currently scsi_dh_lookup() doesn't check for NULL as a device name. This
combined with nvme over dm-mpath results in the following messages
emitted by device-mapper:
device-mapper: multipath: Could not failover device 259:67: Handler scsi_dh_(null) error 14.
Let scsi_dh_lookup() fail fast on NULL names.
[mkp: typo fix]
Cc: <stable@vger.kernel.org> # v4.16
Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Reviewed-by: Bart Van Assche <bart.vanassche@wdc.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
|
|
The first assignment to shost->use_blk_mq is redundant as it is
overwritten by the following statement. Remove this redundant code.
Detected by CoverityScan, CID#1466993 ("Unused value")
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Reviewed-by: Bart Van Assche <bart.vanassche@wdc.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
|
|
Implement @cell substitution handling such that if @cell is seen as a name
in a dynamic root mount, then the name of the root cell for that network
namespace will be substituted for @cell during lookup.
The substitution of @cell for the current net namespace is set by writing
the cell name to /proc/fs/afs/rootcell. The value can be obtained by
reading the file.
For example:
# mount -t afs none /kafs -o dyn
# echo grand.central.org >/proc/fs/afs/rootcell
# ls /kafs/@cell
archive/ cvs/ doc/ local/ project/ service/ software/ user/ www/
# cat /proc/fs/afs/rootcell
grand.central.org
Signed-off-by: David Howells <dhowells@redhat.com>
|
|
Implement the AFS feature by which @sys at the end of a pathname component
may be substituted for one of a list of values, typically naming the
operating system. Up to 16 alternatives may be specified and these are
tried in turn until one works. Each network namespace has[*] a separate
independent list.
Upon creation of a new network namespace, the list of values is
initialised[*] to a single OpenAFS-compatible string representing arch type
plus "_linux26". For example, on x86_64, the sysname is "amd64_linux26".
[*] Or will, once network namespace support is finalised in kAFS.
The list may be set by:
# for i in foo bar linux-x86_64; do echo $i; done >/proc/fs/afs/sysname
for which separate writes to the same fd are amalgamated and applied on
close. The LF character may be used as a separator to specify multiple
items in the same write() call.
The list may be cleared by:
# echo >/proc/fs/afs/sysname
and read by:
# cat /proc/fs/afs/sysname
foo
bar
linux-x86_64
Signed-off-by: David Howells <dhowells@redhat.com>
|
|
When afs_lookup() is called, prospectively look up the next 50 uncached
fids also from that same directory and cache the results, rather than just
looking up the one file requested.
This allows us to use the FS.InlineBulkStatus RPC op to increase efficiency
by fetching up to 50 file statuses at a time.
Signed-off-by: David Howells <dhowells@redhat.com>
|
|
AFS cells that are added or set as the workstation cell through /proc are
pinned against removal by setting the AFS_CELL_FL_NO_GC flag on them and
taking a ref. The ref should be only taken if the flag wasn't already set.
Fix this by making it conditional.
Without this an assertion failure will occur during module removal
indicating that the refcount is too elevated.
Signed-off-by: David Howells <dhowells@redhat.com>
|
|
Fix warnings raised by checker, including:
(*) Warnings raised by unequal comparison for the purposes of sorting,
where the endianness doesn't matter:
fs/afs/addr_list.c:246:21: warning: restricted __be16 degrades to integer
fs/afs/addr_list.c:246:30: warning: restricted __be16 degrades to integer
fs/afs/addr_list.c:248:21: warning: restricted __be32 degrades to integer
fs/afs/addr_list.c:248:49: warning: restricted __be32 degrades to integer
fs/afs/addr_list.c:283:21: warning: restricted __be16 degrades to integer
fs/afs/addr_list.c:283:30: warning: restricted __be16 degrades to integer
(*) afs_set_cb_interest() is not actually used and can be removed.
(*) afs_cell_gc_delay() should be provided with a sysctl.
(*) afs_cell_destroy() needs to use rcu_access_pointer() to read
cell->vl_addrs.
(*) afs_init_fs_cursor() should be static.
(*) struct afs_vnode::permit_cache needs to be marked __rcu.
(*) afs_server_rcu() needs to use rcu_access_pointer().
(*) afs_destroy_server() should use rcu_access_pointer() on
server->addresses as the server object is no longer accessible.
(*) afs_find_server() casts __be16/__be32 values to int in order to
directly compare them for the purpose of finding a match in a list,
but is should also annotate the cast with __force to avoid checker
warnings.
(*) afs_check_permit() accesses vnode->permit_cache outside of the RCU
readlock, though it doesn't then access the value; the extraneous
access is deleted.
False positives:
(*) Conditional locking around the code in xdr_decode_AFSFetchStatus. This
can be dealt with in a separate patch.
fs/afs/fsclient.c:148:9: warning: context imbalance in 'xdr_decode_AFSFetchStatus' - different lock contexts for basic block
(*) Incorrect handling of seq-retry lock context balance:
fs/afs/inode.c:455:38: warning: context imbalance in 'afs_getattr' - different
lock contexts for basic block
fs/afs/server.c:52:17: warning: context imbalance in 'afs_find_server' - different lock contexts for basic block
fs/afs/server.c:128:17: warning: context imbalance in 'afs_find_server_by_uuid' - different lock contexts for basic block
Errors:
(*) afs_lookup_cell_rcu() needs to break out of the seq-retry loop, not go
round again if it successfully found the workstation cell.
(*) Fix UUID decode in afs_deliver_cb_probe_uuid().
(*) afs_cache_permit() has a missing rcu_read_unlock() before one of the
jumps to the someone_else_changed_it label. Move the unlock to after
the label.
(*) afs_vl_get_addrs_u() is using ntohl() rather than htonl() when
encoding to XDR.
(*) afs_deliver_yfsvl_get_endpoints() is using htonl() rather than ntohl()
when decoding from XDR.
Signed-off-by: David Howells <dhowells@redhat.com>
|
|
Remove the const marking from the actor function pointer in the dir_context
struct. The const prevents the structure from being used as part of a
kmalloc'd object as it makes the compiler require that the actor member be
set at object initialisation time (or not at all), incuring something like
the following error if you try and set it later:
fs/afs/dir.c:556:20: error: assignment of read-only member 'actor'
Marking the member const like this adds very little in the way of sanity
checking as the type checking system is likely to provide sufficient - and
if not, the kernel is very likely to oops repeatably in this case.
Fixes: ac6614b76478 ("[readdir] constify ->actor")
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Al Viro <viro@zeniv.linux.org.uk>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs namei updates from Al Viro:
- make lookup_one_len() safe with parent locked only shared(incoming
afs series wants that)
- fix of getname_kernel() regression from 2015 (-stable fodder, that
one).
* 'work.namei' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
getname_kernel() needs to make sure that ->name != ->iname in long case
make lookup_one_len() safe to use with directory locked shared
new helper: __lookup_slow()
merge common parts of lookup_one_len{,_unlocked} into common helper
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/hubcap/linux
Pull orangefs updates from Mike Marshall:
"Fixes and cleanups:
- Documentation cleanups
- removal of unused code
- make some structs static
- implement Orangefs vm_operations fault callout
- eliminate two single-use functions and put their cleaned up code in
line.
- replace a vmalloc/memset instance with vzalloc
- fix a race condition bug in wait code"
* tag 'for-linus-4.17-ofs' of git://git.kernel.org/pub/scm/linux/kernel/git/hubcap/linux:
Orangefs: documentation updates
orangefs: document package install and xfstests procedure
orangefs: remove unused code
orangefs: make several *_operations structs static
orangefs: implement vm_ops->fault
orangefs: open code short single-use functions
orangefs: replace vmalloc and memset with vzalloc
orangefs: bug fix for a race condition when getting a slot
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux
Pull pstore fix from Kees Cook:
"Fix another compression Kconfig combination missed in testing (Tobias
Regnery)"
* tag 'pstore-v4.17-rc1-fix' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
pstore: fix crypto dependencies without compression
|
|
Commit 0619f0f5e36f ("selinux: wrap selinuxfs state") triggers a BUG
when SELinux is runtime-disabled (i.e. systemd or equivalent disables
SELinux before initial policy load via /sys/fs/selinux/disable based on
/etc/selinux/config SELINUX=disabled).
This does not manifest if SELinux is disabled via kernel command line
argument or if SELinux is enabled (permissive or enforcing).
Before:
SELinux: Disabled at runtime.
BUG: Dentry 000000006d77e5c7{i=17,n=null} still in use (1) [unmount of selinuxfs selinuxfs]
After:
SELinux: Disabled at runtime.
Fixes: 0619f0f5e36f ("selinux: wrap selinuxfs state")
Reported-by: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Stephen Smalley <sds@tycho.nsa.gov>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Pull kvm updates from Paolo Bonzini:
"ARM:
- VHE optimizations
- EL2 address space randomization
- speculative execution mitigations ("variant 3a", aka execution past
invalid privilege register access)
- bugfixes and cleanups
PPC:
- improvements for the radix page fault handler for HV KVM on POWER9
s390:
- more kvm stat counters
- virtio gpu plumbing
- documentation
- facilities improvements
x86:
- support for VMware magic I/O port and pseudo-PMCs
- AMD pause loop exiting
- support for AMD core performance extensions
- support for synchronous register access
- expose nVMX capabilities to userspace
- support for Hyper-V signaling via eventfd
- use Enlightened VMCS when running on Hyper-V
- allow userspace to disable MWAIT/HLT/PAUSE vmexits
- usual roundup of optimizations and nested virtualization bugfixes
Generic:
- API selftest infrastructure (though the only tests are for x86 as
of now)"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (174 commits)
kvm: x86: fix a prototype warning
kvm: selftests: add sync_regs_test
kvm: selftests: add API testing infrastructure
kvm: x86: fix a compile warning
KVM: X86: Add Force Emulation Prefix for "emulate the next instruction"
KVM: X86: Introduce handle_ud()
KVM: vmx: unify adjacent #ifdefs
x86: kvm: hide the unused 'cpu' variable
KVM: VMX: remove bogus WARN_ON in handle_ept_misconfig
Revert "KVM: X86: Fix SMRAM accessing even if VM is shutdown"
kvm: Add emulation for movups/movupd
KVM: VMX: raise internal error for exception during invalid protected mode state
KVM: nVMX: Optimization: Dont set KVM_REQ_EVENT when VMExit with nested_run_pending
KVM: nVMX: Require immediate-exit when event reinjected to L2 and L1 event pending
KVM: x86: Fix misleading comments on handling pending exceptions
KVM: x86: Rename interrupt.pending to interrupt.injected
KVM: VMX: No need to clear pending NMI/interrupt on inject realmode interrupt
x86/kvm: use Enlightened VMCS when running on Hyper-V
x86/hyper-v: detect nested features
x86/hyper-v: define struct hv_enlightened_vmcs and clean field bits
...
|
|
|
|
|
|
Commit 3c8ba0d61d04 ("kernel.h: Retain constant expression output for
max()/min()") rewrote our min/max macros to be very clever, but in the
meantime resurrected a variable name shadow issue that we had had
previously fixed in commit 589a9785ee3a ("min/max: remove sparse
warnings when they're nested").
That commit talks about the sparse warnings that this shadowing causes,
which we ignored as just a minor annoyance. But it turns out that the
sparse warning is the least of our problems. We actually have a real
bug due to the shadowing through the interaction with "min_not_zero()",
which ends up doing
min(__x, __y)
internally, and then the new declaration of "__x" and "__y" as new
variables in __cmp_once() results in a complete mess of an expression,
and "min_not_zero()" doesn't work at all.
For some odd reason, this only ever caused (reported) problems on s390,
even though it is a generic issue and most of the (obviously successful)
testing of the problematic commit had happened on other architectures.
Quoting Sebastian Ott:
"What happened is that the bio build by the partition detection code
was attempted to be split by the block layer because the block queue
had a max_sector setting of 0. blk_queue_max_hw_sectors uses
min_not_zero."
So re-introduce the use of __UNIQUE_ID() to make sure that the min/max
macros do not have these kinds of clashes.
[ That said, __UNIQUE_ID() itself has several issues that make it less
than wonderful.
In particular, the "uniqueness" has a fallback on the line number,
which means that it's not actually unique in more complex cases if you
don't build with gcc or clang (which have working unique counters that
aren't tied to line numbers).
That historical broken fallback also means that we have that pointless
"prefix" argument that doesn't actually make much sense _except_ for
the known-broken case. Oh well. ]
Fixes: 3c8ba0d61d04 ("kernel.h: Retain constant expression output for max()/min()")
Reported-and-tested-by: Sebastian Ott <sebott@linux.vnet.ibm.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Miguel Ojeda <miguel.ojeda.sandonis@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
|
|
The filestreams allocator stores an xfs_fstrm_item structure in the MRU to
cache inode number to agno mappings for a particular length of time. Each
xfs_fstrm_item contains the internal MRU structure, an inode pointer and
agno value.
The inode pointer stored in the xfs_fstrm_item is not referenced, however,
which means the inode itself can be removed and reclaimed before the MRU
item is freed. If this occurs, xfs_fstrm_free_func() can access freed or
unrelated memory through xfs_fstrm_item->ip and crash.
The obvious solution is to grab an inode reference for xfs_fstrm_item.
The filestream mechanism only actually uses the inode pointer as a means
to access the xfs_mount, however. Rather than add unnecessary
complexity, simplify the implementation to store an xfs_mount pointer in
struct xfs_mru_cache, and pass it to the free callback. This also
requires updates to the tracepoint class to provide the associated data
via parameters rather than the inode and a minor hack to peek at the MRU
key to establish the inode number at free time.
Based on debugging work and an earlier patch from Brian Foster, who
also wrote most of this changelog.
Reported-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
|
|
The "normal" kernel page table creation mechanisms using
PAGE_KERNEL_* page protections will never set _PAGE_GLOBAL with PTI.
The few places in the kernel that always want _PAGE_GLOBAL must
avoid using PAGE_KERNEL_*.
Document that we want it here and its use is not accidental.
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Kees Cook <keescook@google.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Nadav Amit <namit@vmware.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-mm@kvack.org
Link: http://lkml.kernel.org/r/20180406205507.BCF4D4F0@viggo.jf.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
The __PAGE_KERNEL_* page permissions are "raw". They contain bits
that may or may not be supported on the current processor. They need
to be filtered by a mask (currently __supported_pte_mask) to turn them
into a value that we can actually set in a PTE.
These __PAGE_KERNEL_* values all contain _PAGE_GLOBAL. But, with PTI,
we want to be able to support _PAGE_GLOBAL (have the bit set in
__supported_pte_mask) but not have it appear in any of these masks by
default.
This patch creates a new mask, __default_kernel_pte_mask, and applies
it when creating all of the PAGE_KERNEL_* masks. This makes
PAGE_KERNEL_* safe to use anywhere (they only contain supported bits).
It also ensures that PAGE_KERNEL_* contains _PAGE_GLOBAL on PTI=n
kernels but clears _PAGE_GLOBAL when PTI=y.
We also make __default_kernel_pte_mask a non-GPL exported symbol
because there are plenty of driver-available interfaces that take
PAGE_KERNEL_* permissions.
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Kees Cook <keescook@google.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Nadav Amit <namit@vmware.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-mm@kvack.org
Link: http://lkml.kernel.org/r/20180406205506.030DB6B6@viggo.jf.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
When clearing _PAGE_PRESENT on a huge page, we need to be careful
to also clear _PAGE_PSE, otherwise it might still get confused
for a valid large page table entry.
We do that near the spot where we *set* _PAGE_PSE. That's fine,
but it's unnecessary. pgprot_large_2_4k() already did it.
BTW, I also noticed that pgprot_large_2_4k() and
pgprot_4k_2_large() are not symmetric. pgprot_large_2_4k() clears
_PAGE_PSE (because it is aliased to _PAGE_PAT) but
pgprot_4k_2_large() does not put _PAGE_PSE back. Bummer.
Also, add some comments and change "promote" to "move". "Promote"
seems an odd word to move when we are logically moving a bit to a
lower bit position. Also add an extra line return to make it clear
to which line the comment applies.
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Kees Cook <keescook@google.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Nadav Amit <namit@vmware.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-mm@kvack.org
Link: http://lkml.kernel.org/r/20180406205504.9B0F44A9@viggo.jf.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
The pageattr code has a pattern repeated where it sets _PAGE_GLOBAL
for present PTEs but clears it for non-present PTEs. The intention
is to keep _PAGE_GLOBAL from getting confused with _PAGE_PROTNONE
since _PAGE_GLOBAL is for present PTEs and _PAGE_PROTNONE is for
non-present
But, this pattern makes no sense. Effectively, it says, if you use
the pageattr code, always set _PAGE_GLOBAL when _PAGE_PRESENT.
canon_pgprot() will clear it if unsupported (because it masks the
value with __supported_pte_mask) but we *always* set it. Even if
canon_pgprot() did not filter _PAGE_GLOBAL, it would be OK.
_PAGE_GLOBAL is ignored when CR4.PGE=0 by the hardware.
This unconditional setting of _PAGE_GLOBAL is a problem when we have
PTI and non-PTI and we want some areas to have _PAGE_GLOBAL and some
not.
This updated version of the code says:
1. Clear _PAGE_GLOBAL when !_PAGE_PRESENT
2. Never set _PAGE_GLOBAL implicitly
3. Allow _PAGE_GLOBAL to be in cpa.set_mask
4. Allow _PAGE_GLOBAL to be inherited from previous PTE
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Kees Cook <keescook@google.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Nadav Amit <namit@vmware.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-mm@kvack.org
Link: http://lkml.kernel.org/r/20180406205502.86E199DA@viggo.jf.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
Pull ARM SA1100 updates from Russell King:
"We have support for arbitary MMIO registers providing platform GPIOs,
which allows us to abstract some of the SA11x0 CF support.
This set of updates makes that change"
* 'for-linus-sa1100' of git://git.armlinux.org.uk/~rmk/linux-arm:
ARM: sa1100/simpad: switch simpad CF to use gpiod APIs
ARM: sa1100/shannon: convert to generic CF sockets
ARM: sa1100/nanoengine: convert to generic CF sockets
ARM: sa1100/h3xxx: switch h3xxx PCMCIA to use gpiod APIs
ARM: sa1100/cerf: convert to generic CF sockets
ARM: sa1100/assabet: convert to generic CF sockets
ARM: sa1100: provide infrastructure to support generic CF sockets
pcmcia: sa1100: provide generic CF support
|
|
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
For non-paranoid entries, idtentry knows how to switch from the
kernel stack to the user stack, as does error_entry. This results
in pointless duplication and code bloat. Make idtentry stop
thinking about stacks for non-paranoid entries.
This reduces text size by 5377 bytes.
This goes back to the following commit:
7f2590a110b8 ("x86/entry/64: Use a per-CPU trampoline stack for IDT entries")
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dominik Brodowski <linux@dominikbrodowski.net>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/90aab80c1f906e70742eaa4512e3c9b5e62d59d4.1522794757.git.luto@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
This Kconfig warning appeared after a fix to the Kconfig validation.
The GPIO_CS5535 driver depends on the MFD_CS5535 driver, but the former
is selected in places where the latter is not:
WARNING: unmet direct dependencies detected for GPIO_CS5535
Depends on [m]: GPIOLIB [=y] && (X86 [=y] || MIPS || COMPILE_TEST [=y]) && MFD_CS5535 [=m]
Selected by [y]:
- OLPC_XO1_SCI [=y] && X86_32 [=y] && OLPC [=y] && OLPC_XO1_PM [=y] && INPUT [=y]=y
The warning does seem appropriate, since the GPIO_CS5535 driver won't
work unless MFD_CS5535 is also present. However, there is no link time
dependency between the two, so this caused no problems during randconfig
testing before.
This changes the 'select GPIO_CS5535' to 'depends on GPIO_CS5535' to
avoid the issue, at the expense of making it harder to configure the
driver (one now has to select the dependencies first).
The 'select MFD_CORE' part is completely redundant, since we already
depend on MFD_CS5535 here, so I'm removing that as well.
Ideally, the private symbols exported by that cs5535 gpio driver would
just be converted to gpiolib interfaces so we could expletely avoid
this dependency.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Masahiro Yamada <yamada.masahiro@socionext.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: linux-kbuild@vger.kernel.org
Fixes: f622f8279581 ("kconfig: warn unmet direct dependency of tristate symbols selected by y")
Link: http://lkml.kernel.org/r/20180404124539.3817101-1-arnd@arndb.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
swiotlb_alloc() calls dma_direct_alloc(), which can satisfy lower than 32-bit
DMA mask requests using GFP_DMA if the architecture supports it. Various
x86 drivers rely on that, so we need to support that. At the same time
the whole kernel expects a 32-bit DMA mask to just work, so the other magic
in swiotlb_dma_supported() isn't actually needed either.
Reported-by: Dominik Brodowski <linux@dominikbrodowski.net>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: iommu@lists.linux-foundation.org
Fixes: 6e4bf5867783 ("x86/dma: Use generic swiotlb_ops")
Link: http://lkml.kernel.org/r/20180409091517.6619-2-hch@lst.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
Pull ARM updates from Russell King:
"A number of core ARM changes:
- Refactoring linker script by Nicolas Pitre
- Enable source fortification
- Add support for Cortex R8"
* 'for-linus' of git://git.armlinux.org.uk/~rmk/linux-arm:
ARM: decompressor: fix warning introduced in fortify patch
ARM: 8751/1: Add support for Cortex-R8 processor
ARM: 8749/1: Kconfig: Add ARCH_HAS_FORTIFY_SOURCE
ARM: simplify and fix linker script for TCM
ARM: linker script: factor out TCM bits
ARM: linker script: factor out vectors and stubs
ARM: linker script: factor out unwinding table sections
ARM: linker script: factor out stuff for the .text section
ARM: linker script: factor out stuff for the DISCARD section
ARM: linker script: factor out some common definitions between XIP and non-XIP
|
|
A use-after-free bug was caught by KASAN while running usdt related
code (BCC project. bcc/tests/python/test_usdt2.py):
==================================================================
BUG: KASAN: use-after-free in uprobe_perf_close+0x222/0x3b0
Read of size 4 at addr ffff880384f9b4a4 by task test_usdt2.py/870
CPU: 4 PID: 870 Comm: test_usdt2.py Tainted: G W 4.16.0-next-20180409 #215
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
Call Trace:
dump_stack+0xc7/0x15b
? show_regs_print_info+0x5/0x5
? printk+0x9c/0xc3
? kmsg_dump_rewind_nolock+0x6e/0x6e
? uprobe_perf_close+0x222/0x3b0
print_address_description+0x83/0x3a0
? uprobe_perf_close+0x222/0x3b0
kasan_report+0x1dd/0x460
? uprobe_perf_close+0x222/0x3b0
uprobe_perf_close+0x222/0x3b0
? probes_open+0x180/0x180
? free_filters_list+0x290/0x290
trace_uprobe_register+0x1bb/0x500
? perf_event_attach_bpf_prog+0x310/0x310
? probe_event_disable+0x4e0/0x4e0
perf_uprobe_destroy+0x63/0xd0
_free_event+0x2bc/0xbd0
? lockdep_rcu_suspicious+0x100/0x100
? ring_buffer_attach+0x550/0x550
? kvm_sched_clock_read+0x1a/0x30
? perf_event_release_kernel+0x3e4/0xc00
? __mutex_unlock_slowpath+0x12e/0x540
? wait_for_completion+0x430/0x430
? lock_downgrade+0x3c0/0x3c0
? lock_release+0x980/0x980
? do_raw_spin_trylock+0x118/0x150
? do_raw_spin_unlock+0x121/0x210
? do_raw_spin_trylock+0x150/0x150
perf_event_release_kernel+0x5d4/0xc00
? put_event+0x30/0x30
? fsnotify+0xd2d/0xea0
? sched_clock_cpu+0x18/0x1a0
? __fsnotify_update_child_dentry_flags.part.0+0x1b0/0x1b0
? pvclock_clocksource_read+0x152/0x2b0
? pvclock_read_flags+0x80/0x80
? kvm_sched_clock_read+0x1a/0x30
? sched_clock_cpu+0x18/0x1a0
? pvclock_clocksource_read+0x152/0x2b0
? locks_remove_file+0xec/0x470
? pvclock_read_flags+0x80/0x80
? fcntl_setlk+0x880/0x880
? ima_file_free+0x8d/0x390
? lockdep_rcu_suspicious+0x100/0x100
? ima_file_check+0x110/0x110
? fsnotify+0xea0/0xea0
? kvm_sched_clock_read+0x1a/0x30
? rcu_note_context_switch+0x600/0x600
perf_release+0x21/0x40
__fput+0x264/0x620
? fput+0xf0/0xf0
? do_raw_spin_unlock+0x121/0x210
? do_raw_spin_trylock+0x150/0x150
? SyS_fchdir+0x100/0x100
? fsnotify+0xea0/0xea0
task_work_run+0x14b/0x1e0
? task_work_cancel+0x1c0/0x1c0
? copy_fd_bitmaps+0x150/0x150
? vfs_read+0xe5/0x260
exit_to_usermode_loop+0x17b/0x1b0
? trace_event_raw_event_sys_exit+0x1a0/0x1a0
do_syscall_64+0x3f6/0x490
? syscall_return_slowpath+0x2c0/0x2c0
? lockdep_sys_exit+0x1f/0xaa
? syscall_return_slowpath+0x1a3/0x2c0
? lockdep_sys_exit+0x1f/0xaa
? prepare_exit_to_usermode+0x11c/0x1e0
? enter_from_user_mode+0x30/0x30
random: crng init done
? __put_user_4+0x1c/0x30
entry_SYSCALL_64_after_hwframe+0x3d/0xa2
RIP: 0033:0x7f41d95f9340
RSP: 002b:00007fffe71e4268 EFLAGS: 00000246 ORIG_RAX: 0000000000000003
RAX: 0000000000000000 RBX: 000000000000000d RCX: 00007f41d95f9340
RDX: 0000000000000000 RSI: 0000000000002401 RDI: 000000000000000d
RBP: 0000000000000000 R08: 00007f41ca8ff700 R09: 00007f41d996dd1f
R10: 00007fffe71e41e0 R11: 0000000000000246 R12: 00007fffe71e4330
R13: 0000000000000000 R14: fffffffffffffffc R15: 00007fffe71e4290
Allocated by task 870:
kasan_kmalloc+0xa0/0xd0
kmem_cache_alloc_node+0x11a/0x430
copy_process.part.19+0x11a0/0x41c0
_do_fork+0x1be/0xa20
do_syscall_64+0x198/0x490
entry_SYSCALL_64_after_hwframe+0x3d/0xa2
Freed by task 0:
__kasan_slab_free+0x12e/0x180
kmem_cache_free+0x102/0x4d0
free_task+0xfe/0x160
__put_task_struct+0x189/0x290
delayed_put_task_struct+0x119/0x250
rcu_process_callbacks+0xa6c/0x1b60
__do_softirq+0x238/0x7ae
The buggy address belongs to the object at ffff880384f9b480
which belongs to the cache task_struct of size 12928
It occurs because task_struct is freed before perf_event which refers
to the task and task flags are checked while teardown of the event.
perf_event_alloc() assigns task_struct to hw.target of perf_event,
but there is no reference counting for it.
As a fix we get_task_struct() in perf_event_alloc() at above mentioned
assignment and put_task_struct() in _free_event().
Signed-off-by: Prashant Bhole <bhole_prashant_q7@lab.ntt.co.jp>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: <stable@kernel.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes: 63b6da39bb38e8f1a1ef3180d32a39d6 ("perf: Fix perf_event_exit_task() race")
Link: http://lkml.kernel.org/r/20180409100346.6416-1-bhole_prashant_q7@lab.ntt.co.jp
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/gerg/m68knommu
Pull m68knommu update from Greg Ungerer:
"Only a single fix to set the DMA masks in the ColdFire FEC platform
data structure.
This stops the warning from dma-mapping.h at boot time"
* 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/gerg/m68knommu:
m68k: set dma and coherent masks for platform FEC ethernets
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/mattst88/alpha
Pull alpha updates from Matt Turner:
"A few small changes for alpha"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mattst88/alpha:
alpha: io: reorder barriers to guarantee writeX() and iowriteX() ordering
alpha: Implement CPU vulnerabilities sysfs functions.
alpha: rtc: stop validating rtc_time in .read_time
alpha: rtc: remove unused set_mmss ops
|
|
Stephen reports that an x86 allmodconfig build fails to build the
of_pmem driver due to a missing definition of of_node_to_nid(). That
helper is currently only exported in the OF_NUMA=y case. In other cases,
ppc and sparc, it is a weak symbol, and outside of those platforms it is
a static inline.
Until an OF_NUMA=n configuration can reliably support usage of
of_node_to_nid() in modules across architectures, mark this driver as
'bool' instead of 'tristate'.
Cc: Rob Herring <robh@kernel.org>
Cc: Oliver O'Halloran <oohall@gmail.com>
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux
Pull s390 updates from Martin Schwidefsky:
- Improvements for the spectre defense:
* The spectre related code is consolidated to a single file
nospec-branch.c
* Automatic enable/disable for the spectre v2 defenses (expoline vs.
nobp)
* Syslog messages for specve v2 are added
* Enable CONFIG_GENERIC_CPU_VULNERABILITIES and define the attribute
functions for spectre v1 and v2
- Add helper macros for assembler alternatives and use them to shorten
the code in entry.S.
- Add support for persistent configuration data via the SCLP Store Data
interface. The H/W interface requires a page table that uses 4K pages
only, the code to setup such an address space is added as well.
- Enable virtio GPU emulation in QEMU. To do this the depends
statements for a few common Kconfig options are modified.
- Add support for format-3 channel path descriptors and add a binary
sysfs interface to export the associated utility strings.
- Add a sysfs attribute to control the IFCC handling in case of
constant channel errors.
- The vfio-ccw changes from Cornelia.
- Bug fixes and cleanups.
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux: (40 commits)
s390/kvm: improve stack frame constants in entry.S
s390/lpp: use assembler alternatives for the LPP instruction
s390/entry.S: use assembler alternatives
s390: add assembler macros for CPU alternatives
s390: add sysfs attributes for spectre
s390: report spectre mitigation via syslog
s390: add automatic detection of the spectre defense
s390: move nobp parameter functions to nospec-branch.c
s390/cio: add util_string sysfs attribute
s390/chsc: query utility strings via fmt3 channel path descriptor
s390/cio: rename struct channel_path_desc
s390/cio: fix unbind of io_subchannel_driver
s390/qdio: split up CCQ handling for EQBS / SQBS
s390/qdio: don't retry EQBS after CCQ 96
s390/qdio: restrict buffer merging to eligible devices
s390/qdio: don't merge ERROR output buffers
s390/qdio: simplify math in get_*_buffer_frontier()
s390/decompressor: trim uncompressed image head during the build
s390/crypto: Fix kernel crash on aes_s390 module remove.
s390/defkeymap: fix global init to zero
...
|
|
snd_pcm_hw_params() (more exactly snd_pcm_hw_params_choose()) contains
a check of the return error from snd_pcm_hw_param_first() and _last()
with snd_BUG_ON() -- i.e. it may trigger WARN_ON() depending on the
kconfig.
This was a valid check in the past, as these functions shouldn't
return any error if the parameters have been already refined via
snd_pcm_hw_refine() beforehand. However, the recent rewrite
introduced a kmalloc() in snd_pcm_hw_refine() for removing VLA, and
this brought a possibility to trigger an error. As a result, syzbot
caught lots of superfluous kernel WARN_ON() and paniced via fault
injection.
As the WARN_ON() is no longer valid with the introduction of
kmalloc(), let's drop snd_BUG_ON() check, in order to make the world
peaceful place again.
Reported-by: syzbot+803e0047ac3a3096bb4f@syzkaller.appspotmail.com
Fixes: 5730f9f744cf ("ALSA: pcm: Remove VLA usage")
Signed-off-by: Takashi Iwai <tiwai@suse.de>
|
|
handle_tx will delay rx for tens or even hundreds of milliseconds when tx busy
polling udp packets with small length(e.g. 1byte udp payload), because setting
VHOST_NET_WEIGHT takes into account only sent-bytes but no single packet length.
Ping-Latencies shown below were tested between two Virtual Machines using
netperf (UDP_STREAM, len=1), and then another machine pinged the client:
vq size=256
Packet-Weight Ping-Latencies(millisecond)
min avg max
Origin 3.319 18.489 57.303
64 1.643 2.021 2.552
128 1.825 2.600 3.224
256 1.997 2.710 4.295
512 1.860 3.171 4.631
1024 2.002 4.173 9.056
2048 2.257 5.650 9.688
4096 2.093 8.508 15.943
vq size=512
Packet-Weight Ping-Latencies(millisecond)
min avg max
Origin 6.537 29.177 66.245
64 2.798 3.614 4.403
128 2.861 3.820 4.775
256 3.008 4.018 4.807
512 3.254 4.523 5.824
1024 3.079 5.335 7.747
2048 3.944 8.201 12.762
4096 4.158 11.057 19.985
Seems pretty consistent, a small dip at 2 VQ sizes.
Ring size is a hint from device about a burst size it can tolerate. Based on
benchmarks, set the weight to 2 * vq size.
To evaluate this change, another tests were done using netperf(RR, TX) between
two machines with Intel(R) Xeon(R) Gold 6133 CPU @ 2.50GHz, and vq size was
tweaked through qemu. Results shown below does not show obvious changes.
vq size=256 TCP_RR vq size=512 TCP_RR
size/sessions/+thu%/+normalize% size/sessions/+thu%/+normalize%
1/ 1/ -7%/ -2% 1/ 1/ 0%/ -2%
1/ 4/ +1%/ 0% 1/ 4/ +1%/ 0%
1/ 8/ +1%/ -2% 1/ 8/ 0%/ +1%
64/ 1/ -6%/ 0% 64/ 1/ +7%/ +3%
64/ 4/ 0%/ +2% 64/ 4/ -1%/ +1%
64/ 8/ 0%/ 0% 64/ 8/ -1%/ -2%
256/ 1/ -3%/ -4% 256/ 1/ -4%/ -2%
256/ 4/ +3%/ +4% 256/ 4/ +1%/ +2%
256/ 8/ +2%/ 0% 256/ 8/ +1%/ -1%
vq size=256 UDP_RR vq size=512 UDP_RR
size/sessions/+thu%/+normalize% size/sessions/+thu%/+normalize%
1/ 1/ -5%/ +1% 1/ 1/ -3%/ -2%
1/ 4/ +4%/ +1% 1/ 4/ -2%/ +2%
1/ 8/ -1%/ -1% 1/ 8/ -1%/ 0%
64/ 1/ -2%/ -3% 64/ 1/ +1%/ +1%
64/ 4/ -5%/ -1% 64/ 4/ +2%/ 0%
64/ 8/ 0%/ -1% 64/ 8/ -2%/ +1%
256/ 1/ +7%/ +1% 256/ 1/ -7%/ 0%
256/ 4/ +1%/ +1% 256/ 4/ -3%/ -4%
256/ 8/ +2%/ +2% 256/ 8/ +1%/ +1%
vq size=256 TCP_STREAM vq size=512 TCP_STREAM
size/sessions/+thu%/+normalize% size/sessions/+thu%/+normalize%
64/ 1/ 0%/ -3% 64/ 1/ 0%/ 0%
64/ 4/ +3%/ -1% 64/ 4/ -2%/ +4%
64/ 8/ +9%/ -4% 64/ 8/ -1%/ +2%
256/ 1/ +1%/ -4% 256/ 1/ +1%/ +1%
256/ 4/ -1%/ -1% 256/ 4/ -3%/ 0%
256/ 8/ +7%/ +5% 256/ 8/ -3%/ 0%
512/ 1/ +1%/ 0% 512/ 1/ -1%/ -1%
512/ 4/ +1%/ -1% 512/ 4/ 0%/ 0%
512/ 8/ +7%/ -5% 512/ 8/ +6%/ -1%
1024/ 1/ 0%/ -1% 1024/ 1/ 0%/ +1%
1024/ 4/ +3%/ 0% 1024/ 4/ +1%/ 0%
1024/ 8/ +8%/ +5% 1024/ 8/ -1%/ 0%
2048/ 1/ +2%/ +2% 2048/ 1/ -1%/ 0%
2048/ 4/ +1%/ 0% 2048/ 4/ 0%/ -1%
2048/ 8/ -2%/ 0% 2048/ 8/ 5%/ -1%
4096/ 1/ -2%/ 0% 4096/ 1/ -2%/ 0%
4096/ 4/ +2%/ 0% 4096/ 4/ 0%/ 0%
4096/ 8/ +9%/ -2% 4096/ 8/ -5%/ -1%
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Haibin Zhang <haibinzhang@tencent.com>
Signed-off-by: Yunfang Tai <yunfangtai@tencent.com>
Signed-off-by: Lidong Chen <lidongchen@tencent.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
It is too expensive to pass u64 values via linked list, instead
allocate array for them by overall number of mac addresses from netdev.
This eventually removes multiple kmalloc() calls, aviod memory
fragmentation and allow to put single null check on kmalloc
return value in order to prevent a potential null pointer dereference.
Addresses-Coverity-ID: 1467429 ("Dereference null return value")
Fixes: 37c3347eb247 ("net: thunderx: add ndo_set_rx_mode callback implementation for VF")
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Vadim Lomovtsev <Vadim.Lomovtsev@cavium.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|