linux.git - Linus' kernel tree

Age	Commit message (Collapse)	Author
2017-09-25	fs: Fix page cache inconsistency when mixing buffered and AIO DIO	Lukas Czerner
	Currently when mixing buffered reads and asynchronous direct writes it is possible to end up with the situation where we have stale data in the page cache while the new data is already written to disk. This is permanent until the affected pages are flushed away. Despite the fact that mixing buffered and direct IO is ill-advised it does pose a thread for a data integrity, is unexpected and should be fixed. Fix this by deferring completion of asynchronous direct writes to a process context in the case that there are mapped pages to be found in the inode. Later before the completion in dio_complete() invalidate the pages in question. This ensures that after the completion the pages in the written area are either unmapped, or populated with up-to-date data. Also do the same for the iomap case which uses iomap_dio_complete() instead. This has a side effect of deferring the completion to a process context for every AIO DIO that happens on inode that has pages mapped. However since the consensus is that this is ill-advised practice the performance implication should not be a problem. This was based on proposal from Jeff Moyer, thanks! Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Jeff Moyer <jmoyer@redhat.com> Signed-off-by: Lukas Czerner <lczerner@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-09-25	nvmet: implement valid sqhd values in completions	James Smart
	To support sqhd, for initiators that are following the spec and paying attention to sqhd vs their sqtail values: - add sqhd to struct nvmet_sq - initialize sqhd to 0 in nvmet_sq_setup - rather than propagate the 0's-based qsize value from the connect message which requires a +1 in every sqhd update, and as nothing else references it, convert to 1's-based value in nvmt_sq/cq_setup() calls. - validate connect message sqsize being non-zero per spec. - updated assign sqhd for every completion that goes back. Also remove handling the NULL sq case in __nvmet_req_complete, as it can't happen with the current code. Signed-off-by: James Smart <james.smart@broadcom.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Max Gurtovoy <maxg@mellanox.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-09-25	nvme-fabrics: Allow 0 as KATO value	Guilherme G. Piccoli
	Currently, driver code allows user to set 0 as KATO (Keep Alive TimeOut), but this is not being respected. This patch enforces the expected behavior. Signed-off-by: Guilherme G. Piccoli <gpiccoli@linux.vnet.ibm.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-09-25	nvme: allow timed-out ios to retry	James Smart
	Currently the nvme_req_needs_retry() applies several checks to see if a retry is allowed. On of those is whether the current time has exceeded the start time of the io plus the timeout length. This check, if an io times out, means there is never a retry allowed for the io. Which means applications see the io failure. Remove this check and allow the io to timeout, like it does on other protocols, and retries to be made. On the FC transport, a frame can be lost for an individual io, and there may be no other errors that escalate for the connection/association. The io will timeout, which causes the transport to escalate into creating a new association, but the io that timed out, due to this retry logic, has already failed back to the application and things are hosed. Signed-off-by: James Smart <james.smart@broadcom.com> Reviewed-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-09-25	nvme: stop aer posting if controller state not live	James Smart
	If an nvme async_event command completes, in most cases, a new async event is posted. However, if the controller enters a resetting or reconnecting state, there is nothing to block the scheduled work element from posting the async event again. Nor are there calls from the transport to stop async events when an association dies. In the case of FC, where the association is torn down, the aer must be aborted on the FC link and completes through the normal job completion path. Thus the terminated async event ends up being rescheduled even though the controller isn't in a valid state for the aer, and the reposting gets the transport into a partially torn down data structure. It's possible to hit the scenario on rdma, although much less likely due to an aer completing right as the association is terminated and as the association teardown reclaims the blk requests via nvme_cancel_request() so its immediate, not a link-related action like on FC. Fix by putting controller state checks in both the async event completion routine where it schedules the async event and in the async event work routine before it calls into the transport. It's effectively a "stop_async_events()" behavior. The transport, when it creates a new association with the subsystem will transition the state back to live and is already restarting the async event posting. Signed-off-by: James Smart <james.smart@broadcom.com> [hch: remove taking a lock over reading the controller state] Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-09-25	nvme-pci: Print invalid SGL only once	Keith Busch
	The WARN_ONCE macro returns true if the condition is true, not if the warn was raised, so we're printing the scatter list every time it's invalid. This is excessive and makes debugging harder, so this patch prints it just once. Signed-off-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-09-25	nvme-pci: initialize queue memory before interrupts	Keith Busch
	A spurious interrupt before the nvme driver has initialized the completion queue may inadvertently cause the driver to believe it has a completion to process. This may result in a NULL dereference since the nvmeq's tags are not set at this point. The patch initializes the host's CQ memory so that a spurious interrupt isn't mistaken for a real completion. Signed-off-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-09-25	nvmet-fc: fix failing max io queue connections	James Smart
	fc transport is treating NVMET_NR_QUEUES as maximum queue count, e.g. admin queue plus NVMET_NR_QUEUES-1 io queues. But NVMET_NR_QUEUES is the number of io queues, so maximum queue count is really NVMET_NR_QUEUES+1. Fix the handling in the target fc transport Signed-off-by: James Smart <james.smart@broadcom.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-09-25	nvme-fc: use transport-specific sgl format	James Smart
	Sync with NVM Express spec change and FC-NVME 1.18. FC transport sets SGL type to Transport SGL Data Block Descriptor and subtype to transport-specific value 0x0A. Removed the warn-on's on the PRP fields. They are unneeded. They were to check for values from the upper layer that weren't set right, and for the most part were fine. But, with Async events, which reuse the same structure and 2nd time issued the SGL overlay converted them to the Transport SGL values - the warn-on's were errantly firing. Signed-off-by: James Smart <james.smart@broadcom.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-09-25	nvme: add transport SGL definitions	James Smart
	Add transport SGL defintions from NVMe TP 4008, required for the final NVMe-FC standard. Signed-off-by: James Smart <james.smart@broadcom.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-09-25	nvme.h: remove FC transport-specific error values	James Smart
	The NVM express group recinded the reserved range for the transport. Remove the FC-centric values that had been defined. Signed-off-by: James Smart <james.smart@broadcom.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-09-25	qla2xxx: remove use of FC-specific error codes	James Smart
	The qla2xxx driver uses the FC-specific error when it needed to return an error to the FC-NVME transport. Convert to use a generic value instead. Signed-off-by: James Smart <james.smart@broadcom.com> Acked-by: Himanshu Madhani <himanshu.madhani@cavium.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-09-25	lpfc: remove use of FC-specific error codes	James Smart
	The lpfc driver uses the FC-specific error when it needed to return an error to the FC-NVME transport. Convert to use a generic value instead. Signed-off-by: James Smart <james.smart@broadcom.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-09-25	nvmet-fcloop: remove use of FC-specific error codes	James Smart
	The FC-NVME transport loopback test module used the FC-specific error codes in cases where it emulated a transport abort case. Instead of using the FC-specific values, now use a generic value (NVME_SC_INTERNAL). Signed-off-by: James Smart <james.smart@broadcom.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-09-25	nvmet-fc: remove use of FC-specific error codes	James Smart
	The FC-NVME target transport used the FC-specific error codes in return codes when the transport or lldd failed. Instead of using the FC-specific values, now use a generic value (NVME_SC_INTERNAL). Signed-off-by: James Smart <james.smart@broadcom.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-09-25	nvme-fc: remove use of FC-specific error codes	James Smart
	The FC-NVME transport used the FC-specific error codes in cases where it had to fabricate an error to go back up stack. Instead of using the FC-specific values, now use a generic value (NVME_SC_INTERNAL). Signed-off-by: James Smart <james.smart@broadcom.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-09-25	loop: remove union of use_aio and ref in struct loop_cmd	Omar Sandoval
	When the request is completed, lo_complete_rq() checks cmd->use_aio. However, if this is in fact an aio request, cmd->use_aio will have already been reused as cmd->ref by lo_rw_aio*. Fix it by not using a union. On x86_64, there's a hole after the union anyways, so this doesn't make struct loop_cmd any bigger. Fixes: 92d773324b7e ("block/loop: fix use after free") Signed-off-by: Omar Sandoval <osandov@fb.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-09-25	blktrace: Fix potential deadlock between delete & sysfs ops	Waiman Long
	The lockdep code had reported the following unsafe locking scenario: CPU0 CPU1 ---- ---- lock(s_active#228); lock(&bdev->bd_mutex/1); lock(s_active#228); lock(&bdev->bd_mutex); * DEADLOCK * The deadlock may happen when one task (CPU1) is trying to delete a partition in a block device and another task (CPU0) is accessing tracing sysfs file (e.g. /sys/block/dm-1/trace/act_mask) in that partition. The s_active isn't an actual lock. It is a reference count (kn->count) on the sysfs (kernfs) file. Removal of a sysfs file, however, require a wait until all the references are gone. The reference count is treated like a rwsem using lockdep instrumentation code. The fact that a thread is in the sysfs callback method or in the ioctl call means there is a reference to the opended sysfs or device file. That should prevent the underlying block structure from being removed. Instead of using bd_mutex in the block_device structure, a new blk_trace_mutex is now added to the request_queue structure to protect access to the blk_trace structure. Suggested-by: Christoph Hellwig <hch@infradead.org> Signed-off-by: Waiman Long <longman@redhat.com> Acked-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Fix typo in patch subject line, and prune a comment detailing how the code used to work. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-09-25	nbd: ignore non-nbd ioctl's	Josef Bacik
	In testing we noticed that nbd would spew if you ran a fio job against the raw device itself. This is because fio calls a block device specific ioctl, however the block layer will first pass this back to the driver ioctl handler in case the driver wants to do something special. Since the device was setup using netlink this caused us to spew every time fio called this ioctl. Since we don't have special handling, just error out for any non-nbd specific ioctl's that come in. This fixes the spew. Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-09-25	bsg-lib: don't free job in bsg_prepare_job	Christoph Hellwig
	The job structure is allocated as part of the request, so we should not free it in the error path of bsg_prepare_job. Signed-off-by: Christoph Hellwig <hch@lst.de> Cc: stable@vger.kernel.org Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-09-25	brd: fix overflow in __brd_direct_access	Mikulas Patocka
	The code in __brd_direct_access multiplies the pgoff variable by page size and divides it by 512. It can cause overflow on 32-bit architectures. The overflow happens if we create ramdisk larger than 4G and use it as a sparse device. This patch replaces multiplication and division with multiplication by the number of sectors per page. Reviewed-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Fixes: 1647b9b959c7 ("brd: add dax_operations support") Cc: stable@vger.kernel.org # 4.12+ Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-09-25	genirq: Check __free_irq() return value for NULL	Alexandru Moise
	__free_irq() can return a NULL irqaction for example when trying to free already-free IRQ, but the callsite unconditionally dereferences the returned pointer. Fix this by adding a check and return NULL. Signed-off-by: Alexandru Moise <00moses.alexander00@gmail.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lkml.kernel.org/r/20170919200412.GA29985@gmail.com
2017-09-25	futex: Fix pi_state->owner serialization	Peter Zijlstra
	There was a reported suspicion about a race between exit_pi_state_list() and put_pi_state(). The same report mentioned the comment with put_pi_state() said it should be called with hb->lock held, and it no longer is in all places. As it turns out, the pi_state->owner serialization is indeed broken. As per the new rules: 734009e96d19 ("futex: Change locking rules") pi_state->owner should be serialized by pi_state->pi_mutex.wait_lock. For the sites setting pi_state->owner we already hold wait_lock (where required) but exit_pi_state_list() and put_pi_state() were not and raced on clearing it. Fixes: 734009e96d19 ("futex: Change locking rules") Reported-by: Gratian Crisan <gratian.crisan@ni.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: dvhart@infradead.org Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/r/20170922154806.jd3ffltfk24m4o4y@hirez.programming.kicks-ass.net
2017-09-25	KEYS: use kmemdup() in request_key_auth_new()	Eric Biggers
	kmemdup() is preferred to kmalloc() followed by memcpy(). Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: David Howells <dhowells@redhat.com>
2017-09-25	KEYS: restrict /proc/keys by credentials at open time	Eric Biggers
	When checking for permission to view keys whilst reading from /proc/keys, we should use the credentials with which the /proc/keys file was opened. This is because, in a classic type of exploit, it can be possible to bypass checks for the current credentials by passing the file descriptor to a suid program. Following commit 34dbbcdbf633 ("Make file credentials available to the seqfile interfaces") we can finally fix it. So let's do it. Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: David Howells <dhowells@redhat.com>
2017-09-25	KEYS: reset parent each time before searching key_user_tree	Eric Biggers
	In key_user_lookup(), if there is no key_user for the given uid, we drop key_user_lock, allocate a new key_user, and search the tree again. But we failed to set 'parent' to NULL at the beginning of the second search. If the tree were to be empty for the second search, the insertion would be done with an invalid 'parent', scribbling over freed memory. Fortunately this can't actually happen currently because the tree always contains at least the root_key_user. But it still should be fixed to make the code more robust. Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: David Howells <dhowells@redhat.com>
2017-09-25	KEYS: prevent KEYCTL_READ on negative key	Eric Biggers
	Because keyctl_read_key() looks up the key with no permissions requested, it may find a negatively instantiated key. If the key is also possessed, we went ahead and called ->read() on the key. But the key payload will actually contain the ->reject_error rather than the normal payload. Thus, the kernel oopses trying to read the user_key_payload from memory address (int)-ENOKEY = 0x00000000ffffff82. Fortunately the payload data is stored inline, so it shouldn't be possible to abuse this as an arbitrary memory read primitive... Reproducer: keyctl new_session keyctl request2 user desc '' @s keyctl read $(keyctl show \| awk '/user: desc/ {print $1}') It causes a crash like the following: BUG: unable to handle kernel paging request at 00000000ffffff92 IP: user_read+0x33/0xa0 PGD 36a54067 P4D 36a54067 PUD 0 Oops: 0000 [#1] SMP CPU: 0 PID: 211 Comm: keyctl Not tainted 4.14.0-rc1 #337 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-20170228_101828-anatol 04/01/2014 task: ffff90aa3b74c3c0 task.stack: ffff9878c0478000 RIP: 0010:user_read+0x33/0xa0 RSP: 0018:ffff9878c047bee8 EFLAGS: 00010246 RAX: 0000000000000001 RBX: ffff90aa3d7da340 RCX: 0000000000000017 RDX: 0000000000000000 RSI: 00000000ffffff82 RDI: ffff90aa3d7da340 RBP: ffff9878c047bf00 R08: 00000024f95da94f R09: 0000000000000000 R10: 0000000000000001 R11: 0000000000000000 R12: 0000000000000000 R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000 FS: 00007f58ece69740(0000) GS:ffff90aa3e200000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00000000ffffff92 CR3: 0000000036adc001 CR4: 00000000003606f0 Call Trace: keyctl_read_key+0xac/0xe0 SyS_keyctl+0x99/0x120 entry_SYSCALL_64_fastpath+0x1f/0xbe RIP: 0033:0x7f58ec787bb9 RSP: 002b:00007ffc8d401678 EFLAGS: 00000206 ORIG_RAX: 00000000000000fa RAX: ffffffffffffffda RBX: 00007ffc8d402800 RCX: 00007f58ec787bb9 RDX: 0000000000000000 RSI: 00000000174a63ac RDI: 000000000000000b RBP: 0000000000000004 R08: 00007ffc8d402809 R09: 0000000000000020 R10: 0000000000000000 R11: 0000000000000206 R12: 00007ffc8d402800 R13: 00007ffc8d4016e0 R14: 0000000000000000 R15: 0000000000000000 Code: e5 41 55 49 89 f5 41 54 49 89 d4 53 48 89 fb e8 a4 b4 ad ff 85 c0 74 09 80 3d b9 4c 96 00 00 74 43 48 8b b3 20 01 00 00 4d 85 ed <0f> b7 5e 10 74 29 4d 85 e4 74 24 4c 39 e3 4c 89 e2 4c 89 ef 48 RIP: user_read+0x33/0xa0 RSP: ffff9878c047bee8 CR2: 00000000ffffff92 Fixes: 61ea0c0ba904 ("KEYS: Skip key state checks when checking for possession") Cc: <stable@vger.kernel.org> [v3.13+] Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: David Howells <dhowells@redhat.com>
2017-09-25	KEYS: prevent creating a different user's keyrings	Eric Biggers
	It was possible for an unprivileged user to create the user and user session keyrings for another user. For example: sudo -u '#3000' sh -c 'keyctl add keyring _uid.4000 "" @u keyctl add keyring _uid_ses.4000 "" @u sleep 15' & sleep 1 sudo -u '#4000' keyctl describe @u sudo -u '#4000' keyctl describe @us This is problematic because these "fake" keyrings won't have the right permissions. In particular, the user who created them first will own them and will have full access to them via the possessor permissions, which can be used to compromise the security of a user's keys: -4: alswrv-----v------------ 3000 0 keyring: _uid.4000 -5: alswrv-----v------------ 3000 0 keyring: _uid_ses.4000 Fix it by marking user and user session keyrings with a flag KEY_FLAG_UID_KEYRING. Then, when searching for a user or user session keyring by name, skip all keyrings that don't have the flag set. Fixes: 69664cf16af4 ("keys: don't generate user and user session keyrings unless they're accessed") Cc: <stable@vger.kernel.org> [v2.6.26+] Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: David Howells <dhowells@redhat.com>
2017-09-25	KEYS: fix writing past end of user-supplied buffer in keyring_read()	Eric Biggers
	Userspace can call keyctl_read() on a keyring to get the list of IDs of keys in the keyring. But if the user-supplied buffer is too small, the kernel would write the full list anyway --- which will corrupt whatever userspace memory happened to be past the end of the buffer. Fix it by only filling the space that is available. Fixes: b2a4df200d57 ("KEYS: Expand the capacity of a keyring") Cc: <stable@vger.kernel.org> [v3.13+] Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: David Howells <dhowells@redhat.com>
2017-09-25	KEYS: fix key refcount leak in keyctl_read_key()	Eric Biggers
	In keyctl_read_key(), if key_permission() were to return an error code other than EACCES, we would leak a the reference to the key. This can't actually happen currently because key_permission() can only return an error code other than EACCES if security_key_permission() does, only SELinux and Smack implement that hook, and neither can return an error code other than EACCES. But it should still be fixed, as it is a bug waiting to happen. Fixes: 29db91906340 ("[PATCH] Keys: Add LSM hooks for key management [try #3]") Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: David Howells <dhowells@redhat.com>
2017-09-25	KEYS: fix key refcount leak in keyctl_assume_authority()	Eric Biggers
	In keyctl_assume_authority(), if keyctl_change_reqkey_auth() were to fail, we would leak the reference to the 'authkey'. Currently this can only happen if prepare_creds() fails to allocate memory. But it still should be fixed, as it is a more severe bug waiting to happen. This patch also moves the read of 'authkey->serial' to before the reference to the authkey is dropped. Doing the read after dropping the reference is very fragile because it assumes we still hold another reference to the key. (Which we do, in current->cred->request_key_auth, but there's no reason not to write it in the "obviously correct" way.) Fixes: d84f4f992cbd ("CRED: Inaugurate COW credentials") Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: David Howells <dhowells@redhat.com>
2017-09-25	KEYS: don't revoke uninstantiated key in request_key_auth_new()	Eric Biggers
	If key_instantiate_and_link() were to fail (which fortunately isn't possible currently), the call to key_revoke(authkey) would crash with a NULL pointer dereference in request_key_auth_revoke() because the key has not yet been instantiated. Fix this by removing the call to key_revoke(). key_put() is sufficient, as it's not possible for an uninstantiated authkey to have been used for anything yet. Fixes: b5f545c880a2 ("[PATCH] keys: Permit running process to instantiate keys") Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: David Howells <dhowells@redhat.com>
2017-09-25	KEYS: fix cred refcount leak in request_key_auth_new()	Eric Biggers
	In request_key_auth_new(), if key_alloc() or key_instantiate_and_link() were to fail, we would leak a reference to the 'struct cred'. Currently this can only happen if key_alloc() fails to allocate memory. But it still should be fixed, as it is a more severe bug waiting to happen. Fix it by cleaning things up to use a helper function which frees a 'struct request_key_auth' correctly. Fixes: d84f4f992cbd ("CRED: Inaugurate COW credentials") Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: David Howells <dhowells@redhat.com>
2017-09-25	perf evsel: Fix attr.exclude_kernel setting for default cycles:p	Arnaldo Carvalho de Melo
	Yet another fix for probing the max attr.precise_ip setting: it is not enough settting attr.exclude_kernel for !root users, as they _can_ profile the kernel if the kernel.perf_event_paranoid sysctl is set to -1, so check that as well. Testing it: As non root: $ sysctl kernel.perf_event_paranoid kernel.perf_event_paranoid = 2 $ perf record sleep 1 $ perf evlist -v cycles:uppp: ..., exclude_kernel: 1, ... precise_ip: 3, ... Now as non-root, but with kernel.perf_event_paranoid set set to the most permissive value, -1: $ sysctl kernel.perf_event_paranoid kernel.perf_event_paranoid = -1 $ perf record sleep 1 $ perf evlist -v cycles:ppp: ..., exclude_kernel: 0, ... precise_ip: 3, ... $ I.e. non-root, default kernel.perf_event_paranoid: :uppp modifier = not allowed to sample the kernel, non-root, most permissible kernel.perf_event_paranoid: :ppp = allowed to sample the kernel. In both cases, use the highest available precision: attr.precise_ip = 3. Reported-and-Tested-by: Ingo Molnar <mingo@kernel.org> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: David Ahern <dsahern@gmail.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Wang Nan <wangnan0@huawei.com> Fixes: d37a36979077 ("perf evsel: Fix attr.exclude_kernel setting for default cycles:p") Link: http://lkml.kernel.org/n/tip-nj2qkf75xsd6pw6hhjzfqqdx@git.kernel.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2017-09-25	tools include: Sync kernel ABI headers with tooling headers	Ingo Molnar
	Time for a sync with ABI/uapi headers with the upcoming v4.14 kernel. None of the ABI changes require any source code level changes to our existing in-kernel tooling code: - tools/arch/s390/include/uapi/asm/kvm.h: New KVM_S390_VM_TOD_EXT ABI, not used by in-kernel tooling. - tools/arch/x86/include/asm/cpufeatures.h: tools/arch/x86/include/asm/disabled-features.h: New PCID, SME and VGIF x86 CPU feature bits defined. - tools/include/asm-generic/hugetlb_encode.h: tools/include/uapi/asm-generic/mman-common.h: tools/include/uapi/linux/mman.h: Two new madvise() flags, plus a hugetlb system call mmap flags restructuring/extension changes. - tools/include/uapi/drm/drm.h: tools/include/uapi/drm/i915_drm.h: New drm_syncobj_create flags definitions, new drm_syncobj_wait and drm_syncobj_array ABIs. DRM_I915_PERF_* calls and a new I915_PARAM_HAS_EXEC_FENCE_ARRAY ABI for the Intel driver. - tools/include/uapi/linux/bpf.h: New bpf_sock fields (::mark and ::priority), new XDP_REDIRECT action, new kvm_ppc_smmu_info fields (::data_keys, instr_keys) Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: David Ahern <dsahern@gmail.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Milian Wolff <milian.wolff@kdab.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Taeung Song <treeze.taeung@gmail.com> Cc: Wang Nan <wangnan0@huawei.com> Cc: Yao Jin <yao.jin@linux.intel.com> Link: http://lkml.kernel.org/r/20170913073823.lxmi4c7ejqlfabjx@gmail.com Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2017-09-25	perf tools: Get all of tools/{arch,include}/ in the MANIFEST	Arnaldo Carvalho de Melo
	Now that I'm switching the container builds from using a local volume pointing to the kernel repository with the perf sources, instead getting a detached tarball to be able to use a container cluster, some places broke because I forgot to put some of the required files in tools/perf/MANIFEST, namely some bitsperlong.h files. So, to fix it do the same as for tools/build/ and pack the whole tools/arch/ directory. Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: David Ahern <dsahern@gmail.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Wang Nan <wangnan0@huawei.com> Link: http://lkml.kernel.org/n/tip-wmenpjfjsobwdnfde30qqncj@git.kernel.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2017-09-25	arch: change default endian for microblaze	Babu Moger
	Fix the default for microblaze. Michal Simek mentioned default for microblaze should be CPU_LITTLE_ENDIAN. Fixes : commit 206d3642d8ee ("arch/microblaze: add choice for endianness and update Makefile") Signed-off-by: Babu Moger <babu.moger@oracle.com> Cc: Michal Simek <monstr@monstr.eu> Signed-off-by: Michal Simek <michal.simek@xilinx.com>
2017-09-25	microblaze: Cocci spatch "vma_pages"	Thomas Meyer
	Use vma_pages function on vma object instead of explicit computation. Found by coccinelle spatch "api/vma_pages.cocci" Signed-off-by: Thomas Meyer <thomas@m3y3r.de> Signed-off-by: Michal Simek <michal.simek@xilinx.com>
2017-09-25	microblaze: Add missing kvm_para.h to Kbuild	Michal Simek
	Running make allmodconfig;make is throwing compilation error: CC kernel/watchdog.o In file included from ./include/linux/kvm_para.h:4:0, from kernel/watchdog.c:29: ./include/uapi/linux/kvm_para.h:32:26: fatal error: asm/kvm_para.h: No such file or directory #include <asm/kvm_para.h> ^ compilation terminated. make[1]: * [kernel/watchdog.o] Error 1 make: * [kernel/watchdog.o] Error 2 Reported-by: Michal Hocko <mhocko@kernel.org> Suggested-by: Geert Uytterhoeven <geert@linux-m68k.org> Signed-off-by: Michal Simek <michal.simek@xilinx.com> Fixes: 83f0124ad81e87b ("microblaze: remove asm-generic wrapper headers") Reviewed-by: Tobias Klauser <tklauser@distanz.ch> Tested-by: Michal Hocko <mhocko@suse.com>
2017-09-25	x86/timers: Move simple_udelay_calibration() past kvmclock_init()	Boris Ostrovsky
	simple_udelay_calibration() relies on x86_platform's calibration ops. For KVM these ops are set late in setup_arch() and so simple_udelay_calibration() ends up using native version. Besides being possibly incorrect, this significantly increases kernel boot time. For example, on my laptop executing start_kernel() by a guest takes ~10 times more than when KVM's ops are used. Since early_xdbc_setup_hardware() relies on calibration having been performed move it too. Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Cc: baolu.lu@linux.intel.com Link: https://lkml.kernel.org/r/20170911185111.20636-1-boris.ostrovsky@oracle.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2017-09-25	x86/timers: Make recalibrate_cpu_khz() void	Dou Liyang
	recalibrate_cpu_khz() is called from powernow K7 and Pentium 4/Xeon CPU freq driver. It recalibrates cpu frequency in case of SMP = n and doesn't need to return anything. Mark it void, also remove the #else branch. Signed-off-by: Dou Liyang <douly.fnst@cn.fujitsu.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lkml.kernel.org/r/1500003247-17368-2-git-send-email-douly.fnst@cn.fujitsu.com
2017-09-25	x86/timers: Move the simple udelay calibration to tsc.h	Dou Liyang
	Commit dd759d93f4dd ("x86/timers: Add simple udelay calibration") adds an static function in x86 boot-time initializations. But, this function is actually related to TSC, so it should be maintained in tsc.c, not in setup.c. Move simple_udelay_calibration() from setup.c to tsc.c and rename it to tsc_early_delay_calibrate for more readability. Signed-off-by: Dou Liyang <douly.fnst@cn.fujitsu.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lkml.kernel.org/r/1500003247-17368-1-git-send-email-douly.fnst@cn.fujitsu.com
2017-09-25	x86/apic: Use lapic_is_integrated() consistently	Dou Liyang
	lapic_is_integrated() is a wrapper around APIC_INTEGRATED(), but not used consistently. Replace the direct usage of APIC_INTEGRATED() and fixup a hard to read tail comment. No functional change. [ tglx: Made it compile and work .... ] Signed-off-by: Dou Liyang <douly.fnst@cn.fujitsu.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: bhe@redhat.com Link: https://lkml.kernel.org/r/1504774161-7137-2-git-send-email-douly.fnst@cn.fujitsu.com
2017-09-25	x86/apic: Remove duplicate X86_64 conditional in lapic_is_integrated()	Dou Liyang
	The macro APIC_INTEGRATED(x) is already wrapped by CONFIG_X86_32. So it can be invoked unconditionally. Remove the extra "#ifdef CONFIG_X86_64...". No functional change. Signed-off-by: Dou Liyang <douly.fnst@cn.fujitsu.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: bhe@redhat.com Link: https://lkml.kernel.org/r/1504774161-7137-1-git-send-email-douly.fnst@cn.fujitsu.com
2017-09-25	x86/apic: Remove init_bsp_APIC()	Dou Liyang
	init_bsp_APIC() which works for the virtual wire mode is used in ISA irq initialization at boot time. With the new APIC interrupt delivery mode scheme, which initializes the APIC before the first interrupt is expected, init_bsp_APIC() is not longer required and can be removed. Signed-off-by: Dou Liyang <douly.fnst@cn.fujitsu.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: yinghai@kernel.org Cc: bhe@redhat.com Link: https://lkml.kernel.org/r/1505293975-26005-13-git-send-email-douly.fnst@cn.fujitsu.com
2017-09-25	x86/apic: Initialize interrupt mode after timer init	Dou Liyang
	A cold or warm boot through BIOS sets the APIC in default interrupt delivery mode. A dump-capture kernel will not go through a BIOS reset and leave the interrupt delivery mode in the state which was active on the crashed kernel, but the dump kernel startup code assumes default delivery mode which can result in interrupt delivery/handling to fail. To solve this problem, it's required to set up the final interrupt delivery mode as soon as possible. As IOAPIC setup needs the timer initialized for verifying the timer interrupt delivery mode, the earliest point is right after timer setup in late_time_init(). That results in the following init order: 1) Set up the legacy timer, if applicable on the platform 2) Set up APIC/IOAPIC which includes the verification of the legacy timer interrupt delivery. 3) TSC calibration 4) Local APIC timer setup Signed-off-by: Dou Liyang <douly.fnst@cn.fujitsu.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: yinghai@kernel.org Cc: bhe@redhat.com Link: https://lkml.kernel.org/r/1505293975-26005-12-git-send-email-douly.fnst@cn.fujitsu.com
2017-09-25	x86/init: Add intr_mode_init to x86_init_ops	Dou Liyang
	X86 and XEN initialize interrupt delivery mode in different way. To avoid conditionals, add a new x86_init_ops function which defaults to the standard function and can be overridden by the early XEN platform code. [ tglx: Folded the XEN part which was a separate patch to preserve bisectability ] Signed-off-by: Dou Liyang <douly.fnst@cn.fujitsu.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: yinghai@kernel.org Cc: bhe@redhat.com Link: https://lkml.kernel.org/r/1505293975-26005-10-git-send-email-douly.fnst@cn.fujitsu.com
2017-09-25	x86/ioapic: Refactor the delay logic in timer_irq_works()	Dou Liyang
	timer_irq_works() is used to detects the timer IRQs. It calls mdelay(10) to delay ten ticks and check whether the timer IRQ work or not. mdelay() depends on the loops_per_jiffy which is set up in calibrate_delay(), but the delay calibration depends on a working timer interrupt, which causes a chicken and egg problem. The correct solution is to set up the interrupt mode and making sure that the timer interrupt is delivered correctly before invoking calibrate_delay(). That means that mdelay() cannot be used in timer_irq_works(). Provide helper functions to make a rough delay estimate which is good enough to prove that the timer interrupt is working. Either use TSC or a simple delay loop and assume that 4GHz is the maximum CPU frequency to base the delay calculation on. Signed-off-by: Dou Liyang <douly.fnst@cn.fujitsu.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: yinghai@kernel.org Cc: bhe@redhat.com Link: https://lkml.kernel.org/r/1505293975-26005-9-git-send-email-douly.fnst@cn.fujitsu.com
2017-09-25	x86/apic: Unify interrupt mode setup for UP system	Dou Liyang
	In UniProcessor kernel with UP_LATE_INIT=y, the interrupt delivery mode is initialized in up_late_init(). Use the new unified apic_intr_mode_init() function and remove APIC_init_uniprocessor(). Signed-off-by: Dou Liyang <douly.fnst@cn.fujitsu.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: yinghai@kernel.org Cc: bhe@redhat.com Link: https://lkml.kernel.org/r/1505293975-26005-8-git-send-email-douly.fnst@cn.fujitsu.com
2017-09-25	x86/apic: Mark the apic_intr_mode extern for sanity check cleanup	Dou Liyang
	Calling native_smp_prepare_cpus() to prepare for SMP bootup, does some sanity checking, enables APIC mode and disables SMP feature. Now, APIC mode setup has been unified to apic_intr_mode_init(), some sanity checks are redundant and need to be cleanup. Mark the apic_intr_mode extern to refine the switch and remove the redundant sanity check. Signed-off-by: Dou Liyang <douly.fnst@cn.fujitsu.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: yinghai@kernel.org Cc: bhe@redhat.com Link: https://lkml.kernel.org/r/1505293975-26005-7-git-send-email-douly.fnst@cn.fujitsu.com