summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2017-03-16afs: security: Replace rcu_assign_pointer() with RCU_INIT_POINTER()Andreea-Cristina Bernat
The use of "rcu_assign_pointer()" is NULLing out the pointer. According to RCU_INIT_POINTER()'s block comment: "1. This use of RCU_INIT_POINTER() is NULLing out the pointer" it is better to use it instead of rcu_assign_pointer() because it has a smaller overhead. The following Coccinelle semantic patch was used: @@ @@ - rcu_assign_pointer + RCU_INIT_POINTER (..., NULL) Signed-off-by: Andreea-Cristina Bernat <bernat.ada@gmail.com> Signed-off-by: David Howells <dhowells@redhat.com>
2017-03-16afs: inode: Replace rcu_assign_pointer() with RCU_INIT_POINTER()Andreea-Cristina Bernat
The use of "rcu_assign_pointer()" is NULLing out the pointer. According to RCU_INIT_POINTER()'s block comment: "1. This use of RCU_INIT_POINTER() is NULLing out the pointer" it is better to use it instead of rcu_assign_pointer() because it has a smaller overhead. The following Coccinelle semantic patch was used: @@ @@ - rcu_assign_pointer + RCU_INIT_POINTER (..., NULL) Signed-off-by: Andreea-Cristina Bernat <bernat.ada@gmail.com> Signed-off-by: David Howells <dhowells@redhat.com>
2017-03-16afs: Distinguish mountpoints from symlinks by file mode aloneDavid Howells
In AFS, mountpoints appear as symlinks with mode 0644 and normal symlinks have mode 0777, so use this to distinguish them rather than reading the content and parsing it. In the case of a mountpoint, the symlink body is a formatted string indicating the location of the target volume. Note that with this, kAFS no longer 'pre-fetches' the contents of symlinks, so afs_readpage() may fail with an access-denial because when the VFS calls d_automount(), it wraps the call in an credentials override that sets the initial creds - thereby preventing access to the caller's keyrings and the authentication keys held therein. To this end, a patch reverting that change to the VFS is required also. Reported-by: Jeffrey Altman <jaltman@auristor.com> Signed-off-by: David Howells <dhowells@redhat.com>
2017-03-16afs: Flush outstanding writes when an fd is closedDavid Howells
Flush outstanding writes in afs when an fd is closed. This is what NFS and CIFS do. Reported-by: Marc Dionne <marc.c.dionne@gmail.com> Signed-off-by: David Howells <dhowells@redhat.com>
2017-03-16afs: Handle a short write to an AFS pageDavid Howells
Handle the situation where afs_write_begin() is told to expect that a full-page write will be made, but this doesn't happen (EFAULT, CTRL-C, etc.), and so afs_write_end() sees a partial write took place. Currently, no attempt is to deal with the discrepency. Fix this by loading the gap from the server. Reported-by: Al Viro <viro@ZenIV.linux.org.uk> Signed-off-by: David Howells <dhowells@redhat.com>
2017-03-16afs: Kill struct afs_read::pg_offsetDavid Howells
Kill struct afs_read::pg_offset as nothing uses it. It's unnecessary as pos can be masked off. Signed-off-by: David Howells <dhowells@redhat.com>
2017-03-16afs: Handle better the server returning excess or short dataDavid Howells
When an AFS server is given an FS.FetchData{,64} request to read data from a file, it is permitted by the protocol to return more or less than was requested. kafs currently relies on the latter behaviour in readpage{,s} to handle a partial page at the end of the file (we just ask for a whole page and clear space beyond the short read). However, we don't handle all cases. Add: (1) Handle excess data by discarding it rather than aborting. Note that we use a common static buffer to discard into so that the decryption algorithm advances the PCBC state. (2) Handle a short read that affects more than just the last page. Note that if a read comes up unexpectedly short of long, it's possible that the server's copy of the file changed - in which case the data version number will have been incremented and the callback will have been broken - in which case all the pages currently attached to the inode will be zapped anyway at some point. Signed-off-by: David Howells <dhowells@redhat.com>
2017-03-16afs: Deal with an empty callback arrayMarc Dionne
Servers may send a callback array that is the same size as the FID array, or an empty array. If the callback count is 0, the code would attempt to read (fid_count * 12) bytes of data, which would fail and result in an unmarshalling error. This would lead to stale data for remotely modified files or directories. Store the callback array size in the internal afs_call structure and use that to determine the amount of data to read. Signed-off-by: Marc Dionne <marc.dionne@auristor.com>
2017-03-16afs: Adjust mode bits processingMarc Dionne
Mode bits for an afs file should not be enforced in the usual way. For files, the absence of user bits can restrict file access with respect to what is granted by the server. These bits apply regardless of the owner or the current uid; the rest of the mode bits (group, other) are ignored. Signed-off-by: Marc Dionne <marc.dionne@auristor.com> Signed-off-by: David Howells <dhowells@redhat.com>
2017-03-16afs: Populate group ID from vnode statusMarc Dionne
The group was hard coded to GLOBAL_ROOT_GID; use the group ID that was received from the server. Signed-off-by: Marc Dionne <marc.dionne@auristor.com> Signed-off-by: David Howells <dhowells@redhat.com>
2017-03-16afs: Fix page overput in afs_fill_page()David Howells
afs_fill_page() loads the page it wants to fill into the afs_read request without incrementing its refcount - but then calls afs_put_read() to clean up afterwards, which then releases a ref on the page. Fix this by getting a ref on the page before calling afs_vnode_fetch_data(). This causes sync after a write to hang in afs_writepages_region() because find_get_pages_tag() gets confused and doesn't return. Fixes: 196ee9cd2d04 ("afs: Make afs_fs_fetch_data() take a list of pages") Reported-by: Marc Dionne <marc.dionne@auristor.com> Signed-off-by: David Howells <dhowells@redhat.com> Tested-by: Marc Dionne <marc.dionne@auristor.com>
2017-03-16afs: Fix missing put_page()David Howells
In afs_writepages_region(), inside the loop where we find dirty pages to deal with, one of the if-statements is missing a put_page(). Signed-off-by: David Howells <dhowells@redhat.com>
2017-03-16mmc: ushc: fix NULL-deref at probeJohan Hovold
Make sure to check the number of endpoints to avoid dereferencing a NULL-pointer should a malicious device lack endpoints. Fixes: 53f3a9e26ed5 ("mmc: USB SD Host Controller (USHC) driver") Cc: stable <stable@vger.kernel.org> # 2.6.37 Cc: David Vrabel <david.vrabel@csr.com> Signed-off-by: Johan Hovold <johan@kernel.org> Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org>
2017-03-16mmc: sdhci-of-at91: Support external regulatorsRomain Izard
The SDHCI controller in the SAMA5D2 chip requires a valid voltage set in the power control register, otherwise commands will fail with a timeout error. When using the regulator framework to specify the regulator used by the mmc device, the voltage is not configured, and it is not possible to use the connected device. Implement a custom 'set_power' function for this specific hardware, that configures the voltage in the register in all cases. Signed-off-by: Romain Izard <romain.izard.pro@gmail.com> Acked-by: Ludovic Desroches <ludovic.desroches@microchip.com> Cc: stable@vger.kernel.org #4.9+ Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org>
2017-03-16mmc: core: mmc_blk_rw_cmd_err - remove unused variableWinkler, Tomas
Fix compilation warning: drivers/mmc/core/block.c:1563:24: warning: variable ‘mq_rq’ set but not used [-Wunused-but-set-variable] struct mmc_queue_req *mq_rq; Signed-off-by: Tomas Winkler <tomas.winkler@intel.com> Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org>
2017-03-16drm/amdgpu: fix the clearing wb sizeHuang Rui
The clearing wb size should be the one that it is assigned. Signed-off-by: Huang Rui <ray.huang@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2017-03-16drm/amdgpu: reinstate oland workaround for sclkAlex Deucher
Higher sclks seem to be unstable on some boards. bug: https://bugs.freedesktop.org/show_bug.cgi?id=100222 Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Cc: stable@vger.kernel.org
2017-03-16drm/radeon: reinstate oland workaround for sclkAlex Deucher
Higher sclks seem to be unstable on some boards. bug: https://bugs.freedesktop.org/show_bug.cgi?id=100222 Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Cc: stable@vger.kernel.org
2017-03-16perf/core: Better explain the inherit magicPeter Zijlstra
While going through the event inheritance code Oleg got confused. Add some comments to better explain the silent dissapearance of orphaned events. So what happens is that at perf_event_release_kernel() time; when an event looses its connection to userspace (and ceases to exist from the user's perspective) we can still have an arbitrary amount of inherited copies of the event. We want to synchronously find and remove all these child events. Since that requires a bit of lock juggling, there is the possibility that concurrent clone()s will create new child events. Therefore we first mark the parent event as DEAD, which marks all the extant child events as orphaned. We then avoid copying orphaned events; in order to avoid getting more of them. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephane Eranian <eranian@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vince Weaver <vincent.weaver@maine.edu> Cc: fweisbec@gmail.com Link: http://lkml.kernel.org/r/20170316125823.289567442@infradead.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-16perf/core: Simplify perf_event_free_task()Peter Zijlstra
We have ctx->event_list that contains all events; no need to repeatedly iterate the group lists to find them all. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephane Eranian <eranian@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vince Weaver <vincent.weaver@maine.edu> Cc: fweisbec@gmail.com Link: http://lkml.kernel.org/r/20170316125823.239678244@infradead.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-16perf/core: Fix event inheritance on fork()Peter Zijlstra
While hunting for clues to a use-after-free, Oleg spotted that perf_event_init_context() can loose an error value with the result that fork() can succeed even though we did not fully inherit the perf event context. Spotted-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephane Eranian <eranian@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vince Weaver <vincent.weaver@maine.edu> Cc: oleg@redhat.com Cc: stable@vger.kernel.org Fixes: 889ff0150661 ("perf/core: Split context's event group list into pinned and non-pinned lists") Link: http://lkml.kernel.org/r/20170316125823.190342547@infradead.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-16perf/core: Fix use-after-free in perf_release()Peter Zijlstra
Dmitry reported syzcaller tripped a use-after-free in perf_release(). After much puzzlement Oleg spotted the below scenario: Task1 Task2 fork() perf_event_init_task() /* ... */ goto bad_fork_$foo; /* ... */ perf_event_free_task() mutex_lock(ctx->lock) perf_free_event(B) perf_event_release_kernel(A) mutex_lock(A->child_mutex) list_for_each_entry(child, ...) { /* child == B */ ctx = B->ctx; get_ctx(ctx); mutex_unlock(A->child_mutex); mutex_lock(A->child_mutex) list_del_init(B->child_list) mutex_unlock(A->child_mutex) /* ... */ mutex_unlock(ctx->lock); put_ctx() /* >0 */ free_task(); mutex_lock(ctx->lock); mutex_lock(A->child_mutex); /* ... */ mutex_unlock(A->child_mutex); mutex_unlock(ctx->lock) put_ctx() /* 0 */ ctx->task && !TOMBSTONE put_task_struct() /* UAF */ This patch closes the hole by making perf_event_free_task() destroy the task <-> ctx relation such that perf_event_release_kernel() will no longer observe the now dead task. Spotted-by: Oleg Nesterov <oleg@redhat.com> Reported-by: Dmitry Vyukov <dvyukov@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephane Eranian <eranian@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vince Weaver <vincent.weaver@maine.edu> Cc: fweisbec@gmail.com Cc: oleg@redhat.com Cc: stable@vger.kernel.org Fixes: c6e5b73242d2 ("perf: Synchronously clean up child events") Link: http://lkml.kernel.org/r/20170314155949.GE32474@worktop Link: http://lkml.kernel.org/r/20170316125823.140295131@infradead.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-16mmc: mediatek: Fixed bug where clock frequency could be set wrongyong mao
This patch can fix two issues: Issue 1: In previous code, div may be overflow when setting clock frequency as f_min. We can use DIV_ROUND_UP to fix this boundary related issue. Issue 2: In previous code, we can not set the correct clock frequency when div equals 0xff. Signed-off-by: Yong Mao <yong.mao@mediatek.com> Signed-off-by: Chaotian Jing <chaotian.jing@mediatek.com> Reviewed-by: Daniel Kurtz <djkurtz@chromium.org> Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org>
2017-03-16powerpc: Wire up statx() syscallChandan Rajendra
Test runs on a ppc64 BE guest succeeded. linux/samples/statx/test-statx program was executed on the following file types, 1. Regular file 2. Directory 3. device file 4. symlink 5. Named pipe The test run also included invoking test-statx with the runtime options provided in the main() function of test-statx.c Signed-off-by: Chandan Rajendra <chandan@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-03-16hwrng: geode - Revert managed API changesPrarit Bhargava
After commit e9afc746299d ("hwrng: geode - Use linux/io.h instead of asm/io.h") the geode-rng driver uses devres with pci_dev->dev to keep track of resources, but does not actually register a PCI driver. This results in the following issues: 1. The driver leaks memory because the driver does not attach to a device. The driver only uses the PCI device as a reference. devm_*() functions will release resources on driver detach, which the geode-rng driver will never do. As a result, 2. The driver cannot be reloaded because there is always a use of the ioport and region after the first load of the driver. Revert the changes made by e9afc746299d ("hwrng: geode - Use linux/io.h instead of asm/io.h"). Cc: <stable@vger.kernel.org> Signed-off-by: Prarit Bhargava <prarit@redhat.com> Fixes: 6e9b5e76882c ("hwrng: geode - Migrate to managed API") Cc: Matt Mackall <mpm@selenic.com> Cc: Corentin LABBE <clabbe.montjoie@gmail.com> Cc: PrasannaKumar Muralidharan <prasannatsmkumar@gmail.com> Cc: Wei Yongjun <weiyongjun1@huawei.com> Cc: linux-crypto@vger.kernel.org Cc: linux-geode@lists.infradead.org Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2017-03-16hwrng: amd - Revert managed API changesPrarit Bhargava
After commit 31b2a73c9c5f ("hwrng: amd - Migrate to managed API"), the amd-rng driver uses devres with pci_dev->dev to keep track of resources, but does not actually register a PCI driver. This results in the following issues: 1. The message WARNING: CPU: 2 PID: 621 at drivers/base/dd.c:349 driver_probe_device+0x38c is output when the i2c_amd756 driver loads and attempts to register a PCI driver. The PCI & device subsystems assume that no resources have been registered for the device, and the WARN_ON() triggers since amd-rng has already do so. 2. The driver leaks memory because the driver does not attach to a device. The driver only uses the PCI device as a reference. devm_*() functions will release resources on driver detach, which the amd-rng driver will never do. As a result, 3. The driver cannot be reloaded because there is always a use of the ioport and region after the first load of the driver. Revert the changes made by 31b2a73c9c5f ("hwrng: amd - Migrate to managed API"). Cc: <stable@vger.kernel.org> Signed-off-by: Prarit Bhargava <prarit@redhat.com> Fixes: 31b2a73c9c5f ("hwrng: amd - Migrate to managed API"). Cc: Matt Mackall <mpm@selenic.com> Cc: Corentin LABBE <clabbe.montjoie@gmail.com> Cc: PrasannaKumar Muralidharan <prasannatsmkumar@gmail.com> Cc: Wei Yongjun <weiyongjun1@huawei.com> Cc: linux-crypto@vger.kernel.org Cc: linux-geode@lists.infradead.org Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2017-03-16crypto: ccp - Assign DMA commands to the channel's CCPGary R Hook
The CCP driver generally uses a round-robin approach when assigning operations to available CCPs. For the DMA engine, however, the DMA mappings of the SGs are associated with a specific CCP. When an IOMMU is enabled, the IOMMU is programmed based on this specific device. If the DMA operations are not performed by that specific CCP then addressing errors and I/O page faults will occur. Update the CCP driver to allow a specific CCP device to be requested for an operation and use this in the DMA engine support. Cc: <stable@vger.kernel.org> # 4.9.x- Signed-off-by: Gary R Hook <gary.hook@amd.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2017-03-16nl80211: fix dumpit error path RTNL deadlocksJohannes Berg
Sowmini pointed out Dmitry's RTNL deadlock report to me, and it turns out to be perfectly accurate - there are various error paths that miss unlock of the RTNL. To fix those, change the locking a bit to not be conditional in all those nl80211_prepare_*_dump() functions, but make those require the RTNL to start with, and fix the buggy error paths. This also let me use sparse (by appropriately overriding the rtnl_lock/rtnl_unlock functions) to validate the changes. Cc: stable@vger.kernel.org Reported-by: Sowmini Varadhan <sowmini.varadhan@oracle.com> Reported-by: Dmitry Vyukov <dvyukov@google.com> Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2017-03-16sched/deadline: Use deadline instead of period when calculating overflowSteven Rostedt (VMware)
I was testing Daniel's changes with his test case, and tweaked it a little. Instead of having the runtime equal to the deadline, I increased the deadline ten fold. Daniel's test case had: attr.sched_runtime = 2 * 1000 * 1000; /* 2 ms */ attr.sched_deadline = 2 * 1000 * 1000; /* 2 ms */ attr.sched_period = 2 * 1000 * 1000 * 1000; /* 2 s */ To make it more interesting, I changed it to: attr.sched_runtime = 2 * 1000 * 1000; /* 2 ms */ attr.sched_deadline = 20 * 1000 * 1000; /* 20 ms */ attr.sched_period = 2 * 1000 * 1000 * 1000; /* 2 s */ The results were rather surprising. The behavior that Daniel's patch was fixing came back. The task started using much more than .1% of the CPU. More like 20%. Looking into this I found that it was due to the dl_entity_overflow() constantly returning true. That's because it uses the relative period against relative runtime vs the absolute deadline against absolute runtime. runtime / (deadline - t) > dl_runtime / dl_period There's even a comment mentioning this, and saying that when relative deadline equals relative period, that the equation is the same as using deadline instead of period. That comment is backwards! What we really want is: runtime / (deadline - t) > dl_runtime / dl_deadline We care about if the runtime can make its deadline, not its period. And then we can say "when the deadline equals the period, the equation is the same as using dl_period instead of dl_deadline". After correcting this, now when the task gets enqueued, it can throttle correctly, and Daniel's fix to the throttling of sleeping deadline tasks works even when the runtime and deadline are not the same. Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Daniel Bristot de Oliveira <bristot@redhat.com> Cc: Juri Lelli <juri.lelli@arm.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Luca Abeni <luca.abeni@santannapisa.it> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Romulo Silva de Oliveira <romulo.deoliveira@ufsc.br> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tommaso Cucinotta <tommaso.cucinotta@sssup.it> Link: http://lkml.kernel.org/r/02135a27f1ae3fe5fd032568a5a2f370e190e8d7.1488392936.git.bristot@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-16sched/deadline: Throttle a constrained deadline task activated after the ↵Daniel Bristot de Oliveira
deadline During the activation, CBS checks if it can reuse the current task's runtime and period. If the deadline of the task is in the past, CBS cannot use the runtime, and so it replenishes the task. This rule works fine for implicit deadline tasks (deadline == period), and the CBS was designed for implicit deadline tasks. However, a task with constrained deadline (deadine < period) might be awakened after the deadline, but before the next period. In this case, replenishing the task would allow it to run for runtime / deadline. As in this case deadline < period, CBS enables a task to run for more than the runtime / period. In a very loaded system, this can cause a domino effect, making other tasks miss their deadlines. To avoid this problem, in the activation of a constrained deadline task after the deadline but before the next period, throttle the task and set the replenishing timer to the begin of the next period, unless it is boosted. Reproducer: --------------- %< --------------- int main (int argc, char **argv) { int ret; int flags = 0; unsigned long l = 0; struct timespec ts; struct sched_attr attr; memset(&attr, 0, sizeof(attr)); attr.size = sizeof(attr); attr.sched_policy = SCHED_DEADLINE; attr.sched_runtime = 2 * 1000 * 1000; /* 2 ms */ attr.sched_deadline = 2 * 1000 * 1000; /* 2 ms */ attr.sched_period = 2 * 1000 * 1000 * 1000; /* 2 s */ ts.tv_sec = 0; ts.tv_nsec = 2000 * 1000; /* 2 ms */ ret = sched_setattr(0, &attr, flags); if (ret < 0) { perror("sched_setattr"); exit(-1); } for(;;) { /* XXX: you may need to adjust the loop */ for (l = 0; l < 150000; l++); /* * The ideia is to go to sleep right before the deadline * and then wake up before the next period to receive * a new replenishment. */ nanosleep(&ts, NULL); } exit(0); } --------------- >% --------------- On my box, this reproducer uses almost 50% of the CPU time, which is obviously wrong for a task with 2/2000 reservation. Signed-off-by: Daniel Bristot de Oliveira <bristot@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Juri Lelli <juri.lelli@arm.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Luca Abeni <luca.abeni@santannapisa.it> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Romulo Silva de Oliveira <romulo.deoliveira@ufsc.br> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tommaso Cucinotta <tommaso.cucinotta@sssup.it> Link: http://lkml.kernel.org/r/edf58354e01db46bf42df8d2dd32418833f68c89.1488392936.git.bristot@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-16sched/deadline: Make sure the replenishment timer fires in the next periodDaniel Bristot de Oliveira
Currently, the replenishment timer is set to fire at the deadline of a task. Although that works for implicit deadline tasks because the deadline is equals to the begin of the next period, that is not correct for constrained deadline tasks (deadline < period). For instance: f.c: --------------- %< --------------- int main (void) { for(;;); } --------------- >% --------------- # gcc -o f f.c # trace-cmd record -e sched:sched_switch \ -e syscalls:sys_exit_sched_setattr \ chrt -d --sched-runtime 490000000 \ --sched-deadline 500000000 \ --sched-period 1000000000 0 ./f # trace-cmd report | grep "{pid of ./f}" After setting parameters, the task is replenished and continue running until being throttled: f-11295 [003] 13322.113776: sys_exit_sched_setattr: 0x0 The task is throttled after running 492318 ms, as expected: f-11295 [003] 13322.606094: sched_switch: f:11295 [-1] R ==> watchdog/3:32 [0] But then, the task is replenished 500719 ms after the first replenishment: <idle>-0 [003] 13322.614495: sched_switch: swapper/3:0 [120] R ==> f:11295 [-1] Running for 490277 ms: f-11295 [003] 13323.104772: sched_switch: f:11295 [-1] R ==> swapper/3:0 [120] Hence, in the first period, the task runs 2 * runtime, and that is a bug. During the first replenishment, the next deadline is set one period away. So the runtime / period starts to be respected. However, as the second replenishment took place in the wrong instant, the next replenishment will also be held in a wrong instant of time. Rather than occurring in the nth period away from the first activation, it is taking place in the (nth period - relative deadline). Signed-off-by: Daniel Bristot de Oliveira <bristot@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Luca Abeni <luca.abeni@santannapisa.it> Reviewed-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Reviewed-by: Juri Lelli <juri.lelli@arm.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Romulo Silva de Oliveira <romulo.deoliveira@ufsc.br> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tommaso Cucinotta <tommaso.cucinotta@sssup.it> Link: http://lkml.kernel.org/r/ac50d89887c25285b47465638354b63362f8adff.1488392936.git.bristot@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-16locking/rwsem: Fix down_write_killable() for CONFIG_RWSEM_GENERIC_SPINLOCK=yNiklas Cassel
We hang if SIGKILL has been sent, but the task is stuck in down_read() (after do_exit()), even though no task is doing down_write() on the rwsem in question: INFO: task libupnp:21868 blocked for more than 120 seconds. libupnp D 0 21868 1 0x08100008 ... Call Trace: __schedule() schedule() __down_read() do_exit() do_group_exit() __wake_up_parent() This bug has already been fixed for CONFIG_RWSEM_XCHGADD_ALGORITHM=y in the following commit: 04cafed7fc19 ("locking/rwsem: Fix down_write_killable()") ... however, this bug also exists for CONFIG_RWSEM_GENERIC_SPINLOCK=y. Signed-off-by: Niklas Cassel <niklas.cassel@axis.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: <mhocko@suse.com> Cc: <stable@vger.kernel.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Niklas Cassel <niklass@axis.com> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Fixes: d47996082f52 ("locking/rwsem: Introduce basis for down_write_killable()") Link: http://lkml.kernel.org/r/1487981873-12649-1-git-send-email-niklass@axis.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-16sched/loadavg: Use {READ,WRITE}_ONCE() for sample windowMatt Fleming
'calc_load_update' is accessed without any kind of locking and there's a clear assumption in the code that only a single value is read or written. Make this explicit by using READ_ONCE() and WRITE_ONCE(), and avoid unintentionally seeing multiple values, or having the load/stores split. Technically the loads in calc_global_*() don't require this since those are the only functions that update 'calc_load_update', but I've added the READ_ONCE() for consistency. Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Matt Fleming <matt@codeblueprint.co.uk> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Mike Galbraith <umgwanakikbuti@gmail.com> Cc: Morten Rasmussen <morten.rasmussen@arm.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vincent Guittot <vincent.guittot@linaro.org> Link: http://lkml.kernel.org/r/20170217120731.11868-3-matt@codeblueprint.co.uk Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-16sched/loadavg: Avoid loadavg spikes caused by delayed NO_HZ accountingMatt Fleming
If we crossed a sample window while in NO_HZ we will add LOAD_FREQ to the pending sample window time on exit, setting the next update not one window into the future, but two. This situation on exiting NO_HZ is described by: this_rq->calc_load_update < jiffies < calc_load_update In this scenario, what we should be doing is: this_rq->calc_load_update = calc_load_update [ next window ] But what we actually do is: this_rq->calc_load_update = calc_load_update + LOAD_FREQ [ next+1 window ] This has the effect of delaying load average updates for potentially up to ~9seconds. This can result in huge spikes in the load average values due to per-cpu uninterruptible task counts being out of sync when accumulated across all CPUs. It's safe to update the per-cpu active count if we wake between sample windows because any load that we left in 'calc_load_idle' will have been zero'd when the idle load was folded in calc_global_load(). This issue is easy to reproduce before, commit 9d89c257dfb9 ("sched/fair: Rewrite runnable load and utilization average tracking") just by forking short-lived process pipelines built from ps(1) and grep(1) in a loop. I'm unable to reproduce the spikes after that commit, but the bug still seems to be present from code review. Signed-off-by: Matt Fleming <matt@codeblueprint.co.uk> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Mike Galbraith <umgwanakikbuti@gmail.com> Cc: Morten Rasmussen <morten.rasmussen@arm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vincent Guittot <vincent.guittot@linaro.org> Fixes: commit 5167e8d ("sched/nohz: Rewrite and fix load-avg computation -- again") Link: http://lkml.kernel.org/r/20170217120731.11868-2-matt@codeblueprint.co.uk Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-16sched/deadline: Add missing update_rq_clock() in dl_task_timer()Wanpeng Li
The following warning can be triggered by hot-unplugging the CPU on which an active SCHED_DEADLINE task is running on: ------------[ cut here ]------------ WARNING: CPU: 7 PID: 0 at kernel/sched/sched.h:833 replenish_dl_entity+0x71e/0xc40 rq->clock_update_flags < RQCF_ACT_SKIP CPU: 7 PID: 0 Comm: swapper/7 Tainted: G B 4.11.0-rc1+ #24 Hardware name: LENOVO ThinkCentre M8500t-N000/SHARKBAY, BIOS FBKTC1AUS 02/16/2016 Call Trace: <IRQ> dump_stack+0x85/0xc4 __warn+0x172/0x1b0 warn_slowpath_fmt+0xb4/0xf0 ? __warn+0x1b0/0x1b0 ? debug_check_no_locks_freed+0x2c0/0x2c0 ? cpudl_set+0x3d/0x2b0 replenish_dl_entity+0x71e/0xc40 enqueue_task_dl+0x2ea/0x12e0 ? dl_task_timer+0x777/0x990 ? __hrtimer_run_queues+0x270/0xa50 dl_task_timer+0x316/0x990 ? enqueue_task_dl+0x12e0/0x12e0 ? enqueue_task_dl+0x12e0/0x12e0 __hrtimer_run_queues+0x270/0xa50 ? hrtimer_cancel+0x20/0x20 ? hrtimer_interrupt+0x119/0x600 hrtimer_interrupt+0x19c/0x600 ? trace_hardirqs_off+0xd/0x10 local_apic_timer_interrupt+0x74/0xe0 smp_apic_timer_interrupt+0x76/0xa0 apic_timer_interrupt+0x93/0xa0 The DL task will be migrated to a suitable later deadline rq once the DL timer fires and currnet rq is offline. The rq clock of the new rq should be updated. This patch fixes it by updating the rq clock after holding the new rq's rq lock. Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Matt Fleming <matt@codeblueprint.co.uk> Cc: Juri Lelli <juri.lelli@arm.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1488865888-15894-1-git-send-email-wanpeng.li@hotmail.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-16x86/mpx: Make unnecessarily global function staticTobias Klauser
Make the function get_user_bd_entry() static as it is not used outside of arch/x86/mm/mpx.c This fixes a sparse warning. Signed-off-by: Tobias Klauser <tklauser@distanz.ch> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-16MAINTAINERS: update for mwifiex driver maintainersAmitkumar Karwar
Ganapathi & Xinming are starting to take a more active role in the mwifiex driver maintainership here onwards on account of organizational changes. CC: Xinming Hu <huxm@marvell.com> CC: Ganapathi Bhat <gbhat@marvell.com> Signed-off-by: Amitkumar Karwar <akarwar@marvell.com> Signed-off-by: Nishant Sarmukadam <nishants@marvell.com> Signed-off-by: Cathy Luo <cluo@marvell.com> Signed-off-by: Kalle Valo <kvalo@codeaurora.org>
2017-03-16mwifiex: uninit wakeup info when removing deviceBrian Norris
We manually init wakeup info, but we don't detach it on device removal. This means that if we (for example) rmmod + modprobe the driver, the device framework might return -EEXIST the second time, and we'll complain in the logs: [ 839.311881] mwifiex_pcie 0000:01:00.0: fail to init wakeup for mwifiex AFAICT, there's no other negative effect. But we can fix this by disabling wakeup on remove, similar to what a few other drivers do (e.g., the power supply framework). This code (and bug) has existed on SDIO for a while, but it got moved around and enabled for PCIe with commit 853402a00823 ("mwifiex: Enable WoWLAN for both sdio and pcie"). Signed-off-by: Brian Norris <briannorris@chromium.org> Signed-off-by: Kalle Valo <kvalo@codeaurora.org>
2017-03-16mwifiex: set adapter->dev before starting to use mwifiex_dbg()Brian Norris
The mwifiex_dbg() log handler utilizes the struct device in adapter->dev. Without it, it decides not to print anything. As of commit 2e02b5814217 ("mwifiex: Allow mwifiex early access to device structure"), we started assigning that pointer only after we finished mwifiex_register() -- this effectively neuters any mwifiex_dbg() logging done before this point. Let's move the device assignment into mwifiex_register(). Fixes: 2e02b5814217 ("mwifiex: Allow mwifiex early access to device structure") Cc: Rajat Jain <rajatja@google.com> Signed-off-by: Brian Norris <briannorris@chromium.org> Signed-off-by: Kalle Valo <kvalo@codeaurora.org>
2017-03-16mwifiex: pcie: don't leak DMA buffers when removingBrian Norris
When PCIe FLR support was added, much of the remove/release code for PCIe was migrated to ->down_dev(), but ->down_dev() is never called for device removal. Let's refactor the cleanup to be done in both cases. Also, drop the comments above mwifiex_cleanup_pcie(), because they were clearly wrong, and it's better to have clear and obvious code than to detail the code steps in comments anyway. Fixes: 4c5dae59d2e9 ("mwifiex: add PCIe function level reset support") Cc: <stable@vger.kernel.org> Signed-off-by: Brian Norris <briannorris@chromium.org> Signed-off-by: Kalle Valo <kvalo@codeaurora.org>
2017-03-16drm/i915: Do .init_clock_gating() earlier to avoid it clobbering watermarksVille Syrjälä
Currently ILK-BDW explicitly disable LP1+ watermarks from their .init_clock_gating() hooks. Unfortunately that hook gets called way too late since by that time we've already initialized all the watermark state tracking which then gets out of sync with the hardware state. We may eventually want to consider killing off the explicit LP1+ disable from .init_clock_gating(). In the meantime however, we can avoid the problem by reordering the init sequence such that intel_modeset_init_hw()->intel_init_clock_gating() gets called prior to the hardware state takeover. I suppose prior to the two stage watermark programming we were magically saved by something that forced the watermarks to be reprogrammed fully after .init_clock_gating() got called. But now that no longer happens. Note that the diff might look a bit odd as it kills off one call of intel_update_cdclk(), but that's fine because intel_modeset_init_hw() does the exact same thing. Previously we just did it twice. Actually even this new init sequence is pretty bogus as .init_clock_gating() really should be called before any gem hardware init since it can configure various clock gating workarounds and whatnot that affect the GT side as well. Also intel_modeset_init() really should get split up into better defined init stages. Another "fun" detail is that intel_modeset_gem_init() is where RPS/RC6 gets configured. Why that is done from the display code is beyond me. I've decided to leave all this be for now, and just try to fix the init sequence enough for watermarks to work. Cc: stable@vger.kernel.org Cc: Gabriele Mazzotta <gabriele.mzt@gmail.com> Cc: David Purton <dcpurton@marshwiggle.net> Cc: Matt Roper <matthew.d.roper@intel.com> Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com> Reported-by: Gabriele Mazzotta <gabriele.mzt@gmail.com> Reported-by: David Purton <dcpurton@marshwiggle.net> Tested-by: Gabriele Mazzotta <gabriele.mzt@gmail.com> Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=96645 Fixes: ed4a6a7ca853 ("drm/i915: Add two-stage ILK-style watermark programming (v11)") Signed-off-by: Ville Syrjälä <ville.syrjala@linux.intel.com> Link: http://patchwork.freedesktop.org/patch/msgid/20170220140443.30891-1-ville.syrjala@linux.intel.com Reviewed-by: Daniel Vetter <daniel.vetter@ffwll.ch> Link: http://patchwork.freedesktop.org/patch/msgid/20170315143158.31780-1-ville.syrjala@linux.intel.com (cherry picked from commit 5be6e33400992d3450e1c8234a5af353e1560580) Signed-off-by: Jani Nikula <jani.nikula@intel.com>
2017-03-16iwlwifi: mvm: cleanup pending frames in DQA modeSara Sharon
When a station is asleep, the fw will set it as "asleep". All queues that are used only by one station will be stopped by the fw. In pre-DQA mode this was relevant for aggregation queues. However, in DQA mode a queue is owned by one station only, so all queues will be stopped. As a result, we don't expect to get filtered frames back to mac80211 and don't have to maintain the entire pending_frames state logic, the same way as we do in aggregations. The correct behavior is to align DQA behavior with the aggregation queue behaviour pre-DQA: - Don't count pending frames. - Let mac80211 know we have frames in these queues so that it can properly handle trigger frames. When a trigger frame is received, mac80211 tells the driver to send frames from the queues using release_buffered_frames. The driver will tell the fw to let frames out even if the station is asleep. This is done by iwl_mvm_sta_modify_sleep_tx_count. Reported-and-tested-by: Jens Axboe <axboe@kernel.dk> Reported-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Sara Sharon <sara.sharon@intel.com> Signed-off-by: Luca Coelho <luciano.coelho@intel.com> Signed-off-by: Kalle Valo <kvalo@codeaurora.org>
2017-03-16Merge branch 'drm-fixes-4.11' of git://people.freedesktop.org/~agd5f/linux ↵Dave Airlie
into drm-fixes A few amd fixes. * 'drm-fixes-4.11' of git://people.freedesktop.org/~agd5f/linux: drm/amd/amdgpu: Fix debugfs reg read/write address width drm/amdgpu/si: add dpm quirk for Oland drm/radeon/si: add dpm quirk for Oland drm: amd: remove broken include path drm/amd/powerplay: fix copy error in smu7_clockpoweragting.c drm/amdgpu: fix parser init error path to avoid crash in parser fini drm/amd/amdgpu: Disable GFX_PG on Carrizo until compute issues solved
2017-03-15Merge branch 'for-linus' of git://git.kernel.dk/linux-blockLinus Torvalds
Pull block fixes from Jens Axboe: "Four small fixes for this cycle: - followup fix from Neil for a fix that went in before -rc2, ensuring that we always see the full per-task bio_list. - fix for blk-mq-sched from me that ensures that we retain similar direct-to-issue behavior on running the queue. - fix from Sagi fixing a potential NULL pointer dereference in blk-mq on spurious CPU unplug. - a memory leak fix in writeback from Tahsin, fixing a case where device removal of a mounted device can leak a struct wb_writeback_work" * 'for-linus' of git://git.kernel.dk/linux-block: blk-mq-sched: don't run the queue async from blk_mq_try_issue_directly() writeback: fix memory leak in wb_queue_work() blk-mq: Fix tagset reinit in the presence of cpu hot-unplug blk: Ensure users for current->bio_list can see the full list.
2017-03-16cpufreq: Fix and clean up show_cpuinfo_cur_freq()Rafael J. Wysocki
There is a missing newline in show_cpuinfo_cur_freq(), so add it, but while at it clean that function up somewhat too. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: Viresh Kumar <viresh.kumar@linaro.org> Cc: All applicable <stable@vger.kernel.org>
2017-03-15net: properly release sk_frag.pageEric Dumazet
I mistakenly added the code to release sk->sk_frag in sk_common_release() instead of sk_destruct() TCP sockets using sk->sk_allocation == GFP_ATOMIC do no call sk_common_release() at close time, thus leaking one (order-3) page. iSCSI is using such sockets. Fixes: 5640f7685831 ("net: use a per task frag allocator") Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-15amd-xgbe: Fix jumbo MTU processing on newer hardwareLendacky, Thomas
Newer hardware does not provide a cumulative payload length when multiple descriptors are needed to handle the data. Once the MTU increases beyond the size that can be handled by a single descriptor, the SKB does not get built properly by the driver. The driver will now calculate the size of the data buffers used by the hardware. The first buffer of the first descriptor is for packet headers or packet headers and data when the headers can't be split. Subsequent descriptors in a multi-descriptor chain will not use the first buffer. The second buffer is used by all the descriptors in the chain for payload data. Based on whether the driver is processing the first, intermediate, or last descriptor it can calculate the buffer usage and build the SKB properly. Tested and verified on both old and new hardware. Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-15net: bcmgenet: Do not suspend PHY if Wake-on-LAN is enabledFlorian Fainelli
Suspending the PHY would be putting it in a low power state where it may no longer allow us to do Wake-on-LAN. Fixes: cc013fb48898 ("net: bcmgenet: correctly suspend and resume PHY device") Signed-off-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-15MAINTAINERS: remove MACVLAN and VLAN entriesPablo Neira
macvlan.c file seems to be both in VLAN and MACVLAN DRIVER, so remove the MACVLAN DRIVER since this is redundant. I propose with this patch to remove the VLAN (802.1Q) entry so this just falls into the NETWORKING [GENERAL]. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-15Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nfDavid S. Miller
Pablo Neira Ayuso says: ==================== Netfilter fixes for net The following patchset contains Netfilter fixes for your net tree, a rather large batch of fixes targeted to nf_tables, conntrack and bridge netfilter. More specifically, they are: 1) Don't track fragmented packets if the socket option IP_NODEFRAG is set. From Florian Westphal. 2) SCTP protocol tracker assumes that ICMP error messages contain the checksum field, what results in packet drops. From Ying Xue. 3) Fix inconsistent handling of AH traffic from nf_tables. 4) Fix new bitmap set representation with big endian. Fix mismatches in nf_tables due to incorrect big endian handling too. Both patches from Liping Zhang. 5) Bridge netfilter doesn't honor maximum fragment size field, cap to largest fragment seen. From Florian Westphal. 6) Fake conntrack entry needs to be aligned to 8 bytes since the 3 LSB bits are now used to store the ctinfo. From Steven Rostedt. 7) Fix element comments with the bitmap set type. Revert the flush field in the nft_set_iter structure, not required anymore after fixing up element comments. 8) Missing error on invalid conntrack direction from nft_ct, also from Liping Zhang. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>