summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2020-05-26io_uring: call statx directlyBijan Mottahedeh
Calling statx directly both simplifies the interface and avoids potential incompatibilities between sync and async invokations. Signed-off-by: Bijan Mottahedeh <bijan.mottahedeh@oracle.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-26statx: allow system call to be invoked from io_uringBijan Mottahedeh
This is a prepatory patch to allow io_uring to invoke statx directly. Signed-off-by: Bijan Mottahedeh <bijan.mottahedeh@oracle.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-26io_uring: add io_statx structureBijan Mottahedeh
Separate statx data from open in io_kiocb. No functional changes. Signed-off-by: Bijan Mottahedeh <bijan.mottahedeh@oracle.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-26Merge branch 'net-phy-mscc-miim-reduce-waiting-time-between-MDIO-transactions'David S. Miller
Antoine Tenart says: ==================== net: phy: mscc-miim: reduce waiting time between MDIO transactions This series aims at reducing the waiting time between MDIO transactions when using the MSCC MIIM MDIO controller. I'm not sure we need patch 4/4 and we could reasonably drop it from the series. I'm including the patch as it could help to ensure the system is functional with a non optimal configuration. We needed to improve the driver's performances as when using a PHY requiring lots of registers accesses (such as the VSC85xx family), delays would add up and ended up to be quite large which would cause issues such as: a slow initialization of the PHY, and issues when using timestamping operations (this feature will be sent quite soon to the mailing lists). ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-26net: phy: mscc-miim: read poll when high resolution timers are disabledAntoine Tenart
The driver uses a read polling mechanism to check the status of the MDIO bus, to know if it is ready to accept next commands. This polling mechanism uses usleep_delay() under the hood between reads which is fine as long as high resolution timers are enabled. Otherwise the delays will end up to be much longer than expected. This patch fixes this by using udelay() under the hood when CONFIG_HIGH_RES_TIMERS isn't enabled. This increases CPU usage. Signed-off-by: Antoine Tenart <antoine.tenart@bootlin.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-26net: phy: mscc-miim: improve waiting logicAntoine Tenart
The MSCC MIIM MDIO driver uses a waiting logic to wait for the MDIO bus to be ready to accept next commands. It does so by polling the BUSY status bit which indicates the MDIO bus has completed all pending operations. This can take time, and the controller supports writing the next command as soon as there are no pending commands (which happens while the MDIO bus is busy completing its current command). This patch implements this improved logic by adding an helper to poll the PENDING status bit, and by adjusting where we should wait for the bus to not be busy or to not be pending. Signed-off-by: Antoine Tenart <antoine.tenart@bootlin.com> Reviewed-by: Alexandre Belloni <alexandre.belloni@bootlin.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-26net: phy: mscc-miim: remove redundant timeout checkAntoine Tenart
readl_poll_timeout already returns -ETIMEDOUT if the condition isn't satisfied, there's no need to check again the condition after calling it. Remove the redundant timeout check. Signed-off-by: Antoine Tenart <antoine.tenart@bootlin.com> Reviewed-by: Alexandre Belloni <alexandre.belloni@bootlin.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-26net: phy: mscc-miim: use more reasonable delaysAntoine Tenart
The MSCC MIIM MDIO driver uses delays to read poll a status register. I made multiple tests on a Ocelot PCS120 platform which led me to reduce those delays. The delay in between which the polling function is allowed to sleep is reduced from 100us to 50us which in almost all cases is a good value to succeed at the first retry. The overall delay is also lowered as the prior value was really way to high, 10000us is large enough. Signed-off-by: Antoine Tenart <antoine.tenart@bootlin.com> Reviewed-by: Alexandre Belloni <alexandre.belloni@bootlin.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-26net: mdiobus: add clause 45 mdiobus accessorsRussell King
There is a recurring pattern throughout some of the PHY code converting a devad and regnum to our packed clause 45 representation. Rather than having this scattered around the code, let's put a common translation function in mdio.h, and provide some register accessors. Convert the phylib core, phylink, bcm87xx and cortina to use these. Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-26Merge branch 'flow-mpls'David S. Miller
Guillaume Nault says: ==================== flow_dissector, cls_flower: Add support for multiple MPLS Label Stack Entries Currently, the flow dissector and the Flower classifier can only handle the first entry of an MPLS label stack. This patch series generalises the code to allow parsing and matching the Label Stack Entries that follow. Patch 1 extends the flow dissector to parse MPLS LSEs until the Bottom Of Stack bit is reached. The number of parsed LSEs is capped at FLOW_DIS_MPLS_MAX (arbitrarily set to 7). Flower and the NFP driver are updated to take into account the new layout of struct flow_dissector_key_mpls. Patch 2 extends Flower. It defines new netlink attributes, which are independent from the previous MPLS ones. Mixing the old and the new attributes in a same filter is not allowed. For backward compatibility, the old attributes are used when dumping filters that don't require the new ones. Changes since v2: * Fix compilation with the new MLX5 bareudp tunnel code. Changes since v1: * Fix compilation of NFP driver (kbuild test robot). * Fix sparse warning with entropy label (kbuild test robot). ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-26cls_flower: Support filtering on multiple MPLS Label Stack EntriesGuillaume Nault
With struct flow_dissector_key_mpls now recording the first FLOW_DIS_MPLS_MAX labels, we can extend Flower to filter on any of these LSEs independently. In order to avoid creating new netlink attributes for every possible depth, let's define a new TCA_FLOWER_KEY_MPLS_OPTS nested attribute that contains the list of LSEs to match. Each LSE is represented by another attribute, TCA_FLOWER_KEY_MPLS_OPTS_LSE, which then contains the attributes representing the depth and the MPLS fields to match at this depth (label, TTL, etc.). For each MPLS field, the mask is always set to all-ones, as this is what the original API did. We could allow user configurable masks in the future if there is demand for more flexibility. The new API also allows to only specify an LSE depth. In that case, Flower only verifies that the MPLS label stack depth is greater or equal to the provided depth (that is, an LSE exists at this depth). Filters that only match on one (or more) fields of the first LSE are dumped using the old netlink attributes, to avoid confusing user space programs that don't understand the new API. Signed-off-by: Guillaume Nault <gnault@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-26flow_dissector: Parse multiple MPLS Label Stack EntriesGuillaume Nault
The current MPLS dissector only parses the first MPLS Label Stack Entry (second LSE can be parsed too, but only to set a key_id). This patch adds the possibility to parse several LSEs by making __skb_flow_dissect_mpls() return FLOW_DISSECT_RET_PROTO_AGAIN as long as the Bottom Of Stack bit hasn't been seen, up to a maximum of FLOW_DIS_MPLS_MAX entries. FLOW_DIS_MPLS_MAX is arbitrarily set to 7. This should be enough for many practical purposes, without wasting too much space. To record the parsed values, flow_dissector_key_mpls is modified to store an array of stack entries, instead of just the values of the first one. A bit field, "used_lses", is also added to keep track of the LSEs that have been set. The objective is to avoid defining a new FLOW_DISSECTOR_KEY_MPLS_XX for each level of the MPLS stack. TC flower is adapted for the new struct flow_dissector_key_mpls layout. Matching on several MPLS Label Stack Entries will be added in the next patch. The NFP and MLX5 drivers are also adapted: nfp_flower_compile_mac() and mlx5's parse_tunnel() now verify that the rule only uses the first LSE and fail if it doesn't. Finally, the behaviour of the FLOW_DISSECTOR_KEY_MPLS_ENTROPY key is slightly modified. Instead of recording the first Entropy Label, it now records the last one. This shouldn't have any consequences since there doesn't seem to have any user of FLOW_DISSECTOR_KEY_MPLS_ENTROPY in the tree. We'd probably better do a hash of all parsed MPLS labels instead (excluding reserved labels) anyway. That'd give better entropy and would probably also simplify the code. But that's not the purpose of this patch, so I'm keeping that as a future possible improvement. Signed-off-by: Guillaume Nault <gnault@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-26Merge tag 'batadv-next-for-davem-20200526' of ↵David S. Miller
git://git.open-mesh.org/linux-merge Simon Wunderlich says: ==================== This cleanup patchset includes the following patches: - Fix revert dynamic lockdep key changes for batman-adv, by Sven Eckelmann - use rcu_replace_pointer() where appropriate, by Antonio Quartulli - Revert "disable ethtool link speed detection when auto negotiation off", by Sven Eckelmann ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-26Merge branch 'tipc-add-some-improvements'David S. Miller
Tuong Lien says: ==================== tipc: add some improvements This series adds some improvements to TIPC. The first patch improves the TIPC broadcast's performance with the 'Gap ACK blocks' mechanism similar to unicast before, while the others give support on tracing & statistics for broadcast links, and an alternative to carry broadcast retransmissions via unicast which might be useful in some cases. Besides, the Nagle algorithm can now automatically 'adjust' itself depending on the specific network condition a stream connection runs by the last patch. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-26tipc: add test for Nagle algorithm effectivenessTuong Lien
When streaming in Nagle mode, we try to bundle small messages from user as many as possible if there is one outstanding buffer, i.e. not ACK-ed by the receiving side, which helps boost up the overall throughput. So, the algorithm's effectiveness really depends on when Nagle ACK comes or what the specific network latency (RTT) is, compared to the user's message sending rate. In a bad case, the user's sending rate is low or the network latency is small, there will not be many bundles, so making a Nagle ACK or waiting for it is not meaningful. For example: a user sends its messages every 100ms and the RTT is 50ms, then for each messages, we require one Nagle ACK but then there is only one user message sent without any bundles. In a better case, even if we have a few bundles (e.g. the RTT = 300ms), but now the user sends messages in medium size, then there will not be any difference at all, that says 3 x 1000-byte data messages if bundled will still result in 3 bundles with MTU = 1500. When Nagle is ineffective, the delay in user message sending is clearly wasted instead of sending directly. Besides, adding Nagle ACKs will consume some processor load on both the sending and receiving sides. This commit adds a test on the effectiveness of the Nagle algorithm for an individual connection in the network on which it actually runs. Particularly, upon receipt of a Nagle ACK we will compare the number of bundles in the backlog queue to the number of user messages which would be sent directly without Nagle. If the ratio is good (e.g. >= 2), Nagle mode will be kept for further message sending. Otherwise, we will leave Nagle and put a 'penalty' on the connection, so it will have to spend more 'one-way' messages before being able to re-enter Nagle. In addition, the 'ack-required' bit is only set when really needed that the number of Nagle ACKs will be reduced during Nagle mode. Testing with benchmark showed that with the patch, there was not much difference in throughput for small messages since the tool continuously sends messages without a break, so Nagle would still take in effect. Acked-by: Ying Xue <ying.xue@windriver.com> Acked-by: Jon Maloy <jmaloy@redhat.com> Signed-off-by: Tuong Lien <tuong.t.lien@dektech.com.au> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-26tipc: add support for broadcast rcv stats dumpingTuong Lien
This commit enables dumping the statistics of a broadcast-receiver link like the traditional 'broadcast-link' one (which is for broadcast- sender). The link dumping can be triggered via netlink (e.g. the iproute2/tipc tool) by the link flag - 'TIPC_NLA_LINK_BROADCAST' as the indicator. The name of a broadcast-receiver link of a specific peer will be in the format: 'broadcast-link:<peer-id>'. For example: Link <broadcast-link:1001002> Window:50 packets RX packets:7841 fragments:2408/440 bundles:0/0 TX packets:0 fragments:0/0 bundles:0/0 RX naks:0 defs:124 dups:0 TX naks:21 acks:0 retrans:0 Congestion link:0 Send queue max:0 avg:0 In addition, the broadcast-receiver link statistics can be reset in the usual way via netlink by specifying that link name in command. Note: the 'tipc_link_name_ext()' is removed because the link name can now be retrieved simply via the 'l->name'. Acked-by: Ying Xue <ying.xue@windriver.com> Acked-by: Jon Maloy <jmaloy@redhat.com> Signed-off-by: Tuong Lien <tuong.t.lien@dektech.com.au> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-26tipc: enable broadcast retrans via unicastTuong Lien
In some environment, broadcast traffic is suppressed at high rate (i.e. a kind of bandwidth limit setting). When it is applied, TIPC broadcast can still run successfully. However, when it comes to a high load, some packets will be dropped first and TIPC tries to retransmit them but the packet retransmission is intentionally broadcast too, so making things worse and not helpful at all. This commit enables the broadcast retransmission via unicast which only retransmits packets to the specific peer that has really reported a gap i.e. not broadcasting to all nodes in the cluster, so will prevent from being suppressed, and also reduce some overheads on the other peers due to duplicates, finally improve the overall TIPC broadcast performance. Note: the functionality can be turned on/off via the sysctl file: echo 1 > /proc/sys/net/tipc/bc_retruni echo 0 > /proc/sys/net/tipc/bc_retruni Default is '0', i.e. the broadcast retransmission still works as usual. Acked-by: Ying Xue <ying.xue@windriver.com> Acked-by: Jon Maloy <jmaloy@redhat.com> Signed-off-by: Tuong Lien <tuong.t.lien@dektech.com.au> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-26tipc: add back link trace eventsTuong Lien
In the previous commit ("tipc: add Gap ACK blocks support for broadcast link"), we have removed the following link trace events due to the code changes: - tipc_link_bc_ack - tipc_link_retrans This commit adds them back along with some minor changes to adapt to the new code. Acked-by: Ying Xue <ying.xue@windriver.com> Acked-by: Jon Maloy <jmaloy@redhat.com> Signed-off-by: Tuong Lien <tuong.t.lien@dektech.com.au> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-26tipc: introduce Gap ACK blocks for broadcast linkTuong Lien
As achieved through commit 9195948fbf34 ("tipc: improve TIPC throughput by Gap ACK blocks"), we apply the same mechanism for the broadcast link as well. The 'Gap ACK blocks' data field in a 'PROTOCOL/STATE_MSG' will consist of two parts built for both the broadcast and unicast types: 31 16 15 0 +-------------+-------------+-------------+-------------+ | bgack_cnt | ugack_cnt | len | +-------------+-------------+-------------+-------------+ - | gap | ack | | +-------------+-------------+-------------+-------------+ > bc gacks : : : | +-------------+-------------+-------------+-------------+ - | gap | ack | | +-------------+-------------+-------------+-------------+ > uc gacks : : : | +-------------+-------------+-------------+-------------+ - which is "automatically" backward-compatible. We also increase the max number of Gap ACK blocks to 128, allowing upto 64 blocks per type (total buffer size = 516 bytes). Besides, the 'tipc_link_advance_transmq()' function is refactored which is applicable for both the unicast and broadcast cases now, so some old functions can be removed and the code is optimized. With the patch, TIPC broadcast is more robust regardless of packet loss or disorder, latency, ... in the underlying network. Its performance is boost up significantly. For example, experiment with a 5% packet loss rate results: $ time tipc-pipe --mc --rdm --data_size 123 --data_num 1500000 real 0m 42.46s user 0m 1.16s sys 0m 17.67s Without the patch: $ time tipc-pipe --mc --rdm --data_size 123 --data_num 1500000 real 8m 27.94s user 0m 0.55s sys 0m 2.38s Acked-by: Ying Xue <ying.xue@windriver.com> Acked-by: Jon Maloy <jmaloy@redhat.com> Signed-off-by: Tuong Lien <tuong.t.lien@dektech.com.au> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-26qed: Add EDPM mode type for user-fw compatibilityYuval Basson
In older FW versions the completion flag was treated as the ack flag in edpm messages. Expose the FW option of setting which mode the QP is in by adding a flag to the qedr <-> qed API. Flag is added for backward compatibility with libqedr. This flag will be set by qedr after determining whether the libqedr is using the updated version. Fixes: f10939403352 ("qed: Add support for QP verbs") Signed-off-by: Yuval Basson <yuval.bason@marvell.com> Signed-off-by: Michal Kalderon <michal.kalderon@marvell.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-26tcp: tcp_v4_err() icmp skb is named icmp_skbEric Dumazet
I missed the fact that tcp_v4_err() differs from tcp_v6_err(). After commit 4d1a2d9ec1c1 ("Rename skb to icmp_skb in tcp_v4_err()") the skb argument has been renamed to icmp_skb only in one function. I will in a future patch reconciliate these functions to avoid this kind of confusion. Fixes: 45af29ca761c ("tcp: allow traceroute -Mtcp for unpriv users") Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-26block/floppy: fix contended case in floppy_queue_rq()Jiri Kosina
Since the switch of floppy driver to blk-mq, the contended (fdc_busy) case in floppy_queue_rq() is not handled correctly. In case we reach floppy_queue_rq() with fdc_busy set (i.e. with the floppy locked due to another request still being in-flight), we put the request on the list of requests and return BLK_STS_OK to the block core, without actually scheduling delayed work / doing further processing of the request. This means that processing of this request is postponed until another request comes and passess uncontended. Which in some cases might actually never happen and we keep waiting indefinitely. The simple testcase is for i in `seq 1 2000`; do echo -en $i '\r'; blkid --info /dev/fd0 2> /dev/null; done run in quemu. That reliably causes blkid eventually indefinitely hanging in __floppy_read_block_0() waiting for completion, as the BIO callback never happens, and no further IO is ever submitted on the (non-existent) floppy device. This was observed reliably on qemu-emulated device. Fix that by not queuing the request in the contended case, and return BLK_STS_RESOURCE instead, so that blk core handles the request rescheduling and let it pass properly non-contended later. Fixes: a9f38e1dec107a ("floppy: convert to blk-mq") Cc: stable@vger.kernel.org Tested-by: Libor Pechacek <lpechacek@suse.cz> Signed-off-by: Jiri Kosina <jkosina@suse.cz> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-26drm/amdgpu: fix device attribute node create failed with multi gpuKevin Wang
the origin design will use varible of "attr->states" to save node supported states on current gpu device, but for multi gpu device, when probe second gpu device, the driver will check attribute node states from previous gpu device wthether to create attribute node. it will cause other gpu device create attribute node faild. 1. add member attr_list into amdgpu_device to link supported device attribute node. 2. add new structure "struct amdgpu_device_attr_entry{}" to track device attribute state. 3. drop member "states" from amdgpu_device_attr. v2: 1. move "attr_list" into amdgpu_pm and rename to "pm_attr_list". 2. refine create & remove device node functions parameter. fix: drm/amdgpu: optimize amdgpu device attribute code Signed-off-by: Kevin Wang <kevin1.wang@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-05-26ACPI/IORT: Remove the unused __get_pci_rid()Zenghui Yu
Since commit bc8648d49a95 ("ACPI/IORT: Handle PCI aliases properly for IOMMUs"), __get_pci_rid() has become actually unused and can be removed. Signed-off-by: Zenghui Yu <yuzenghui@huawei.com> Acked-by: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com> Acked-by: Hanjun Guo <guohanjun@huawei.com> Link: https://lore.kernel.org/r/20200509093430.1983-1-yuzenghui@huawei.com Signed-off-by: Will Deacon <will@kernel.org>
2020-05-26io_uring: get rid of manual punting in io_closePavel Begunkov
io_close() was punting async manually to skip grabbing files. Use REQ_F_NO_FILE_TABLE instead, and pass it through the generic path with -EAGAIN. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-26io_uring: separate DRAIN flushing into a cold pathPavel Begunkov
io_commit_cqring() assembly doesn't look good with extra code handling drained requests. IOSQE_IO_DRAIN is slow and discouraged to be used in a hot path, so try to minimise its impact by putting it into a helper and doing a fast check. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-26io_uring: don't re-read sqe->off in timeout_prep()Pavel Begunkov
SQEs are user writable, don't read sqe->off twice in io_timeout_prep() Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-26io_uring: simplify io_timeout lockingPavel Begunkov
Move spin_lock_irq() earlier to have only 1 call site of it in io_timeout(). It makes the flow easier. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-26io_uring: fix flush req->refs underflowPavel Begunkov
In io_uring_cancel_files(), after refcount_sub_and_test() leaves 0 req->refs, it calls io_put_req(), which would also put a ref. Call io_free_req() instead. Cc: stable@vger.kernel.org Fixes: 2ca10259b418 ("io_uring: prune request from overflow list on flush") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-26exec: Always set cap_ambient in cap_bprm_set_credsEric W. Biederman
An invariant of cap_bprm_set_creds is that every field in the new cred structure that cap_bprm_set_creds might set, needs to be set every time to ensure the fields does not get a stale value. The field cap_ambient is not set every time cap_bprm_set_creds is called, which means that if there is a suid or sgid script with an interpreter that has neither the suid nor the sgid bits set the interpreter should be able to accept ambient credentials. Unfortuantely because cap_ambient is not reset to it's original value the interpreter can not accept ambient credentials. Given that the ambient capability set is expected to be controlled by the caller, I don't think this is particularly serious. But it is definitely worth fixing so the code works correctly. I have tested to verify my reading of the code is correct and the interpreter of a sgid can receive ambient capabilities with this change and cannot receive ambient capabilities without this change. Cc: stable@vger.kernel.org Cc: Andy Lutomirski <luto@kernel.org> Fixes: 58319057b784 ("capabilities: ambient capabilities") Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2020-05-26rcu: Provide rcu_irq_exit_check_preempt()Thomas Gleixner
Provide a debug check which can be invoked from exception return to kernel mode before an attempt is made to schedule. Warn if RCU is not ready for this. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Paul E. McKenney <paulmck@kernel.org> Link: https://lore.kernel.org/r/20200521202117.089709607@linutronix.de
2020-05-26rcu: Abstract out rcu_irq_enter_check_tick() from rcu_nmi_enter()Paul E. McKenney
There will likely be exception handlers that can sleep, which rules out the usual approach of invoking rcu_nmi_enter() on entry and also rcu_nmi_exit() on all exit paths. However, the alternative approach of just not calling anything can prevent RCU from coaxing quiescent states from nohz_full CPUs that are looping in the kernel: RCU must instead IPI them explicitly. It would be better to enable the scheduler tick on such CPUs to interact with RCU in a lighter-weight manner, and this enabling is one of the things that rcu_nmi_enter() currently does. What is needed is something that helps RCU coax quiescent states while not preventing subsequent sleeps. This commit therefore splits out the nohz_full scheduler-tick enabling from the rest of the rcu_nmi_enter() logic into a new function named rcu_irq_enter_check_tick(). [ tglx: Renamed the function and made it a nop when context tracking is off ] [ mingo: Fixed a CONFIG_NO_HZ_FULL assumption, harmonized and fixed all the comment blocks and cleaned up rcu_nmi_enter()/exit() definitions. ] Suggested-by: Andy Lutomirski <luto@kernel.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20200521202116.996113173@linutronix.de
2020-05-26spi: spi-fsl-lpspi: Fix runtime PM imbalance on errorDinghao Liu
pm_runtime_get_sync() increments the runtime PM usage counter even when it returns an error code. Thus a pairing decrement is needed on the error handling path to keep the counter balanced. Signed-off-by: Dinghao Liu <dinghao.liu@zju.edu.cn> Link: https://lore.kernel.org/r/20200523133859.5625-1-dinghao.liu@zju.edu.cn Signed-off-by: Mark Brown <broonie@kernel.org>
2020-05-26spi: Remove note about transfer limit for spi_write_then_read()Mark Brown
Originally spi_write_then_read() used a fixed statically allocated buffer which limited the maximum message size it could handle. This restriction was removed a while ago so that we could dynamically allocate a buffer if required but the kerneldoc was not updated to reflect this, do so. Reported-by: Marc Kleine-Budde <mkl@pengutronix.de> Signed-off-by: Mark Brown <broonie@kernel.org> Link: https://lore.kernel.org/r/20200525133120.57273-1-broonie@kernel.org Signed-off-by: Mark Brown <broonie@kernel.org>
2020-05-26sched/fair: Don't NUMA balance for kthreadsJens Axboe
Stefano reported a crash with using SQPOLL with io_uring: BUG: kernel NULL pointer dereference, address: 00000000000003b0 CPU: 2 PID: 1307 Comm: io_uring-sq Not tainted 5.7.0-rc7 #11 RIP: 0010:task_numa_work+0x4f/0x2c0 Call Trace: task_work_run+0x68/0xa0 io_sq_thread+0x252/0x3d0 kthread+0xf9/0x130 ret_from_fork+0x35/0x40 which is task_numa_work() oopsing on current->mm being NULL. The task work is queued by task_tick_numa(), which checks if current->mm is NULL at the time of the call. But this state isn't necessarily persistent, if the kthread is using use_mm() to temporarily adopt the mm of a task. Change the task_tick_numa() check to exclude kernel threads in general, as it doesn't make sense to attempt ot balance for kthreads anyway. Reported-by: Stefano Garzarella <sgarzare@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: Peter Zijlstra <peterz@infradead.org> Link: https://lore.kernel.org/r/865de121-8190-5d30-ece5-3b097dc74431@kernel.dk
2020-05-26x86/io_apic: Remove unused function mp_init_irq_at_boot()YueHaibing
There are no callers in-tree anymore since ef9e56d894ea ("x86/ioapic: Remove obsolete post hotplug update") so remove it. Signed-off-by: YueHaibing <yuehaibing@huawei.com> Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lkml.kernel.org/r/20200508140808.49428-1-yuehaibing@huawei.com
2020-05-26x86/syscalls: Revert "x86/syscalls: Make __X32_SYSCALL_BIT be unsigned long"Andy Lutomirski
Revert 45e29d119e99 ("x86/syscalls: Make __X32_SYSCALL_BIT be unsigned long") and add a comment to discourage someone else from making the same mistake again. It turns out that some user code fails to compile if __X32_SYSCALL_BIT is unsigned long. See, for example [1] below. [ bp: Massage and do the same thing in the respective tools/ header. ] Fixes: 45e29d119e99 ("x86/syscalls: Make __X32_SYSCALL_BIT be unsigned long") Reported-by: Thorsten Glaser <t.glaser@tarent.de> Signed-off-by: Andy Lutomirski <luto@kernel.org> Signed-off-by: Borislav Petkov <bp@suse.de> Cc: stable@kernel.org Link: [1] https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=954294 Link: https://lkml.kernel.org/r/92e55442b744a5951fdc9cfee10badd0a5f7f828.1588983892.git.luto@kernel.org
2020-05-26spi: pxa2xx: Fix runtime PM ref imbalance on probe errorLukas Wunner
The PXA2xx SPI driver releases a runtime PM ref in the probe error path even though it hasn't acquired a ref earlier. Apparently commit e2b714afee32 ("spi: pxa2xx: Disable runtime PM if controller registration fails") sought to copy-paste the invocation of pm_runtime_disable() from pxa2xx_spi_remove(), but erroneously copied the call to pm_runtime_put_noidle() as well. Drop it. Fixes: e2b714afee32 ("spi: pxa2xx: Disable runtime PM if controller registration fails") Signed-off-by: Lukas Wunner <lukas@wunner.de> Reviewed-by: Jarkko Nikula <jarkko.nikula@linux.intel.com> Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Cc: stable@vger.kernel.org # v4.17+ Cc: Jarkko Nikula <jarkko.nikula@linux.intel.com> Link: https://lore.kernel.org/r/58b2ac6942ca1f91aaeeafe512144bc5343e1d84.1590408496.git.lukas@wunner.de Signed-off-by: Mark Brown <broonie@kernel.org>
2020-05-26spi: pxa2xx: Fix controller unregister orderLukas Wunner
The PXA2xx SPI driver uses devm_spi_register_controller() on bind. As a consequence, on unbind, __device_release_driver() first invokes pxa2xx_spi_remove() before unregistering the SPI controller via devres_release_all(). This order is incorrect: pxa2xx_spi_remove() disables the chip, rendering the SPI bus inaccessible even though the SPI controller is still registered. When the SPI controller is subsequently unregistered, it unbinds all its slave devices. Because their drivers cannot access the SPI bus, e.g. to quiesce interrupts, the slave devices may be left in an improper state. As a rule, devm_spi_register_controller() must not be used if the ->remove() hook performs teardown steps which shall be performed after unregistering the controller and specifically after unbinding of slaves. Fix by reverting to the non-devm variant of spi_register_controller(). An alternative approach would be to use device-managed functions for all steps in pxa2xx_spi_remove(), e.g. by calling devm_add_action_or_reset() on probe. However that approach would add more LoC to the driver and it wouldn't lend itself as well to backporting to stable. The improper use of devm_spi_register_controller() was introduced in 2013 by commit a807fcd090d6 ("spi: pxa2xx: use devm_spi_register_master()"), but all earlier versions of the driver going back to 2006 were likewise broken because they invoked spi_unregister_master() at the end of pxa2xx_spi_remove(), rather than at the beginning. Fixes: e0c9905e87ac ("[PATCH] SPI: add PXA2xx SSP SPI Driver") Signed-off-by: Lukas Wunner <lukas@wunner.de> Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Cc: stable@vger.kernel.org # v2.6.17+ Cc: Tsuchiya Yuto <kitakar@gmail.com> Link: https://bugzilla.kernel.org/show_bug.cgi?id=206403#c1 Link: https://lore.kernel.org/r/834c446b1cf3284d2660f1bee1ebe3e737cd02a9.1590408496.git.lukas@wunner.de Signed-off-by: Mark Brown <broonie@kernel.org>
2020-05-26spi: dw: Fix controller unregister orderLukas Wunner
The Designware SPI driver uses devm_spi_register_controller() on bind. As a consequence, on unbind, __device_release_driver() first invokes dw_spi_remove_host() before unregistering the SPI controller via devres_release_all(). This order is incorrect: dw_spi_remove_host() shuts down the chip, rendering the SPI bus inaccessible even though the SPI controller is still registered. When the SPI controller is subsequently unregistered, it unbinds all its slave devices. Because their drivers cannot access the SPI bus, e.g. to quiesce interrupts, the slave devices may be left in an improper state. As a rule, devm_spi_register_controller() must not be used if the ->remove() hook performs teardown steps which shall be performed after unregistering the controller and specifically after unbinding of slaves. Fix by reverting to the non-devm variant of spi_register_controller(). An alternative approach would be to use device-managed functions for all steps in dw_spi_remove_host(), e.g. by calling devm_add_action_or_reset() on probe. However that approach would add more LoC to the driver and it wouldn't lend itself as well to backporting to stable. Fixes: 04f421e7b0b1 ("spi: dw: use managed resources") Signed-off-by: Lukas Wunner <lukas@wunner.de> Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Cc: stable@vger.kernel.org # v3.14+ Cc: Baruch Siach <baruch@tkos.co.il> Link: https://lore.kernel.org/r/3fff8cb8ae44a9893840d0688be15bb88c090a14.1590408496.git.lukas@wunner.de Signed-off-by: Mark Brown <broonie@kernel.org>
2020-05-26ARM: 8980/1: Allow either FLATMEM or SPARSEMEM on the multiplatform buildGregory Fong
ARMv7 chips with LPAE can often benefit from SPARSEMEM, as portions of system memory can be located deep in the 36-bit address space. Allow FLATMEM or SPARSEMEM to be selectable at compile time; FLATMEM remains the default. This is based on Kevin's "[PATCH 3/3] ARM: Allow either FLATMEM or SPARSEMEM on the multi-v7 build" from [1] and shamelessly rips off his commit message text above. As Arnd pointed out at [2] there doesn't seem to be any reason to tie this specifically to ARMv7, so this has been changed to apply to all multiplatform kernels. The addition of this option does not change the defaults and a build with any defconfig will behave the same way as previously. The only effect this change has is to enable user to change "Memory model" selection in interactive kernel configuration (menuconfig, xconfig etc). [1] http://lists.infradead.org/pipermail/linux-arm-kernel/2014-September/286837.html [2] http://lists.infradead.org/pipermail/linux-arm-kernel/2014-October/298950.html [ rppt: added ARCH_SELECT_MEMORY_MODEL and updated the changelog ] Cc: Kevin Cernekee <cernekee@gmail.com> Tested-by: Stephen Boyd <sboyd@codeaurora.org> Signed-off-by: Gregory Fong <gregory.0xf0@gmail.com> Signed-off-by: Doug Berger <opendmb@gmail.com> Signed-off-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: Mike Rapoport <mike.rapoport@gmail.com> Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk>
2020-05-26ARM: 8979/1: Remove redundant ARCH_SPARSEMEM_DEFAULT settingKevin Cernekee
If ARCH_SPARSEMEM_ENABLE=y and ARCH_{FLATMEM,DISCONTIGMEM}_ENABLE=n, then the logic in mm/Kconfig already makes CONFIG_SPARSEMEM the only choice. This is true for all of the existing ARM users of ARCH_SPARSEMEM_ENABLE. Forcing ARCH_SPARSEMEM_DEFAULT=y if ARCH_SPARSEMEM_ENABLE=y prevents us from ever defaulting to FLATMEM, so we should remove this setting. Link: https://lkml.org/lkml/2015/6/4/757 Signed-off-by: Kevin Cernekee <cernekee@gmail.com> Tested-by: Stephen Boyd <sboyd@codeaurora.org> Acked-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Gregory Fong <gregory.0xf0@gmail.com> Signed-off-by: Doug Berger <opendmb@gmail.com> Signed-off-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: Mike Rapoport <mike.rapoport@gmail.com> Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk>
2020-05-26ARM: 8978/1: mm: make act_mm() respect THREAD_SIZELinus Walleij
Recent work with KASan exposed the folling hard-coded bitmask in arch/arm/mm/proc-macros.S: bic rd, sp, #8128 bic rd, rd, #63 This forms the bitmask 0x1FFF that is coinciding with (PAGE_SIZE << THREAD_SIZE_ORDER) - 1, this code was assuming that THREAD_SIZE is always 8K (8192). As KASan was increasing THREAD_SIZE_ORDER to 2, I ran into this bug. Fix it by this little oneline suggested by Ard: bic rd, sp, #(THREAD_SIZE - 1) & ~63 Where THREAD_SIZE is defined using THREAD_SIZE_ORDER. We have to also include <linux/const.h> since the THREAD_SIZE expands to use the _AC() macro. Cc: Ard Biesheuvel <ardb@kernel.org> Cc: Florian Fainelli <f.fainelli@gmail.com> Suggested-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Linus Walleij <linus.walleij@linaro.org> Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk>
2020-05-26Merge tag 'efi-arm-no-relocate-for-rmk' of ↵Russell King
git://git.kernel.org/pub/scm/linux/kernel/git/ardb/linux into misc Simplify EFI handover to decompressor The EFI stub in the ARM kernel runs in the context of the firmware, which means it usually runs with the caches and MMU on. Currently, we relocate the zImage so it appears in the first 128 MiB, disable the MMU and caches and invoke the decompressor via its ordinary entry point. However, since we can pass the base of DRAM directly, there is no need to relocate the zImage, which also means there is no need to disable and re-enable the caches and create new page tables etc. This also allows systems whose DRAM start address is not a round multiple of 128 MB to decompress the kernel proper to the base of memory, ensuring that all memory is usable at runtime.
2020-05-26PM: runtime: clk: Fix clk_pm_runtime_get() error pathRafael J. Wysocki
clk_pm_runtime_get() assumes that the PM-runtime usage counter will be dropped by pm_runtime_get_sync() on errors, which is not the case, so PM-runtime references to devices acquired by the former are leaked on errors returned by the latter. Fix this by modifying clk_pm_runtime_get() to drop the reference if pm_runtime_get_sync() returns an error. Fixes: 9a34b45397e5 clk: Add support for runtime PM Cc: 4.15+ <stable@vger.kernel.org> # 4.15+ Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Reviewed-by: Ulf Hansson <ulf.hansson@linaro.org>
2020-05-26cpuidle: Convert Qualcomm SPM driver to a generic CPUidle driverStephan Gerhold
The Qualcomm SPM cpuidle driver seems to be the last driver still using the generic ARM CPUidle infrastructure. Converting it actually allows us to simplify the driver, and we end up being able to remove more lines than adding new ones: - We can parse the CPUidle states in the device tree directly with dt_idle_states (and don't need to duplicate that functionality into the spm driver). - Each "saw" device managed by the SPM driver now directly registers its own cpuidle driver, removing the need for any global (per cpu) state. The device tree binding is the same, so the driver stays compatible with all old device trees. Signed-off-by: Stephan Gerhold <stephan@gerhold.net> Reviewed-by: Lina Iyer <ilina@codeaurora.org> Reviewed-by: Ulf Hansson <ulf.hansson@linaro.org> Acked-by: Bjorn Andersson <bjorn.andersson@linaro.org> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2020-05-26powerpc/64s: Fix restore of NV GPRs after facility unavailable exceptionMichael Ellerman
Commit 702f09805222 ("powerpc/64s/exception: Remove lite interrupt return") changed the interrupt return path to not restore non-volatile registers by default, and explicitly restore them in paths where it is required. But it missed that the facility unavailable exception can sometimes modify user registers, ie. when it does emulation of move from DSCR. This is seen as a failure of the dscr_sysfs_thread_test: test: dscr_sysfs_thread_test [cpu 0] User DSCR should be 1 but is 0 failure: dscr_sysfs_thread_test So restore non-volatile GPRs after facility unavailable exceptions. Currently the hypervisor facility unavailable exception is also wired up to call facility_unavailable_exception(). In practice we should never take a hypervisor facility unavailable exception for the DSCR. On older bare metal systems we set HFSCR_DSCR unconditionally in __init_HFSCR, or on newer systems it should be enabled via the "data-stream-control-register" device tree CPU feature. Even if it's not, since commit f3c99f97a3cd ("KVM: PPC: Book3S HV: Don't access HFSCR, LPIDR or LPCR when running nested"), the KVM code has unconditionally set HFSCR_DSCR when running guests. So we should only get a hypervisor facility unavailable for the DSCR if skiboot has disabled the "data-stream-control-register" feature, and we are somehow in guest context but not via KVM. Given all that, it should be unnecessary to add a restore of non-volatile GPRs after the hypervisor facility exception, because we never expect to hit that path. But equally we may as well add the restore, because we never expect to hit that path, and if we ever did, at least we would correctly restore the registers to their post emulation state. In future we can split the non-HV and HV facility unavailable handling so that there is no emulation in the HV handler, and then remove the restore for the HV case. Fixes: 702f09805222 ("powerpc/64s/exception: Remove lite interrupt return") Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20200526061808.2472279-1-mpe@ellerman.id.au
2020-05-26batman-adv: Revert "disable ethtool link speed detection when auto ↵Sven Eckelmann
negotiation off" The commit 8c46fcd78308 ("batman-adv: disable ethtool link speed detection when auto negotiation off") disabled the usage of ethtool's link_ksetting when auto negotation was enabled due to invalid values when used with tun/tap virtual net_devices. According to the patch, automatic measurements should be used for these kind of interfaces. But there are major flaws with this argumentation: * automatic measurements are not implemented * auto negotiation has nothing to do with the validity of the retrieved values The first point has to be fixed by a longer patch series. The "validity" part of the second point must be addressed in the same patch series by dropping the usage of ethtool's link_ksetting (thus always doing automatic measurements over ethernet). Drop the patch again to have more default values for various net_device types/configurations. The user can still overwrite them using the batadv_hardif's BATADV_ATTR_THROUGHPUT_OVERRIDE. Reported-by: Matthias Schiffer <mschiffer@universe-factory.net> Signed-off-by: Sven Eckelmann <sven@narfation.org> Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2020-05-26ALSA: usb-audio: mixer: volume quirk for ESS Technology Asus USB DACChris Chiu
The Asus USB DAC is a USB type-C audio dongle for connecting to the headset and headphone. The volume minimum value -23040 which is 0xa600 in hexadecimal with the resolution value 1 indicates this should be endianness issue caused by the firmware bug. Add a volume quirk to fix the volume control problem. Also fixes this warning: Warning! Unlikely big volume range (=23040), cval->res is probably wrong. [5] FU [Headset Capture Volume] ch = 1, val = -23040/0/1 Warning! Unlikely big volume range (=23040), cval->res is probably wrong. [7] FU [Headset Playback Volume] ch = 1, val = -23040/0/1 Signed-off-by: Chris Chiu <chiu@endlessm.com> Cc: <stable@vger.kernel.org> Link: https://lore.kernel.org/r/20200526062613.55401-1-chiu@endlessm.com Signed-off-by: Takashi Iwai <tiwai@suse.de>
2020-05-26ALSA: hda/realtek - Add a model for Thinkpad T570 without DAC workaroundTakashi Iwai
We fixed the regression of the speaker volume for some Thinkpad models (e.g. T570) by the commit 54947cd64c1b ("ALSA: hda/realtek - Fix speaker output regression on Thinkpad T570"). Essentially it fixes the DAC / pin pairing by a static table. It was confirmed and merged to stable kernel later. Now, interestingly, we got another regression report for the very same model (T570) about the similar problem, and the commit above was the culprit. That is, by some reason, there are devices that prefer the DAC1, and another device DAC2! Unfortunately those have the same ID and we have no idea what can differentiate, in this patch, a new fixup model "tpt470-dock-fix" is provided, so that users with such a machine can apply it manually. When model=tpt470-dock-fix option is passed to snd-hda-intel module, it avoids the fixed DAC pairing and the DAC1 is assigned to the speaker like the earlier versions. Fixes: 54947cd64c1b ("ALSA: hda/realtek - Fix speaker output regression on Thinkpad T570") BugLink: https://apibugzilla.suse.com/show_bug.cgi?id=1172017 Cc: <stable@vger.kernel.org> Link: https://lore.kernel.org/r/20200526062406.9799-1-tiwai@suse.de Signed-off-by: Takashi Iwai <tiwai@suse.de>