summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2020-01-23Merge branch 'Fixes-for-SONIC-ethernet-driver'David S. Miller
Finn Thain says: ==================== Fixes for SONIC ethernet driver Various SONIC driver problems have become apparent over the years, including tx watchdog timeouts, lost packets and duplicated packets. The problems are mostly caused by bugs in buffer handling, locking and (re-)initialization code. This patch series resolves these problems. This series has been tested on National Semiconductor hardware (macsonic), qemu-system-m68k (macsonic) and qemu-system-mips64el (jazzsonic). The emulated dp8393x device used in QEMU also has bugs. I have fixed the bugs that I know of in a series of patches at, https://github.com/fthain/qemu/commits/sonic Changed since v1: - Minor revisions as described in commit logs. - Deferred net-next patches. Changed since v2: - Minor revisions as described in commit logs. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-23net/sonic: Prevent tx watchdog timeoutFinn Thain
Section 5.5.3.2 of the datasheet says, If FIFO Underrun, Byte Count Mismatch, Excessive Collision, or Excessive Deferral (if enabled) errors occur, transmission ceases. In this situation, the chip asserts a TXER interrupt rather than TXDN. But the handler for the TXDN is the only way that the transmit queue gets restarted. Hence, an aborted transmission can result in a watchdog timeout. This problem can be reproduced on congested link, as that can result in excessive transmitter collisions. Another way to reproduce this is with a FIFO Underrun, which may be caused by DMA latency. In event of a TXER interrupt, prevent a watchdog timeout by restarting transmission. Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Tested-by: Stan Johnson <userm57@yahoo.com> Signed-off-by: Finn Thain <fthain@telegraphics.com.au> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-23net/sonic: Fix CAM initializationFinn Thain
Section 4.3.1 of the datasheet says, This bit [TXP] must not be set if a Load CAM operation is in progress (LCAM is set). The SONIC will lock up if both bits are set simultaneously. Testing has shown that the driver sometimes attempts to set LCAM while TXP is set. Avoid this by waiting for command completion before and after giving the LCAM command. After issuing the Load CAM command, poll for !SONIC_CR_LCAM rather than SONIC_INT_LCD, because the SONIC_CR_TXP bit can't be used until !SONIC_CR_LCAM. When in reset mode, take the opportunity to reset the CAM Enable register. Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Tested-by: Stan Johnson <userm57@yahoo.com> Signed-off-by: Finn Thain <fthain@telegraphics.com.au> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-23net/sonic: Fix command register usageFinn Thain
There are several issues relating to command register usage during chip initialization. Firstly, the SONIC sometimes comes out of software reset with the Start Timer bit set. This gets logged as, macsonic macsonic eth0: sonic_init: status=24, i=101 Avoid this by giving the Stop Timer command earlier than later. Secondly, the loop that waits for the Read RRA command to complete has the break condition inverted. That's why the for loop iterates until its termination condition. Call the helper for this instead. Finally, give the Receiver Enable command after clearing interrupts, not before, to avoid the possibility of losing an interrupt. Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Tested-by: Stan Johnson <userm57@yahoo.com> Signed-off-by: Finn Thain <fthain@telegraphics.com.au> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-23net/sonic: Quiesce SONIC before re-initializing descriptor memoryFinn Thain
Make sure the SONIC's DMA engine is idle before altering the transmit and receive descriptors. Add a helper for this as it will be needed again. Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Tested-by: Stan Johnson <userm57@yahoo.com> Signed-off-by: Finn Thain <fthain@telegraphics.com.au> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-23net/sonic: Fix receive buffer replenishmentFinn Thain
As soon as the driver is finished with a receive buffer it allocs a new one and overwrites the corresponding RRA entry with a new buffer pointer. Problem is, the buffer pointer is split across two word-sized registers. It can't be updated in one atomic store. So this operation races with the chip while it stores received packets and advances its RRP register. This could result in memory corruption by a DMA write. Avoid this problem by adding buffers only at the location given by the RWP register, in accordance with the National Semiconductor datasheet. Re-factor this code into separate functions to calculate a RRA pointer and to update the RWP. Fixes: efcce839360f ("[PATCH] macsonic/jazzsonic network drivers update") Tested-by: Stan Johnson <userm57@yahoo.com> Signed-off-by: Finn Thain <fthain@telegraphics.com.au> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-23net/sonic: Improve receive descriptor status flag checkFinn Thain
After sonic_tx_timeout() calls sonic_init(), it can happen that sonic_rx() will subsequently encounter a receive descriptor with no flags set. Remove the comment that says that this can't happen. When giving a receive descriptor to the SONIC, clear the descriptor status field. That way, any rx descriptor with flags set can only be a newly received packet. Don't process a descriptor without the LPKT bit set. The buffer is still in use by the SONIC. Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Tested-by: Stan Johnson <userm57@yahoo.com> Signed-off-by: Finn Thain <fthain@telegraphics.com.au> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-23net/sonic: Avoid needless receive descriptor EOL flag updatesFinn Thain
The while loop in sonic_rx() traverses the rx descriptor ring. It stops when it reaches a descriptor that the SONIC has not used. Each iteration advances the EOL flag so the SONIC can keep using more descriptors. Therefore, the while loop has no definite termination condition. The algorithm described in the National Semiconductor literature is quite different. It consumes descriptors up to the one with its EOL flag set (which will also have its "in use" flag set). All freed descriptors are then returned to the ring at once, by adjusting the EOL flags (and link pointers). Adopt the algorithm from datasheet as it's simpler, terminates quickly and avoids a lot of pointless descriptor EOL flag changes. Fixes: efcce839360f ("[PATCH] macsonic/jazzsonic network drivers update") Tested-by: Stan Johnson <userm57@yahoo.com> Signed-off-by: Finn Thain <fthain@telegraphics.com.au> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-23net/sonic: Fix receive buffer handlingFinn Thain
The SONIC can sometimes advance its rx buffer pointer (RRP register) without advancing its rx descriptor pointer (CRDA register). As a result the index of the current rx descriptor may not equal that of the current rx buffer. The driver mistakenly assumes that they are always equal. This assumption leads to incorrect packet lengths and possible packet duplication. Avoid this by calling a new function to locate the buffer corresponding to a given descriptor. Fixes: efcce839360f ("[PATCH] macsonic/jazzsonic network drivers update") Tested-by: Stan Johnson <userm57@yahoo.com> Signed-off-by: Finn Thain <fthain@telegraphics.com.au> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-23net/sonic: Fix interface error stats collectionFinn Thain
The tx_aborted_errors statistic should count packets flagged with EXD, EXC, FU, or BCM bits because those bits denote an aborted transmission. That corresponds to the bitmask 0x0446, not 0x0642. Use macros for these constants to avoid mistakes. Better to leave out FIFO Underruns (FU) as there's a separate counter for that purpose. Don't lump all these errors in with the general tx_errors counter as that's used for tx timeout events. On the rx side, don't count RDE and RBAE interrupts as dropped packets. These interrupts don't indicate a lost packet, just a lack of resources. When a lack of resources results in a lost packet, this gets reported in the rx_missed_errors counter (along with RFO events). Don't double-count rx_frame_errors and rx_crc_errors. Don't use the general rx_errors counter for events that already have special counters. Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Tested-by: Stan Johnson <userm57@yahoo.com> Signed-off-by: Finn Thain <fthain@telegraphics.com.au> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-23net/sonic: Use MMIO accessorsFinn Thain
The driver accesses descriptor memory which is simultaneously accessed by the chip, so the compiler must not be allowed to re-order CPU accesses. sonic_buf_get() used 'volatile' to prevent that. sonic_buf_put() should have done so too but was overlooked. Fixes: efcce839360f ("[PATCH] macsonic/jazzsonic network drivers update") Tested-by: Stan Johnson <userm57@yahoo.com> Signed-off-by: Finn Thain <fthain@telegraphics.com.au> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-23net/sonic: Clear interrupt flags immediatelyFinn Thain
The chip can change a packet's descriptor status flags at any time. However, an active interrupt flag gets cleared rather late. This allows a race condition that could theoretically lose an interrupt. Fix this by clearing asserted interrupt flags immediately. Fixes: efcce839360f ("[PATCH] macsonic/jazzsonic network drivers update") Tested-by: Stan Johnson <userm57@yahoo.com> Signed-off-by: Finn Thain <fthain@telegraphics.com.au> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-23net/sonic: Add mutual exclusion for accessing shared stateFinn Thain
The netif_stop_queue() call in sonic_send_packet() races with the netif_wake_queue() call in sonic_interrupt(). This causes issues like "NETDEV WATCHDOG: eth0 (macsonic): transmit queue 0 timed out". Fix this by disabling interrupts when accessing tx_skb[] and next_tx. Update a comment to clarify the synchronization properties. Fixes: efcce839360f ("[PATCH] macsonic/jazzsonic network drivers update") Tested-by: Stan Johnson <userm57@yahoo.com> Signed-off-by: Finn Thain <fthain@telegraphics.com.au> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-23net: fsl/fman: rename IF_MODE_XGMII to IF_MODE_10GMadalin Bucur
As the only 10G PHY interface type defined at the moment the code was developed was XGMII, although the PHY interface mode used was not XGMII, XGMII was used in the code to denote 10G. This patch renames the 10G interface mode to remove the ambiguity. Signed-off-by: Madalin Bucur <madalin.bucur@oss.nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-23Merge branch 'net-fsl-fman-address-erratum-A011043'David S. Miller
Madalin Bucur says: ==================== net: fsl/fman: address erratum A011043 This addresses a HW erratum on some QorIQ DPAA devices. MDIO reads to internal PCS registers may result in having the MDIO_CFG[MDIO_RD_ER] bit set, even when there is no error and read data (MDIO_DATA[MDIO_DATA]) is correct. Software may get false read error when reading internal PCS registers through MDIO. As a workaround, all internal MDIO accesses should ignore the MDIO_CFG[MDIO_RD_ER] bit. When the issue was present, one could see such errors during boot: mdio_bus ffe4e5000: Error while reading PHY0 reg at 3.3 ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-23net/fsl: treat fsl,erratum-a011043Madalin Bucur
When fsl,erratum-a011043 is set, adjust for erratum A011043: MDIO reads to internal PCS registers may result in having the MDIO_CFG[MDIO_RD_ER] bit set, even when there is no error and read data (MDIO_DATA[MDIO_DATA]) is correct. Software may get false read error when reading internal PCS registers through MDIO. As a workaround, all internal MDIO accesses should ignore the MDIO_CFG[MDIO_RD_ER] bit. Signed-off-by: Madalin Bucur <madalin.bucur@oss.nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-23powerpc/fsl/dts: add fsl,erratum-a011043Madalin Bucur
Add fsl,erratum-a011043 to internal MDIO buses. Software may get false read error when reading internal PCS registers through MDIO. As a workaround, all internal MDIO accesses should ignore the MDIO_CFG[MDIO_RD_ER] bit. Signed-off-by: Madalin Bucur <madalin.bucur@oss.nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-23dt-bindings: net: add fsl,erratum-a011043Madalin Bucur
Add an entry for erratum A011043: the MDIO_CFG[MDIO_RD_ER] bit may be falsely set when reading internal PCS registers. MDIO reads to internal PCS registers may result in having the MDIO_CFG[MDIO_RD_ER] bit set, even when there is no error and read data (MDIO_DATA[MDIO_DATA]) is correct. Software may get false read error when reading internal PCS registers through MDIO. As a workaround, all internal MDIO accesses should ignore the MDIO_CFG[MDIO_RD_ER] bit. Signed-off-by: Madalin Bucur <madalin.bucur@oss.nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-23qlcnic: Fix CPU soft lockup while collecting firmware dumpManish Chopra
Driver while collecting firmware dump takes longer time to collect/process some of the firmware dump entries/memories. Bigger capture masks makes it worse as it results in larger amount of data being collected and results in CPU soft lockup. Place cond_resched() in some of the driver flows that are expectedly time consuming to relinquish the CPU to avoid CPU soft lockup panic. Signed-off-by: Shahed Shaikh <shshaikh@marvell.com> Tested-by: Yonggen Xu <Yonggen.Xu@dell.com> Signed-off-by: Manish Chopra <manishc@marvell.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-23Merge tag 'xarray-5.5' of git://git.infradead.org/users/willy/linux-daxLinus Torvalds
Pull XArray fixes from Matthew Wilcox: "Primarily bugfixes, mostly around handling index wrap-around correctly. A couple of doc fixes and adding missing APIs. I had an oops live on stage at linux.conf.au this year, and it turned out to be a bug in xas_find() which I can't prove isn't triggerable in the current codebase. Then in looking for the bug, I spotted two more bugs. The bots have had a few days to chew on this with no problems reported, and it passes the test-suite (which now has more tests to make sure these problems don't come back)" * tag 'xarray-5.5' of git://git.infradead.org/users/willy/linux-dax: XArray: Add xa_for_each_range XArray: Fix xas_find returning too many entries XArray: Fix xa_find_after with multi-index entries XArray: Fix infinite loop with entry at ULONG_MAX XArray: Add wrappers for nested spinlocks XArray: Improve documentation of search marks XArray: Fix xas_pause at ULONG_MAX
2020-01-23Merge tag 'trace-v5.5-rc6-2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace Pull tracing fixes from Steven Rostedt: "Various tracing fixes: - Fix a function comparison warning for a xen trace event macro - Fix a double perf_event linking to a trace_uprobe_filter for multiple events - Fix suspicious RCU warnings in trace event code for using list_for_each_entry_rcu() when the "_rcu" portion wasn't needed. - Fix a bug in the histogram code when using the same variable - Fix a NULL pointer dereference when tracefs lockdown enabled and calling trace_set_default_clock() - A fix to a bug found with the double perf_event linking patch" * tag 'trace-v5.5-rc6-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: tracing/uprobe: Fix to make trace_uprobe_filter alignment safe tracing: Do not set trace clock if tracefs lockdown is in effect tracing: Fix histogram code when expression has same var as value tracing: trigger: Replace unneeded RCU-list traversals tracing/uprobe: Fix double perf_event linking on multiprobe uprobe tracing: xen: Ordered comparison of function pointers
2020-01-23Merge tag 'ceph-for-5.5-rc8' of https://github.com/ceph/ceph-clientLinus Torvalds
Pull ceph fix from Ilya Dryomov: "A fix for a potential use-after-free from Jeff, marked for stable" * tag 'ceph-for-5.5-rc8' of https://github.com/ceph/ceph-client: ceph: hold extra reference to r_parent over life of request
2020-01-23Merge tag 'pm-5.5-rc8' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm Pull power management fix from Rafael Wysocki: "Prevent the kernel from crashing during resume from hibernation if free pages contain leftover data from the restore kernel and init_on_free is set (Alexander Potapenko)" * tag 'pm-5.5-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: PM: hibernate: fix crashes with init_on_free=1
2020-01-23Merge tag 'pci-v5.5-fixes-2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci Pull PCI fix from Bjorn Helgaas: "Mark ATS as broken on AMD Navi14 GPU rev 0xc5 (Alex Deucher)" * tag 'pci-v5.5-fixes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci: PCI: Mark AMD Navi14 GPU rev 0xc5 ATS as broken
2020-01-23partitions/ldm: fix spelling mistake "to" -> "too"Colin Ian King
There is a spelling mistake in a ldm_error message. Fix it. Signed-off-by: Colin Ian King <colin.king@canonical.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-01-23bcache: reap from tail of c->btree_cache in bch_mca_scan()Coly Li
When shrink btree node cache from c->btree_cache in bch_mca_scan(), no matter the selected node is reaped or not, it will be rotated from the head to the tail of c->btree_cache list. But in bcache journal code, when flushing the btree nodes with oldest journal entry, btree nodes are iterated and slected from the tail of c->btree_cache list in btree_flush_write(). The list_rotate_left() in bch_mca_scan() will make btree_flush_write() iterate more nodes in c->btree_list in reverse order. This patch just reaps the selected btree node cache, and not move it from the head to the tail of c->btree_cache list. Then bch_mca_scan() will not mess up c->btree_cache list to btree_flush_write(). Signed-off-by: Coly Li <colyli@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-01-23bcache: reap c->btree_cache_freeable from the tail in bch_mca_scan()Coly Li
In order to skip the most recently freed btree node cahce, currently in bch_mca_scan() the first 3 caches in c->btree_cache_freeable list are skipped when shrinking bcache node caches in bch_mca_scan(). The related code in bch_mca_scan() is, 737 list_for_each_entry_safe(b, t, &c->btree_cache_freeable, list) { 738 if (nr <= 0) 739 goto out; 740 741 if (++i > 3 && 742 !mca_reap(b, 0, false)) { lines free cache memory 746 } 747 nr--; 748 } The problem is, if virtual memory code calls bch_mca_scan() and the calculated 'nr' is 1 or 2, then in the above loop, nothing will be shunk. In such case, if slub/slab manager calls bch_mca_scan() for many times with small scan number, it does not help to shrink cache memory and just wasts CPU cycles. This patch just selects btree node caches from tail of the c->btree_cache_freeable list, then the newly freed host cache can still be allocated by mca_alloc(), and at least 1 node can be shunk. Signed-off-by: Coly Li <colyli@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-01-23bcache: remove member accessed from struct btreeColy Li
The member 'accessed' of struct btree is used in bch_mca_scan() when shrinking btree node caches. The original idea is, if b->accessed is set, clean it and look at next btree node cache from c->btree_cache list, and only shrink the caches whose b->accessed is cleaned. Then only cold btree node cache will be shrunk. But when I/O pressure is high, it is very probably that b->accessed of a btree node cache will be set again in bch_btree_node_get() before bch_mca_scan() selects it again. Then there is no chance for bch_mca_scan() to shrink enough memory back to slub or slab system. This patch removes member accessed from struct btree, then once a btree node ache is selected, it will be immediately shunk. By this change, bch_mca_scan() may release btree node cahce more efficiently. Signed-off-by: Coly Li <colyli@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-01-23bcache: print written and keys in trace_bcache_btree_writeGuoju Fang
It's useful to dump written block and keys on btree write, this patch add them into trace_bcache_btree_write. Signed-off-by: Guoju Fang <fangguoju@gmail.com> Signed-off-by: Coly Li <colyli@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-01-23bcache: avoid unnecessary btree nodes flushing in btree_flush_write()Coly Li
the commit 91be66e1318f ("bcache: performance improvement for btree_flush_write()") was an effort to flushing btree node with oldest btree node faster in following methods, - Only iterate dirty btree nodes in c->btree_cache, avoid scanning a lot of clean btree nodes. - Take c->btree_cache as a LRU-like list, aggressively flushing all dirty nodes from tail of c->btree_cache util the btree node with oldest journal entry is flushed. This is to reduce the time of holding c->bucket_lock. Guoju Fang and Shuang Li reported that they observe unexptected extra write I/Os on cache device after applying the above patch. Guoju Fang provideed more detailed diagnose information that the aggressive btree nodes flushing may cause 10x more btree nodes to flush in his workload. He points out when system memory is large enough to hold all btree nodes in memory, c->btree_cache is not a LRU-like list any more. Then the btree node with oldest journal entry is very probably not- close to the tail of c->btree_cache list. In such situation much more dirty btree nodes will be aggressively flushed before the target node is flushed. When slow SATA SSD is used as cache device, such over- aggressive flushing behavior will cause performance regression. After spending a lot of time on debug and diagnose, I find the real condition is more complicated, aggressive flushing dirty btree nodes from tail of c->btree_cache list is not a good solution. - When all btree nodes are cached in memory, c->btree_cache is not a LRU-like list, the btree nodes with oldest journal entry won't be close to the tail of the list. - There can be hundreds dirty btree nodes reference the oldest journal entry, before flushing all the nodes the oldest journal entry cannot be reclaimed. When the above two conditions mixed together, a simply flushing from tail of c->btree_cache list is really NOT a good idea. Fortunately there is still chance to make btree_flush_write() work better. Here is how this patch avoids unnecessary btree nodes flushing, - Only acquire c->journal.lock when getting oldest journal entry of fifo c->journal.pin. In rested locations check the journal entries locklessly, so their values can be changed on other cores in parallel. - In loop list_for_each_entry_safe_reverse(), checking latest front point of fifo c->journal.pin. If it is different from the original point which we get with locking c->journal.lock, it means the oldest journal entry is reclaim on other cores. At this moment, all selected dirty nodes recorded in array btree_nodes[] are all flushed and clean on other CPU cores, it is unncessary to iterate c->btree_cache any longer. Just quit the list_for_each_entry_safe_reverse() loop and the following for-loop will skip all the selected clean nodes. - Find a proper time to quit the list_for_each_entry_safe_reverse() loop. Check the refcount value of orignial fifo front point, if the value is larger than selected node number of btree_nodes[], it means more matching btree nodes should be scanned. Otherwise it means no more matching btee nodes in rest of c->btree_cache list, the loop can be quit. If the original oldest journal entry is reclaimed and fifo front point is updated, the refcount of original fifo front point will be 0, then the loop will be quit too. - Not hold c->bucket_lock too long time. c->bucket_lock is also required for space allocation for cached data, hold it for too long time will block regular I/O requests. When iterating list c->btree_cache, even there are a lot of maching btree nodes, in order to not holding c->bucket_lock for too long time, only BTREE_FLUSH_NR nodes are selected and to flush in following for-loop. With this patch, only btree nodes referencing oldest journal entry are flushed to cache device, no aggressive flushing for unnecessary btree node any more. And in order to avoid blocking regluar I/O requests, each time when btree_flush_write() called, at most only BTREE_FLUSH_NR btree nodes are selected to flush, even there are more maching btree nodes in list c->btree_cache. At last, one more thing to explain: Why it is safe to read front point of c->journal.pin without holding c->journal.lock inside the list_for_each_entry_safe_reverse() loop ? Here is my answer: When reading the front point of fifo c->journal.pin, we don't need to know the exact value of front point, we just want to check whether the value is different from the original front point (which is accurate value because we get it while c->jouranl.lock is held). For such purpose, it works as expected without holding c->journal.lock. Even the front point is changed on other CPU core and not updated to local core, and current iterating btree node has identical journal entry local as original fetched fifo front point, it is still safe. Because after holding mutex b->write_lock (with memory barrier) this btree node can be found as clean and skipped, the loop will quite latter when iterate on next node of list c->btree_cache. Fixes: 91be66e1318f ("bcache: performance improvement for btree_flush_write()") Reported-by: Guoju Fang <fangguoju@gmail.com> Reported-by: Shuang Li <psymon@bonuscloud.io> Signed-off-by: Coly Li <colyli@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-01-23bcache: add code comments for state->pool in __btree_sort()Coly Li
To explain the pages allocated from mempool state->pool can be swapped in __btree_sort(), because state->pool is a page pool, which allocates pages by alloc_pages() indeed. Signed-off-by: Coly Li <colyli@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-01-23lib: crc64: include <linux/crc64.h> for 'crc64_be'Ben Dooks (Codethink)
The crc64_be() is declared in <linux/crc64.h> so include this where the symbol is defined to avoid the following warning: lib/crc64.c:43:12: warning: symbol 'crc64_be' was not declared. Should it be static? Signed-off-by: Ben Dooks (Codethink) <ben.dooks@codethink.co.uk> Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Signed-off-by: Coly Li <colyli@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-01-23bcache: use read_cache_page_gfp to read the superblockChristoph Hellwig
Avoid a pointless dependency on buffer heads in bcache by simply open coding reading a single page. Also add a SB_OFFSET define for the byte offset of the superblock instead of using magic numbers. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Coly Li <colyli@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-01-23bcache: store a pointer to the on-disk sb in the cache and cached_dev structuresChristoph Hellwig
This allows to properly build the superblock bio including the offset in the page using the normal bio helpers. This fixes writing the superblock for page sizes larger than 4k where the sb write bio would need an offset in the bio_vec. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Coly Li <colyli@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-01-23bcache: return a pointer to the on-disk sb from read_superChristoph Hellwig
Returning the properly typed actual data structure insteaf of the containing struct page will save the callers some work going forward. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Coly Li <colyli@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-01-23bcache: transfer the sb_page reference to register_{bdev,cache}Christoph Hellwig
Avoid an extra reference count roundtrip by transferring the sb_page ownership to the lower level register helpers. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Coly Li <colyli@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-01-23bcache: fix use-after-free in register_bcache()Coly Li
The patch "bcache: rework error unwinding in register_bcache" introduces a use-after-free regression in register_bcache(). Here are current code, 2510 out_free_path: 2511 kfree(path); 2512 out_module_put: 2513 module_put(THIS_MODULE); 2514 out: 2515 pr_info("error %s: %s", path, err); 2516 return ret; If some error happens and the above code path is executed, at line 2511 path is released, but referenced at line 2515. Then KASAN reports a use- after-free error message. This patch changes line 2515 in the following way to fix the problem, 2515 pr_info("error %s: %s", path?path:"", err); Signed-off-by: Coly Li <colyli@suse.de> Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-01-23bcache: properly initialize 'path' and 'err' in register_bcache()Coly Li
Patch "bcache: rework error unwinding in register_bcache" from Christoph Hellwig changes the local variables 'path' and 'err' in undefined initial state. If the code in register_bcache() jumps to label 'out:' or 'out_module_put:' by goto, these two variables might be reference with undefined value by the following line, out_module_put: module_put(THIS_MODULE); out: pr_info("error %s: %s", path, err); return ret; Therefore this patch initializes these two local variables properly in register_bcache() to avoid such issue. Signed-off-by: Coly Li <colyli@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-01-23bcache: rework error unwinding in register_bcacheChristoph Hellwig
Split the successful and error return path, and use one goto label for each resource to unwind. This also fixes some small errors like leaking the module reference count in the reboot case (which seems entirely harmless) or printing the wrong warning messages for early failures. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Coly Li <colyli@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-01-23bcache: use a separate data structure for the on-disk super blockChristoph Hellwig
Split out an on-disk version struct cache_sb with the proper endianness annotations. This fixes a fair chunk of sparse warnings, but there are some left due to the way the checksum is defined. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Coly Li <colyli@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-01-23bcache: cached_dev_free needs to put the sb pageLiang Chen
Same as cache device, the buffer page needs to be put while freeing cached_dev. Otherwise a page would be leaked every time a cached_dev is stopped. Signed-off-by: Liang Chen <liangchen.linux@gmail.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Coly Li <colyli@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-01-23readdir: make user_access_begin() use the real access rangeLinus Torvalds
In commit 9f79b78ef744 ("Convert filldir[64]() from __put_user() to unsafe_put_user()") I changed filldir to not do individual __put_user() accesses, but instead use unsafe_put_user() surrounded by the proper user_access_begin/end() pair. That make them enormously faster on modern x86, where the STAC/CLAC games make individual user accesses fairly heavy-weight. However, the user_access_begin() range was not really the exact right one, since filldir() has the unfortunate problem that it needs to not only fill out the new directory entry, it also needs to fix up the previous one to contain the proper file offset. It's unfortunate, but the "d_off" field in "struct dirent" is _not_ the file offset of the directory entry itself - it's the offset of the next one. So we end up backfilling the offset in the previous entry as we walk along. But since x86 didn't really care about the exact range, and used to be the only architecture that did anything fancy in user_access_begin() to begin with, the filldir[64]() changes did something lazy, and even commented on it: /* * Note! This range-checks 'previous' (which may be NULL). * The real range was checked in getdents */ if (!user_access_begin(dirent, sizeof(*dirent))) goto efault; and it all worked fine. But now 32-bit ppc is starting to also implement user_access_begin(), and the fact that we faked the range to only be the (possibly not even valid) previous directory entry becomes a problem, because ppc32 will actually be using the range that is passed in for more than just "check that it's user space". This is a complete rewrite of Christophe's original patch. By saving off the record length of the previous entry instead of a pointer to it in the filldir data structures, we can simplify the range check and the writing of the previous entry d_off field. No need for any conditionals in the user accesses themselves, although we retain the conditional EINTR checking for the "was this the first directory entry" signal handling latency logic. Fixes: 9f79b78ef744 ("Convert filldir[64]() from __put_user() to unsafe_put_user()") Link: https://lore.kernel.org/lkml/a02d3426f93f7eb04960a4d9140902d278cab0bb.1579697910.git.christophe.leroy@c-s.fr/ Link: https://lore.kernel.org/lkml/408c90c4068b00ea8f1c41cca45b84ec23d4946b.1579783936.git.christophe.leroy@c-s.fr/ Reported-and-tested-by: Christophe Leroy <christophe.leroy@c-s.fr> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-01-23readdir: be more conservative with directory entry namesLinus Torvalds
Commit 8a23eb804ca4 ("Make filldir[64]() verify the directory entry filename is valid") added some minimal validity checks on the directory entries passed to filldir[64](). But they really were pretty minimal. This fleshes out at least the name length check: we used to disallow zero-length names, but really, negative lengths or oevr-long names aren't ok either. Both could happen if there is some filesystem corruption going on. Now, most filesystems tend to use just an "unsigned char" or similar for the length of a directory entry name, so even with a corrupt filesystem you should never see anything odd like that. But since we then use the name length to create the directory entry record length, let's make sure it actually is half-way sensible. Note how POSIX states that the size of a path component is limited by NAME_MAX, but we actually use PATH_MAX for the check here. That's because while NAME_MAX is generally the correct maximum name length (it's 255, for the same old "name length is usually just a byte on disk"), there's nothing in the VFS layer that really cares. So the real limitation at a VFS layer is the total pathname length you can pass as a filename: PATH_MAX. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-01-23Btrfs: make deduplication with range including the last block workFilipe Manana
Since btrfs was migrated to use the generic VFS helpers for clone and deduplication, it stopped allowing for the last block of a file to be deduplicated when the source file size is not sector size aligned (when eof is somewhere in the middle of the last block). There are two reasons for that: 1) The generic code always rounds down, to a multiple of the block size, the range's length for deduplications. This means we end up never deduplicating the last block when the eof is not block size aligned, even for the safe case where the destination range's end offset matches the destination file's size. That rounding down operation is done at generic_remap_check_len(); 2) Because of that, the btrfs specific code does not expect anymore any non-aligned range length's for deduplication and therefore does not work if such nona-aligned length is given. This patch addresses that second part, and it depends on a patch that fixes generic_remap_check_len(), in the VFS, which was submitted ealier and has the following subject: "fs: allow deduplication of eof block into the end of the destination file" These two patches address reports from users that started seeing lower deduplication rates due to the last block never being deduplicated when the file size is not aligned to the filesystem's block size. Link: https://lore.kernel.org/linux-btrfs/2019-1576167349.500456@svIo.N5dq.dFFD/ CC: stable@vger.kernel.org # 5.1+ Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-01-23fs: allow deduplication of eof block into the end of the destination fileFilipe Manana
We always round down, to a multiple of the filesystem's block size, the length to deduplicate at generic_remap_check_len(). However this is only needed if an attempt to deduplicate the last block into the middle of the destination file is requested, since that leads into a corruption if the length of the source file is not block size aligned. When an attempt to deduplicate the last block into the end of the destination file is requested, we should allow it because it is safe to do it - there's no stale data exposure and we are prepared to compare the data ranges for a length not aligned to the block (or page) size - in fact we even do the data compare before adjusting the deduplication length. After btrfs was updated to use the generic helpers from VFS (by commit 34a28e3d77535e ("Btrfs: use generic_remap_file_range_prep() for cloning and deduplication")) we started to have user reports of deduplication not reflinking the last block anymore, and whence users getting lower deduplication scores. The main use case is deduplication of entire files that have a size not aligned to the block size of the filesystem. We already allow cloning the last block to the end (and beyond) of the destination file, so allow for deduplication as well. Link: https://lore.kernel.org/linux-btrfs/2019-1576167349.500456@svIo.N5dq.dFFD/ CC: stable@vger.kernel.org # 5.1+ Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-01-23drm/amdgpu: remove the experimental flag for renoirAlex Deucher
Should work properly with the latest sbios on 5.5 and newer kernels. Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-01-23btrfs: free block groups after free'ing fs treesJosef Bacik
Sometimes when running generic/475 we would trip the WARN_ON(cache->reserved) check when free'ing the block groups on umount. This is because sometimes we don't commit the transaction because of IO errors and thus do not cleanup the tree logs until at umount time. These blocks are still reserved until they are cleaned up, but they aren't cleaned up until _after_ we do the free block groups work. Fix this by moving the free after free'ing the fs roots, that way all of the tree logs are cleaned up and we have a properly cleaned fs. A bunch of loops of generic/475 confirmed this fixes the problem. CC: stable@vger.kernel.org # 4.9+ Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-01-23btrfs: Fix split-brain handling when changing FSID to metadata uuidNikolay Borisov
Current code doesn't correctly handle the situation which arises when a file system that has METADATA_UUID_INCOMPAT flag set and has its FSID changed to the one in metadata uuid. This causes the incompat flag to disappear. In case of a power failure we could end up in a situation where part of the disks in a multi-disk filesystem are correctly reverted to METADATA_UUID_INCOMPAT flag unset state, while others have METADATA_UUID_INCOMPAT set and CHANGING_FSID_V2_IN_PROGRESS. This patch corrects the behavior required to handle the case where a disk of the second type is scanned first, creating the necessary btrfs_fs_devices. Subsequently, when a disk which has already completed the transition is scanned it should overwrite the data in btrfs_fs_devices. Reported-by: Su Yue <Damenly_Su@gmx.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-01-23btrfs: Handle another split brain scenario with metadata uuid featureNikolay Borisov
There is one more cases which isn't handled by the original metadata uuid work. Namely, when a filesystem has METADATA_UUID incompat bit and the user decides to change the FSID to the original one e.g. have metadata_uuid and fsid match. In case of power failure while this operation is in progress we could end up in a situation where some of the disks have the incompat bit removed and the other half have both METADATA_UUID_INCOMPAT and FSID_CHANGING_IN_PROGRESS flags. This patch handles the case where a disk that has successfully changed its FSID such that it equals METADATA_UUID is scanned first. Subsequently when a disk with both METADATA_UUID_INCOMPAT/FSID_CHANGING_IN_PROGRESS flags is scanned find_fsid_changed won't be able to find an appropriate btrfs_fs_devices. This is done by extending find_fsid_changed to correctly find btrfs_fs_devices whose metadata_uuid/fsid are the same and they match the metadata_uuid of the currently scanned device. Fixes: cc5de4e70256 ("btrfs: Handle final split-brain possibility during fsid change") Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reported-by: Su Yue <Damenly_Su@gmx.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-01-23btrfs: Factor out metadata_uuid code from find_fsid.Su Yue
find_fsid became rather hairy with the introduction of metadata uuid changing feature. Alleviate this by factoring out the metadata uuid specific code in a dedicated function which deals with finding correct fsid for a device with changed uuid. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Su Yue <Damenly_Su@gmx.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>