summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2017-02-20rbd: use rbd_obj_bytes() moreIlya Dryomov
Returning u64 doesn't make sense: max header->obj_order is 25 and ceph_file_layout::object_size is u32. Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Jason Dillaman <dillaman@redhat.com>
2017-02-20rbd: remove now unused rbd_obj_request_wait() and helpersIlya Dryomov
Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Jason Dillaman <dillaman@redhat.com>
2017-02-20rbd: switch rbd_obj_method_sync() to ceph_osdc_call()Ilya Dryomov
As explained in the previous commit, rbd_obj_request machinery (and rbd_osd_req_create() in particular) shouldn't be used for working with metadata objects. Switch to the recently added ceph_osdc_call(). It assumes single pages for outbound and inbound buffers, but that's OK - none of the callers need more than that. These pages need to be allocated (messenger is in dire need of proper iterator interface!), but we are swapping for pages[] and pagelist allocations in the existing code. Kill class_name argument - all rbd methods are under "rbd". Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Jason Dillaman <dillaman@redhat.com>
2017-02-20libceph: pass reply buffer length through ceph_osdc_call()Ilya Dryomov
To spare checking for "this reply fits into a page, but does it fit into my buffer?" in some callers, osd_req_op_cls_response_data_pages() needs to know how big it is. Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Jason Dillaman <dillaman@redhat.com>
2017-02-20rbd: do away with obj_request in rbd_obj_read_sync()Ilya Dryomov
rbd_obj_request machinery is completely unnecessary here; all that's being done is fetching a metadata object - no striping, cloning, etc. More importantly, rbd_osd_req_create() grabs pool id from layout and that is becoming a data pool id. Kill offset argument - all metadata objects are small and read in full. Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Jason Dillaman <dillaman@redhat.com>
2017-02-20rbd: initialize rbd_dev->header_oloc earlyIlya Dryomov
No reason to delay it until image_id is known. This will be required by some rbd_obj_method_sync() callers, after rbd_obj_method_sync() is changed to take oloc. Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Jason Dillaman <dillaman@redhat.com>
2017-02-20rbd: kill rbd_image_header::{crypt_type,comp_type}Ilya Dryomov
Image format 1 is deprecated and format 2 doesn't have these. Also, __rbd_dev_create() takes care of zeroing (or otherwise initializing) format 2 specific fields. Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Jason Dillaman <dillaman@redhat.com>
2017-02-20rbd: use kstrndup() in rbd_header_from_disk()Ilya Dryomov
Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Jason Dillaman <dillaman@redhat.com>
2017-02-20libceph: bump CEPH_PG_MAX_SIZE to 32Ilya Dryomov
... to accommodate potentially very wide EC pools. This increases the size of a typical rbd ceph_osd_request by ~12% (from 1040 to 1168 bytes), but I'd rather go future proof here. Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Sage Weil <sage@redhat.com>
2017-02-20libceph: don't go through with the mapping if the PG is too wideIlya Dryomov
With EC overwrites maturing, the kernel client will be getting exposed to potentially very wide EC pools. While "min(pi->size, X)" works fine when the cluster is stable and happy, truncating OSD sets interferes with resend logic (ceph_is_new_interval(), etc). Abort the mapping if the pool is too wide, assigning the request to the homeless session. Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Sage Weil <sage@redhat.com>
2017-02-20crush: merge working data and scratchIlya Dryomov
Much like Arlo Guthrie, I decided that one big pile is better than two little piles. Reflects ceph.git commit 95c2df6c7e0b22d2ea9d91db500cf8b9441c73ba. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-02-20crush: remove mutable part of CRUSH mapIlya Dryomov
Then add it to the working state. It would be very nice if we didn't have to take a lock to calculate a crush placement. By moving the permutation array into the working data, we can treat the CRUSH map as immutable. Reflects ceph.git commit cbcd039651c0569551cb90d26ce27e1432671f2a. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-02-20libceph: add osdmap_set_crush() helperIlya Dryomov
Simplify osdmap_decode() and osdmap_apply_incremental() a bit. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-02-20libceph: remove unneeded stddef.h includeStafford Horne
This was causing a build failure for openrisc when using musl and gcc 5.4.0 since the file is not available in the toolchain. It doesnt seem this is needed and removing it does not cause any build warnings for me. Signed-off-by: Stafford Horne <shorne@gmail.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-02-20ceph: do a LOOKUP in d_revalidate instead of GETATTRJeff Layton
In commit c3f4688a08f (ceph: don't set req->r_locked_dir in ceph_d_revalidate), we changed the code to do a GETATTR instead of a LOOKUP as the parent info isn't strictly necessary to revalidate the dentry. What we missed there though is that in order to update the lease on the dentry after revalidating it, we _do_ need parent info. Change ceph_d_revalidate back to doing a LOOKUP instead of a GETATTR so that we can get the parent info in order to update the lease from ceph_fill_trace. Note that we set req->r_parent here, but we cannot set the CEPH_MDS_R_PARENT_LOCKED flag as we can't guarantee that it is. Signed-off-by: Jeff Layton <jlayton@redhat.com> Reviewed-by: Yan, Zheng <zyan@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-02-20ceph: call update_dentry_lease even when r_locked dir is not setJeff Layton
We don't really require that the parent be locked in order to update the lease on a dentry. Lease info is protected by the d_lock. In the event that the parent is not locked in ceph_fill_trace, and we have both parent and target info, go ahead and update the dentry lease. Signed-off-by: Jeff Layton <jlayton@redhat.com> Reviewed-by: Yan, Zheng <zyan@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-02-20ceph: vet the target and parent inodes before updating dentry leaseJeff Layton
In a later patch, we're going to need to allow ceph_fill_trace to update the dentry's lease when the parent is not locked. This is potentially racy though -- by the time we get around to processing the trace, the parent may have already changed. Change update_dentry_lease to take a ceph_vino pointer and use that to ensure that the dentry's parent still matches it before updating the lease. Signed-off-by: Jeff Layton <jlayton@redhat.com> Reviewed-by: Yan, Zheng <zyan@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-02-20ceph: don't update_dentry_lease unless we actually got oneJeff Layton
This if block updates the dentry lease even in the case where the MDS didn't grant one. Signed-off-by: Jeff Layton <jlayton@redhat.com> Reviewed-by: Yan, Zheng <zyan@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-02-20ceph: add a new flag to indicate whether parent is lockedJeff Layton
struct ceph_mds_request has an r_locked_dir pointer, which is set to indicate the parent inode and that its i_rwsem is locked. In some critical places, we need to be able to indicate the parent inode to the request handling code, even when its i_rwsem may not be locked. Most of the code that operates on r_locked_dir doesn't require that the i_rwsem be locked. We only really need it to handle manipulation of the dcache. The rest (filling of the inode, updating dentry leases, etc.) already has its own locking. Add a new r_req_flags bit that indicates whether the parent is locked when doing the request, and rename the pointer to "r_parent". For now, all the places that set r_parent also set this flag, but that will change in a later patch. Signed-off-by: Jeff Layton <jlayton@redhat.com> Reviewed-by: Yan, Zheng <zyan@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-02-20ceph: convert bools in ceph_mds_request to a new r_req_flags fieldJeff Layton
Currently, we have a bunch of bool flags in struct ceph_mds_request. We need more flags though, but each bool takes (at least) a byte. Those add up over time. Merge all of the existing bools in this struct into a single unsigned long, and use the set/test/clear_bit macros to manipulate them. These are atomic operations, but that is required here to prevent load/modify/store races. The existing flags are protected by different locks, so we can't rely on them for that purpose. Signed-off-by: Jeff Layton <jlayton@redhat.com> Reviewed-by: Yan, Zheng <zyan@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-02-20ceph: drop session argument to ceph_fill_traceJeff Layton
Just get it from r_session since that's what's always passed in. Signed-off-by: Jeff Layton <jlayton@redhat.com> Reviewed-by: Yan, Zheng <zyan@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-02-20ceph: remove "Debugging hook" from ceph_fill_traceJeff Layton
Keeping around commented out code is just asking for it to bitrot and makes viewing the code under cscope more confusing. If we really need this, then we can revert this patch and put it under a Kconfig option. Signed-off-by: Jeff Layton <jlayton@redhat.com> Reviewed-by: Yan, Zheng <zyan@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-02-20ceph: avoid calling ceph_renew_caps() infinitelyYan, Zheng
__ceph_caps_mds_wanted() ignores caps from stale session. So the return value of __ceph_caps_mds_wanted() can keep the same across ceph_renew_caps(). This causes try_get_cap_refs() to keep calling ceph_renew_caps(). The fix is ignore the session valid check for the try_get_cap_refs() case. If session is stale, just let the caps requester sleep. Signed-off-by: Yan, Zheng <zyan@redhat.com>
2017-02-20ceph: make sure flushing inode in proper session's cap_flushing listYan, Zheng
when flushing inode's auth cap changes, we need to move it into the new auth cap session's cap_flushing list Signed-off-by: Yan, Zheng <zyan@redhat.com>
2017-02-20ceph: update readpages osd request according to size of pagesYan, Zheng
add_to_page_cache_lru() can fails, so the actual pages to read can be smaller than the initial size of osd request. We need to update osd request size in that case. Signed-off-by: Yan, Zheng <zyan@redhat.com> Reviewed-by: Jeff Layton <jlayton@redhat.com>
2017-02-20ceph: fix bogus endianness change in ceph_ioctl_set_layoutJeff Layton
sparse says: fs/ceph/ioctl.c:100:28: warning: cast to restricted __le64 preferred_osd is a __s64 so we don't need to do any conversion. Also, just remove the cast in ceph_ioctl_get_layout as it's not needed. Signed-off-by: Jeff Layton <jlayton@redhat.com> Reviewed-by: Sage Weil <sage@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-02-20libceph: include linux/sched.h into crypto.c directlyIlya Dryomov
Currently crypto.c gets linux/sched.h indirectly through linux/slab.h from linux/kasan.h. Include it directly for memalloc_noio_*() inlines. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-02-20libceph: use BUG() instead of BUG_ON(1)Arnd Bergmann
I ran into this compile warning, which is the result of BUG_ON(1) not always leading to the compiler treating the code path as unreachable: include/linux/ceph/osdmap.h: In function 'ceph_can_shift_osds': include/linux/ceph/osdmap.h:62:1: error: control reaches end of non-void function [-Werror=return-type] Using BUG() here avoids the warning. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-02-20ceph: avoid updating mds_wanted too frequentlyYan, Zheng
user space may open/close single file frequently. It's not good to send a clientcaps message to mds for each open/close syscall. Signed-off-by: Yan, Zheng <zyan@redhat.com>
2017-02-20ceph: set io_pages bdi hintAndreas Gerstmayr
This patch sets the io_pages bdi hint based on the rsize mount option. Without this patch large buffered reads (request size > max readahead) are processed sequentially in chunks of the readahead size (i.e. read requests are sent out up to the readahead size, then the do_generic_file_read() function waits until the first page is received). With this patch read requests are sent out at once up to the size specified in the rsize mount option (default: 64 MB). Signed-off-by: Andreas Gerstmayr <andreas.gerstmayr@catalysts.cc> Acked-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Yan, Zheng <zyan@redhat.com>
2017-02-20ceph: fix spelling mistake: "enabing" -> "enabling"Colin Ian King
trivial fix to spelling mistake in debug message Signed-off-by: Colin Ian King <colin.king@canonical.com> Signed-off-by: Yan, Zheng <zyan@redhat.com>
2017-02-20ceph: cleanup ACCESS_ONCE -> READ_ONCESeraphime Kirkovski
This removes the uses of ACCESS_ONCE in favor of READ_ONCE Signed-off-by: Seraphime Kirkovski <kirkseraph@gmail.com> Signed-off-by: Yan, Zheng <zyan@redhat.com>
2017-02-20ceph: pass parent inode info to ceph_encode_dentry_release if we have itJeff Layton
If we have a parent inode reference already, then we don't need to go back up the directory tree to find one. Link: http://tracker.ceph.com/issues/18148 Signed-off-by: Jeff Layton <jlayton@redhat.com> Reviewed-by: Yan, Zheng <zyan@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-02-20ceph: fix unsafe dcache access in ceph_encode_dentry_releaseJeff Layton
Accessing d_parent requires some sort of locking or it could vanish out from under us. Since we take the d_lock anyway, use that to fetch d_parent and take a reference to it, and then use that reference to call ceph_encode_inode_release. Link: http://tracker.ceph.com/issues/18148 Signed-off-by: Jeff Layton <jlayton@redhat.com> Reviewed-by: Yan, Zheng <zyan@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-02-20ceph: pass parent dir ino info to build_dentry_pathJeff Layton
In the event that we have a parent inode reference in the request, we can use that instead of mucking about in the dcache. Pass any parent inode info we have down to build_dentry_path so it can make use of it. Link: http://tracker.ceph.com/issues/18148 Signed-off-by: Jeff Layton <jlayton@redhat.com> Reviewed-by: Yan, Zheng <zyan@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-02-20ceph: clean up unsafe d_parent accesses in build_dentry_pathJeff Layton
While we hold a reference to the dentry when build_dentry_path is called, we could end up racing with a rename that changes d_parent. Handle that situation correctly, by using the rcu_read_lock to ensure that the parent dentry and inode stick around long enough to safely check ceph_snap and ceph_ino. Link: http://tracker.ceph.com/issues/18148 Signed-off-by: Jeff Layton <jlayton@redhat.com> Reviewed-by: Yan, Zheng <zyan@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-02-20ceph: clean up unsafe d_parent access in __choose_mdsJeff Layton
__choose_mds exists to pick an MDS to use when issuing a call. Doing that typically involves picking an inode and using the authoritative MDS for it. In most cases, that's pretty straightforward, as we are using an inode to which we hold a reference (usually represented by r_dentry or r_inode in the request). In the case of a snapshotted directory however, we need to fetch the non-snapped parent, which involves walking back up the parents in the tree. The dentries in the snapshot dir are effectively frozen but the overall parent is _not_, and could vanish if a concurrent rename were to occur. Clean this code up and take special care to ensure the validity of the entries we're working with. First, try to use the inode in r_locked_dir if one exists. If not and all we have is r_dentry, then we have to walk back up the tree. Use the rcu_read_lock for this so we can ensure that any d_parent we find won't go away, and take extra care to deal with the possibility that the dentries could go negative. Change get_nonsnap_parent to return an inode, and take a reference to that inode before returning (if any). Change all of the other places where we set "inode" in __choose_mds to also take a reference, and then call iput on that inode before exiting the function. Link: http://tracker.ceph.com/issues/18148 Signed-off-by: Jeff Layton <jlayton@redhat.com> Reviewed-by: Yan, Zheng <zyan@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-02-20Merge branch 'kvm-ppc-next' of ↵Paolo Bonzini
git://git.kernel.org/pub/scm/linux/kernel/git/paulus/powerpc into HEAD Paul Mackerras writes: "Please do a pull from my kvm-ppc-next branch to get some fixes which I would like to have in 4.11. There are four small commits there; two are fixes for potential host crashes in the new HPT resizing code, and the other two are changes to printks to make KVM on PPC a little less noisy."
2017-02-20fork: Fix task_struct alignmentPeter Zijlstra
Stupid bug that wrecked the alignment of task_struct and causes WARN()s in the x86 FPU code on some platforms. Reported-by: Fengguang Wu <fengguang.wu@intel.com> Tested-by: Fengguang Wu <fengguang.wu@intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Fixes: e274795ea7b7 ("locking/mutex: Fix mutex handoff") Link: http://lkml.kernel.org/r/20170218142645.GH6500@twins.programming.kicks-ass.net Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-02-20ALSA: usb-audio: Tascam US-16x08 DSP mixer quirkDetlef Urban
Add mixer quirk for Tascam US-16x08 usb interface. Even that this is an usb compliant device, the input channels and DSP functions (EQ/Compressor) aren't accessible by default. Signed-off-by: Detlef Urban <onkel@paraair.de> Signed-off-by: Takashi Iwai <tiwai@suse.de>
2017-02-20objtool: Fix CONFIG_STACK_VALIDATION=y warning for out-of-tree modulesJosh Poimboeuf
When building a CONFIG_STACK_VALIDATION enabled kernel without the libelf devel package installed, the Makefile prints a warning: "Cannot use CONFIG_STACK_VALIDATION, please install libelf-dev, libelf-devel or elfutils-libelf-devel" But when building an out-of-tree module, the warning doesn't show. Instead it tries to use objtool, and the build fails with: /bin/sh: ./tools/objtool/objtool: No such file or directory Make sure the warning and the disabling of objtool occur in all cases, by moving the CONFIG_STACK_VALIDATION checks outside the 'ifeq ($(KBUILD_EXTMOD),)' block in the Makefile. Tested-By: Marc MERLIN <marc@merlins.org> Suggested-by: Jessica Yu <jeyu@redhat.com> Reported-by: Marc MERLIN <marc@merlins.org> Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com> Reviewed-by: Jessica Yu <jeyu@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Michal Marek <mmarek@suse.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Fixes: 3b27a0c85d70 ("objtool: Detect and warn if libelf is missing and don't break the build") Link: http://lkml.kernel.org/r/b3088ae4a8698143d4851965793c61fec2135b1f.1487182864.git.jpoimboe@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-02-20Merge branch 'for-next' into for-linusTakashi Iwai
2017-02-19md/raid1: fix a use-after-free bugShaohua Li
Commit fd76863 (RAID1: a new I/O barrier implementation to remove resync window) introduces a user-after-free bug. Signed-off-by: Shaohua Li <shli@fb.com>
2017-02-19RAID1: avoid unnecessary spin locks in I/O barrier codecolyli@suse.de
When I run a parallel reading performan testing on a md raid1 device with two NVMe SSDs, I observe very bad throughput in supprise: by fio with 64KB block size, 40 seq read I/O jobs, 128 iodepth, overall throughput is only 2.7GB/s, this is around 50% of the idea performance number. The perf reports locking contention happens at allow_barrier() and wait_barrier() code, - 41.41% fio [kernel.kallsyms] [k] _raw_spin_lock_irqsave - _raw_spin_lock_irqsave + 89.92% allow_barrier + 9.34% __wake_up - 37.30% fio [kernel.kallsyms] [k] _raw_spin_lock_irq - _raw_spin_lock_irq - 100.00% wait_barrier The reason is, in these I/O barrier related functions, - raise_barrier() - lower_barrier() - wait_barrier() - allow_barrier() They always hold conf->resync_lock firstly, even there are only regular reading I/Os and no resync I/O at all. This is a huge performance penalty. The solution is a lockless-like algorithm in I/O barrier code, and only holding conf->resync_lock when it has to. The original idea is from Hannes Reinecke, and Neil Brown provides comments to improve it. I continue to work on it, and make the patch into current form. In the new simpler raid1 I/O barrier implementation, there are two wait barrier functions, - wait_barrier() Which calls _wait_barrier(), is used for regular write I/O. If there is resync I/O happening on the same I/O barrier bucket, or the whole array is frozen, task will wait until no barrier on same barrier bucket, or the whold array is unfreezed. - wait_read_barrier() Since regular read I/O won't interfere with resync I/O (read_balance() will make sure only uptodate data will be read out), it is unnecessary to wait for barrier in regular read I/Os, waiting in only necessary when the whole array is frozen. The operations on conf->nr_pending[idx], conf->nr_waiting[idx], conf-> barrier[idx] are very carefully designed in raise_barrier(), lower_barrier(), _wait_barrier() and wait_read_barrier(), in order to avoid unnecessary spin locks in these functions. Once conf-> nr_pengding[idx] is increased, a resync I/O with same barrier bucket index has to wait in raise_barrier(). Then in _wait_barrier() if no barrier raised in same barrier bucket index and array is not frozen, the regular I/O doesn't need to hold conf->resync_lock, it can just increase conf->nr_pending[idx], and return to its caller. wait_read_barrier() is very similar to _wait_barrier(), the only difference is it only waits when array is frozen. For heavy parallel reading I/Os, the lockless I/O barrier code almostly gets rid of all spin lock cost. This patch significantly improves raid1 reading peroformance. From my testing, a raid1 device built by two NVMe SSD, runs fio with 64KB blocksize, 40 seq read I/O jobs, 128 iodepth, overall throughput increases from 2.7GB/s to 4.6GB/s (+70%). Changelog V4: - Change conf->nr_queued[] to atomic_t. - Define BARRIER_BUCKETS_NR_BITS by (PAGE_SHIFT - ilog2(sizeof(atomic_t))) V3: - Add smp_mb__after_atomic() as Shaohua and Neil suggested. - Change conf->nr_queued[] from atomic_t to int. - Change conf->array_frozen from atomic_t back to int, and use READ_ONCE(conf->array_frozen) to check value of conf->array_frozen in _wait_barrier() and wait_read_barrier(). - In _wait_barrier() and wait_read_barrier(), add a call to wake_up(&conf->wait_barrier) after atomic_dec(&conf->nr_pending[idx]), to fix a deadlock between _wait_barrier()/wait_read_barrier and freeze_array(). V2: - Remove a spin_lock/unlock pair in raid1d(). - Add more code comments to explain why there is no racy when checking two atomic_t variables at same time. V1: - Original RFC patch for comments. Signed-off-by: Coly Li <colyli@suse.de> Cc: Shaohua Li <shli@fb.com> Cc: Hannes Reinecke <hare@suse.com> Cc: Johannes Thumshirn <jthumshirn@suse.de> Cc: Guoqing Jiang <gqjiang@suse.com> Reviewed-by: Neil Brown <neilb@suse.de> Signed-off-by: Shaohua Li <shli@fb.com>
2017-02-19RAID1: a new I/O barrier implementation to remove resync windowcolyli@suse.de
'Commit 79ef3a8aa1cb ("raid1: Rewrite the implementation of iobarrier.")' introduces a sliding resync window for raid1 I/O barrier, this idea limits I/O barriers to happen only inside a slidingresync window, for regular I/Os out of this resync window they don't need to wait for barrier any more. On large raid1 device, it helps a lot to improve parallel writing I/O throughput when there are background resync I/Os performing at same time. The idea of sliding resync widow is awesome, but code complexity is a challenge. Sliding resync window requires several variables to work collectively, this is complexed and very hard to make it work correctly. Just grep "Fixes: 79ef3a8aa1" in kernel git log, there are 8 more patches to fix the original resync window patch. This is not the end, any further related modification may easily introduce more regreassion. Therefore I decide to implement a much simpler raid1 I/O barrier, by removing resync window code, I believe life will be much easier. The brief idea of the simpler barrier is, - Do not maintain a global unique resync window - Use multiple hash buckets to reduce I/O barrier conflicts, regular I/O only has to wait for a resync I/O when both them have same barrier bucket index, vice versa. - I/O barrier can be reduced to an acceptable number if there are enough barrier buckets Here I explain how the barrier buckets are designed, - BARRIER_UNIT_SECTOR_SIZE The whole LBA address space of a raid1 device is divided into multiple barrier units, by the size of BARRIER_UNIT_SECTOR_SIZE. Bio requests won't go across border of barrier unit size, that means maximum bio size is BARRIER_UNIT_SECTOR_SIZE<<9 (64MB) in bytes. For random I/O 64MB is large enough for both read and write requests, for sequential I/O considering underlying block layer may merge them into larger requests, 64MB is still good enough. Neil also points out that for resync operation, "we want the resync to move from region to region fairly quickly so that the slowness caused by having to synchronize with the resync is averaged out over a fairly small time frame". For full speed resync, 64MB should take less then 1 second. When resync is competing with other I/O, it could take up a few minutes. Therefore 64MB size is fairly good range for resync. - BARRIER_BUCKETS_NR There are BARRIER_BUCKETS_NR buckets in total, which is defined by, #define BARRIER_BUCKETS_NR_BITS (PAGE_SHIFT - 2) #define BARRIER_BUCKETS_NR (1<<BARRIER_BUCKETS_NR_BITS) this patch makes the bellowed members of struct r1conf from integer to array of integers, - int nr_pending; - int nr_waiting; - int nr_queued; - int barrier; + int *nr_pending; + int *nr_waiting; + int *nr_queued; + int *barrier; number of the array elements is defined as BARRIER_BUCKETS_NR. For 4KB kernel space page size, (PAGE_SHIFT - 2) indecates there are 1024 I/O barrier buckets, and each array of integers occupies single memory page. 1024 means for a request which is smaller than the I/O barrier unit size has ~0.1% chance to wait for resync to pause, which is quite a small enough fraction. Also requesting single memory page is more friendly to kernel page allocator than larger memory size. - I/O barrier bucket is indexed by bio start sector If multiple I/O requests hit different I/O barrier units, they only need to compete I/O barrier with other I/Os which hit the same I/O barrier bucket index with each other. The index of a barrier bucket which a bio should look for is calculated by sector_to_idx() which is defined in raid1.h as an inline function, static inline int sector_to_idx(sector_t sector) { return hash_long(sector >> BARRIER_UNIT_SECTOR_BITS, BARRIER_BUCKETS_NR_BITS); } Here sector_nr is the start sector number of a bio. - Single bio won't go across boundary of a I/O barrier unit If a request goes across boundary of barrier unit, it will be split. A bio may be split in raid1_make_request() or raid1_sync_request(), if sectors returned by align_to_barrier_unit_end() is smaller than original bio size. Comparing to single sliding resync window, - Currently resync I/O grows linearly, therefore regular and resync I/O will conflict within a single barrier units. So the I/O behavior is similar to single sliding resync window. - But a barrier unit bucket is shared by all barrier units with identical barrier uinit index, the probability of conflict might be higher than single sliding resync window, in condition that writing I/Os always hit barrier units which have identical barrier bucket indexs with the resync I/Os. This is a very rare condition in real I/O work loads, I cannot imagine how it could happen in practice. - Therefore we can achieve a good enough low conflict rate with much simpler barrier algorithm and implementation. There are two changes should be noticed, - In raid1d(), I change the code to decrease conf->nr_pending[idx] into single loop, it looks like this, spin_lock_irqsave(&conf->device_lock, flags); conf->nr_queued[idx]--; spin_unlock_irqrestore(&conf->device_lock, flags); This change generates more spin lock operations, but in next patch of this patch set, it will be replaced by a single line code, atomic_dec(&conf->nr_queueud[idx]); So we don't need to worry about spin lock cost here. - Mainline raid1 code split original raid1_make_request() into raid1_read_request() and raid1_write_request(). If the original bio goes across an I/O barrier unit size, this bio will be split before calling raid1_read_request() or raid1_write_request(), this change the code logic more simple and clear. - In this patch wait_barrier() is moved from raid1_make_request() to raid1_write_request(). In raid_read_request(), original wait_barrier() is replaced by raid1_read_request(). The differnece is wait_read_barrier() only waits if array is frozen, using different barrier function in different code path makes the code more clean and easy to read. Changelog V4: - Add alloc_r1bio() to remove redundant r1bio memory allocation code. - Fix many typos in patch comments. - Use (PAGE_SHIFT - ilog2(sizeof(int))) to define BARRIER_BUCKETS_NR_BITS. V3: - Rebase the patch against latest upstream kernel code. - Many fixes by review comments from Neil, - Back to use pointers to replace arraries in struct r1conf - Remove total_barriers from struct r1conf - Add more patch comments to explain how/why the values of BARRIER_UNIT_SECTOR_SIZE and BARRIER_BUCKETS_NR are decided. - Use get_unqueued_pending() to replace get_all_pendings() and get_all_queued() - Increase bucket number from 512 to 1024 - Change code comments format by review from Shaohua. V2: - Use bio_split() to split the orignal bio if it goes across barrier unit bounday, to make the code more simple, by suggestion from Shaohua and Neil. - Use hash_long() to replace original linear hash, to avoid a possible confilict between resync I/O and sequential write I/O, by suggestion from Shaohua. - Add conf->total_barriers to record barrier depth, which is used to control number of parallel sync I/O barriers, by suggestion from Shaohua. - In V1 patch the bellowed barrier buckets related members in r1conf are allocated in memory page. To make the code more simple, V2 patch moves the memory space into struct r1conf, like this, - int nr_pending; - int nr_waiting; - int nr_queued; - int barrier; + int nr_pending[BARRIER_BUCKETS_NR]; + int nr_waiting[BARRIER_BUCKETS_NR]; + int nr_queued[BARRIER_BUCKETS_NR]; + int barrier[BARRIER_BUCKETS_NR]; This change is by the suggestion from Shaohua. - Remove some inrelavent code comments, by suggestion from Guoqing. - Add a missing wait_barrier() before jumping to retry_write, in raid1_make_write_request(). V1: - Original RFC patch for comments Signed-off-by: Coly Li <colyli@suse.de> Cc: Johannes Thumshirn <jthumshirn@suse.de> Cc: Guoqing Jiang <gqjiang@suse.com> Reviewed-by: Neil Brown <neilb@suse.de> Signed-off-by: Shaohua Li <shli@fb.com>
2017-02-19sctp: check duplicate node before inserting a new transportXin Long
sctp has changed to use rhlist for transport rhashtable since commit 7fda702f9315 ("sctp: use new rhlist interface on sctp transport rhashtable"). But rhltable_insert_key doesn't check the duplicate node when inserting a node, unlike rhashtable_lookup_insert_key. It may cause duplicate assoc/transport in rhashtable. like: client (addr A, B) server (addr X, Y) connect to X INIT (1) ------------> connect to Y INIT (2) ------------> INIT_ACK (1) <------------ INIT_ACK (2) <------------ After sending INIT (2), one transport will be created and hashed into rhashtable. But when receiving INIT_ACK (1) and processing the address params, another transport will be created and hashed into rhashtable with the same addr Y and EP as the last transport. This will confuse the assoc/transport's lookup. This patch is to fix it by returning err if any duplicate node exists before inserting it. Fixes: 7fda702f9315 ("sctp: use new rhlist interface on sctp transport rhashtable") Reported-by: Fabio M. Di Nitto <fdinitto@redhat.com> Signed-off-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-19of_mdio: Add "broadcom,bcm5241" to the whitelist.David Daney
Some Cavium dev boards have firmware which doesn't supply a proper ethernet-phy-ieee802.3-c22" compatible property. Restore these boards to working order by whitelisting this compatible value. Signed-off-by: David Daney <david.daney@cavium.com> Acked-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-19net: ethernet: stmmac: dwmac-rk: Add RK3328 gmac supportdavid.wu
Add constants and callback functions for the dwmac on rk3328 socs. As can be seen, the base structure is the same, only registers and the bits in them moved slightly. Signed-off-by: david.wu <david.wu@rock-chips.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-19sctp: sctp_transport_dst_check should check if transport pmtu is dst mtuXin Long
Now when sending a packet, sctp_transport_dst_check will check if dst is obsolete by calling ipv4/ip6_dst_check. But they return obsolete only when adding a new cache, after that when the cache's pmtu is updated again, it will not trigger transport->dst/pmtu's update. It can be reproduced by reducing route's pmtu twice. At the 1st time client will add a new cache, and transport->pathmtu gets updated as sctp_transport_dst_check finds it's obsolete. But at the 2nd time, cache's mtu is updated, sctp client will never send out any packet, because transport->pmtu has no chance to update. This patch is to fix this by also checking if transport pmtu is dst mtu in sctp_transport_dst_check, so that transport->pmtu can be updated on time. Signed-off-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-19Merge branch 'sctp-rcv-side-stream-reconf-ssn-reset-req-chunk'David S. Miller
Xin Long says: ==================== sctp: add receiver-side procedures for stream reconf ssn reset request chunk Patch 3/7 and 4/7 are to implement receiver-side procedures for the Outgoing and Incoming SSN Reset Request Parameter described in rfc6525 section 5.2.2 and 5.2.3 Patch 1/7 and 2/7 are ahead of them to define some apis. Patch 5/7-7/7 are to add the process of reconf chunk event in rx path. Note that with this patchset, asoc->reconf_enable has no chance yet to be set, until the patch "sctp: add get and set sockopt for reconf_enable" is applied in the future. As we can not just enable it when sctp is not capable of processing reconf chunk yet. v1->v2: - re-split the patchset and make sure it has no dead codes for review. - rename the titles of the commits and improve some changelogs. - drop __packed from some structures in patch 1/7. - fix some kbuild warnings in patch 3/7 by initializing str_p = NULL. - sctp_chunk_lookup_strreset_param changes to return sctp_paramhdr_t * and uses sctp_strreset_tsnreq to access request_seq in patch 3/7. - use __u<size> in uapi sctp.h in patch 1/7. - do str_list endian conversion when generating stream_reset_event in patch 2/7. - remove str_list endian conversion, pass resp_seq param with network endian to lookup_strreset_param in 3/7. - move str_list endian conversion out of sctp_make_strreset_req, so that sctp_make_strreset_req can be used more conveniently to process inreq in patch 4/7. - remove sctp_merge_reconf_chunk and not support response with multiparam in patch 6/7. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>