summaryrefslogtreecommitdiff
path: root/fs
AgeCommit message (Collapse)Author
2017-06-20fs: Use RWF_* flags for AIO operationsGoldwyn Rodrigues
aio_rw_flags is introduced in struct iocb (using aio_reserved1) which will carry the RWF_* flags. We cannot use aio_flags because they are not checked for validity which may break existing applications. Note, the only place RWF_HIPRI comes in effect is dio_await_one(). All the rest of the locations, aio code return -EIOCBQUEUED before the checks for RWF_HIPRI. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-06-20fs: Separate out kiocb flags setup based on RWF_* flagsGoldwyn Rodrigues
Also added RWF_SUPPORTED to encompass all flags. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-06-20btrfs: Round down values which are written for total_bytes_sizeNikolay Borisov
We got an internal report about a file system not wanting to mount following 99e3ecfcb9f4 ("Btrfs: add more validation checks for superblock"). BTRFS error (device sdb1): super_total_bytes 1000203816960 mismatch with fs_devices total_rw_bytes 1000203820544 Subtracting the numbers we get a difference of less than a 4kb. Upon closer inspection it became apparent that mkfs actually rounds down the size of the device to a multiple of sector size. However, the same cannot be said for various functions which modify the total size and are called from btrfs_balance as well as when adding a new device. So this patch ensures that values being saved into on-disk data structures are always rounded down to a multiple of sectorsize. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-20btrfs: Manually implement device_total_bytes getter/setterNikolay Borisov
The device->total_bytes member needs to always be rounded down to sectorsize so that it corresponds to the value of super->total_bytes. However, there are multiple places where the setter is fed a value which is not rounded which can cause a fs to be unmountable due to the check introduced in 99e3ecfcb9f4 ("Btrfs: add more validation checks for superblock"). This patch implements the getter/setter manually so that in a later patch I can add necessary code to catch offenders. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-20btrfs: obsolete and remove mount option alloc_startDavid Sterba
The mount option alloc_start was used in the past for debugging and stressing the chunk allocator. Not meant to be used by users, so we're not breaking anybody's setup. There was some added complexity handling changes of the value and when it was not same as default. Such code has likely been untested and I think it's better to remove it. This patch kills all use of alloc_start, and by doing that also fixes a bug when alloc_size is set, potentially called from statfs: in btrfs_calc_avail_data_space, traversing the list in RCU, the RCU protection is temporarily dropped so btrfs_account_dev_extents_size can be called and then RCU is locked again! Doing that inside list_for_each_entry_rcu is just asking for trouble, but unlikely to be observed in practice. Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-20btrfs: move fs_info::fs_frozen to the flagsDavid Sterba
We can keep the state among the other fs_info flags, there's no reason why fs_frozen would need to be separate. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-20btrfs: cleanup duplicate return value in insert_inline_extentDavid Sterba
The pattern when err is used for function exit and ret is used for return values of callees is not used here. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-20fs/proc: kcore: use kcore_list type to check for vmalloc/module addressArd Biesheuvel
Instead of passing each start address into is_vmalloc_or_module_addr() to decide whether it falls into either the VMALLOC or the MODULES region, we can simply check the type field of the current kcore_list entry, since it will be set to KCORE_VMALLOC based on exactly the same conditions. As a bonus, when reading the KCORE_TEXT region on architectures that have one, this will avoid using vread() on the region if it happens to intersect with a KCORE_VMALLOC region. This is due the fact that the KCORE_TEXT region is the first one to be added to the kcore region list. Reported-by: Tan Xiaojun <tanxiaojun@huawei.com> Tested-by: Tan Xiaojun <tanxiaojun@huawei.com> Tested-by: Mark Rutland <mark.rutland@arm.com> Acked-by: Mark Rutland <mark.rutland@arm.com> Reviewed-by: Laura Abbott <labbott@redhat.com> Reviewed-by: Jiri Olsa <jolsa@kernel.org> Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org> Signed-off-by: Will Deacon <will.deacon@arm.com>
2017-06-20sched/wait: Disambiguate wq_entry->task_list and wq_head->task_list namingIngo Molnar
So I've noticed a number of instances where it was not obvious from the code whether ->task_list was for a wait-queue head or a wait-queue entry. Furthermore, there's a number of wait-queue users where the lists are not for 'tasks' but other entities (poll tables, etc.), in which case the 'task_list' name is actively confusing. To clear this all up, name the wait-queue head and entry list structure fields unambiguously: struct wait_queue_head::task_list => ::head struct wait_queue_entry::task_list => ::entry For example, this code: rqw->wait.task_list.next != &wait->task_list ... is was pretty unclear (to me) what it's doing, while now it's written this way: rqw->wait.head.next != &wait->entry ... which makes it pretty clear that we are iterating a list until we see the head. Other examples are: list_for_each_entry_safe(pos, next, &x->task_list, task_list) { list_for_each_entry(wq, &fence->wait.task_list, task_list) { ... where it's unclear (to me) what we are iterating, and during review it's hard to tell whether it's trying to walk a wait-queue entry (which would be a bug), while now it's written as: list_for_each_entry_safe(pos, next, &x->head, entry) { list_for_each_entry(wq, &fence->wait.head, entry) { Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-06-20sched/wait: Split out the wait_bit*() APIs from <linux/wait.h> into ↵Ingo Molnar
<linux/wait_bit.h> The wait_bit*() types and APIs are mixed into wait.h, but they are a pretty orthogonal extension of wait-queues. Furthermore, only about 50 kernel files use these APIs, while over 1000 use the regular wait-queue functionality. So clean up the main wait.h by moving the wait-bit functionality out of it, into a separate .h and .c file: include/linux/wait_bit.h for types and APIs kernel/sched/wait_bit.c for the implementation Update all header dependencies. This reduces the size of wait.h rather significantly, by about 30%. Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-06-20sched/wait: Standardize 'struct wait_bit_queue' wait-queue entry field nameIngo Molnar
Rename 'struct wait_bit_queue::wait' to ::wq_entry, to more clearly name it as a wait-queue entry. Propagate it to a couple of usage sites where the wait-bit-queue internals are exposed. Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-06-20sched/wait: Rename wait_queue_t => wait_queue_entry_tIngo Molnar
Rename: wait_queue_t => wait_queue_entry_t 'wait_queue_t' was always a slight misnomer: its name implies that it's a "queue", but in reality it's a queue *entry*. The 'real' queue is the wait queue head, which had to carry the name. Start sorting this out by renaming it to 'wait_queue_entry_t'. This also allows the real structure name 'struct __wait_queue' to lose its double underscore and become 'struct wait_queue_entry', which is the more canonical nomenclature for such data types. Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-06-19cifs: use get_random_u32 for 32-bit lock randomJason A. Donenfeld
Using get_random_u32 here is faster, more fitting of the use case, and just as cryptographically secure. It also has the benefit of providing better randomness at early boot, which is sometimes when this is used. Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com> Cc: Steve French <sfrench@samba.org> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2017-06-19xfs: separate function to check if inode shares extentsDarrick J. Wong
Separate the "clear reflink flag" function into one function that checks if the flag is needed, and a second function that checks and clears the flag. The inode scrub code will want to check the necessity of the flag without clearing it. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com>
2017-06-19xfs: reflink find shared should take a transactionDarrick J. Wong
Adapt _reflink_find_shared to take an optional transaction pointer. The inode scrubber code will need to decide (within transaction context) if a file has shared blocks. To avoid buffer deadlocks, we must pass the tp through to this function's utility calls. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com>
2017-06-19xfs: check if an inode is cached and allocatedDarrick J. Wong
Check the inode cache for a particular inode number. If it's in the cache, check that it's not currently being reclaimed. If it's not being reclaimed, return zero if the inode is allocated. This function will be used by various scrubbers to decide if the cache is more up to date than the disk in terms of checking if an inode is allocated. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com>
2017-06-19xfs: export _inobt_btrec_to_irec and _ialloc_cluster_alignment for scrubDarrick J. Wong
Create a function to extract an in-core inobt record from a generic btree_rec union so that scrub will be able to check inobt records and check inode block alignment. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com>
2017-06-19xfs: plumb in needed functions for range querying of various btreesDarrick J. Wong
Plumb in the pieces (init_high_key, diff_two_keys) necessary to call query_range on the inode space and block mapping btrees and to extract raw btree records. This will eventually be used by the inobt and bmbt scrubbers. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com>
2017-06-19xfs: export various function for the online scrubberDarrick J. Wong
Export various internal functions so that the online scrubber can use them to check the state of metadata. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com>
2017-06-19xfs: always compile the btree inorder check functionsDarrick J. Wong
The btree record and key inorder check functions will be used by the btree scrubber code, so make sure they're always built. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com>
2017-06-19xfs: remove double-underscore integer typesDarrick J. Wong
This is a purely mechanical patch that removes the private __{u,}int{8,16,32,64}_t typedefs in favor of using the system {u,}int{8,16,32,64}_t typedefs. This is the sed script used to perform the transformation and fix the resulting whitespace and indentation errors: s/typedef\t__uint8_t/typedef __uint8_t\t/g s/typedef\t__uint/typedef __uint/g s/typedef\t__int\([0-9]*\)_t/typedef int\1_t\t/g s/__uint8_t\t/__uint8_t\t\t/g s/__uint/uint/g s/__int\([0-9]*\)_t\t/__int\1_t\t\t/g s/__int/int/g /^typedef.*int[0-9]*_t;$/d Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
2017-06-19xfs: optimize _btree_query_allDarrick J. Wong
Don't bother wandering our way through the leaf nodes when the caller issues a query_all; just zoom down the left side of the tree and walk rightwards along level zero. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com>
2017-06-19btrfs: use GFP_KERNEL in btrfs_init_dev_replace_tgtdevDavid Sterba
The function is called from ioctl context and we don't hold any locks that take part in writeback. Right now it's only fs_info::volume_mutex. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-19btrfs: use GFP_KERNEL in btrfs_calc_avail_data_spaceDavid Sterba
We don't hold any locks here. Inidirectly called from statfs. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-19btrfs: Use btrfs_space_info_used instead of opencoding itNikolay Borisov
Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-19btrfs: wait part of the write_dev_flush() can be separated outAnand Jain
Submit and wait parts of write_dev_flush() can be split into two separate functions for better readability. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-19btrfs: remove redundant null bdev counting during flush submissionAnand Jain
There is no extra benefit to count null bdev during the submit loop, as these null devices will be anyway checked during command completion device loop just after the submit loop. We are holding the device_list_mutex, the device->bdev status won't change in between. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-19btrfs: write_dev_flush does not return ENOMEM anymoreAnand Jain
Since commit "btrfs: btrfs_io_bio_alloc never fails, skip error handling" write_dev_flush will not return ENOMEM in the sending part. We do not need to check for it in the callers. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> [ updated changelog ] Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-19Btrfs: compression must free at least one sector sizeTimofey Titovets
We already skip storing data where compression does not make the result at least one byte less. Let's make the logic better and check that compression frees at least one sector size of bytes, otherwise it's not that useful. Signed-off-by: Timofey Titovets <nefelim4ag@gmail.com> Reviewed-by: David Sterba <dsterba@suse.com> [ changelog updated ] Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-19btrfs: sink gfp parameter to btrfs_io_bio_allocDavid Sterba
We can hardcode GFP_NOFS to btrfs_io_bio_alloc, although it means we change it back from GFP_KERNEL in scrub. I'd rather save a few stack bytes from not passing the gfp flags in the remaining, more imporatant, contexts and the bio allocating API now looks more consistent. Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-19btrfs: add helper to initialize the non-bio part of btrfs_io_bioDavid Sterba
We use btrfs_bioset for bios and ask to allocate the entire size of btrfs_io_bio from btrfs bio_alloc_bioset. The member 'bio' is initialized but the bytes from 0 to offset of 'bio' are left uninitialized. Although we initialize some of the members in our helpers, we should initialize the whole structures. Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-19btrfs: document mandatory order of bio in btrfs_io_bioDavid Sterba
Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-19Btrfs: skip checksum verification if IO error occursLiu Bo
Currently dio read also goes to verify checksum if -EIO has been returned, although it usually fails on checksum, it's not necessary at all, we could directly check if there is another copy to read. And with this, the behavior of dio read is now consistent with that of buffered read. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> [ use bool for uptodate ] Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-19Btrfs: tolerate errors if we have retried successfullyLiu Bo
With raid1 profile, dio read isn't tolerating IO errors if read length is less than the stripe length (64K). Our bio didn't get split in btrfs_submit_direct_hook() if (dip->flags & BTRFS_DIO_ORIG_BIO_SUBMITTED) is true and that happens when the read length is less than 64k. In this case, if the underlying device returns error somehow, bio->bi_error has recorded that error. If we could recover the correct data from another copy in profile raid1/10/5/6, with btrfs_subio_endio_read() returning 0, bio would have the correct data in its vector, but bio->bi_error is not updated accordingly so that the following dio_end_io(dio_bio, bio->bi_error) makes directIO think this read has failed. This fixes the problem by setting bio's error to 0 if a good copy has been found. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-19btrfs: pass bytes to btrfs_bio_allocDavid Sterba
Most callers of btrfs_bio_alloc convert from bytes to sectors. Hide that in the helper and simplify the logic in the callsers. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-19btrfs: opencode trivial compressed_bio_alloc, simplify error handlingDavid Sterba
compressed_bio_alloc is now a trivial wrapper around btrfs_bio_alloc, no point keeping it. The error handling can be simplified, as we know btrfs_bio_alloc will never fail. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-19btrfs: remove redundant parameters from btrfs_bio_allocDavid Sterba
All callers pass gfp_flags=GFP_NOFS and nr_vecs=BIO_MAX_PAGES. submit_extent_page adds __GFP_HIGH that does not make a difference in our case as it allows access to memory reserves but otherwise does not change the constraints. Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-19btrfs: sink gfp parameter to btrfs_bio_cloneDavid Sterba
All callers pass GFP_NOFS. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-19btrfs: btrfs_io_bio_alloc never fails, skip error handlingDavid Sterba
Update direct callers of btrfs_io_bio_alloc that do error handling, that we can now remove. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-19btrfs: btrfs_bio_clone never fails, skip error handlingDavid Sterba
Update direct callers of btrfs_bio_clone that do error handling, that we can now remove. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-19btrfs: btrfs_bio_alloc never fails, skip error handlingDavid Sterba
Update direct callers of btrfs_bio_alloc that do error handling, that we can now remove. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-19btrfs: bioset allocations will never fail, adapt our helpersDavid Sterba
Christoph pointed out that bio allocations backed by a bioset will never fail. As we always use a bioset for all bio allocations, we can skip the error handling. This patch adjusts our low-level helpers, the cascaded changes to all callers will come next. CC: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-19btrfs: switch to kvmalloc and GFP_KERNEL in lzo/zlib alloc_workspaceDavid Sterba
The compression workspace buffers are larger than a page so we use vmalloc, unconditionally. This is not always necessary as there might be contiguous memory available. Let's use the kvmalloc helpers that will try kmalloc first and fallback to vmalloc. For that they require GFP_KERNEL flags. As we now have the alloc_workspace calls protected by memalloc_nofs in the critical contexts, we can safely use GFP_KERNEL. Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-19btrfs: switch kmallocs to GFP_KERNEL in lzo/zlib alloc_workspaceDavid Sterba
As alloc_workspace is now protected by memalloc_nofs where needed, we can switch the kmalloc to use GFP_KERNEL. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-19btrfs: add memalloc_nofs protections around alloc_workspace callbackDavid Sterba
The workspaces are preallocated at the beginning where we can safely use GFP_KERNEL, but in some cases the find_workspace might reach the allocation again, now in a more restricted context when the bios or pages are being compressed. To avoid potential lockup when alloc_workspace -> vmalloc would silently use the GFP_KERNEL, add the memalloc_nofs helpers around the critical call site. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-19btrfs: adjust includes after vmalloc removalDavid Sterba
As we don't use vmalloc/vzalloc/vfree directly in ctree.c, we can now use the proper header that defines kvmalloc. Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-19btrfs: use GFP_KERNEL in init_ipathDavid Sterba
Now that init_ipath is called either from a safe context or with memalloc_nofs protection, we can switch to GFP_KERNEL allocations in init_path and init_data_container. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-19btrfs: scrub: add memalloc_nofs protection around init_ipathDavid Sterba
init_ipath is called from a safe ioctl context and from scrub when printing an error. The protection is added for three reasons: * init_data_container calls vmalloc and this does not work as expected in the GFP_NOFS context, so this silently does GFP_KERNEL and might deadlock in some cases * keep the context constraint of GFP_NOFS, used by scrub * we want to use GFP_KERNEL unconditionally inside init_ipath or its callees Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-19btrfs: send: use kvmalloc in iterate_dir_itemDavid Sterba
We use a growing buffer for xattrs larger than a page size, at some point vmalloc is unconditionally used for larger buffers. We can still try to avoid it using the kvmalloc helper. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-19btrfs: replace opencoded kvzalloc with the helperDavid Sterba
The logic of kmalloc and vmalloc fallback is opencoded in several places, we can now use the existing helper. Signed-off-by: David Sterba <dsterba@suse.com>