summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2012-05-30Btrfs: fix the same inode id problem when doing auto defragmentMiao Xie
Two files in the different subvolumes may have the same inode id, so The rb-tree which is used to manage the defragment object must take it into account. This patch fix this problem. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
2012-05-30Btrfs: fall back to non-inline if we don't have enough spaceJosef Bacik
If cow_file_range_inline fails with ENOSPC we abort the transaction which isn't very nice. This really shouldn't be happening anyways but there's no sense in making it a horrible error when we can easily just go allocate normal data space for this stuff. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com>
2012-05-30Btrfs: fix how we deal with the orphan block rsvJosef Bacik
Ceph was hitting this race where we would remove an inode from the per-root orphan list before we would release the space we had reserved for the inode. We actually don't need a list or anything, we just need to make sure the root doesn't try to free up the orphan reserve until after the inodes have released their reservations. So use an atomic counter instead of a list on the root and only decrement the counter after we've released our reservation. I've tested this as well as several others and we no longer see the warnings that you would see while running ceph. Thanks, Btrfs: fix how we deal with the orphan block rsv Ceph was hitting this race where we would remove an inode from the per-root orphan list before we would release the space we had reserved for the inode. We actually don't need a list or anything, we just need to make sure the root doesn't try to free up the orphan reserve until after the inodes have released their reservations. So use an atomic counter instead of a list on the root and only decrement the counter after we've released our reservation. I've tested this as well as several others and we no longer see the warnings that you would see while running ceph. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com>
2012-05-30Btrfs: convert the inode bit field to use the actual bit operationsJosef Bacik
Miao pointed this out while I was working on an orphan problem that messing with a bitfield where different ranges are protected by different locks doesn't work out right. Turns out we've been doing this forever where we have different parts of the bit field protected by either no lock at all or different locks which could cause all sorts of weird problems including the issue I was hitting. So instead make a runtime_flags thing that we use the normal bit operations on that are all atomic so we can keep having our no/different locking for the different flags and then make force_compress it's own thing so it can be treated normally. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com>
2012-05-30Btrfs: merge contigous regions when loading free space cacheJosef Bacik
When we write out the free space cache we will write out everything that is in our in memory tree, and then we will just walk the pinned extents tree and write anything we see there. The problem with this is that during normal operations the pinned extents will be merged back into the free space tree normally, and then we can allocate space from the merged areas and commit them to the tree log. If we crash and replay the tree log we will crash again because the tree log will try to free up space from what looks like 2 seperate but contiguous entries, since one entry is from the original free space cache and the other was a pinned extent that was merged back. To fix this we just need to walk the free space tree after we load it and merge contiguous entries back together. This will keep the tree log stuff from breaking and it will make the allocator behave more nicely. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com>
2012-05-30Btrfs: do not do balance in readonly modeLiu Bo
In normal cases, we would not be allowed to do balance in RO mode. However, when we're using a seeding device and adding another device to sprout, things will change: $ mkfs.btrfs /dev/sdb7 $ btrfstune -S 1 /dev/sdb7 $ mount /dev/sdb7 /mnt/btrfs -o ro $ btrfs fi bal /mnt/btrfs -----------------------> fail. $ btrfs dev add /dev/sdb8 /mnt/btrfs $ btrfs fi bal /mnt/btrfs -----------------------> works! It should not be designed as an exception, and we'd better add another check for mnt flags. Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com> Reviewed-by: Josef Bacik <josef@redhat.com>
2012-05-30Btrfs: use fastpath in extent state ops as much as possibleLiu Bo
Fully utilize our extent state's new helper functions to use fastpath as much as possible. Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com> Reviewed-by: Josef Bacik <josef@redhat.com>
2012-05-30Btrfs: fix wrong error returned by adding a deviceLiu Bo
Reproduce: $ mkfs.btrfs /dev/sdb7 $ mount /dev/sdb7 /mnt/btrfs -o ro $ btrfs dev add /dev/sdb8 /mnt/btrfs ERROR: error adding the device '/dev/sdb8' - Invalid argument Since we mount with readonly options, and /dev/sdb7 is not a seeding one, a readonly notification is preferred. Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com> Reviewed-by: Josef Bacik <josef@redhat.com>
2012-05-30Btrfs: finish ordered extents in their own threadJosef Bacik
We noticed that the ordered extent completion doesn't really rely on having a page and that it could be done independantly of ending the writeback on a page. This patch makes us not do the threaded endio stuff for normal buffered writes and direct writes so we can end page writeback as soon as possible (in irq context) and only start threads to do the ordered work when it is actually done. Compression needs to be reworked some to take advantage of this as well, but atm it has to do a find_get_page in its endio handler so it must be done in its own thread. This makes direct writes quite a bit faster. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com>
2012-05-30Btrfs: do not check delalloc when updating disk_i_sizeJosef Bacik
We are checking delalloc to see if it is ok to update the i_size. There are 2 cases it stops us from updating 1) If there is delalloc between our current disk_i_size and this ordered extent 2) If there is delalloc between our current ordered extent and the next ordered extent These tests are racy however since we can set delalloc for these ranges at any time. Also for the first case if we notice there is delalloc between disk_i_size and our ordered extent we will not update disk_i_size and assume that when that delalloc bit gets written out it will update everything properly. However if we crash before that we will have file extents outside of our i_size, which is not good, so this test is dangerous as well as racy. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com>
2012-05-30Btrfs: avoid buffer overrun in mount option handlingJim Meyering
There is an off-by-one error: allocating room for a maximal result string but without room for a trailing NUL. That, can lead to returning a transformed string that is not NUL-terminated, and then to a caller reading beyond end of the malloc'd buffer. Rewrite to s/kzalloc/kmalloc/, remove unwarranted use of strncpy (the result is guaranteed to fit), remove dead strlen at end, and change a few variable names and comments. Reviewed-by: Josef Bacik <josef@redhat.com> Signed-off-by: Jim Meyering <meyering@redhat.com>
2012-05-30Btrfs: NUL-terminate path buffer in DEV_INFO ioctl resultJim Meyering
A device with name of length BTRFS_DEVICE_PATH_NAME_MAX or longer would not be NUL-terminated in the DEV_INFO ioctl result buffer. Signed-off-by: Jim Meyering <meyering@redhat.com>
2012-05-30Btrfs: avoid buffer overrun in btrfs_printkJim Meyering
The buffer read-overrun would be triggered by a printk format starting with <N>, where N is a single digit. NUL-terminate after strncpy. Use memcpy, not strncpy, since we know the string we're copying fits in the destination buffer and contains no NUL byte. Signed-off-by: Jim Meyering <meyering@redhat.com>
2012-05-30Fix minor type issuesDaniel J Blueman
Address some minor type issues identified by sparse checker. Signed-off-by: Daniel J Blueman <daniel@quora.org>
2012-05-30btrfs: allow changing 'thread_pool' size at remount timeSergei Trofimovich
Changing 'mount -oremount,thread_pool=2 /' didn't make any effect: maximum amount of worker threads is specified in 2 places: - in 'strict btrfs_fs_info::thread_pool_size' - in each worker struct: 'struct btrfs_workers::max_workers' 'mount -oremount' updated only 'btrfs_fs_info::thread_pool_size'. Fix it by pushing new maximum value to all created worker structures as well. Cc: Josef Bacik <josef@redhat.com> Cc: Chris Mason <chris.mason@oracle.com> Reviewed-by: Josef Bacik <josef@redhat.com> Signed-off-by: Sergei Trofimovich <slyfox@gentoo.org>
2012-05-30Btrfs: do not do filemap_write_and_wait_range in fsyncJosef Bacik
We already do the btrfs_wait_ordered_range which will do this for us, so just remove this call so we don't call it twice. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com>
2012-05-30Btrfs: remove useless waiting and extra filemap workJosef Bacik
In btrfs_wait_ordered_range we have been calling filemap_fdata_write() twice because compression does strange things and then waiting. Then we look up ordered extents and if we find any we will always schedule_timeout(); once and then loop back around and do it all again. We will even check to see if there is delalloc pages on this range and loop again. So this patch gets rid of the multipe fdata_write() calls and just does filemap_write_and_wait(). In the case of compression we will still find the ordered extents and start those individually if we need to so that is ok, but in the normal buffered case we avoid all this weird overhead. Then in the case of the schedule_timeout(1), we don't need it. All callers either 1) don't care, they just want to make sure what they just wrote maeks it to disk or 2) are doing the lock()->lookup ordered->unlock->flush thing in which case it will lock and check for ordered extents _anyway_ so get back to them as quickly as possible. The delaloc check is simply not needed, this only catches the case where we write to the file again since doing the filemap_write_and_wait() and if the caller truly cares about that it will take care of everything itself. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com>
2012-05-30Btrfs: fix compile warnings in extent_io.cJosef Bacik
These warnings are bogus since we will always have at least one page in an eb, but to make the compiler happy just set ret = 0 in these two cases. Thanks, Btrfs: fix compile warnings in extent_io.c These warnings are bogus since we will always have at least one page in an eb, but to make the compiler happy just set ret = 0 in these two cases. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com>
2012-05-30Btrfs: cache no acl on new inodesJosef Bacik
When running compilebench I noticed we were spending some time looking up acls on new inodes, which shouldn't be happening since there were no acls. This is because when we init acls on the inode after creating them we don't cache the fact there are no acls if there aren't any. Doing this adds a little bit of a bump to my compilebench runs. Thanks, Btrfs: cache no acl on new inodes Signed-off-by: Josef Bacik <josef@redhat.com>
2012-05-30Btrfs: use i_version instead of our own sequenceJosef Bacik
We've been keeping around the inode sequence number in hopes that somebody would use it, but nobody uses it and people actually use i_version which serves the same purpose, so use i_version where we used the incore inode's sequence number and that way the sequence is updated properly across the board, and not just in file write. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com>
2012-05-30Btrfs: tree mod log sanity checks in join_transactionJan Schmidt
When a fresh transaction begins, the tree mod log must be clean. Users of the tree modification log must ensure they never span across transaction boundaries. We reset the sequence to 0 in this safe situation to make absolutely sure overflow can't happen. Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
2012-05-30Btrfs: fs_info variable for join_transactionJan Schmidt
Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
2012-05-30Btrfs: use the tree modification log for backref resolvingJan Schmidt
This enables backref resolving on life trees while they are changing. This is a prerequisite for quota groups and just nice to have for everything else. Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
2012-05-30Btrfs: add btrfs_search_old_slotJan Schmidt
The tree modification log together with the current state of the tree gives a consistent, old version of the tree. btrfs_search_old_slot is used to search through this old version and return old (dummy!) extent buffers. Naturally, this function cannot do any tree modifications. Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
2012-05-30Btrfs: add del_ptr and insert_ptr modifications to the tree mod logJan Schmidt
Record all relevant modifications to block pointers in the tree mod log so that we can rewind them later on for backref walking. Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
2012-05-30Btrfs: put all block modifications into the tree mod logJan Schmidt
When running functions that can make changes to the internal trees (e.g. btrfs_search_slot), we check if somebody may be interested in the block we're currently modifying. If so, we record our modification to be able to rewind it later on. Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
2012-05-30Btrfs: add tree modification log functionsJan Schmidt
The tree mod log will log modifications made fs-tree nodes. Most modifications are done by autobalance of the tree. Such changes are recorded as long as a block entry exists. When released, the log is cleaned. With the tree modification log, it's possible to reconstruct a consistent old state of the tree. This is required to do backref walking on a busy file system. Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
2012-05-30Merge branch 'x86/mce' into x86/urgentIngo Molnar
Merge in these fixlets. Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-05-30sched: Remove NULL assignment of dattr_curKamalesh Babulal
Remove explicit NULL assignment of static pointer dattr_cur from init_sched_domains(). Signed-off-by: Kamalesh Babulal <kamalesh@linux.vnet.ibm.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/20120523091411.GG5005@linux.vnet.ibm.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-05-30sched: Remove the last NULL entry from sched_feat_namesHiroshi Shimamoto
No need to have the last NULL entry. Signed-off-by: Hiroshi Shimamoto <h-shimamoto@ct.jp.nec.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/4FBF29E7.5020805@ct.jp.nec.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-05-30sched: Make sched_feat_names constHiroshi Shimamoto
The strings sched_feat_names are never changed. Signed-off-by: Hiroshi Shimamoto <h-shimamoto@ct.jp.nec.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/4FBF29B2.9030904@ct.jp.nec.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-05-30sched/rt: Fix SCHED_RR across cgroupsColin Cross
task_tick_rt() has an optimization to only reschedule SCHED_RR tasks if they were the only element on their rq. However, with cgroups a SCHED_RR task could be the only element on its per-cgroup rq but still be competing with other SCHED_RR tasks in its parent's cgroup. In this case, the SCHED_RR task in the child cgroup would never yield at the end of its timeslice. If the child cgroup rt_runtime_us was the same as the parent cgroup rt_runtime_us, the task in the parent cgroup would starve completely. Modify task_tick_rt() to check that the task is the only task on its rq, and that the each of the scheduling entities of its ancestors is also the only entity on its rq. Signed-off-by: Colin Cross <ccross@android.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/1337229266-15798-1-git-send-email-ccross@android.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-05-30sched: Move nr_cpus_allowed out of 'struct sched_rt_entity'Peter Zijlstra
Since nr_cpus_allowed is used outside of sched/rt.c and wants to be used outside of there more, move it to a more natural site. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/n/tip-kr61f02y9brwzkh6x53pdptm@git.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-05-30sched: Make sure to not re-read variables after validationPeter Zijlstra
We could re-read rq->rt_avg after we validated it was smaller than total, invalidating the check and resulting in an unintended negative. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: David Rientjes <rientjes@google.com> Link: http://lkml.kernel.org/r/1337688268.9698.29.camel@twins Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-05-30sched: Fix SD_OVERLAPPeter Zijlstra
SD_OVERLAP exists to allow overlapping groups, overlapping groups appear in NUMA topologies that aren't fully connected. The typical result of not fully connected NUMA is that each cpu (or rather node) will have different spans for a particular distance. However due to how sched domains are traversed -- only the first cpu in the mask goes one level up -- the next level only cares about the spans of the cpus that went up. Due to this two things were observed to be broken: - build_overlap_sched_groups() -- since its possible the cpu we're building the groups for exists in multiple (or all) groups, the selection criteria of the first group didn't ensure there was a cpu for which is was true that cpumask_first(span) == cpu. Thus load- balancing would terminate. - update_group_power() -- assumed that the cpu span of the first group of the domain was covered by all groups of the child domain. The above explains why this isn't true, so deal with it. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: David Rientjes <rientjes@google.com> Link: http://lkml.kernel.org/r/1337788843.9783.14.camel@laptop Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-05-30sched: Don't try allocating memory from offline nodesPeter Zijlstra
Allocators don't appreciate it when you try and allocate memory from offline nodes. Reported-and-tested-by: Tony Luck <tony.luck@intel.com> Reported-and-tested-by: Anton Blanchard <anton@samba.org> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/n/tip-epfc1io9whb7o22bcujf31vn@git.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-05-30sched/nohz: Fix rq->cpu_load calculations some morePeter Zijlstra
Follow up on commit 556061b00 ("sched/nohz: Fix rq->cpu_load[] calculations") since while that fixed the busy case it regressed the mostly idle case. Add a callback from the nohz exit to also age the rq->cpu_load[] array. This closes the hole where either there was no nohz load balance pass during the nohz, or there was a 'significant' amount of idle time between the last nohz balance and the nohz exit. So we'll update unconditionally from the tick to not insert any accidental 0 load periods while busy, and we try and catch up from nohz idle balance and nohz exit. Both these are still prone to missing a jiffy, but that has always been the case. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: pjt@google.com Cc: Venkatesh Pallipadi <venki@google.com> Link: http://lkml.kernel.org/n/tip-kt0trz0apodbf84ucjfdbr1a@git.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-05-30Merge branches 'iommu/fixes', 'dma-debug', 'arm/omap', 'arm/tegra', 'core' ↵Joerg Roedel
and 'x86/amd' into next
2012-05-30Documentation: kernel-parameters.txt Add amd_iommu_dumpShuah Khan
Add amd_iommu_dump to kernel-parameters.txt Signed-off-by: Shuah Khan <shuahkhan@gmail.com> Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
2012-05-30ASoC: fsi: bugfix: ensure dma is terminatedKuninori Morimoto
FSI DMAEngine has to be stopped certainly at the start/stop time. Without this patch, it will include noise on playback. Signed-off-by: Kuninori Morimoto <kuninori.morimoto.gx@renesas.com> Signed-off-by: Mark Brown <broonie@opensource.wolfsonmicro.com>
2012-05-30ASoC: fsi: bugfix: correct dma areaKuninori Morimoto
FSI driver is using dma_sync_single_xxx(), but the dma area was not correct. This patch fix it up. Signed-off-by: Kuninori Morimoto <kuninori.morimoto.gx@renesas.com> Signed-off-by: Mark Brown <broonie@opensource.wolfsonmicro.com>
2012-05-30[SCSI] fix scsi_wait_scanJames Bottomley
Commit c751085943362143f84346d274e0011419c84202 Author: Rafael J. Wysocki <rjw@sisk.pl> Date: Sun Apr 12 20:06:56 2009 +0200 PM/Hibernate: Wait for SCSI devices scan to complete during resume Broke the scsi_wait_scan module in 2.6.30. Apparently debian still uses it so fix it and backport to stable before removing it in 3.6. The breakage is caused because the function template in include/scsi/scsi_scan.h is defined to be a nop unless SCSI is built in. That means that in the modular case (which is every distro), the scsi_wait_scan module does a simple async_synchronize_full() instead of waiting for scans. Cc: <stable@vger.kernel.org> Signed-off-by: James Bottomley <JBottomley@Parallels.com>
2012-05-30ASoC: fsi: bugfix: enable master clock control on DMA streamKuninori Morimoto
DMA stream handler didn't care about master clock. This patch fixes it up. Signed-off-by: Kuninori Morimoto <kuninori.morimoto.gx@renesas.com> Signed-off-by: Mark Brown <broonie@opensource.wolfsonmicro.com>
2012-05-30[SCSI] fix async probe regressionDan Williams
Commit a7a20d1 "[SCSI] sd: limit the scope of the async probe domain" moved sd probe work out of reach of wait_for_device_probe(). Allow it to be synced via scsi_complete_async_scans(). Reported-by: Meelis Roos <mroos@linux.ee> Tested-by: Meelis Roos <mroos@linux.ee> Signed-off-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: James Bottomley <JBottomley@Parallels.com>
2012-05-30[SCSI] be2iscsi: fix dma free size mismatch regressionMike Christie
This patch should go into 3.5 fixes. The bug was added in the patches for the 3.5 feature window. As you can see from the patch I made a mistake. During development I switched from passing a struct to the size of the struct, but left the sizeof. This results in us allocating 4 bytes (sizeof(int)) but then calling pci_free_consistent with the size of the struct. Signed-off-by: Mike Christie <michaelc@cs.wisc.edu> Signed-off-by: James Bottomley <JBottomley@Parallels.com>
2012-05-30sched/x86: Use cpu_llc_shared_mask(cpu) for coregroup_maskPeter Zijlstra
Commit commit 8e7fbcbc2 ("sched: Remove stale power aware scheduling remnants and dysfunctional knobs") made a boo-boo with removing the power aware scheduling muck from the x86 topology bits. We should unconditionally use the llc_shared mask for multi-core. Reported-and-tested-by: Mike Galbraith <efault@gmx.de> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Borislav Petkov <bp@amd64.org> Cc: Andreas Herrmann <andreas.herrmann3@amd.com> Link: http://lkml.kernel.org/n/tip-lsksc2kfyeveb13avh327p0d@git.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-05-30[SCSI] qla4xxx: Update driver version to 5.02.00-k17Vikas Chaudhary
Signed-off-by: Vikas Chaudhary <vikas.chaudhary@qlogic.com> Signed-off-by: James Bottomley <JBottomley@Parallels.com>
2012-05-30[SCSI] qla4xxx: Capture minidump for ISP82XX on firmware failureTej Parkash
Added support to capture dump (Minidump) which allows us to catpure a snapshot of the firmware/hardware states at the time of firmware failure Signed-off-by: Tej Parkash <tej.parkash@qlogic.com> Signed-off-by: Shyam Sundar <shyam.sundar@qlogic.com> Signed-off-by: Vikas Chaudhary <vikas.chaudhary@qlogic.com> Signed-off-by: James Bottomley <JBottomley@Parallels.com>
2012-05-30[SCSI] qla4xxx: Add change_queue_depth API supportTej Parkash
change_queue_depth will adjust device queuedepth upon receiving "SAM_STAT_TASK_SET_FULL" scsi status from the target. Also added ql4xqfulltracking command line param to enable or disable queuefull tracking. One can disabling queuefull tracking to ensure user set scsi device queuedepth is not altered. Signed-off-by: Tej Parkash <tej.parkash@qlogic.com> Signed-off-by: Vikas Chaudhary <vikas.chaudhary@qlogic.com> Signed-off-by: James Bottomley <JBottomley@Parallels.com>
2012-05-30Merge branch 'linus' into perf/urgentIngo Molnar
Merge back Linus's latest branch so that we pick up the uprobes changes. ( I tested this branch locally and while it's one from the middle of the merge window it's a good one to base further work off. ) Signed-off-by: Ingo Molnar <mingo@kernel.org>