Age | Commit message (Collapse) | Author |
|
Pull xfs update from Ben Myers:
"There is plenty going on, including the cleanup of xfssyncd, metadata
verifiers, CRC infrastructure for the log, tracking of inodes with
speculative allocation, a cleanup of xfs_fs_subr.c, fixes for
XFS_IOC_ZERO_RANGE, and important fix related to log replay (only
update the last_sync_lsn when a transaction completes), a fix for
deadlock on AGF buffers, documentation and comment updates, and a few
more cleanups and fixes.
Details:
- remove the xfssyncd mess
- only update the last_sync_lsn when a transaction completes
- zero allocation_args on the kernel stack
- fix AGF/alloc workqueue deadlock
- silence uninitialised f.file warning
- Update inode alloc comments
- Update mount options documentation
- report projid32bit feature in geometry call
- speculative preallocation inode tracking
- fix attr tree double split corruption
- fix broken error handling in xfs_vm_writepage
- drop buffer io reference when a bad bio is built
- add more attribute tree trace points
- growfs infrastructure changes for 3.8
- fs/xfs/xfs_fs_subr.c die die die
- add CRC infrastructure
- add CRC checks to the log
- Remove description of nodelaylog mount option from xfs.txt
- inode allocation should use unmapped buffers
- byte range granularity for XFS_IOC_ZERO_RANGE
- fix direct IO nested transaction deadlock
- fix stray dquot unlock when reclaiming dquots
- fix sparse reported log CRC endian issue"
Fix up trivial conflict in fs/xfs/xfs_fsops.c due to the same patch
having been applied twice (commits eaef854335ce and 1375cb65e87b: "xfs:
growfs: don't read garbage for new secondary superblocks") with later
updates to the affected code in the XFS tree.
* tag 'for-linus-v3.8-rc1' of git://oss.sgi.com/xfs/xfs: (78 commits)
xfs: fix sparse reported log CRC endian issue
xfs: fix stray dquot unlock when reclaiming dquots
xfs: fix direct IO nested transaction deadlock.
xfs: byte range granularity for XFS_IOC_ZERO_RANGE
xfs: inode allocation should use unmapped buffers.
xfs: Remove the description of nodelaylog mount option from xfs.txt
xfs: add CRC checks to the log
xfs: add CRC infrastructure
xfs: convert buffer verifiers to an ops structure.
xfs: connect up write verifiers to new buffers
xfs: add pre-write metadata buffer verifier callbacks
xfs: add buffer pre-write callback
xfs: Add verifiers to dir2 data readahead.
xfs: add xfs_da_node verification
xfs: factor and verify attr leaf reads
xfs: factor dir2 leaf read
xfs: factor out dir2 data block reading
xfs: factor dir2 free block reading
xfs: verify dir2 block format buffers
xfs: factor dir2 block read operations
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/teigland/linux-dlm
Pull dlm updates from David Teigland:
"This set fixes some conditions in which value blocks are invalidated,
and includes two trivial cleanups."
* tag 'dlm-3.8' of git://git.kernel.org/pub/scm/linux/kernel/git/teigland/linux-dlm:
dlm: fix lvb invalidation conditions
fs/dlm: remove CONFIG_EXPERIMENTAL
dlm: remove unused variable in *dlm_lowcomms_get_buffer()
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/aegl/linux
Pull pstore fixes from Tony Luck:
"Patch series to allow EFI variable backend to pstore to hold multiple
records."
* tag 'please-pull-pstore_mevent' of git://git.kernel.org/pub/scm/linux/kernel/git/aegl/linux:
efi_pstore: Add a format check for an existing variable name at erasing time
efi_pstore: Add a format check for an existing variable name at reading time
efi_pstore: Add a sequence counter to a variable name
efi_pstore: Add ctime to argument of erase callback
efi_pstore: Remove a logic erasing entries from a write callback to hold multiple logs
efi_pstore: Add a logic erasing entries to an erase callback
efi_pstore: Check remaining space with QueryVariableInfo() before writing data
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler updates from Ingo Molnar:
"The biggest change affects group scheduling: we now track the runnable
average on a per-task entity basis, allowing a smoother, exponential
decay average based load/weight estimation instead of the previous
binary on-the-runqueue/off-the-runqueue load weight method.
This will inevitably disturb workloads that were in some sort of
borderline balancing state or unstable equilibrium, so an eye has to
be kept on regressions.
For that reason the new load average is only limited to group
scheduling (shares distribution) at the moment (which was also hurting
the most from the prior, crude weight calculation and whose scheduling
quality wins most from this change) - but we plan to extend this to
regular SMP balancing as well in the future, which will simplify and
speed up things a bit.
Other changes involve ongoing preparatory work to extend NOHZ to the
scheduler as well, eventually allowing completely irq-free user-space
execution."
* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (33 commits)
Revert "sched/autogroup: Fix crash on reboot when autogroup is disabled"
cputime: Comment cputime's adjusting code
cputime: Consolidate cputime adjustment code
cputime: Rename thread_group_times to thread_group_cputime_adjusted
cputime: Move thread_group_cputime() to sched code
vtime: Warn if irqs aren't disabled on system time accounting APIs
vtime: No need to disable irqs on vtime_account()
vtime: Consolidate a bit the ctx switch code
vtime: Explicitly account pending user time on process tick
vtime: Remove the underscore prefix invasion
sched/autogroup: Fix crash on reboot when autogroup is disabled
cputime: Separate irqtime accounting from generic vtime
cputime: Specialize irq vtime hooks
kvm: Directly account vtime to system on guest switch
vtime: Make vtime_account_system() irqsafe
vtime: Gather vtime declarations to their own header file
sched: Describe CFS load-balancer
sched: Introduce temporary FAIR_GROUP_SCHED dependency for load-tracking
sched: Make __update_entity_runnable_avg() fast
sched: Update_cfs_shares at period edge
...
|
|
Merge misc updates from Andrew Morton:
"About half of most of MM. Going very early this time due to
uncertainty over the coreautounifiednumasched things. I'll send the
other half of most of MM tomorrow. The rest of MM awaits a slab merge
from Pekka."
* emailed patches from Andrew Morton: (71 commits)
memory_hotplug: ensure every online node has NORMAL memory
memory_hotplug: handle empty zone when online_movable/online_kernel
mm, memory-hotplug: dynamic configure movable memory and portion memory
drivers/base/node.c: cleanup node_state_attr[]
bootmem: fix wrong call parameter for free_bootmem()
avr32, kconfig: remove HAVE_ARCH_BOOTMEM
mm: cma: remove watermark hacks
mm: cma: skip watermarks check for already isolated blocks in split_free_page()
mm, oom: fix race when specifying a thread as the oom origin
mm, oom: change type of oom_score_adj to short
mm: cleanup register_node()
mm, mempolicy: remove duplicate code
mm/vmscan.c: try_to_freeze() returns boolean
mm: introduce putback_movable_pages()
virtio_balloon: introduce migration primitives to balloon pages
mm: introduce compaction and migration for ballooned pages
mm: introduce a common interface for balloon pages mobility
mm: redefine address_space.assoc_mapping
mm: adjust address_space_operations.migratepage() return code
arch/sparc/kernel/sys_sparc_64.c: s/COLOUR/COLOR/
...
|
|
The maximum oom_score_adj is 1000 and the minimum oom_score_adj is -1000,
so this range can be represented by the signed short type with no
functional change. The extra space this frees up in struct signal_struct
will be used for per-thread oom kill flags in the next patch.
Signed-off-by: David Rientjes <rientjes@google.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Cc: Anton Vorontsov <anton.vorontsov@linaro.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Overhaul struct address_space.assoc_mapping renaming it to
address_space.private_data and its type is redefined to void*. By this
approach we consistently name the .private_* elements from struct
address_space as well as allow extended usage for address_space
association with other data structures through ->private_data.
Also, all users of old ->assoc_mapping element are converted to reflect
its new name and type change (->private_data).
Signed-off-by: Rafael Aquini <aquini@redhat.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Memory fragmentation introduced by ballooning might reduce significantly
the number of 2MB contiguous memory blocks that can be used within a
guest, thus imposing performance penalties associated with the reduced
number of transparent huge pages that could be used by the guest workload.
This patch-set follows the main idea discussed at 2012 LSFMMS session:
"Ballooning for transparent huge pages" -- http://lwn.net/Articles/490114/
to introduce the required changes to the virtio_balloon driver, as well as
the changes to the core compaction & migration bits, in order to make
those subsystems aware of ballooned pages and allow memory balloon pages
become movable within a guest, thus avoiding the aforementioned
fragmentation issue
Following are numbers that prove this patch benefits on allowing
compaction to be more effective at memory ballooned guests.
Results for STRESS-HIGHALLOC benchmark, from Mel Gorman's mmtests suite,
running on a 4gB RAM KVM guest which was ballooning 512mB RAM in 64mB
chunks, at every minute (inflating/deflating), while test was running:
===BEGIN stress-highalloc
STRESS-HIGHALLOC
highalloc-3.7 highalloc-3.7
rc4-clean rc4-patch
Pass 1 55.00 ( 0.00%) 62.00 ( 7.00%)
Pass 2 54.00 ( 0.00%) 62.00 ( 8.00%)
while Rested 75.00 ( 0.00%) 80.00 ( 5.00%)
MMTests Statistics: duration
3.7 3.7
rc4-clean rc4-patch
User 1207.59 1207.46
System 1300.55 1299.61
Elapsed 2273.72 2157.06
MMTests Statistics: vmstat
3.7 3.7
rc4-clean rc4-patch
Page Ins 3581516 2374368
Page Outs 11148692 10410332
Swap Ins 80 47
Swap Outs 3641 476
Direct pages scanned 37978 33826
Kswapd pages scanned 1828245 1342869
Kswapd pages reclaimed 1710236 1304099
Direct pages reclaimed 32207 31005
Kswapd efficiency 93% 97%
Kswapd velocity 804.077 622.546
Direct efficiency 84% 91%
Direct velocity 16.703 15.682
Percentage direct scans 2% 2%
Page writes by reclaim 79252 9704
Page writes file 75611 9228
Page writes anon 3641 476
Page reclaim immediate 16764 11014
Page rescued immediate 0 0
Slabs scanned 2171904 2152448
Direct inode steals 385 2261
Kswapd inode steals 659137 609670
Kswapd skipped wait 1 69
THP fault alloc 546 631
THP collapse alloc 361 339
THP splits 259 263
THP fault fallback 98 50
THP collapse fail 20 17
Compaction stalls 747 499
Compaction success 244 145
Compaction failures 503 354
Compaction pages moved 370888 474837
Compaction move failure 77378 65259
===END stress-highalloc
This patch:
Introduce MIGRATEPAGE_SUCCESS as the default return code for
address_space_operations.migratepage() method and documents the expected
return code for the same method in failure cases.
Signed-off-by: Rafael Aquini <aquini@redhat.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Update the hugetlb_get_unmapped_area function to make use of
vm_unmapped_area() instead of implementing a brute force search.
Signed-off-by: Michel Lespinasse <walken@google.com>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Russell King <linux@arm.linux.org.uk>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Paul Mundt <lethal@linux-sh.org>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
There was some desire in large applications using MAP_HUGETLB or
SHM_HUGETLB to use 1GB huge pages on some mappings, and stay with 2MB on
others. This is useful together with NUMA policy: use 2MB interleaving
on some mappings, but 1GB on local mappings.
This patch extends the IPC/SHM syscall interfaces slightly to allow
specifying the page size.
It borrows some upper bits in the existing flag arguments and allows
encoding the log of the desired page size in addition to the *_HUGETLB
flag. When 0 is specified the default size is used, this makes the
change fully compatible.
Extending the internal hugetlb code to handle this is straight forward.
Instead of a single mount it just keeps an array of them and selects the
right mount based on the specified page size. When no page size is
specified it uses the mount of the default page size.
The change is not visible in /proc/mounts because internal mounts don't
appear there. It also has very little overhead: the additional mounts
just consume a super block, but not more memory when not used.
I also exported the new flags to the user headers (they were previously
under __KERNEL__). Right now only symbols for x86 and some other
architecture for 1GB and 2MB are defined. The interface should already
work for all other architectures though. Only architectures that define
multiple hugetlb sizes actually need it (that is currently x86, tile,
powerpc). However tile and powerpc have user configurable hugetlb
sizes, so it's not easy to add defines. A program on those
architectures would need to query sysfs and use the appropiate log2.
[akpm@linux-foundation.org: cleanups]
[rientjes@google.com: fix build]
[akpm@linux-foundation.org: checkpatch fixes]
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Acked-by: Rik van Riel <riel@redhat.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Hillf Danton <dhillf@gmail.com>
Signed-off-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
There is no reason to pass the nr_pages_dirtied argument, because
nr_pages_dirtied value from the caller is unused in
balance_dirty_pages_ratelimited_nr().
Signed-off-by: Namjae Jeon <linkinjeon@gmail.com>
Signed-off-by: Vivek Trivedi <vtrivedi018@gmail.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty
Pull TTY/Serial merge from Greg Kroah-Hartman:
"Here's the big tty/serial tree set of changes for 3.8-rc1.
Contained in here is a bunch more reworks of the tty port layer from
Jiri and bugfixes from Alan, along with a number of other tty and
serial driver updates by the various driver authors.
Also, Jiri has been coerced^Wconvinced to be the co-maintainer of the
TTY layer, which is much appreciated by me.
All of these have been in the linux-next tree for a while.
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>"
Fixed up some trivial conflicts in the staging tree, due to the fwserial
driver having come in both ways (but fixed up a bit in the serial tree),
and the ioctl handling in the dgrp driver having been done slightly
differently (staging tree got that one right, and removed both
TIOCGSOFTCAR and TIOCSSOFTCAR).
* tag 'tty-3.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty: (146 commits)
staging: sb105x: fix potential NULL pointer dereference in mp_chars_in_buffer()
staging/fwserial: Remove superfluous free
staging/fwserial: Use WARN_ONCE when port table is corrupted
staging/fwserial: Destruct embedded tty_port on teardown
staging/fwserial: Fix build breakage when !CONFIG_BUG
staging: fwserial: Add TTY-over-Firewire serial driver
drivers/tty/serial/serial_core.c: clean up HIGH_BITS_OFFSET usage
staging: dgrp: dgrp_tty.c: Audit the return values of get/put_user()
staging: dgrp: dgrp_tty.c: Remove the TIOCSSOFTCAR ioctl handler from dgrp driver
serial: ifx6x60: Add modem power off function in the platform reboot process
serial: mxs-auart: unmap the scatter list before we copy the data
serial: mxs-auart: disable the Receive Timeout Interrupt when DMA is enabled
serial: max310x: Setup missing "can_sleep" field for GPIO
tty/serial: fix ifx6x60.c declaration warning
serial: samsung: add devicetree properties for non-Exynos SoCs
serial: samsung: fix potential soft lockup during uart write
tty: vt: Remove redundant null check before kfree.
tty/8250 Add check for pci_ioremap_bar failure
tty/8250 Add support for Commtech's Fastcom Async-335 and Fastcom Async-PCIe cards
tty/8250 Add XR17D15x devices to the exar_handle_irq override
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core
Pull driver core updates from Greg Kroah-Hartman:
"Here's the large driver core updates for 3.8-rc1.
The biggest thing here is the various __dev* marking removals. This
is going to be a pain for the merge with different subsystem trees, I
know, but all of the patches included here have been ACKed by their
various subsystem maintainers, as they wanted them to go through here.
If this is too much of a pain, I can pull all of them out of this tree
and just send you one with the other fixes/updates and then, after
3.8-rc1 is out, do the rest of the removals to ensure we catch them
all, it's up to you. The merges should all be trivial, and Stephen
has been doing them all in linux-next for a few weeks now quite
easily.
Other than the __dev* marking removals, there's nothing major here,
some firmware loading updates and other minor things in the driver
core.
All of these have (much to Stephen's annoyance), been in linux-next
for a while.
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>"
Fixed up trivial conflicts in drivers/gpio/gpio-{em,stmpe}.c due to gpio
update.
* tag 'driver-core-3.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (93 commits)
modpost.c: Stop checking __dev* section mismatches
init.h: Remove __dev* sections from the kernel
acpi: remove use of __devinit
PCI: Remove __dev* markings
PCI: Always build setup-bus when PCI is enabled
PCI: Move pci_uevent into pci-driver.c
PCI: Remove CONFIG_HOTPLUG ifdefs
unicore32/PCI: Remove CONFIG_HOTPLUG ifdefs
sh/PCI: Remove CONFIG_HOTPLUG ifdefs
powerpc/PCI: Remove CONFIG_HOTPLUG ifdefs
mips/PCI: Remove CONFIG_HOTPLUG ifdefs
microblaze/PCI: Remove CONFIG_HOTPLUG ifdefs
dma: remove use of __devinit
dma: remove use of __devexit_p
firewire: remove use of __devinitdata
firewire: remove use of __devinit
leds: remove use of __devexit
leds: remove use of __devinit
leds: remove use of __devexit_p
mmc: remove use of __devexit
...
|
|
We were mistakenly returning EINTR when we found an outstanding signal.
Instead we should returen ERESTARTSYS and allow the kernel to handle
things the right way.
Patch-from: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Eric Paris <eparis@redhat.com>
|
|
failed
In inotify_ignored_and_remove_idr() the removal of a watch descriptor is skipped
if the allocation of an ignored event failed and we are leaking memory (the
watch descriptor and the mark linked to it).
This patch ensures that the watch descriptor is removed regardless of whether
event creation failed or not.
Signed-off-by: Lino Sanfilippo <LinoSanfilippo@gmx.de>
Signed-off-by: Eric Paris <eparis@redhat.com>
|
|
Boyd Yang reported a problem for the case that multiple threads of the same
thread group are waiting for a reponse for a permission event.
In this case it is possible that some of the threads are never woken up, even
if the response for the event has been received
(see http://marc.info/?l=linux-kernel&m=131822913806350&w=2).
The reason is that we are currently merging permission events if they belong to
the same thread group. But we are not prepared to wake up more than one waiter
for each event. We do
wait_event(group->fanotify_data.access_waitq, event->response ||
atomic_read(&group->fanotify_data.bypass_perm));
and after that
event->response = 0;
which is the reason that even if we woke up all waiters for the same event
some of them may see event->response being already set 0 again, then go back to
sleep and block forever.
With this patch we avoid that more than one thread is waiting for a response
by not merging permission events for the same thread group any more.
Reported-by: Boyd Yang <boyd.yang@gmail.com>
Signed-off-by: Lino Sanfilippo <LinoSanfilipp@gmx.de>
Signed-off-by: Eric Paris <eparis@redhat.com>
|
|
inotify is supposed to support async signal notification when information
is available on the inotify fd. This patch moves that support to generic
fsnotify functions so it can be used by all notification mechanisms.
Signed-off-by: Eric Paris <eparis@redhat.com>
|
|
On Mon, Aug 01, 2011 at 04:38:22PM -0400, Eric Paris wrote:
>
> I finally built and tested a v3.0 kernel with these patches (I know I'm
> SOOOOOO far behind). Not what I hoped for:
>
> > [ 150.937798] VFS: Busy inodes after unmount of tmpfs. Self-destruct in 5 seconds. Have a nice day...
> > [ 150.945290] BUG: unable to handle kernel NULL pointer dereference at 0000000000000070
> > [ 150.946012] IP: [<ffffffff810ffd58>] shmem_free_inode+0x18/0x50
> > [ 150.946012] PGD 2bf9e067 PUD 2bf9f067 PMD 0
> > [ 150.946012] Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
> > [ 150.946012] CPU 0
> > [ 150.946012] Modules linked in: nfs lockd fscache auth_rpcgss nfs_acl sunrpc ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 ip6table_filter ip6_tables ext4 jbd2 crc16 joydev ata_piix i2c_piix4 pcspkr uinput ipv6 autofs4 usbhid [last unloaded: scsi_wait_scan]
> > [ 150.946012]
> > [ 150.946012] Pid: 2764, comm: syscall_thrash Not tainted 3.0.0+ #1 Red Hat KVM
> > [ 150.946012] RIP: 0010:[<ffffffff810ffd58>] [<ffffffff810ffd58>] shmem_free_inode+0x18/0x50
> > [ 150.946012] RSP: 0018:ffff88002c2e5df8 EFLAGS: 00010282
> > [ 150.946012] RAX: 000000004e370d9f RBX: 0000000000000000 RCX: ffff88003a029438
> > [ 150.946012] RDX: 0000000033630a5f RSI: 0000000000000000 RDI: ffff88003491c240
> > [ 150.946012] RBP: ffff88002c2e5e08 R08: 0000000000000000 R09: 0000000000000000
> > [ 150.946012] R10: 0000000000000000 R11: 0000000000000000 R12: ffff88003a029428
> > [ 150.946012] R13: ffff88003a029428 R14: ffff88003a029428 R15: ffff88003499a610
> > [ 150.946012] FS: 00007f5a05420700(0000) GS:ffff88003f600000(0000) knlGS:0000000000000000
> > [ 150.946012] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> > [ 150.946012] CR2: 0000000000000070 CR3: 000000002a662000 CR4: 00000000000006f0
> > [ 150.946012] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> > [ 150.946012] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> > [ 150.946012] Process syscall_thrash (pid: 2764, threadinfo ffff88002c2e4000, task ffff88002bfbc760)
> > [ 150.946012] Stack:
> > [ 150.946012] ffff88003a029438 ffff88003a029428 ffff88002c2e5e38 ffffffff81102f76
> > [ 150.946012] ffff88003a029438 ffff88003a029598 ffffffff8160f9c0 ffff88002c221250
> > [ 150.946012] ffff88002c2e5e68 ffffffff8115e9be ffff88002c2e5e68 ffff88003a029438
> > [ 150.946012] Call Trace:
> > [ 150.946012] [<ffffffff81102f76>] shmem_evict_inode+0x76/0x130
> > [ 150.946012] [<ffffffff8115e9be>] evict+0x7e/0x170
> > [ 150.946012] [<ffffffff8115ee40>] iput_final+0xd0/0x190
> > [ 150.946012] [<ffffffff8115ef33>] iput+0x33/0x40
> > [ 150.946012] [<ffffffff81180205>] fsnotify_destroy_mark_locked+0x145/0x160
> > [ 150.946012] [<ffffffff81180316>] fsnotify_destroy_mark+0x36/0x50
> > [ 150.946012] [<ffffffff81181937>] sys_inotify_rm_watch+0x77/0xd0
> > [ 150.946012] [<ffffffff815aca52>] system_call_fastpath+0x16/0x1b
> > [ 150.946012] Code: 67 4a 00 b8 e4 ff ff ff eb aa 66 0f 1f 84 00 00 00 00 00 55 48 89 e5 48 83 ec 10 48 89 1c 24 4c 89 64 24 08 48 8b 9f 40 05 00 00
> > [ 150.946012] 83 7b 70 00 74 1c 4c 8d a3 80 00 00 00 4c 89 e7 e8 d2 5d 4a
> > [ 150.946012] RIP [<ffffffff810ffd58>] shmem_free_inode+0x18/0x50
> > [ 150.946012] RSP <ffff88002c2e5df8>
> > [ 150.946012] CR2: 0000000000000070
>
> Looks at aweful lot like the problem from:
> http://www.spinics.net/lists/linux-fsdevel/msg46101.html
>
I tried to reproduce this bug with your test program, but without success.
However, if I understand correctly, this occurs since we dont hold any locks when
we call iput() in mark_destroy(), right?
With the patches you tested, iput() is also not called within any lock, since the
groups mark_mutex is released temporarily before iput() is called. This is, since
the original codes behaviour is similar.
However since we now have a mutex as the biggest lock, we can do what you
suggested (http://www.spinics.net/lists/linux-fsdevel/msg46107.html) and
call iput() with the mutex held to avoid the race.
The patch below implements this. It uses nested locking to avoid deadlock in case
we do the final iput() on an inode which still holds marks and thus would take
the mutex again when calling fsnotify_inode_delete() in destroy_inode().
Signed-off-by: Lino Sanfilippo <LinoSanfilippo@gmx.de>
Signed-off-by: Eric Paris <eparis@redhat.com>
|
|
In clear_marks_by_group_flags() the mark list of a group is iterated and the
marks are put on a temporary list.
Since we introduced fsnotify_destroy_mark_locked() we dont need the temp list
any more and are able to remove the marks while the mark list is iterated and
the mark list mutex is held.
Signed-off-by: Lino Sanfilippo <LinoSanfilippo@gmx.de>
Signed-off-by: Eric Paris <eparis@redhat.com>
|
|
fsnotify_remove_mark()
This patch introduces fsnotify_add_mark_locked() and fsnotify_remove_mark_locked()
which are essentially the same as fsnotify_add_mark() and fsnotify_remove_mark() but
assume that the caller has already taken the groups mark mutex.
Signed-off-by: Lino Sanfilippo <LinoSanfilippo@gmx.de>
Signed-off-by: Eric Paris <eparis@redhat.com>
|
|
In fsnotify_destroy_mark() dont get the group from the passed mark anymore,
but pass the group itself as an additional parameter to the function.
Signed-off-by: Lino Sanfilippo <LinoSanfilippo@gmx.de>
Signed-off-by: Eric Paris <eparis@redhat.com>
|
|
Though the process of the ordered extents is a bit different with the delalloc inode
flush, but we can see it as a subset of the delalloc inode flush, so we also handle
them by flush workers.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
|
|
The process of the ordered operations is similar to the delalloc inode flush, so
we handle them by flush workers.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
|
|
This patch introduce a new worker pool named "flush_workers", and if we
want to force all the inode with pending delalloc to the disks, we can
queue those inodes into the work queue of the worker pool, in this way,
those inodes will be flushed by multi-task.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
|
|
Dave gave me an image of a very full file system that would abort the
transaction because it ran out of space while committing the transaction.
This is because we would think there was plenty of room to create a snapshot
even though the global reserve was not full. This happens because we
calculate the global reserve size before we unpin any space, so after we
unpin the space we allow reservations to occur even though we haven't
reserved all of the space for our global reserve. Fix this by adding to the
global reserve while unpinning in order to make sure we always have enough
space to do our work. With this patch we no longer end up with an aborted
transaction, we return ENOSPC properly to the person trying to create the
snapshot. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
|
|
'disk_key' is not used at all.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
|
|
The argument 'tree_mod_log' is not necessary since all of callers enable it.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
|
|
Since we don't use MOD_LOG_KEY_REMOVE_WHILE_MOVING to add nritems
during rewinding, we should insert a MOD_LOG_KEY_REMOVE operation first.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
|
|
Key MOD_LOG_KEY_REMOVE_WHILE_MOVING means that we're doing memmove inside
an extent buffer node, and the node's number of items remains unchanged
(unless we are inserting a single pointer, but we have MOD_LOG_KEY_ADD for that).
So we don't need to increase node's number of items during rewinding,
otherwise we may get an node larger than leafsize and cause general protection
errors later.
Here is the details,
- If we do memory move for inserting a single pointer, we need to
add node's nritems by one, and we honor MOD_LOG_KEY_ADD for adding.
- If we do memory move for deleting a single pointer, we need to
decrease node's nritems by one, and we honor MOD_LOG_KEY_REMOVE for
deleting.
- If we do memory move for balance left/right, we need to decrease
node's nritems, and we honor MOD_LOG_KEY_REMOVE for balaning.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
|
|
When we find a bitmap free space entry, we may check the previous extent
entry covers the offset or not. But if we find this entry is also a bitmap
entry, we will continue to check the previous entry of the current one by
a while loop. It is unnecessary because it is impossible that the extent
entry which is in front of a bitmap entry can cover the offset of the entry
after that bitmap entry.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
|
|
Alex reported a problem where we were writing between chunks on a rbd
device. The thing is we do bio_add_page using logical offsets, but the
physical offset may be different. So when we map the bio now check to see
if the bio is still ok with the physical offset, and if it is not split the
bio up and redo the bio_add_page with the physical sector. This fixes the
problem for Alex and doesn't affect performance in the normal case. Thanks,
Reported-and-tested-by: Alex Elder <elder@inktank.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
|
|
In some places(such as: evicting inode), we just can not flush the reserved
space of delalloc, flushing the delayed directory index and delayed inode
is OK, but we don't try to flush those things and just go back when there is
no enough space to be reserved. This patch fixes this problem.
We defined 3 types of the flush operations: NO_FLUSH, FLUSH_LIMIT and FLUSH_ALL.
If we can in the transaction, we should not flush anything, or the deadlock
would happen, so use NO_FLUSH. If we flushing the reserved space of delalloc
would cause deadlock, use FLUSH_LIMIT. In the other cases, FLUSH_ALL is used,
and we will flush all things.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
|
|
The comment is not coincident with the code. Fix it.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
|
|
div_factor{_fine} has been implemented for two times, cleanup it.
And I move them into a independent file named math.h because they are
common math functions.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
|
|
Replaces the groups mark_lock spinlock with a mutex. Using a mutex instead
of a spinlock results in more flexibility (i.e it allows to sleep while the
lock is held).
Signed-off-by: Lino Sanfilippo <LinoSanfilippo@gmx.de>
Signed-off-by: Eric Paris <eparis@redhat.com>
|
|
a mark should be destroyed
This patch adds an extra flag to mark_remove_from_mask() to inform the caller if
the mark should be destroyed.
With this we dont destroy the mark implicitly in the function itself any more
but let the caller handle it.
Signed-off-by: Lino Sanfilippo <LinoSanfilippo@gmx.de>
Signed-off-by: Eric Paris <eparis@redhat.com>
|
|
Race-free addition and removal of a mark to a groups mark list would be easier
if we could lock the mark list of group before we lock the specific mark.
This patch changes the order used to add/remove marks to/from mark lists from
1. mark->lock
2. group->mark_lock
3. inode->i_lock
to
1. group->mark_lock
2. mark->lock
3. inode->i_lock
Signed-off-by: Lino Sanfilippo <LinoSanfilippo@gmx.de>
Signed-off-by: Eric Paris <eparis@redhat.com>
|
|
Get a group ref for each mark that is added to the groups list and release that
ref when the mark is freed in fsnotify_put_mark().
We also use get a group reference for duplicated marks and for private event
data.
Now we dont free a group any more when the number of marks becomes 0 but when
the groups ref count does. Since this will only happen when all marks are removed
from a groups mark list, we dont have to set the groups number of marks to 1 at
group creation.
Beside clearing all marks in fsnotify_destroy_group() we do also flush the
groups event queue. This is since events may hold references to groups (due to
private event data) and we have to put those references first before we get a
chance to put the final ref, which will result in a call to
fsnotify_final_destroy_group().
Signed-off-by: Lino Sanfilippo <LinoSanfilippo@gmx.de>
Signed-off-by: Eric Paris <eparis@redhat.com>
|
|
Introduce fsnotify_get_group() which increments the reference counter of a group.
Signed-off-by: Lino Sanfilippo <LinoSanfilippo@gmx.de>
Signed-off-by: Eric Paris <eparis@redhat.com>
|
|
Currently in fsnotify_put_group() the ref count of a group is decremented and if
it becomes 0 fsnotify_destroy_group() is called. Since a groups ref count is only
at group creation set to 1 and never increased after that a call to fsnotify_put_group()
always results in a call to fsnotify_destroy_group().
With this patch fsnotify_destroy_group() is called directly.
Signed-off-by: Lino Sanfilippo <LinoSanfilippo@gmx.de>
Signed-off-by: Eric Paris <eparis@redhat.com>
|
|
If we have a read oplock and set a read lock in it, we can't write to the
locked area - so, filemap_fdatawrite may fail with a no information for a
userspace application even if we request a write to non-locked area. Fix
this by populating the page cache without marking affected pages dirty
after a successful write directly to the server.
Also remove CONFIG_CIFS_SMB2 ifdefs because it's suitable for both CIFS
and SMB2 protocols.
Signed-off-by: Pavel Shilovsky <piastry@etersoft.ru>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <smfrench@gmail.com>
|
|
This should fix a regression that was introduced when the new mount
option parser went in. Also, when the unc= and prefixpath= options
are provided, check their values against the ones we parsed from
the device string. If they differ, then throw a warning that tells
the user that we're using the values from the unc= option for now,
but that that will change in 3.10.
Pavel Shilovsky <piastry@etersoft.ru>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <smfrench@gmail.com>
|
|
Currently the code takes care to ensure that the prefixpath has a
leading '/' delimiter. What if someone passes us a prefixpath with a
leading '\\' instead? The code doesn't properly handle that currently
AFAICS.
Let's just change the code to skip over any leading delimiter character
when copying the prepath. Then, fix up the users of the prepath option
to prefix it with the correct delimiter when they use it.
Also, there's no need to limit the length of the prefixpath to 1k. If
the server can handle it, why bother forbidding it?
Pavel Shilovsky <piastry@etersoft.ru>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <smfrench@gmail.com>
|
|
Make sure we free any existing memory allocated for vol->UNC, just in
case someone passes in multiple unc= options.
Get rid of the check for too long a UNC. The check for >300 bytes seems
arbitrary. We later copy this into the tcon->treeName, for instance and
it's a lot shorter than 300 bytes.
Eliminate an extra kmalloc and copy as well. Just set the vol->UNC
directly with the contents of match_strdup.
Establish that the UNC should be stored with '\\' delimiters. Use
convert_delimiter to change it in place in the vol->UNC.
Finally, move the check for a malformed UNC into
cifs_parse_mount_options so we can catch that situation earlier.
Pavel Shilovsky <piastry@etersoft.ru>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <smfrench@gmail.com>
|
|
The authority fields are supposed to be represented by a single 48-bit
value. It's also supposed to represent the value as hex if it's equal to
or greater than 2^32. This is documented in MS-DTYP, section 2.4.2.1.
Also, fix up the max string length to account for this fix.
Acked-by: Pavel Shilovsky <piastry@etersoft.ru>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <smfrench@gmail.com>
|
|
If the server sends us a target that looks like an outlier, but
is lower than the existing target, then respect it anyway.
However defer actually updating the generation counter until
we get a target that doesn't look like an outlier.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
|
|
if ore_write() fails, we would unlock the pages of pcol, which is now
empty, rather than pcol_copy which owns the pages when ore_write() is
called. this means that no pages will actually be unlocked
(pcol.nr_pages == 0) and the writing process (more accurately, the
syncing process) will hang waiting for a writeback notification that
never comes.
moreover, if ore_write() fails, pcol_free() is called for pcol, whereas
pcol_copy is the object owning the ore_io_state, thus leaking the
ore_io_state.
[Boaz]
I have simplified Idan's original patch a bit, everything else still
holds
Signed-off-by: Idan Kedar <idank@tonian.com>
Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
|
|
Most (all) NFS4ERR_BADSLOT errors are due to the client failing to
respect the server's sr_highest_slotid limit. This mainly happens
due to reordered RPC requests.
The way to handle it is simply to drop the slot that we're using,
and retry using the new highest_slotid limits.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
|
|
|
|
Jian reported that the following sequence would leave "testfile" with
corrupt data:
# mount localhost:/export /mnt/nfs/ -o vers=3
# echo abc > /mnt/nfs/testfile; echo def >> /export/testfile; echo ghi >> /mnt/nfs/testfile
# cat -v /export/testfile
abc
^@^@^@^@ghi
While there's no locking involved here, the operations are serialized,
so CTO should prevent corruption.
The first write to the file is fine and writes 4 bytes. The file is then
extended on the server. When it's reopened a GETATTR is issued and the
size change is noticed. This causes NFS_INO_INVALID_DATA to be set on
the file. Because the file is opened for write only,
nfs_want_read_modify_write() returns 0 to nfs_write_begin().
nfs_updatepage then calls nfs_write_pageuptodate() to see if it should
extend the nfs_page to cover the whole page. NFS_INO_INVALID_DATA is
still set on the file at that point, but that flag is ignored and
nfs_pageuptodate erroneously extends the write to cover the whole page,
with the write done on the server side filled in with zeroes.
This patch just has that function check for NFS_INO_INVALID_DATA in
addition to NFS_INO_REVAL_PAGECACHE. This fixes the bug, but looking
over the code, I wonder if we might have a similar bug in
nfs_revalidate_size(). The difference between those two flags is very
subtle, so it seems like we ought to be checking for
NFS_INO_INVALID_DATA in most of the places that we look for
NFS_INO_REVAL_PAGECACHE.
I believe this is regression introduced by commit 8d197a568. The code
did check for NFS_INO_INVALID_DATA prior to that patch.
Original bug report is here:
https://bugzilla.redhat.com/show_bug.cgi?id=885743
Cc: <stable@vger.kernel.org> # 3.5+
Reported-by: Jian Li <jiali@redhat.com>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
|