summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2013-11-11Btrfs: fix BUG_ON() casued by the reserved space migrationMiao Xie
When we did space balance and snapshot creation at the same time, we might meet the following oops: kernel BUG at fs/btrfs/inode.c:3038! [SNIP] Call Trace: [<ffffffffa0411ec7>] btrfs_orphan_cleanup+0x293/0x407 [btrfs] [<ffffffffa042dc45>] btrfs_mksubvol.isra.28+0x259/0x373 [btrfs] [<ffffffffa042de85>] btrfs_ioctl_snap_create_transid+0x126/0x156 [btrfs] [<ffffffffa042dff1>] btrfs_ioctl_snap_create_v2+0xd0/0x121 [btrfs] [<ffffffffa0430b2c>] btrfs_ioctl+0x414/0x1854 [btrfs] [<ffffffff813b60b7>] ? __do_page_fault+0x305/0x379 [<ffffffff811215a9>] vfs_ioctl+0x1d/0x39 [<ffffffff81121d7c>] do_vfs_ioctl+0x32d/0x3e2 [<ffffffff81057fe7>] ? finish_task_switch+0x80/0xb8 [<ffffffff81121e88>] SyS_ioctl+0x57/0x83 [<ffffffff813b39ff>] ? do_device_not_available+0x12/0x14 [<ffffffff813b99c2>] system_call_fastpath+0x16/0x1b [SNIP] RIP [<ffffffffa040da40>] btrfs_orphan_add+0xc3/0x126 [btrfs] The reason of the problem is that the relocation root creation stole the reserved space, which was reserved for orphan item deletion. There are several ways to fix this problem, one is to increasing the reserved space size of the space balace, and then we can use that space to create the relocation tree for each fs/file trees. But it is hard to calculate the suitable size because we doesn't know how many fs/file trees we need relocate. We fixed this problem by reserving the space for relocation root creation actively since the space it need is very small (one tree block, used for root node copy), then we use that reserved space to create the relocation tree. If we don't reserve space for relocation tree creation, we will use the reserved space of the balance. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-11-11btrfs: remove unused parameter from btrfs_header_fsidRoss Kirk
Remove unused parameter, 'eb'. Unused since introduction in 5f39d397dfbe140a14edecd4e73c34ce23c4f9ee Updated to be rebased against current upstream and correct diff supplied this time! Signed-off-by: Ross Kirk <ross.kirk@gmail.com> Reviewed-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-11-11Btrfs: fix two use-after-free bugs with transaction cleanupJosef Bacik
I was noticing the slab redzone stuff going off every once and a while during transaction aborts. This was caused by two things 1) We would walk the pending snapshots and set their error to -ECANCELED. We don't need to do this, the snapshot stuff waits for a transaction commit and if there is a problem we just free our pending snapshot object and exit. Doing this was causing us to touch the pending snapshot object after the thing had already been freed. 2) We were freeing the transaction manually with wanton disregard for it's use_count reference counter. To fix this I cleaned up the transaction freeing loop to either wait for the transaction commit to finish if it was in the middle of that (since it will be cleaned and freed up there) or to do the cleanup oursevles. I also moved the global "kill all things dirty everywhere" stuff outside of the transaction cleanup loop since that only needs to be done once. With this patch I'm no longer seeing slab corruption because of use after frees. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-11-11Btrfs: remove all BUG_ON()'s from commit_cowonly_rootsJosef Bacik
Noticed this when forcing errors to happen during delayed ref running. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-11-11Btrfs: don't delete ordered roots from list during cleanupJosef Bacik
During transaction cleanup after an abort we are just removing roots from the ordered roots list which is incorrect. We have a BUG_ON() to make sure that the root is still part of the ordered roots list when we put our ordered extent which we were tripping in this case. So do like we do everywhere else and just move it to the tail of the ordered roots list and allow the normal cleanup to take care of stuff. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-11-11Btrfs: cleanup transaction on abortJosef Bacik
If we abort not during a transaction commit we won't clean up anything until we unmount. Unfortunately if we abort in the middle of writing out an ordered extent we won't clean it up and if somebody is waiting on that ordered extent they will wait forever. To fix this just make the transaction kthread call the cleanup transaction stuff if it notices theres an error, and make btrfs_end_transaction wake up the transaction kthread if there is an error. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-11-11Btrfs: do not release metadata for space cache inodesJosef Bacik
I've been testing our error paths and I was tripping the BUG_ON() in drop_outstanding_extent because our outstanding_extents is 0 for space cache inodes. This is because we don't reserve metadata space for these inodes since we depend on the global block reserve for our space. To fix this we need to make sure the DO_ACCOUNTING stuff doesn't actually call release_metadata for space cache inodes. With this patch I'm no longer panicing. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-11-11Btrfs: reset intwrite on transaction abortJosef Bacik
If we abort a transaction in the middle of a commit we weren't undoing the intwrite locking. This patch fixes that problem. Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-11-11Btrfs: relocate csums properly with prealloc extentsJosef Bacik
A user reported a problem where they were getting csum errors when running a balance and running systemd's journal. This is because systemd is awesome and fallocate()'s its log space and writes into it. Unfortunately we assume that when we read in all the csums for an extent that they are sequential starting at the bytenr we care about. This obviously isn't the case for prealloc extents, where we could have written to the middle of the prealloc extent only, which means the csum would be for the bytenr in the middle of our range and not the front of our range. Fix this by offsetting the new bytenr we are logging to based on the original bytenr the csum was for. With this patch I no longer see the csum errors I was seeing. Thanks, Cc: stable@vger.kernel.org Reported-by: Chris Murphy <lists@colorremedies.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-11-11Btrfs: don't leak block group on errorFilipe David Borba Manana
In extent-tree.c:btrfs_write_dirty_block_groups(), if the call to write_one_cache_group() failed, we would return without putting the block group first. Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-11-11Btrfs: fix sync fs to actually wait for all data to be persistedFilipe David Borba Manana
Currently the fs sync function (super.c:btrfs_sync_fs()) doesn't wait for delayed work to finish before returning success to the caller. This change fixes this, ensuring that there's no data loss if a power failure happens right after fs sync returns success to the caller and before the next commit happens. Steps to reproduce the data loss issue: $ mkfs.btrfs -f /dev/sdb3 $ mount /dev/sdb3 /mnt/btrfs $ perl -e '$d = ("\x41" x 6001); open($f,">","/mnt/btrfs/foobar"); print $f $d; close($f);' && btrfs fi sync /mnt/btrfs Right after the btrfs fi sync command (a second or 2 for example), power off the machine and reboot it. The file will be empty, as it can be verified after mounting the filesystem and through btrfs-debug-tree: $ btrfs-debug-tree /dev/sdb3 | egrep '\(257 INODE_ITEM 0\) itemoff' -B 3 -A 8 item 3 key (256 DIR_INDEX 2) itemoff 3751 itemsize 36 location key (257 INODE_ITEM 0) type FILE namelen 6 datalen 0 name: foobar item 4 key (257 INODE_ITEM 0) itemoff 3591 itemsize 160 inode generation 7 transid 7 size 0 block group 0 mode 100644 links 1 item 5 key (257 INODE_REF 256) itemoff 3575 itemsize 16 inode ref index 2 namelen 6 name: foobar checksum tree key (CSUM_TREE ROOT_ITEM 0) leaf 29429760 items 0 free space 3995 generation 7 owner 7 fs uuid 6192815c-af2a-4b75-b3db-a959ffb6166e chunk uuid b529c44b-938c-4d3d-910a-013b4700bcae uuid tree key (UUID_TREE ROOT_ITEM 0) After this patch, the data loss no longer happens after a power failure and btrfs-debug-tree shows: $ btrfs-debug-tree /dev/sdb3 | egrep '\(257 INODE_ITEM 0\) itemoff' -B 3 -A 8 item 3 key (256 DIR_INDEX 2) itemoff 3751 itemsize 36 location key (257 INODE_ITEM 0) type FILE namelen 6 datalen 0 name: foobar item 4 key (257 INODE_ITEM 0) itemoff 3591 itemsize 160 inode generation 6 transid 6 size 6001 block group 0 mode 100644 links 1 item 5 key (257 INODE_REF 256) itemoff 3575 itemsize 16 inode ref index 2 namelen 6 name: foobar item 6 key (257 EXTENT_DATA 0) itemoff 3522 itemsize 53 extent data disk byte 12845056 nr 8192 extent data offset 0 nr 8192 ram 8192 extent compression 0 checksum tree key (CSUM_TREE ROOT_ITEM 0) Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Reviewed-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-11-11Btrfs: fix tracking of orphan inode countFilipe David Borba Manana
In inode.c:btrfs_orphan_add() if we failed to insert the orphan item, we would return without decrementing the orphan count that we just incremented before attempting the insertion, leaving the orphan inode count wrong. In inode.c:btrfs_orphan_del(), we were decrementing the inode orphan count if the bit BTRFS_INODE_ORPHAN_META_RESERVED was set, which is logically wrong because it should be decremented if the bit BTRFS_INODE_HAS_ORPHAN_ITEM was set - after all we increment the count when we set the bit BTRFS_INODE_HAS_ORPHAN_ITEM elsewhere. Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-11-11Btrfs: export btrfs space shared info to userspaceLiu Bo
Similar to ocfs2, btrfs also supports that extents can be shared by different inodes, and there are some userspace tools requesting for this kind of 'space shared infomation'.[1] ocfs2 uses flag FIEMAP_EXTENT_SHARED, so does btrfs. [1]: http://thr3ads.net/ocfs2-devel/2010/09/489052-PATCH-3-3-shared-du-using-fiemap-to-figure-up-the-shared-extents-per-file-and-the-footprint-in Reviewed-by: David Sterba <dsterba@suse.cz> Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-11-11Btrfs: remove path arg from btrfs_truncate_free_space_cacheFilipe David Borba Manana
Not used for anything, and removing it avoids caller's need to allocate a path structure. Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-11-11Btrfs: remove duplicated ino cache's inode lookupFilipe David Borba Manana
We're doing a unnecessary extra lookup of the ino cache's inode when we already have it (and holding a reference) during the process of saving the ino cache contents to disk. Therefore remove this extra lookup. Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-11-11Btrfs: do a full search everytime in btrfs_search_old_slotJosef Bacik
While running some snashot aware defrag tests I noticed I was panicing every once and a while in key_search. This is because of the optimization that says if we find a key at slot 0 it will be at slot 0 all the way down the rest of the tree. This isn't the case for btrfs_search_old_slot since it will likely replay changes to a buffer if something has changed since we took our sequence number. So short circuit this optimization by setting prev_cmp to -1 every time we call key_search so we will do our normal binary search. With this patch I am no longer seeing the panics I was seeing before. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-11-11Btrfs: add a sanity test for btrfs_split_itemJosef Bacik
While looking at somebodys corruption I became completely convinced that btrfs_split_item was broken, so I wrote this test to verify that it was working as it was supposed to. Thankfully it appears to be working as intended, so just add this test to make sure nobody breaks it in the future. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-11-11btrfs: drop unused parameter from btrfs_item_nrRoss Kirk
Remove unused eb parameter from btrfs_item_nr Signed-off-by: Ross Kirk <ross.kirk@gmail.com> Reviewed-by: David Sterba <dsterba@suse.cz> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-11-11Btrfs: don't store NULL byte in symlink extentsFilipe David Borba Manana
It is not necessary to store the NULL byte in a symlink inline file extent. There's currently no code that requires the NULL byte to be present in the extent. This change also doesn't break file format compatibility nor the send/receive feature. The VFS also doesn't need the NULL byte to be present in the extent, as it reads up to inode->i_size bytes (which already excluded the NULL byte) and sets the NULL byte for us (in fs/namei.c:page_getlink()). So with this change we save 1 byte per symlink file extent (which is always inlined in the btree leaf) without losing backward and forward compatibility. Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-11-11Btrfs: eliminate the exceptional root_tree refs=0Stefan Behrens
The fact that btrfs_root_refs() returned 0 for the tree_root caused bugs in the past, therefore it is set to 1 with this patch and (hopefully) all affected code is adapted to this change. I verified this change by temporarily adding WARN_ON() checks everywhere where btrfs_root_refs() is used, checking whether the logic of the code is changed by btrfs_root_refs() returning 1 instead of 0 for root->root_key.objectid == BTRFS_ROOT_TREE_OBJECTID. With these added checks, I ran the xfstests './check -g auto'. The two roots chunk_root and log_root_tree that are only referenced by the superblock and the log_roots below the log_root_tree still have btrfs_root_refs() == 0, only the tree_root is changed. Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-11-12Merge branch 'x86-uaccess-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 uaccess changes from Ingo Molnar: "A single change that micro-optimizes __copy_*_user_inatomic(), used by the futex code" * 'x86-uaccess-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86: Add 1/2/4/8 byte optimization to 64bit __copy_{from,to}_user_inatomic
2013-11-12Merge branch 'x86-reboot-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 reboot changes from Ingo Molnar: "Misc changes - the only one with functional impact should be commit 16c21ae5ca63 ("reboot: Allow specifying warm/cold reset for CF9 boot type") which extends cold/warm reboot handling to the 0xCF9 reboot method" * 'x86-reboot-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/reboot: Correct pr_info() log message in the set_bios/pci/kbd_reboot() x86/reboot: Sort reboot DMI quirks by vendor x86/reboot: Remove the duplicate C6100 entry in the reboot quirks list reboot: Allow specifying warm/cold reset for CF9 boot type
2013-11-12Merge branch 'x86-platform-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 platform fixlet from Ingo Molnar: "A single __initdata fix" * 'x86-platform-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/geode: Fix incorrect placement of __initdata tag
2013-11-12Merge branch 'x86-mm-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 mm fixlet from Ingo Molnar: "One cleanup that documents a particular detail in init_mem_mapping()" * 'x86-mm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/mm: Add 'step_size' comments to init_mem_mapping()
2013-11-12Merge branch 'x86-mce-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 RAS changes from Ingo Molnar: "The biggest change adds support for Intel 'CPER' (UEFI Common Platform Error Record) error logging, which builds upon an enhanced error logging mechanism available on Xeon processors. Full description is here: http://www.intel.com/content/www/us/en/architecture-and-technology/enhanced-mca-logging-xeon-paper.html This change provides a module (and support code) to check for an extended error log and prints extra details about the error on the console" * 'x86-mce-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: ACPI, x86: Fix extended error log driver to depend on CONFIG_X86_LOCAL_APIC dmi: Avoid unaligned memory access in save_mem_devices() Move cper.c from drivers/acpi/apei to drivers/firmware/efi EDAC, GHES: Update ghes error record info ACPI, APEI, CPER: Cleanup CPER memory error output format ACPI, APEI, CPER: Enhance memory reporting capability ACPI, APEI, CPER: Add UEFI 2.4 support for memory error DMI: Parse memory device (type 17) in SMBIOS ACPI, x86: Extended error log driver for x86 platform bitops: Introduce a more generic BITMASK macro ACPI, CPER: Update cper info ACPI, APEI, CPER: Fix status check during error printing
2013-11-12Merge branch 'x86-iommu-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 iommu changes from Ingo Molnar: "Make it easier to turn off the old AMD GART code" * 'x86-iommu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/iommu: Clean up the CONFIG_GART_IOMMU config option a bit x86/iommu: Don't make AMD_GART depend on EXPERT and default y
2013-11-12Merge branch 'x86-intel-mid-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86/intel-mid changes from Ingo Molnar: "Update the 'intel mid' (mobile internet device) platform code as Intel is rolling out more SoC designs. This gets rid of most of the 'MRST' platform code in the process, mostly by renaming and shuffling code around into their respective 'intel-mid' platform drivers" * 'x86-intel-mid-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86, intel-mid: Do not re-introduce usage of obsolete __cpuinit intel_mid: Move platform device setups to their own platform_<device>.* files x86: intel-mid: Add section for sfi device table intel-mid: sfi: Allow struct devs_id.get_platform_data to be NULL intel_mid: Moved SFI related code to sfi.c intel_mid: Added custom handler for ipc devices intel_mid: Added custom device_handler support intel_mid: Refactored sfi_parse_devs() function intel_mid: Renamed *mrst* to *intel_mid* pci: intel_mid: Return true/false in function returning bool intel_mid: Renamed *mrst* to *intel_mid* mrst: Fixed indentation issues mrst: Fixed printk/pr_* related issues
2013-11-12Merge branch 'x86-hyperv-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86/hyperv changes from Ingo Molnar: "These changes enable Linux guests to boot as 'Modern VM' guest kernels on MS-Hyperv hosts" * 'x86-hyperv-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86, hyperv: Move a variable to avoid an unused variable warning x86, hyperv: Fix build error due to missing <asm/apic.h> include x86, hyperv: Correctly guard the local APIC calibration code x86, hyperv: Get the local APIC timer frequency from the hypervisor
2013-11-12Merge branch 'x86-efi-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 EFI changes from Ingo Molnar: "Main changes: - Add support for earlyprintk=efi which uses the EFI framebuffer. Very useful for debugging boot problems. - EFI stub support for large memory maps (more than 128 entries) - EFI ARM support - this was mostly done by generalizing x86 <-> ARM platform differences, such as by moving x86 EFI code into drivers/firmware/efi/ and sharing it with ARM. - Documentation updates - misc fixes" * 'x86-efi-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (26 commits) x86/efi: Add EFI framebuffer earlyprintk support boot, efi: Remove redundant memset() x86/efi: Fix config_table_type array termination x86 efi: bugfix interrupt disabling sequence x86: EFI stub support for large memory maps efi: resolve warnings found on ARM compile efi: Fix types in EFI calls to match EFI function definitions. efi: Renames in handle_cmdline_files() to complete generalization. efi: Generalize handle_ramdisks() and rename to handle_cmdline_files(). efi: Allow efi_free() to be called with size of 0 efi: use efi_get_memory_map() to get final map for x86 efi: generalize efi_get_memory_map() efi: Rename __get_map() to efi_get_memory_map() efi: Move unicode to ASCII conversion to shared function. efi: Generalize relocate_kernel() for use by other architectures. efi: Move relocate_kernel() to shared file. efi: Enforce minimum alignment of 1 page on allocations. efi: Rename memory allocation/free functions efi: Add system table pointer argument to shared functions. efi: Move common EFI stub code from x86 arch code to common location ...
2013-11-12Merge branch 'x86-cpu-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 cpu changes from Ingo Molnar: "The biggest change that stands out is the increase of the CONFIG_NR_CPUS range from 4096 to 8192 - as real hardware out there already went beyond 4k CPUs ... We only allow more than 512 CPUs if offstack cpumasks are enabled. CONFIG_MAXSMP=y remains to be the 'you are nuts!' extreme testcase, which now means a max of 8192 CPUs" * 'x86-cpu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/cpu: Increase max CPU count to 8192 x86/cpu: Allow higher NR_CPUS values x86/cpu: Always print SMP information in /proc/cpuinfo x86/cpu: Track legacy CPU model data only on 32-bit kernels
2013-11-12Merge branch 'x86-cleanups-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 cleanups from Ingo Molnar: "Two small cleanups" * 'x86-cleanups-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86, msr: Use file_inode(), not f_mapping->host x86: mkpiggy.c: Explicitly close the output file
2013-11-12Merge branch 'x86-build-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 build changes from Ingo Molnar: "Two small changes" * 'x86-build-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86, defconfig: Add DEVTMPFS and DEVTMPFS_MOUNT to *86*_defconfig x86, build: move build output statistics away from stderr
2013-11-12Merge branch 'x86-boot-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 boot changes from Ingo Molnar: "Two changes that prettify and compactify the SMP bootup output from: smpboot: Booting Node 0, Processors #1 #2 #3 OK smpboot: Booting Node 1, Processors #4 #5 #6 #7 OK smpboot: Booting Node 2, Processors #8 #9 #10 #11 OK smpboot: Booting Node 3, Processors #12 #13 #14 #15 OK Brought up 16 CPUs to something like: x86: Booting SMP configuration: .... node #0, CPUs: #1 #2 #3 .... node #1, CPUs: #4 #5 #6 #7 .... node #2, CPUs: #8 #9 #10 #11 .... node #3, CPUs: #12 #13 #14 #15 x86: Booted up 4 nodes, 16 CPUs" * 'x86-boot-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/boot: Further compress CPUs bootup message x86: Improve the printout of the SMP bootup CPU table
2013-11-12Merge branch 'x86-asm-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 user access changes from Ingo Molnar: "This tree contains two copy_[from/to]_user() build time checking changes/enhancements from Jan Beulich. The desired outcome is to get better compiler warnings with CONFIG_DEBUG_STRICT_USER_COPY_CHECKS=y, to keep people from introducing bugs such as overflows and information leaks" * 'x86-asm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86: Unify copy_to_user() and add size checking to it x86: Unify copy_from_user() size checking
2013-11-12Merge branch 'x86-apic-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86/apic fix from Ingo Molnar: "A single fix to the IO-APIC / local-APIC shutdown sequence" * 'x86-apic-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/apic: Disable I/O APIC before shutdown of the local APIC
2013-11-12Merge branch 'timers-core-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull timer changes from Ingo Molnar: "Main changes in this cycle were: - Updated full dynticks support. - Event stream support for architected (ARM) timers. - ARM clocksource driver updates. - Move arm64 to using the generic sched_clock framework & resulting cleanup in the generic sched_clock code. - Misc fixes and cleanups" * 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (50 commits) x86/time: Honor ACPI FADT flag indicating absence of a CMOS RTC clocksource: sun4i: remove IRQF_DISABLED clocksource: sun4i: Report the minimum tick that we can program clocksource: sun4i: Select CLKSRC_MMIO clocksource: Provide timekeeping for efm32 SoCs clocksource: em_sti: convert to clk_prepare/unprepare time: Fix signedness bug in sysfs_get_uname() and its callers timekeeping: Fix some trivial typos in comments alarmtimer: return EINVAL instead of ENOTSUPP if rtcdev doesn't exist clocksource: arch_timer: Do not register arch_sys_counter twice timer stats: Add a 'Collection: active/inactive' line to timer usage statistics sched_clock: Remove sched_clock_func() hook arch_timer: Move to generic sched_clock framework clocksource: tcb_clksrc: Remove IRQF_DISABLED clocksource: tcb_clksrc: Improve driver robustness clocksource: tcb_clksrc: Replace clk_enable/disable with clk_prepare_enable/disable_unprepare clocksource: arm_arch_timer: Use clocksource for suspend timekeeping clocksource: dw_apb_timer_of: Mark a few more functions as __init clocksource: Put nodes passed to CLOCKSOURCE_OF_DECLARE callbacks centrally arm: zynq: Enable arm_global_timer ...
2013-11-12Merge branch 'sched-core-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler changes from Ingo Molnar: "The main changes in this cycle are: - (much) improved CONFIG_NUMA_BALANCING support from Mel Gorman, Rik van Riel, Peter Zijlstra et al. Yay! - optimize preemption counter handling: merge the NEED_RESCHED flag into the preempt_count variable, by Peter Zijlstra. - wait.h fixes and code reorganization from Peter Zijlstra - cfs_bandwidth fixes from Ben Segall - SMP load-balancer cleanups from Peter Zijstra - idle balancer improvements from Jason Low - other fixes and cleanups" * 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (129 commits) ftrace, sched: Add TRACE_FLAG_PREEMPT_RESCHED stop_machine: Fix race between stop_two_cpus() and stop_cpus() sched: Remove unnecessary iteration over sched domains to update nr_busy_cpus sched: Fix asymmetric scheduling for POWER7 sched: Move completion code from core.c to completion.c sched: Move wait code from core.c to wait.c sched: Move wait.c into kernel/sched/ sched/wait: Fix __wait_event_interruptible_lock_irq_timeout() sched: Avoid throttle_cfs_rq() racing with period_timer stopping sched: Guarantee new group-entities always have weight sched: Fix hrtimer_cancel()/rq->lock deadlock sched: Fix cfs_bandwidth misuse of hrtimer_expires_remaining sched: Fix race on toggling cfs_bandwidth_used sched: Remove extra put_online_cpus() inside sched_setaffinity() sched/rt: Fix task_tick_rt() comment sched/wait: Fix build breakage sched/wait: Introduce prepare_to_wait_event() sched/wait: Add ___wait_cond_timeout() to wait_event*_timeout() too sched: Remove get_online_cpus() usage sched: Fix race in migrate_swap_stop() ...
2013-11-12Merge branch 'perf-core-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull perf updates from Ingo Molnar: "As a first remark I'd like to note that the way to build perf tooling has been simplified and sped up, in the future it should be enough for you to build perf via: cd tools/perf/ make install (ie without the -j option.) The build system will figure out the number of CPUs and will do a parallel build+install. The various build system inefficiencies and breakages Linus reported against the v3.12 pull request should now be resolved - please (re-)report any remaining annoyances or bugs. Main changes on the perf kernel side: * Performance optimizations: . perf ring-buffer code optimizations, by Peter Zijlstra . perf ring-buffer code optimizations, by Oleg Nesterov . x86 NMI call-stack processing optimizations, by Peter Zijlstra . perf context-switch optimizations, by Peter Zijlstra . perf sampling speedups, by Peter Zijlstra . x86 Intel PEBS processing speedups, by Peter Zijlstra * Enhanced hardware support: . for Intel Ivy Bridge-EP uncore PMUs, by Zheng Yan . for Haswell transactions, by Andi Kleen, Peter Zijlstra * Core perf events code enhancements and fixes by Oleg Nesterov: . for uprobes, if fork() is called with pending ret-probes . for uprobes platform support code * New ABI details by Andi Kleen: . Report x86 Haswell TSX transaction abort cost as weight Main changes on the perf tooling side (some of these tooling changes utilize the above kernel side changes): * 'perf report/top' enhancements: . Convert callchain children list to rbtree, greatly reducing the time taken for callchain processing, from Namhyung Kim. . Add new COMM infrastructure, further improving histogram processing, from Frédéric Weisbecker, one fix from Namhyung Kim. . Add /proc/kcore based live-annotation improvements, including build-id cache support, multi map 'call' instruction navigation fixes, kcore address validation, objdump workarounds. From Adrian Hunter. . Show progress on histogram collapsing, that can take a long time, from Namhyung Kim. . Add --max-stack option to limit callchain stack scan in 'top' and 'report', improving callchain processing when reducing the stack depth is an option, from Waiman Long. . Add new option --ignore-vmlinux for perf top, from Willy Tarreau. * 'perf trace' enhancements: . 'perf trace' now can can use a 'perf probe' dynamic tracepoints to hook into the userspace -> kernel pathname copy so that it can map fds to pathnames without reading /proc/pid/fd/ symlinks. From Arnaldo Carvalho de Melo. . Show VFS path associated with fd in live sessions, using a 'vfs_getname' 'perf probe' created dynamic tracepoint or by looking at /proc/pid/fd, from Arnaldo Carvalho de Melo. . Add 'trace' beautifiers for lots of syscall arguments, from Arnaldo Carvalho de Melo. . Implement more compact 'trace' output by suppressing zeroed args, from Arnaldo Carvalho de Melo. . Show thread COMM by default in 'trace', from Arnaldo Carvalho de Melo. . Add option to show full timestamp in 'trace', from David Ahern. . Add 'record' command in 'trace', to record raw_syscalls:*, from David Ahern. . Add summary option to dump syscall statistics in 'trace', from David Ahern. . Improve error messages in 'trace', providing hints about system configuration steps needed for using it, from Ramkumar Ramachandra. . 'perf trace' now emits hints as to why tracing is not possible, helping the user to setup the system to allow tracing in the desired permission granularity, telling if the problem is due to debugfs not being mounted or with not enough permission for !root, /proc/sys/kernel/perf_event_paranoit value, etc. From Arnaldo Carvalho de Melo. * 'perf record' enhancements: . Check maximum frequency rate for record/top, emitting better error messages, from Jiri Olsa. . 'perf record' code cleanups, from David Ahern. . Improve write_output error message in 'perf record', from Adrian Hunter. . Allow specifying B/K/M/G unit to the --mmap-pages arguments, from Jiri Olsa. . Fix command line callchain attribute tests to handle the new -g/--call-chain semantics, from Arnaldo Carvalho de Melo. * 'perf kvm' enhancements: . Disable live kvm command if timerfd is not supported, from David Ahern. . Fix detection of non-core features, from David Ahern. * 'perf list' enhancements: . Add usage to 'perf list', from David Ahern. . Show error in 'perf list' if tracepoints not available, from Pekka Enberg. * 'perf probe' enhancements: . Support "$vars" meta argument syntax for local variables, allowing asking for all possible variables at a given probe point to be collected when it hits, from Masami Hiramatsu. * 'perf sched' enhancements: . Address the root cause of that 'perf sched' stack initialization build slowdown, by programmatically setting a big array after moving the global variable back to the stack. Fix from Adrian Hunter. * 'perf script' enhancements: . Set up output options for in-stream attributes, from Adrian Hunter. . Print addr by default for BTS in 'perf script', from Adrian Juntmer * 'perf stat' enhancements: . Improved messages when doing profiling in all or a subset of CPUs using a workload as the session delimitator, as in: 'perf stat --cpu 0,2 sleep 10s' from Arnaldo Carvalho de Melo. . Add units to nanosec-based counters in 'perf stat', from David Ahern. . Remove bogus info when using 'perf stat' -e cycles/instructions, from Ramkumar Ramachandra. * 'perf lock' enhancements: . 'perf lock' fixes and cleanups, from Davidlohr Bueso. * 'perf test' enhancements: . Fixup PERF_SAMPLE_TRANSACTION handling in sample synthesizing and 'perf test', from Adrian Hunter. . Clarify the "sample parsing" test entry, from Arnaldo Carvalho de Melo. . Consider PERF_SAMPLE_TRANSACTION in the "sample parsing" test, from Arnaldo Carvalho de Melo. . Memory leak fixes in 'perf test', from Felipe Pena. * 'perf bench' enhancements: . Change the procps visible command-name of invididual benchmark tests plus cleanups, from Ingo Molnar. * Generic perf tooling infrastructure/plumbing changes: . Separating data file properties from session, code reorganization from Jiri Olsa. . Fix version when building out of tree, as when using one of these: $ make help | grep perf perf-tar-src-pkg - Build perf-3.12.0.tar source tarball perf-targz-src-pkg - Build perf-3.12.0.tar.gz source tarball perf-tarbz2-src-pkg - Build perf-3.12.0.tar.bz2 source tarball perf-tarxz-src-pkg - Build perf-3.12.0.tar.xz source tarball $ from David Ahern. . Enhance option parse error message, showing just the help lines of the options affected, from Namhyung Kim. . libtraceevent updates from upstream trace-cmd repo, from Steven Rostedt. . Always use perf_evsel__set_sample_bit to set sample_type, from Adrian Hunter. . Memory and mmap leak fixes from Chenggang Qin. . Assorted build fixes for from David Ahern and Jiri Olsa. . Speed up and prettify the build system, from Ingo Molnar. . Implement addr2line directly using libbfd, from Roberto Vitillo. . Separate the GTK support in a separate libperf-gtk.so DSO, that is only loaded when --gtk is specified, from Namhyung Kim. . perf bash completion fixes and improvements from Ramkumar Ramachandra. . Support for Openembedded/Yocto -dbg packages, from Ricardo Ribalda Delgado. And lots and lots of other fixes and code reorganizations that did not make it into the list, see the shortlog, diffstat and the Git log for details!" * 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (300 commits) uprobes: Fix the memory out of bound overwrite in copy_insn() uprobes: Fix the wrong usage of current->utask in uprobe_copy_process() perf tools: Remove unneeded include perf record: Remove post_processing_offset variable perf record: Remove advance_output function perf record: Refactor feature handling into a separate function perf trace: Don't relookup fields by name in each sample perf tools: Fix version when building out of tree perf evsel: Ditch evsel->handler.data field uprobes: Export write_opcode() as uprobe_write_opcode() uprobes: Introduce arch_uprobe->ixol uprobes: Kill module_init() and module_exit() uprobes: Move function declarations out of arch perf/x86/intel: Add Ivy Bridge-EP uncore IRP box support perf/x86/intel/uncore: Add filter support for IvyBridge-EP QPI boxes perf: Factor out strncpy() in perf_event_mmap_event() tools/perf: Add required memory barriers perf: Fix arch_perf_out_copy_user default perf: Update a stale comment perf: Optimize perf_output_begin() -- address calculation ...
2013-11-12Merge branch 'irq-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull leftover IRQ fixes from Ingo Molnar: "Two (minor) fixlets that missed v3.12" * 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: genirq: Set the irq thread policy without checking CAP_SYS_NICE irq: DocBook/genericirq.tmpl: Correct various typos
2013-11-12Merge branch 'irq-core-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull IRQ changes from Ingo Molnar: "The biggest change this cycle are the softirq/hardirq stack interaction and nesting fixes, cleanups and reorganizations from Frederic. This is the longer followup story to the softirq nesting fix that is already upstream (commit ded797547548: "irq: Force hardirq exit's softirq processing on its own stack")" * 'irq-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: irqchip: bcm2835: Convert to use IRQCHIP_DECLARE macro powerpc: Tell about irq stack coverage x86: Tell about irq stack coverage irq: Optimize softirq stack selection in irq exit irq: Justify the various softirq stack choices irq: Improve a bit softirq debugging irq: Optimize call to softirq on hardirq exit irq: Consolidate do_softirq() arch overriden implementations x86/irq: Correct comment about i8259 initialization
2013-11-12Merge branch 'core-rcu-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull RCU updates from Ingo Molnar: "The main RCU changes in this cycle are: - Idle entry/exit changes, to throttle callback execution and other refinements to speed up kbuild, primarily to address performance issues located by Tibor Billes. - Grace-period related changes, primarily to aid in debugging, inspired by an -rt debugging session. - Code reorganization moving RCU's source files into its own kernel/rcu/ directory. - RCU documentation updates - Miscellaneous fixes. Note, the following commit: 5c889690aa08 mm: Place preemption point in do_mlockall() loop is identical to the commit already in your tree via email: 22356f447ceb mm: Place preemption point in do_mlockall() loop [ Your version of the changelog nicely demonstrates it how kernel oops messages should be trimmed properly :-/ ]" * 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (30 commits) rcu: Move RCU-related source code to kernel/rcu directory rcu: Fix occurrence of "the the" in checklist.txt kthread: Add pointer to vmstat-avoidance patch rcu: Update stall-warning documentation rcu: Consistent rcu_is_watching() naming rcu: Change EXPORT_SYMBOL() to EXPORT_SYMBOL_GPL() rcu: Is it safe to enter an RCU read-side critical section? rcu: Throttle invoke_rcu_core() invocations due to non-lazy callbacks rcu: Throttle rcu_try_advance_all_cbs() execution rcu: Remove redundant code from rcu_cleanup_after_idle() rcu: Fix CONFIG_RCU_NOCB_CPU_ALL panic on machines with sparse CPU mask rcu: Avoid sparse warnings in rcu_nocb_wake trace event rcu: Track rcu_nocb_kthread()'s sleeping and awakening rcu: Distinguish between NOCB and non-NOCB rcu_callback trace events rcu: Add tracing for rcuo no-CBs CPU wakeup handshake rcu: Add tracing of normal (non-NOCB) grace-period requests rcu: Add tracing to rcu_gp_kthread() rcu: Flag lockless access to ->gp_flags with ACCESS_ONCE() rcu: Prevent spurious-wakeup DoS attack on rcu_gp_kthread() rcu: Improve grace-period start logic ...
2013-11-11Merge tag 'mfd-lee-3.13-3' of git://git.linaro.org/people/ljones/mfdSamuel Ortiz
mfd-lee-3.13-3 MFD patches due for v3.13 - 2nd round.
2013-11-11mtd: gpmi: fix kernel BUG due to racing DMA operationsHuang Shijie
[1] The gpmi uses the nand_command_lp to issue the commands to NAND chips. The gpmi issues a DMA operation with gpmi_cmd_ctrl when it handles a NAND_CMD_NONE control command. So when we read a page(NAND_CMD_READ0) from the NAND, we may send two DMA operations back-to-back. If we do not serialize the two DMA operations, we will meet a bug when 1.1) we enable CONFIG_DMA_API_DEBUG, CONFIG_DMADEVICES_DEBUG, and CONFIG_DEBUG_SG. 1.2) Use the following commands in an UART console and a SSH console: cmd 1: while true;do dd if=/dev/mtd0 of=/dev/null;done cmd 1: while true;do dd if=/dev/mmcblk0 of=/dev/null;done The kernel log shows below: ----------------------------------------------------------------- kernel BUG at lib/scatterlist.c:28! Unable to handle kernel NULL pointer dereference at virtual address 00000000 ......................... [<80044a0c>] (__bug+0x18/0x24) from [<80249b74>] (sg_next+0x48/0x4c) [<80249b74>] (sg_next+0x48/0x4c) from [<80255398>] (debug_dma_unmap_sg+0x170/0x1a4) [<80255398>] (debug_dma_unmap_sg+0x170/0x1a4) from [<8004af58>] (dma_unmap_sg+0x14/0x6c) [<8004af58>] (dma_unmap_sg+0x14/0x6c) from [<8027e594>] (mxs_dma_tasklet+0x18/0x1c) [<8027e594>] (mxs_dma_tasklet+0x18/0x1c) from [<8007d444>] (tasklet_action+0x114/0x164) ----------------------------------------------------------------- 1.3) Assume the two DMA operations is X (first) and Y (second). The root cause of the bug: Assume process P issues DMA X, and sleep on the completion @this->dma_done. X's tasklet callback is dma_irq_callback. It firstly wake up the process sleeping on the completion @this->dma_done, and then trid to unmap the scatterlist S. The waked process P will issue Y in another ARM core. Y initializes S->sg_magic to zero with sg_init_one(), while dma_irq_callback is unmapping S at the same time. See the diagram: ARM core 0 | ARM core 1 ------------------------------------------------------------- (P issues DMA X, then sleep) --> | | (X's tasklet wakes P) --> | | | <-- (P begin to issue DMA Y) | (X's tasklet unmap the | scatterlist S with dma_unmap_sg) --> | <-- (Y calls sg_init_one() to init | scatterlist S) | [2] This patch serialize both the X and Y in the following way: Unmap the DMA scatterlist S firstly, and wake up the process at the end of the DMA callback, in such a way, Y will be executed after X. After this patch: ARM core 0 | ARM core 1 ------------------------------------------------------------- (P issues DMA X, then sleep) --> | | (X's tasklet unmap the | scatterlist S with dma_unmap_sg) --> | | (X's tasklet wakes P) --> | | | <-- (P begin to issue DMA Y) | | <-- (Y calls sg_init_one() to init | scatterlist S) | Cc: stable@vger.kernel.org # 3.2 Signed-off-by: Huang Shijie <b32955@freescale.com> Signed-off-by: Brian Norris <computersforpeace@gmail.com>
2013-11-11perf tests: Use lower sample_freq in sw clock event period testArnaldo Carvalho de Melo
We were using it at 10 kHz, which doesn't work in machines where somehow the max freq was auto reduced by the kernel: [root@ssdandy ~]# perf test 19 19: Test software clock events have valid period values : FAILED! [root@ssdandy ~]# perf test -v 19 19: Test software clock events have valid period values : --- start --- Couldn't open evlist: Invalid argument ---- end ---- Test software clock events have valid period values: FAILED! [root@ssdandy ~]# [root@ssdandy ~]# cat /proc/sys/kernel/perf_event_max_sample_rate 7000 Reducing it to 500 Hz should be good enough for this test and also shouldn't affect what it is testing. But warn the user if it fails, informing the knob and the freq tried. Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: David Ahern <dsahern@gmail.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Mike Galbraith <efault@gmx.de> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephane Eranian <eranian@google.com> Link: http://lkml.kernel.org/n/tip-548rhj1uo6xbwnxa95kw3hqe@git.kernel.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2013-11-11Merge branch 'prandom'David S. Miller
prandom fixes/improvements ==================== It would be great if you could still consider this series that fixes and improves prandom for 3.13. We have sent it to netdev as prandom() originally came from net/core/utils.c and networking is its main user. For a detailled description, please see individual patches. For patch 3 in this series, there will be a minor merge conflict with the random tree that is for 3.13. See below how to resolve it. ==== Hannes says: on merge with the random tree I would suggest to resolve the conflict in drivers/char/random.c like this: if (r->entropy_total > 128) { r->initialized = 1; r->entropy_total = 0; if (r == &nonblocking_pool) { prandom_reseed_late(); pr_notice("random: %s pool is initialized\n", r->name); } } So it won't generate a warning if DEBUG_RANDOM_BOOT gets activated. ==== Patch 1 should probably also go to -stable. Set tested on 32 and 64 bit machines. Thanks a lot! Ref. original discussion: http://patchwork.ozlabs.org/patch/289951/ ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2013-11-11random32: add test cases for taus113 implementationDaniel Borkmann
We generated a battery of 100 test cases from GSL taus113 implemention and compare the results from a particular seed and a particular iteration with our implementation in the kernel. We have verified on 32 and 64 bit machines that our taus113 kernel implementation gives same results as GSL taus113 implementation: [ 0.147370] prandom: seed boundary self test passed [ 0.148078] prandom: 100 self tests passed This is a Kconfig option that is disabled on default, just like the crc32 init selftests in order to not unnecessary slow down boot process. We also refactored out prandom_seed_very_weak() as it's now used in multiple places in order to reduce redundant code. GSL code we used for generating test cases: int i, j; srand(time(NULL)); for (i = 0; i < 100; ++i) { int iteration = 500 + (rand() % 500); gsl_rng_default_seed = rand() + 1; gsl_rng *r = gsl_rng_alloc(gsl_rng_taus113); printf("\t{ %lu, ", gsl_rng_default_seed); for (j = 0; j < iteration - 1; ++j) gsl_rng_get(r); printf("%u, %lu },\n", iteration, gsl_rng_get(r)); gsl_rng_free(r); } Joint work with Hannes Frederic Sowa. Cc: Florian Weimer <fweimer@redhat.com> Cc: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Daniel Borkmann <dborkman@redhat.com> Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-11-11random32: upgrade taus88 generator to taus113 from errata paperDaniel Borkmann
Since we use prandom*() functions quite often in networking code i.e. in UDP port selection, netfilter code, etc, upgrade the PRNG from Pierre L'Ecuyer's original paper "Maximally Equidistributed Combined Tausworthe Generators", Mathematics of Computation, 65, 213 (1996), 203--213 to the version published in his errata paper [1]. The Tausworthe generator is a maximally-equidistributed generator, that is fast and has good statistical properties [1]. The version presented there upgrades the 3 state LFSR to a 4 state LFSR with increased periodicity from about 2^88 to 2^113. The algorithm is presented in [1] by the very same author who also designed the original algorithm in [2]. Also, by increasing the state, we make it a bit harder for attackers to "guess" the PRNGs internal state. See also discussion in [3]. Now, as we use this sort of weak initialization discussed in [3] only between core_initcall() until late_initcall() time [*] for prandom32*() users, namely in prandom_init(), it is less relevant from late_initcall() onwards as we overwrite seeds through prandom_reseed() anyways with a seed source of higher entropy, that is, get_random_bytes(). In other words, a exhaustive keysearch of 96 bit would be needed. Now, with the help of this patch, this state-search increases further to 128 bit. Initialization needs to make sure that s1 > 1, s2 > 7, s3 > 15, s4 > 127. taus88 and taus113 algorithm is also part of GSL. I added a test case in the next patch to verify internal behaviour of this patch with GSL and ran tests with the dieharder 3.31.1 RNG test suite: $ dieharder -g 052 -a -m 10 -s 1 -S 4137730333 #taus88 $ dieharder -g 054 -a -m 10 -s 1 -S 4137730333 #taus113 With this seed configuration, in order to compare both, we get the following differences: algorithm taus88 taus113 rands/second [**] 1.61e+08 1.37e+08 sts_serial(4, 1st run) WEAK PASSED sts_serial(9, 2nd run) WEAK PASSED rgb_lagged_sum(31) WEAK PASSED We took out diehard_sums test as according to the authors it is considered broken and unusable [4]. Despite that and the slight decrease in performance (which is acceptable), taus113 here passes all 113 tests (only rgb_minimum_distance_5 in WEAK, the rest PASSED). In general, taus/taus113 is considered "very good" by the authors of dieharder [5]. The papers [1][2] states a single warm-up step is sufficient by running quicktaus once on each state to ensure proper initialization of ~s_{0}: Our selection of (s) according to Table 1 of [1] row 1 holds the condition L - k <= r - s, that is, (32 32 32 32) - (31 29 28 25) <= (25 27 15 22) - (18 2 7 13) with r = k - q and q = (6 2 13 3) as also stated by the paper. So according to [2] we are safe with one round of quicktaus for initialization. However we decided to include the warm-up phase of the PRNG as done in GSL in every case as a safety net. We also use the warm up phase to make the output of the RNG easier to verify by the GSL output. In prandom_init(), we also mix random_get_entropy() into it, just like drivers/char/random.c does it, jiffies ^ random_get_entropy(). random-get_entropy() is get_cycles(). xor is entropy preserving so it is fine if it is not implemented by some architectures. Note, this PRNG is *not* used for cryptography in the kernel, but rather as a fast PRNG for various randomizations i.e. in the networking code, or elsewhere for debugging purposes, for example. [*]: In order to generate some "sort of pseduo-randomness", since get_random_bytes() is not yet available for us, we use jiffies and initialize states s1 - s3 with a simple linear congruential generator (LCG), that is x <- x * 69069; and derive s2, s3, from the 32bit initialization from s1. So the above quote from [3] accounts only for the time from core to late initcall, not afterwards. [**] Single threaded run on MacBook Air w/ Intel Core i5-3317U [1] http://www.iro.umontreal.ca/~lecuyer/myftp/papers/tausme2.ps [2] http://www.iro.umontreal.ca/~lecuyer/myftp/papers/tausme.ps [3] http://thread.gmane.org/gmane.comp.encryption.general/12103/ [4] http://code.google.com/p/dieharder/source/browse/trunk/libdieharder/diehard_sums.c?spec=svn490&r=490#20 [5] http://www.phy.duke.edu/~rgb/General/dieharder.php Joint work with Hannes Frederic Sowa. Cc: Florian Weimer <fweimer@redhat.com> Cc: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Daniel Borkmann <dborkman@redhat.com> Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-11-11random32: move rnd_state to linux/random.hDaniel Borkmann
struct rnd_state got mistakenly pulled into uapi header. It is not used anywhere and does also not belong there! Commit 5960164fde ("lib/random32: export pseudo-random number generator for modules"), the last commit on rnd_state before it got moved to uapi, says: This patch moves the definition of struct rnd_state and the inline __seed() function to linux/random.h. It renames the static __random32() function to prandom32() and exports it for use in modules. Hence, the structure was moved from lib/random32.c to linux/random.h so that it can be used within modules (FCoE-related code in this case), but not from user space. However, it seems to have been mistakenly moved to uapi header through the uapi script. Since no-one should make use of it from the linux headers, move the structure back to the kernel for internal use, so that it can be modified on demand. Joint work with Hannes Frederic Sowa. Cc: Joe Eykholt <jeykholt@cisco.com> Signed-off-by: Daniel Borkmann <dborkman@redhat.com> Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-11-11random32: add prandom_reseed_late() and call when nonblocking pool becomes ↵Hannes Frederic Sowa
initialized The Tausworthe PRNG is initialized at late_initcall time. At that time the entropy pool serving get_random_bytes is not filled sufficiently. This patch adds an additional reseeding step as soon as the nonblocking pool gets marked as initialized. On some machines it might be possible that late_initcall gets called after the pool has been initialized. In this situation we won't reseed again. (A call to prandom_seed_late blocks later invocations of early reseed attempts.) Joint work with Daniel Borkmann. Cc: Eric Dumazet <eric.dumazet@gmail.com> Cc: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: Daniel Borkmann <dborkman@redhat.com> Acked-by: "Theodore Ts'o" <tytso@mit.edu> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-11-11random32: add periodic reseedingHannes Frederic Sowa
The current Tausworthe PRNG is never reseeded with truly random data after the first attempt in late_initcall. As this PRNG is used for some critical random data as e.g. UDP port randomization we should try better and reseed the PRNG once in a while with truly random data from get_random_bytes(). When we reseed with prandom_seed we now make also sure to throw the first output away. This suffices the reseeding procedure. The delay calculation is based on a proposal from Eric Dumazet. Joint work with Daniel Borkmann. Cc: Eric Dumazet <eric.dumazet@gmail.com> Cc: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: Daniel Borkmann <dborkman@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>