git.armlinux.org.uk/linux.git - Linus' kernel tree

Age	Commit message (Collapse)	Author
2015-04-15	IB/ipoib: change init sequence ordering	Doug Ledford
	In preparation for using per device work queues, we need to move the start of the neighbor thread task to after ipoib_ib_dev_init and move the destruction of the neighbor task to before ipoib_ib_dev_cleanup. Otherwise we will end up freeing our workqueue with work possibly still on it. Signed-off-by: Doug Ledford <dledford@redhat.com>
2015-04-15	IB/ipoib: factor out ah flushing	Doug Ledford
	Create a an ipoib_flush_ah and ipoib_stop_ah routines to use at appropriate times to flush out all remaining ah entries before we shut the device down. Because neighbors and mcast entries can each have a reference on any given ah, we must make sure to free all of those first before our ah will actually have a 0 refcount and be able to be reaped. This factoring is needed in preparation for having per-device work queues. The original per-device workqueue code resulted in the following error message: <ibdev>: ib_dealloc_pd failed That error was tracked down to this issue. With the changes to which workqueues were flushed when, there were no flushes of the per device workqueue after the last ah's were freed, resulting in an attempt to dealloc the pd with outstanding resources still allocated. This code puts the explicit flushes in the needed places to avoid that problem. Signed-off-by: Doug Ledford <dledford@redhat.com>
2015-04-15	IB/core: don't disallow registering region starting at 0x0	Yann Droneaud
	In a call to ib_umem_get(), if address is 0x0 and size is already page aligned, check added in commit 8494057ab5e4 ("IB/uverbs: Prevent integer overflow in ib_umem_get address arithmetic") will refuse to register a memory region that could otherwise be valid (provided vm.mmap_min_addr sysctl and mmap_low_allowed SELinux knobs allow userspace to map something at address 0x0). This patch allows back such registration: ib_umem_get() should probably don't care of the base address provided it can be pinned with get_user_pages(). There's two possible overflows, in (addr + size) and in PAGE_ALIGN(addr + size), this patch keep ensuring none of them happen while allowing to pin memory at address 0x0. Anyway, the case of size equal 0 is no more (partially) handled as 0-length memory region are disallowed by an earlier check. Link: http://mid.gmane.org/cover.1428929103.git.ydroneaud@opteya.com Cc: <stable@vger.kernel.org> # 8494057ab5e4 ("IB/uverbs: Prevent integer overflow in ib_umem_get address arithmetic") Cc: Shachar Raindel <raindel@mellanox.com> Cc: Jack Morgenstein <jackm@mellanox.com> Cc: Or Gerlitz <ogerlitz@mellanox.com> Signed-off-by: Yann Droneaud <ydroneaud@opteya.com> Reviewed-by: Sagi Grimberg <sagig@mellanox.com> Reviewed-by: Haggai Eran <haggaie@mellanox.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
2015-04-15	IB/core: disallow registering 0-sized memory region	Yann Droneaud
	If ib_umem_get() is called with a size equal to 0 and an non-page aligned address, one page will be pinned and a 0-sized umem will be returned to the caller. This should not be allowed: it's not expected for a memory region to have a size equal to 0. This patch adds a check to explicitly refuse to register a 0-sized region. Link: http://mid.gmane.org/cover.1428929103.git.ydroneaud@opteya.com Cc: <stable@vger.kernel.org> Cc: Shachar Raindel <raindel@mellanox.com> Cc: Jack Morgenstein <jackm@mellanox.com> Cc: Or Gerlitz <ogerlitz@mellanox.com> Signed-off-by: Yann Droneaud <ydroneaud@opteya.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
2015-04-15	IB/mlx4: Change alias guids default to be host assigned	Yishai Hadas
	Change the default mode to be HOST assigned instead of SM assigned. This is the expected operational mode, because it doesn't depend on SM availability. As PF generates random GUIDs as the initial admin values, this gives out of the box experience. Signed-off-by: Yishai Hadas <yishaih@mellanox.com> Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il> Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
2015-04-15	net/mlx4_core: Return the admin alias GUID upon host view request	Yishai Hadas
	Return the admin alias GUID value upon a GET request via HOST. We do this so that the GUID value requested by the admin is returned even if the SM has not yet approved this GUID (e.g. the SM is down). Note that this does not create a problem, since the virtual port will remain down until the SM does ACK the requested GUID value. Signed-off-by: Yishai Hadas <yishaih@mellanox.com> Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il> Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
2015-04-15	net/mlx4_core: Raise slave shutdown event upon FLR	Yishai Hadas
	There might be cases that PF doesn't get a "reset" command upon slave down (e.g. virsh destroy). In these cases, however, an FLR event is issued. Therefore, when the PF receives an FLR event for a slave, it should also generate a shutdown event on the PF for that slave, to let the PF upper layers (mlx4_ib, eth) perform any required cleanup/actions associated with slave shutdown. Signed-off-by: Yishai Hadas <yishaih@mellanox.com> Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il> Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
2015-04-15	IB/mlx4: Request alias GUID on demand	Yishai Hadas
	Request GIDs from the SM on demand, i.e., when a VF actually needs them, and release them when the GIDs are no longer in use. In cloud environments, this is useful for GID migrations, in which a GID is assigned to a VF on the destination HCA, while the VF on the source HCA is shutdown (but the GID was not administratively released). Signed-off-by: Yishai Hadas <yishaih@mellanox.com> Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il> Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
2015-04-15	IB/mlx4: Change init flow to request alias GUIDs for active VFs	Yishai Hadas
	Change the init flow to ask GUIDs only for active VFs. This is done for both SM & HOST modes so that there is no need any more to maintain the ownership record type. In case SM mode is used, the initial value will be 0, ask the SM to assign, for the HOST mode the initial value will be the HOST generated GUID. This will enable out of the box experience for both probed and attached VFs. Signed-off-by: Yishai Hadas <yishaih@mellanox.com> Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il> Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
2015-04-15	IB/mlx4: Manage admin alias GUID upon admin request	Yishai Hadas
	Set the admin alias GUID per the administrator's request via the sysfs mechanism into the core layer. The "get" request returns the current value. However, if the administrator requests the SM to assign a new value by requesting 0, the SM assigned GUID is returned. Signed-off-by: Yishai Hadas <yishaih@mellanox.com> Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il> Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
2015-04-15	net/mlx4_core: Set initial admin GUIDs for VFs	Yishai Hadas
	To have out of the box experience, the PF generates random GUIDs who serve as the initial admin values. Signed-off-by: Yishai Hadas <yishaih@mellanox.com> Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il> Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
2015-04-15	net/mlx4_core: Manage alias GUID per VF	Yishai Hadas
	Manages alias GUIDs per VF per port in the core layer. This is a pre-step for managing alias GUIDs in a mode that the admin GUID is returned via ib_query_gid() regardless of whether the SM has approved it or not. Signed-off-by: Yishai Hadas <yishaih@mellanox.com> Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il> Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
2015-04-15	IB/mlx4: Alias GUID adding persistency support	Yishai Hadas
	If the SM rejects an alias GUID request the PF driver keeps trying to acquire the specified GUID indefinitely, utilizing an exponential backoff scheme. Retrying is managed per GUID entry. Each entry that wasn't applied holds its next retry information. Retry requests to the SM consist of records of 8 consecutive GUIDS. Each record that contains GUIDs requiring retries holds its next time-to-run based on the retry information of all its GUID entries. The record having the lowest retry time will run first when that retry time arrives. Since the method (SET or DELETE) as sent to the SM applies to all the GUIDs in the record, we must handle SET requests and DELETE requests in separate SM messages (one for SETs and the other for DELETEs). To avoid race conditions where a GUID entry request (set or delete) was modified after the SM request was sent, we save the method and the requested indices as part of the callback's context -- thus, only the requested indexes are evaluated when the response is received. When an GUID entry is approved we turn off its retry-required bit, this prevents redundant SM retries from occurring on that record. The port down event should be sent only when previously it was up. Likewise, the port up event should be sent only if previously the port was down. Synchronization was added around the flows that change entries and record state to prevent race conditions. Signed-off-by: Yishai Hadas <yishaih@mellanox.com> Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il> Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
2015-04-15	VFS: assorted d_backing_inode() annotations	David Howells
	Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-04-15	VFS: assorted weird filesystems: d_inode() annotations	David Howells
	Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-04-15	VFS: normal filesystems (and lustre): d_inode() annotations	David Howells
	that's the bulk of filesystem drivers dealing with inodes of their own Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-04-15	VFS: Fix up some ->d_inode accesses in the chelsio driver	David Howells
	Fix up some ->d_inode accesses in the chelsio driver. (1) FILE_DATA() should just be replaced with file_inode(). (2) set_debugfs_file_size() should be removed and debugfs_create_file_size() should be used to create the file. Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-04-15	block: loop: switch to VFS ITER_BVEC	Christoph Hellwig
	Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-04-15	i2c: core: Export bus recovery functions	Mark Brown
	Current -next fails to link an ARM allmodconfig because drivers that use the core recovery functions can be built as modules but those functions are not exported: ERROR: "i2c_generic_gpio_recovery" [drivers/i2c/busses/i2c-davinci.ko] undefined! ERROR: "i2c_generic_scl_recovery" [drivers/i2c/busses/i2c-davinci.ko] undefined! ERROR: "i2c_recover_bus" [drivers/i2c/busses/i2c-davinci.ko] undefined! Add exports to fix this. Fixes: 5f9296ba21b3c (i2c: Add bus recovery infrastructure) Signed-off-by: Mark Brown <broonie@kernel.org> Signed-off-by: Wolfram Sang <wsa@the-dreams.de>
2015-04-15	Merge branch 'next' of ↵	Linus Torvalds
	git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security Pull security subsystem updates from James Morris: "Highlights for this window: - improved AVC hashing for SELinux by John Brooks and Stephen Smalley - addition of an unconfined label to Smack - Smack documentation update - TPM driver updates" * 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security: (28 commits) lsm: copy comm before calling audit_log to avoid race in string printing tomoyo: Do not generate empty policy files tomoyo: Use if_changed when generating builtin-policy.h tomoyo: Use bin2c to generate builtin-policy.h selinux: increase avtab max buckets selinux: Use a better hash function for avtab selinux: convert avtab hash table to flex_array selinux: reconcile security_netlbl_secattr_to_sid() and mls_import_netlbl_cat() selinux: remove unnecessary pointer reassignment Smack: Updates for Smack documentation tpm/st33zp24/spi: Add missing device table for spi phy. tpm/st33zp24: Add proper wait for ordinal duration in case of irq mode smack: Fix gcc warning from unused smack_syslog_lock mutex in smackfs.c Smack: Allow an unconfined label in bringup mode Smack: getting the Smack security context of keys Smack: Assign smack_known_web as default smk_in label for kernel thread's socket tpm/tpm_infineon: Use struct dev_pm_ops for power management MAINTAINERS: Add Jason as designated reviewer for TPM tpm: Update KConfig text to include TPM2.0 FIFO chips tpm/st33zp24/dts/st33zp24-spi: Add dts documentation for st33zp24 spi phy ...
2015-04-15	Input: atmel_mxt_ts - add support for Google Pixel 2	Dmitry Torokhov
	This change allows atmel_mxt_ts to bind to ACPI-enumerated devices in Google Pixel 2 (2015). While newer version of ACPI standard allow use of device-tree-like properties in device descriptions, the version of ACPI implemented in Google BIOS does not support them, and we have to resort to DMI data to specify exact characteristics of the devices (touchpad vs. touchscreen, GPIO to button mapping, etc). Pixel 1 continues to use i2c devices and platform data created by chromeos-laptop driver, since ACPI does not enumerate them. Reviewed-by: Javier Martinez Canillas <javier.martinez@collabora.co.uk> Tested-by: Javier Martinez Canillas <javier.martinez@collabora.co.uk> Signed-off-by: Dmitry Torokhov <dmitry.torokhov@gmail.com>
2015-04-15	Merge git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6	Linus Torvalds
	Pull crypto update from Herbert Xu: "Here is the crypto update for 4.1: New interfaces: - user-space interface for AEAD - user-space interface for RNG (i.e., pseudo RNG) New hashes: - ARMv8 SHA1/256 - ARMv8 AES - ARMv8 GHASH - ARM assembler and NEON SHA256 - MIPS OCTEON SHA1/256/512 - MIPS img-hash SHA1/256 and MD5 - Power 8 VMX AES/CBC/CTR/GHASH - PPC assembler AES, SHA1/256 and MD5 - Broadcom IPROC RNG driver Cleanups/fixes: - prevent internal helper algos from being exposed to user-space - merge common code from assembly/C SHA implementations - misc fixes" * git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6: (169 commits) crypto: arm - workaround for building with old binutils crypto: arm/sha256 - avoid sha256 code on ARMv7-M crypto: x86/sha512_ssse3 - move SHA-384/512 SSSE3 implementation to base layer crypto: x86/sha256_ssse3 - move SHA-224/256 SSSE3 implementation to base layer crypto: x86/sha1_ssse3 - move SHA-1 SSSE3 implementation to base layer crypto: arm64/sha2-ce - move SHA-224/256 ARMv8 implementation to base layer crypto: arm64/sha1-ce - move SHA-1 ARMv8 implementation to base layer crypto: arm/sha2-ce - move SHA-224/256 ARMv8 implementation to base layer crypto: arm/sha256 - move SHA-224/256 ASM/NEON implementation to base layer crypto: arm/sha1-ce - move SHA-1 ARMv8 implementation to base layer crypto: arm/sha1_neon - move SHA-1 NEON implementation to base layer crypto: arm/sha1 - move SHA-1 ARM asm implementation to base layer crypto: sha512-generic - move to generic glue implementation crypto: sha256-generic - move to generic glue implementation crypto: sha1-generic - move to generic glue implementation crypto: sha512 - implement base layer for SHA-512 crypto: sha256 - implement base layer for SHA-256 crypto: sha1 - implement base layer for SHA-1 crypto: api - remove instance when test failed crypto: api - Move alg ref count init to crypto_check_alg ...
2015-04-15	dm crypt: fix deadlock when async crypto algorithm returns -EBUSY	Ben Collins
	I suspect this doesn't show up for most anyone because software algorithms typically don't have a sense of being too busy. However, when working with the Freescale CAAM driver it will return -EBUSY on occasion under heavy -- which resulted in dm-crypt deadlock. After checking the logic in some other drivers, the scheme for crypt_convert() and it's callback, kcryptd_async_done(), were not correctly laid out to properly handle -EBUSY or -EINPROGRESS. Fix this by using the completion for both -EBUSY and -EINPROGRESS. Now crypt_convert()'s use of completion is comparable to af_alg_wait_for_completion(). Similarly, kcryptd_async_done() follows the pattern used in af_alg_complete(). Before this fix dm-crypt would lockup within 1-2 minutes running with the CAAM driver. Fix was regression tested against software algorithms on PPC32 and x86_64, and things seem perfectly happy there as well. Signed-off-by: Ben Collins <ben.c@servergy.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org
2015-04-15	dm crypt: leverage immutable biovecs when decrypting on read	Mike Snitzer
	Commit 003b5c571 ("block: Convert drivers to immutable biovecs") stopped short of changing dm-crypt to leverage the fact that the biovec array of a bio will no longer be modified. Switch to using bio_clone_fast() when cloning bios for decryption after read. Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-04-15	dm crypt: update URLs to new cryptsetup project page	Milan Broz
	Cryptsetup home page moved to GitLab. Also remove link to abandonded Truecrypt page. Signed-off-by: Milan Broz <gmazyland@gmail.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-04-15	dm: add log writes target	Josef Bacik
	Introduce a new target that is meant for file system developers to test file system integrity at particular points in the life of a file system. We capture all write requests and associated data and log them to a separate device for later replay. There is a userspace utility to do this replay. The idea behind this is to give file system developers a tool to verify that the file system is always consistent. Signed-off-by: Josef Bacik <jbacik@fb.com> Reviewed-by: Zach Brown <zab@zabbo.net> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-04-15	dm table: use bool function return values of true/false not 1/0	Joe Perches
	Use the normal return values for bool functions. Signed-off-by: Joe Perches <joe@perches.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-04-15	dm verity: add error handling modes for corrupted blocks	Sami Tolvanen
	Add device specific modes to dm-verity to specify how corrupted blocks should be handled. The following modes are defined: - DM_VERITY_MODE_EIO is the default behavior, where reading a corrupted block results in -EIO. - DM_VERITY_MODE_LOGGING only logs corrupted blocks, but does not block the read. - DM_VERITY_MODE_RESTART calls kernel_restart when a corrupted block is discovered. In addition, each mode sends a uevent to notify userspace of corruption and to allow further recovery actions. The driver defaults to previous behavior (DM_VERITY_MODE_EIO) and other modes can be enabled with an additional parameter to the verity table. Signed-off-by: Sami Tolvanen <samitolvanen@google.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-04-15	dm delay: use msecs_to_jiffies for time conversion	Nicholas Mc Guire
	Converting milliseconds to jiffies by "val * HZ / 1000" is technically OK but msecs_to_jiffies(val) is the cleaner solution and handles all corner cases correctly. Signed-off-by: Nicholas Mc Guire <hofrat@osadl.org> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-04-15	dm log userspace base: fix compile warning	Nicholas Mc Guire
	This fixes up a compile warning [-Wunused-but-set-variable] - given the comment in userspace_set_region_sync() the non-reporting of errors is intentional so the return value can be dropped to make gcc happy. Also, fix typo in comment. Signed-off-by: Nicholas Mc Guire <hofrat@osadl.org> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-04-15	dm log userspace transfer: match wait_for_completion_timeout return type	Nicholas Mc Guire
	Return type of wait_for_completion_timeout() is unsigned long not int. An appropriately named unsigned long is added and the assignment fixed. Signed-off-by: Nicholas Mc Guire <hofrat@osadl.org> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-04-15	dm table: fall back to getting device using name_to_dev_t()	Dan Ehrenberg
	If a device is used as the root filesystem, it can't be built off of devices which are within the root filesystem (just like command line arguments to root=). For this reason, Linux has a pseudo-filesystem for root= and MD initialization (based on the function name_to_dev_t) which handles different ways of specifying devices including PARTUUID and major:minor. Switch to using name_to_dev_t() in dm_get_device(). Rather than having DM assume that all things which are not major:minor are paths in an already-mounted filesystem, change dm_get_device() to first attempt to look up the device in the filesystem, and if not found it will fall back to using name_to_dev_t(). In terms of backwards compatibility, there are some cases where behavior will be different: - If you have a file in the current working directory named 1:2 and you initialze DM there, then it will try to use that file rather than the disk with that major:minor pair as a backing device. - Similarly for other bdev types which name_to_dev_t() knows how to interpret, the previous behavior was to repeatedly check for the existence of the file (e.g., while waiting for rootfs to come up) but the new behavior is to use the name_to_dev_t() interpretation. For example, if you have a file named /dev/ubiblock0_0 which is a symlink to /dev/sda3, but it is not yet present when DM starts to initialize, then the name_to_dev_t() interpretation will take precedence. These incompatibilities would only show up in really strange setups with bad practices so we shouldn't have to worry about them. Signed-off-by: Dan Ehrenberg <dehrenberg@chromium.org> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-04-15	dm: add 'use_blk_mq' module param and expose in per-device ro sysfs attr	Mike Snitzer
	Request-based DM's blk-mq support defaults to off; but a user can easily change the default using the dm_mod.use_blk_mq module/boot option. Also, you can check what mode a given request-based DM device is using with: cat /sys/block/dm-X/dm/use_blk_mq This change enabled further cleanup and reduced work (e.g. the md->io_pool and md->rq_pool isn't created if using blk-mq). Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-04-15	dm: optimize dm_mq_queue_rq to _not_ use kthread if using pure blk-mq	Mike Snitzer
	dm_mq_queue_rq() is in atomic context so care must be taken to not sleep -- as such GFP_ATOMIC is used for the md->bs bioset allocations and dm-mpath's call to blk_get_request(). In the future the bioset allocations will hopefully go away (by removing support for partial completions of bios in a cloned request). Also prepare for supporting DM blk-mq ontop of old-style request_fn device(s) if a new dm-mod 'use_blk_mq' parameter is set. The kthread will still be used to queue work if blk-mq is used ontop of old-style request_fn device(s). Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-04-15	dm: add full blk-mq support to request-based DM	Mike Snitzer
	Commit e5863d9ad ("dm: allocate requests in target when stacking on blk-mq devices") served as the first step toward fully utilizing blk-mq in request-based DM -- it enabled stacking an old-style (request_fn) request_queue ontop of the underlying blk-mq device(s). That first step didn't improve performance of DM multipath ontop of fast blk-mq devices (e.g. NVMe) because the top-level old-style request_queue was severely limited by the queue_lock. The second step offered here enables stacking a blk-mq request_queue ontop of the underlying blk-mq device(s). This unlocks significant performance gains on fast blk-mq devices, Keith Busch tested on his NVMe testbed and offered this really positive news: "Just providing a performance update. All my fio tests are getting roughly equal performance whether accessed through the raw block device or the multipath device mapper (~470k IOPS). I could only push ~20% of the raw iops through dm before this conversion, so this latest tree is looking really solid from a performance standpoint." Signed-off-by: Mike Snitzer <snitzer@redhat.com> Tested-by: Keith Busch <keith.busch@intel.com>
2015-04-15	dm: impose configurable deadline for dm_request_fn's merge heuristic	Mike Snitzer
	Otherwise, for sequential workloads, the dm_request_fn can allow excessive request merging at the expense of increased service time. Add a per-device sysfs attribute to allow the user to control how long a request, that is a reasonable merge candidate, can be queued on the request queue. The resolution of this request dispatch deadline is in microseconds (ranging from 1 to 100000 usecs), to set a 20us deadline: echo 20 > /sys/block/dm-7/dm/rq_based_seq_io_merge_deadline The dm_request_fn's merge heuristic and associated extra accounting is disabled by default (rq_based_seq_io_merge_deadline is 0). This sysfs attribute is not applicable to bio-based DM devices so it will only ever report 0 for them. By allowing a request to remain on the queue it will block others requests on the queue. But introducing a short dequeue delay has proven very effective at enabling certain sequential IO workloads on really fast, yet IOPS constrained, devices to build up slightly larger IOs -- yielding 90+% throughput improvements. Having precise control over the time taken to wait for larger requests to build affords control beyond that of waiting for certain IO sizes to accumulate (which would require a deadline anyway). This knob will only ever make sense with sequential IO workloads and the particular value used is storage configuration specific. Given the expected niche use-case for when this knob is useful it has been deemed acceptable to expose this relatively crude method for crafting optimal IO on specific storage -- especially given the solution is simple yet effective. In the context of DM multipath, it is advisable to tune this sysfs attribute to a value that offers the best performance for the common case (e.g. if 4 paths are expected active, tune for that; if paths fail then performance may be slightly reduced). Alternatives were explored to have request-based DM autotune this value (e.g. if/when paths fail) but they were quickly deemed too fragile and complex to warrant further design and development time. If this problem proves more common as faster storage emerges we'll have to look at elevating a generic solution into the block core. Tested-by: Shiva Krishna Merla <shivakrishna.merla@netapp.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-04-15	dm sysfs: introduce ability to add writable attributes	Mike Snitzer
	Add DM_ATTR_RW() macro and establish .store method in dm_sysfs_ops. Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-04-15	dm: don't start current request if it would've merged with the previous	Mike Snitzer
	Request-based DM's dm_request_fn() is so fast to pull requests off the queue that steps need to be taken to promote merging by avoiding request processing if it makes sense. If the current request would've merged with previous request let the current request stay on the queue longer. Suggested-by: Jens Axboe <axboe@fb.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-04-15	dm: reduce the queue delay used in dm_request_fn from 100ms to 10ms	Mike Snitzer
	Commit 7eaceaccab ("block: remove per-queue plugging") didn't justify DM's use of a 100ms delay; such an extended delay is a liability when there is reason to re-kick the queue. Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-04-15	dm: don't schedule delayed run of the queue if nothing to do	Mike Snitzer
	In request-based DM's dm_request_fn(), if blk_peek_request() returns NULL just return. Avoids unnecessary blk_delay_queue(). Reported-by: Jens Axboe <axboe@fb.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-04-15	dm: only run the queue on completion if congested or no requests pending	Mike Snitzer
	On really fast storage it can be beneficial to delay running the request_queue to allow the elevator more opportunity to merge requests. Otherwise, it has been observed that requests are being sent to q->request_fn much quicker than is ideal on IOPS-bound backends. Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-04-15	dm: remove request-based logic from make_request_fn wrapper	Mike Snitzer
	The old dm_request() method used for q->make_request_fn had a branch for request-based DM support but it isn't needed given that dm_init_request_based_queue() sets it to the standard blk_queue_bio() anyway. Cleanup dm_init_md_queue() to be DM device-type agnostic and have dm_setup_md_queue() properly finish queue setup based on DM device-type (bio-based vs request-based). A followup block patch can be made to remove the export for blk_queue_bio() now that DM no longer calls it directly. Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-04-15	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next	Linus Torvalds
	Pull networking updates from David Miller: 1) Add BQL support to via-rhine, from Tino Reichardt. 2) Integrate SWITCHDEV layer support into the DSA layer, so DSA drivers can support hw switch offloading. From Floria Fainelli. 3) Allow 'ip address' commands to initiate multicast group join/leave, from Madhu Challa. 4) Many ipv4 FIB lookup optimizations from Alexander Duyck. 5) Support EBPF in cls_bpf classifier and act_bpf action, from Daniel Borkmann. 6) Remove the ugly compat support in ARP for ugly layers like ax25, rose, etc. And use this to clean up the neigh layer, then use it to implement MPLS support. All from Eric Biederman. 7) Support L3 forwarding offloading in switches, from Scott Feldman. 8) Collapse the LOCAL and MAIN ipv4 FIB tables when possible, to speed up route lookups even further. From Alexander Duyck. 9) Many improvements and bug fixes to the rhashtable implementation, from Herbert Xu and Thomas Graf. In particular, in the case where an rhashtable user bulk adds a large number of items into an empty table, we expand the table much more sanely. 10) Don't make the tcp_metrics hash table per-namespace, from Eric Biederman. 11) Extend EBPF to access SKB fields, from Alexei Starovoitov. 12) Split out new connection request sockets so that they can be established in the main hash table. Much less false sharing since hash lookups go direct to the request sockets instead of having to go first to the listener then to the request socks hashed underneath. From Eric Dumazet. 13) Add async I/O support for crytpo AF_ALG sockets, from Tadeusz Struk. 14) Support stable privacy address generation for RFC7217 in IPV6. From Hannes Frederic Sowa. 15) Hash network namespace into IP frag IDs, also from Hannes Frederic Sowa. 16) Convert PTP get/set methods to use 64-bit time, from Richard Cochran. * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1816 commits) fm10k: Bump driver version to 0.15.2 fm10k: corrected VF multicast update fm10k: mbx_update_max_size does not drop all oversized messages fm10k: reset head instead of calling update_max_size fm10k: renamed mbx_tx_dropped to mbx_tx_oversized fm10k: update xcast mode before synchronizing multicast addresses fm10k: start service timer on probe fm10k: fix function header comment fm10k: comment next_vf_mbx flow fm10k: don't handle mailbox events in iov_event path and always process mailbox fm10k: use separate workqueue for fm10k driver fm10k: Set PF queues to unlimited bandwidth during virtualization fm10k: expose tx_timeout_count as an ethtool stat fm10k: only increment tx_timeout_count in Tx hang path fm10k: remove extraneous "Reset interface" message fm10k: separate PF only stats so that VF does not display them fm10k: use hw->mac.max_queues for stats fm10k: only show actual queues, not the maximum in hardware fm10k: allow creation of VLAN on default vid fm10k: fix unused warnings ...
2015-04-15	i2c: jz4780: Fix build for m68k and sparc64	Guenter Roeck
	Fix: drivers/i2c/busses/i2c-jz4780.c: In function 'jz4780_i2c_readw': drivers/i2c/busses/i2c-jz4780.c:181:2: error: implicit declaration of function 'readw' drivers/i2c/busses/i2c-jz4780.c: In function 'jz4780_i2c_writew': drivers/i2c/busses/i2c-jz4780.c:187:2: error: implicit declaration of function 'writew' seen with sparc64:allmodconfig and m68k:allmodconfig. The driver has to include linux/io.h. Fixes: ba92222ed63a ("i2c: jz4780: Add i2c bus controller driver for Ingenic JZ4780") Signed-off-by: Guenter Roeck <linux@roeck-us.net> Signed-off-by: Wolfram Sang <wsa@the-dreams.de>
2015-04-15	s390/dasd: Fix unresumed device after suspend/resume having no paths	Stefan Haberland
	The DASD device driver prevents I/O from being started on stopped devices. This also prevented channel paths to be verified and so the device was unable to be resumed. Fix by allowing path verification requests on stopped devices. Signed-off-by: Stefan Haberland <stefan.haberland@de.ibm.com> Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
2015-04-15	s390/dasd: fix unresumed device after suspend/resume	Stefan Haberland
	The DASD device driver only has a limited amount of memory to build I/O requests. This memory was used by blocklayer requests leading to an inability to build needed internal requests to resume the device. Fix by preventing the DASD driver to fetch requests for a stopped device. Signed-off-by: Stefan Haberland <stefan.haberland@de.ibm.com> Reference-ID: RQM 2520 Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
2015-04-15	s390/dasd: fix inability to set a DASD device offline	Stefan Haberland
	Fix ref counting for DASD devices leading to an inability to set a DASD device offline. Before a worker is scheduled the DASD device driver takes a reference to the device. If the worker was already scheduled this reference was never freed. Fix by giving the reference to the DASD device free when schedule_work() returns false. Signed-off-by: Stefan Haberland <stefan.haberland@de.ibm.com> Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
2015-04-15	s390/mm: Fix memory hotplug for unaligned standby memory	Michael Holzheu
	Commit 27356f54c8c3 ("mm/hotplug: verify hotplug memory range") introduced a check that makes add_memory() only accept section size aligned memory. Therefore on z/VM systems, where standby memory is not aligned, no standby memory is registered at all. Example: #cp def store 3504M standby 2336M 00: CP Q V STORE 00: STORAGE = 3504M MAX = 6G INC = 8M STANDBY = 2336M RESERVED = 0 For this setup the following error message is printed: Section-unaligned hotplug range: start 0xdb000000, size 0x92000000 So fix this and register aligned memory in "sclp_cmd.c". This means that for the corner cases where the standby memory is not aligned we loose some memory. In order to inform the user about the potential loss of standby memory, we add a new message for each added standby block and print how much of the standby memory is usable, for example: sclp_cmd.4336b4: Standby memory at 0x50000000 (256M of 256M usable) sclp_cmd.4336b4: Standby memory at 0xb0000000 (256M of 256M usable) sclp_cmd.4336b4: Standby memory at 0xdb000000 (2048M of 2336M usable) We also ensure that a potential memory block that contains both "assigned" and "standby" memory cannot be setup offline. Reviewed-by: Gerald Schaefer <gerald.schaefer@de.ibm.com> Reviewed-by: Heiko Carstens <heiko.carstens@de.ibm.com> Signed-off-by: Michael Holzheu <holzheu@linux.vnet.ibm.com> Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
2015-04-15	xen/pci: Try harder to get PXM information for Xen	Ross Lagerwall
	If the device being added to Xen is not contained in the ACPI table, walk the PCI device tree to find a parent that is contained in the ACPI table before finding the PXM information from this device. Previously, it would try to get a handle for the device, then the device's bridge, then the physfn. This changes the order so that it tries to get a handle for the device, then the physfn, the walks up the PCI device tree. Signed-off-by: Ross Lagerwall <ross.lagerwall@citrix.com> Signed-off-by: David Vrabel <david.vrabel@citrix.com>
2015-04-15	xenbus_client: Extend interface to support multi-page ring	Wei Liu
	Originally Xen PV drivers only use single-page ring to pass along information. This might limit the throughput between frontend and backend. The patch extends Xenbus driver to support multi-page ring, which in general should improve throughput if ring is the bottleneck. Changes to various frontend / backend to adapt to the new interface are also included. Affected Xen drivers: * blkfront/back * netfront/back * pcifront/back * scsifront/back * vtpmfront The interface is documented, as before, in xenbus_client.c. Signed-off-by: Wei Liu <wei.liu2@citrix.com> Signed-off-by: Paul Durrant <paul.durrant@citrix.com> Signed-off-by: Bob Liu <bob.liu@oracle.com> Cc: Konrad Wilk <konrad.wilk@oracle.com> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com> Signed-off-by: David Vrabel <david.vrabel@citrix.com>