linux.git - Linus' kernel tree

Age	Commit message (Collapse)	Author
2017-02-20	nfp: store NSP ABI version in state structure	Jakub Kicinski
	We read the status register on each NSP open, we can store the NSP ABI version in the state structure so that we don't have to read it again. Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-20	nfp: report manufacturing info on load	Jakub Kicinski
	Report card manufacturing information when driver loads. These identify the version of the board and its subcomponents. Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-20	nfp: refactor NSP initialization and add error message	Jakub Kicinski
	When acquiring NSP communication resource fails user is left with "probe failed with error -2" PCI code message but no info on what caused the problem. Some development boards may not have NSP FW in the flash image. Help users with a more verbouse message. While at it move the whole NSP init to a separate function to keep .probe() callback nice and simple. Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-20	Merge branch 'for-upstream' of ↵	David S. Miller
	git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth-next Johan Hedberg says: ==================== pull request: bluetooth-next 2017-02-19 Here's a set of Bluetooth patches for the 4.11 kernel: - New USB IDs to the btusb driver - Race fix in btmrvl driver - Added out-of-band wakeup support to the btusb driver - NULL dereference fix to bt_sock_recvmsg Please let me know if there are any issues pulling. Thanks. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-20	mm: use helper for calling f_op->mmap()	Miklos Szeredi
	Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2017-02-20	vfs: use helpers for calling f_op->{read,write}_iter()	Miklos Szeredi
	Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2017-02-20	qlcnic: Fix a memory leak in error handling path	Christophe Jaillet
	If 'dma_alloc_coherent()' fails, we should release resources allocated so far, just as done in all other cases in this function. Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-20	net: mvpp2: Fix a memory leak in error handling path	Christophe Jaillet
	if 'devm_kzalloc()' fails, we should release resources allocated so far, just as done a few lines below. Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-20	mlx4: reduce OOM risk on arches with large pages	Eric Dumazet
	Since mlx4 NIC are used on PowerPC with 64K pages, we need to adapt MLX4_EN_ALLOC_PREFER_ORDER definition. Otherwise, a fragment sitting in an out of order TCP queue can hold 0.5 Mbytes and it is a serious OOM risk. Fixes: 51151a16a60f ("mlx4: allow order-0 memory allocations in RX path") Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-20	Merge branch '40GbE' of ↵	David S. Miller
	git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/next-queue Jeff Kirsher says: ==================== 40GbE Intel Wired LAN Driver Updates 2017-02-18 This series contains updates to i40e and i40evf only. Alan fixes a bug in which the driver is unable to exit overflow promiscuous mode after having added "too many" mac filters. Ractored the '%*ph' printk format specifier to instead use the print_hex_dump(). Josh adds enabling multicast magic packet wakeup by adding calls to the mac_address_write admin q function during power down to update the PRTPM_SAH/SAL registers with the MC_MAG_EN bit. Jake remove a duplicate call i40e_update_link_info(), since it does not need to call it twice. Fixes and issue where we calculating the wrong switch id on big endian platforms. Avoided sparse warning, by doing a typecast to ensure the value is of the type expected by csum_replace_by_diff(). Mitch fixes a memory leak by freeing resources during i40e_remove(). Cleans up some code confusion by adding a proper code comment. Carolyn fixes a bug introduced with the addition of the per queue ITR feature support in ethtool. Cleans up a duplicate device id from the PCI table. Harshitha fixes a bug which causes the 'Link Detected' field in ethtool to report the correct link status. Benjamin Poirier from SuSE applies a fix ec13ee80145c ("virtio_net: invoke softirqs after __napi_schedule") to i40e driver as well. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-20	net: qlogic: qla3xxx: use new api ethtool_{get\|set}_link_ksettings	Philippe Reynes
	The ethtool api {get\|set}_settings is deprecated. We move this driver to new api {get\|set}_link_ksettings. As I don't have the hardware, I'd be very pleased if someone may test this patch. Signed-off-by: Philippe Reynes <tremyfr@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-20	fsl/fman: fix spelling mistake in variable name en_tsu_err_exeption	Colin Ian King
	trivial fix to spelling mistake, en_tsu_err_exeption should be en_tsu_err_exception Signed-off-by: Colin Ian King <colin.king@canonical.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-20	net: aquantia: remove function aq_ring_tx_deinit	Lino Sanfilippo
	Both functions aq_ring_rx_deinit() and aq_ring_tx_clean() are almost identical aside from an additional check in the latter. Move that check from the function into its caller and replace aq_ring_rx_deinit() with aq_ring_rx_deinit(). By doing this also adjust the functions return value from int to void since it can never fail. Signed-off-by: Lino Sanfilippo <LinoSanfilippo@gmx.de> Tested-by: Pavel Belous <pavel.belous@aquantia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-20	net: ena: remove superfluous check in ena_remove()	Lino Sanfilippo
	The check in ena_remove() for the pci driver data not being NULL is not needed, since it is always set in the probe() function. Remove the superfluous check. Signed-off-by: Lino Sanfilippo <LinoSanfilippo@gmx.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-20	net: phy: Check phydev->drv	Florian Fainelli
	There are number of function calls, originating from user-space, typically through the Ethernet driver that can make us crash by dereferencing phydev->drv which will be NULL once we unbind the driver from the PHY. There are still functional issues that prevent an unbind then rebind to work, but these will be addressed separately. Suggested-by: Russell King <rmk+kernel@armlinux.org.uk> Signed-off-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-20	net: phy: Fix PHY unbind crash	Florian Fainelli
	The PHY library does not deal very well with bind and unbind events. The first thing we would see is that we were not properly canceling the PHY state machine workqueue, so we would be crashing while dereferencing phydev->drv since there is no driver attached anymore. Suggested-by: Russell King <rmk+kernel@armlinux.org.uk> Signed-off-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-20	Merge branches 'for-4.10/upstream-fixes', 'for-4.11/intel-ish', ↵	Jiri Kosina
	'for-4.11/mayflash', 'for-4.11/microsoft', 'for-4.11/rmi', 'for-4.11/upstream' and 'for-4.11/wacom' into for-linus
2017-02-20	Merge branches 'acpi-ec', 'acpi-button' and 'acpi-apei'	Rafael J. Wysocki
	* acpi-ec: ACPI / EC: Use busy polling mode when GPE is not enabled ACPI / EC: Remove old CLEAR_ON_RESUME quirk * acpi-button: ACPI / button: Remove lid_init_state=method mode ACPI / button: Change default behavior to lid_init_state=open * acpi-apei: ACPI, APEI, EINJ: fix malformed newline escape
2017-02-20	Merge branches 'acpi-bus', 'acpi-sleep' and 'acpi-processor'	Rafael J. Wysocki
	* acpi-bus: spi: acpi: Initialize modalias from of_compatible i2c: acpi: Initialize info.type from of_compatible ACPI / bus: Introduce acpi_of_modalias() equiv of of_modalias_node() * acpi-sleep: ACPI: save NVS memory for Lenovo G50-45 * acpi-processor: x86/ACPI: keep x86_cpu_to_acpiid mapping valid on CPU hotplug
2017-02-20	Merge branch 'acpica'	Rafael J. Wysocki
	* acpica: (22 commits) ACPICA: Update version to 20170119 ACPICA: Tools: Update common signon, remove compilation bit width ACPICA: Source tree: Update copyright notices to 2017 ACPICA: Linuxize: Restore and fix Intel compiler build ACPICA: Update version to 20161222 ACPICA: Parser: Update parse info table for some operators ACPICA: Fix a problem with recent extra support for control method invocations ACPICA: Parser: Allow method invocations as target operands ACPICA: Fix for implicit result conversion for the ToXXX functions ACPICA: Resources: Not a valid resource if buffer length too long ACPICA: Utilities: Update debug output ACPICA: Disassembler: Add Switch/Case disassembly support ACPICA: EFI: Add efihello demo application ACPICA: MSVC: Fix MSVC6 build issues ACPICA: Linux-specific header: Add support for s390x compilation ACPICA: Hardware: Add sleep register hooks ACPICA: Macro header: Fix some typos in comments ACPICA: Hardware: Sort access bit width algorithm ACPICA: Utilities: Add power of two rounding support ACPICA: Hardware: Add access_width/bit_offset support in acpi_hw_write() ...
2017-02-20	Merge branches 'pm-core', 'pm-qos' and 'pm-domains'	Rafael J. Wysocki
	* pm-core: PM / wakeirq: report a wakeup_event on dedicated wekup irq PM / wakeirq: Fix spurious wake-up events for dedicated wakeirqs PM / wakeirq: Enable dedicated wakeirq for suspend * pm-qos: PM / QoS: Fix memory leak on resume_latency.notifiers PM / QoS: Remove unneeded linux/miscdevice.h include * pm-domains: PM / Domains: Provide dummy governors if CONFIG_PM_GENERIC_DOMAINS=n PM / Domains: Fix asynchronous execution of *noirq() callbacks PM / Domains: Correct comment in irq_safe_dev_in_no_sleep_domain() PM / Domains: Rename functions in genpd for power on/off
2017-02-20	Merge branch 'pm-devfreq'	Rafael J. Wysocki
	* pm-devfreq: PM / devfreq: Modify the device name as devfreq(X) for sysfs PM / devfreq: Simplify the sysfs name of devfreq-event device PM / devfreq: Remove unnecessary separate _remove_devfreq() PM / devfreq: Fix wrong trans_stat of passive devfreq device PM / devfreq: Fix available_governor sysfs PM / devfreq: exynos-ppmu: Show the registred device for ppmu device PM / devfreq: Fix the wrong description for userspace governor PM / devfreq: Fix the checkpatch warnings PM / devfreq: exynos-bus: Print the real clock rate of bus PM / devfreq: exynos-ppmu: Use the regmap interface to handle the registers PM / devfreq: exynos-bus: Add the detailed correlation for Exynos5433 PM / devfreq: Don't delete sysfs group twice
2017-02-20	Merge branch 'pm-cpuidle'	Rafael J. Wysocki
	* pm-cpuidle: CPU / PM: expose pm_qos_resume_latency for CPUs cpuidle/menu: add per CPU PM QoS resume latency consideration cpuidle/menu: stop seeking deeper idle if current state is deep enough ACPI / idle: small formatting fixes
2017-02-20	Merge branch 'pm-cpufreq'	Rafael J. Wysocki
	* pm-cpufreq: (28 commits) MAINTAINERS: cpufreq: add bmips-cpufreq.c cpufreq: CPPC: add ACPI_PROCESSOR dependency cpufreq: make ti-cpufreq explicitly non-modular cpufreq: Do not clear real_cpus mask on policy init cpufreq: dt: Don't use generic platdev driver for ti-cpufreq platforms cpufreq: ti: Add cpufreq driver to determine available OPPs at runtime Documentation: dt: add bindings for ti-cpufreq cpufreq: qoriq: Don't look at clock implementation details cpufreq: qoriq: add ARM64 SoCs support cpufreq: brcmstb-avs-cpufreq: remove unnecessary platform_set_drvdata() cpufreq: s3c2416: double free on driver init error path MIPS: BMIPS: enable CPUfreq cpufreq: bmips-cpufreq: CPUfreq driver for Broadcom's BMIPS SoCs BMIPS: Enable prerequisites for CPUfreq in MIPS Kconfig. MIPS: BMIPS: Update defconfig cpufreq: Fix typos in comments cpufreq: intel_pstate: Calculate guaranteed performance for HWP cpufreq: intel_pstate: Make HWP limits compatible with legacy cpufreq: intel_pstate: Lower frequency than expected under no_turbo cpufreq: intel_pstate: Operation mode control from sysfs ...
2017-02-20	Merge branch 'pm-opp'	Rafael J. Wysocki
	* pm-opp: (24 commits) PM / OPP: Expose _of_get_opp_desc_node as dev_pm_opp API PM / OPP: Make _find_opp_table_unlocked() static PM / OPP: Update Documentation to remove RCU specific bits PM / OPP: Simplify dev_pm_opp_get_max_volt_latency() PM / OPP: Simplify _opp_set_availability() PM / OPP: Move away from RCU locking PM / OPP: Take kref from _find_opp_table() PM / OPP: Update OPP users to put reference PM / OPP: Add 'struct kref' to struct dev_pm_opp PM / OPP: Use dev_pm_opp_get_opp_table() instead of _add_opp_table() PM / OPP: Take reference of the OPP table while adding/removing OPPs PM / OPP: Return opp_table from dev_pm_opp_set_*() routines PM / OPP: Add 'struct kref' to OPP table PM / OPP: Add per OPP table mutex PM / OPP: Split out part of _add_opp_table() and _remove_opp_table() PM / OPP: Don't expose srcu_head to register notifiers PM / OPP: Rename dev_pm_opp_get_suspend_opp() and return OPP rate PM / OPP: Don't allocate OPP table from _opp_allocate() PM / OPP: Rename and split _dev_pm_opp_remove_table() PM / OPP: Add light weight _opp_free() routine ...
2017-02-20	video: fbdev: fsl-diu-fb: fix spelling mistake "palette"	Colin Ian King
	trivial fix to spelling mistakes of "palette" Signed-off-by: Colin Ian King <colin.king@canonical.com> Acked-by: Timur Tabi <timur@tabi.org> Signed-off-by: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
2017-02-20	fbdev: ssd1307fb: include linux/gpio/consumer.h	Arnd Bergmann
	Changing this driver to the gpiod API requires the use of the new-style header, depending on the configuration: drivers/video/fbdev/ssd1307fb.c: In function 'ssd1307fb_probe': drivers/video/fbdev/ssd1307fb.c:569:15: error: implicit declaration of function 'devm_gpiod_get_optional';did you mean 'devm_regulator_get_optional'? [-Werror=implicit-function-declaration] drivers/video/fbdev/ssd1307fb.c:570:11: error: 'GPIOD_OUT_LOW' undeclared (first use in this function) Fixes: 72db33355c14 ("fbdev: ssd1307fb: Start to use gpiod API for reset gpio") Cc: Tomi Valkeinen <tomi.valkeinen@ti.com> Cc: Jyri Sarha <jsarha@ti.com> Cc: LABBE Corentin <clabbe.montjoie@gmail.com> Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
2017-02-20	rbd: constify device_type structure	Bhumika Goyal
	Declare device_type structure as const as it is only stored in the type field of a device structure. This field is of type const, so add const to the declaration of device_type structure. File size before: text data bss dec hex filename 61546 11610 208 73364 11e94 drivers/block/rbd.o File size after: text data bss dec hex filename 61610 11578 208 73396 11eb4 drivers/block/rbd.o Signed-off-by: Bhumika Goyal <bhumirks@gmail.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-02-20	s390/zcrypt: make ap_bus explicitly non-modular	Paul Gortmaker
	The Makefile in drivers/s390 has: obj-y += cio/ block/ char/ crypto/ net/ scsi/ virtio/ and the Makefile in crypto/ has: ap-objs := ap_bus.o ap_card.o ap_queue.o meaning that it currently is not being built as a module by anyone. Lets remove the modular code that is essentially orphaned, so that when reading the driver there is no doubt it is builtin-only. Since module_init translates to device_initcall in the non-modular case, the init ordering remains unchanged with this commit. Also note that MODULE_ALIAS is a no-op for non-module builds. We also delete the MODULE_LICENSE tag etc. since all that information is already contained at the top of the file in the comments. We replace module.h with moduleparam.h since the file does declare some module parameters even though it is not modular itself. Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com> Signed-off-by: Harald Freudenberger <freude@linux.vnet.ibm.com> Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
2017-02-20	s390/zcrypt: Removed unneeded debug feature directory creation.	Harald Freudenberger
	The ap bus code and the zcrypt api had invocations to the debug feature debugfs_create_dir() call but never populated these directories in any way. Removed this unneeded code. Signed-off-by: Harald Freudenberger <freude@linux.vnet.ibm.com> Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
2017-02-20	tpm: declare tpm2_get_pcr_allocation() as static	Jarkko Sakkinen
	There's no need to export tpm2_get_pcr_alloation() because it is only a helper function for tpm2_auto_startup(). For the same reason it does not make much sense to maintain documentation for it. Signed-off-by: Jarkko Sakkinen <jarkko.sakkinen@linux.intel.com>
2017-02-20	rbd: kill obj_request->object_name and rbd_segment_name_cache	Ilya Dryomov
	Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Jason Dillaman <dillaman@redhat.com>
2017-02-20	rbd: store and use obj_request->object_no	Ilya Dryomov
	object_no can be trivially formatted into an object name. We already store object names in OSD requests with special care to avoid dynamic allocations for short names. Storing a name in obj_request, obtained as below (!), is a waste and will be removed in the next commit. name = kmem_cache_alloc(rbd_segment_name_cache, ...); snprintf(name, ...); obj_request->object_name = kstrdup(name); kmem_cache_free(rbd_segment_name_cache, name); ... ceph_oid_aprintf(..., "%s", obj_request->object_name); Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Jason Dillaman <dillaman@redhat.com>
2017-02-20	rbd: RBD_V{1,2}_DATA_FORMAT macros	Ilya Dryomov
	... and also fix up the comment -- format 1 data objects have always been 12 hex digits long. Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Jason Dillaman <dillaman@redhat.com>
2017-02-20	rbd: factor out __rbd_osd_req_create()	Ilya Dryomov
	Factor OSD request allocation and initialization code out into __rbd_osd_req_create(). Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Jason Dillaman <dillaman@redhat.com>
2017-02-20	rbd: set offset and length outside of rbd_obj_request_create()	Ilya Dryomov
	The allocation doesn't depend on offset and length. Both offset and length can be changed after obj_request is allocated, too. Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Jason Dillaman <dillaman@redhat.com>
2017-02-20	rbd: support for data-pool feature	Ilya Dryomov
	Add support for RBD_FEATURE_DATA_POOL feature. rbd_dev->layout.pool_id now stores the data pool id. Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Jason Dillaman <dillaman@redhat.com>
2017-02-20	rbd: introduce rbd_init_layout()	Ilya Dryomov
	Rather than initializing layout fields with some made up values in __rbd_dev_create(), move the initialization into rbd_init_layout() and call it after the header is actually populated. Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Jason Dillaman <dillaman@redhat.com>
2017-02-20	rbd: use rbd_obj_bytes() more	Ilya Dryomov
	Returning u64 doesn't make sense: max header->obj_order is 25 and ceph_file_layout::object_size is u32. Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Jason Dillaman <dillaman@redhat.com>
2017-02-20	rbd: remove now unused rbd_obj_request_wait() and helpers	Ilya Dryomov
	Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Jason Dillaman <dillaman@redhat.com>
2017-02-20	rbd: switch rbd_obj_method_sync() to ceph_osdc_call()	Ilya Dryomov
	As explained in the previous commit, rbd_obj_request machinery (and rbd_osd_req_create() in particular) shouldn't be used for working with metadata objects. Switch to the recently added ceph_osdc_call(). It assumes single pages for outbound and inbound buffers, but that's OK - none of the callers need more than that. These pages need to be allocated (messenger is in dire need of proper iterator interface!), but we are swapping for pages[] and pagelist allocations in the existing code. Kill class_name argument - all rbd methods are under "rbd". Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Jason Dillaman <dillaman@redhat.com>
2017-02-20	rbd: do away with obj_request in rbd_obj_read_sync()	Ilya Dryomov
	rbd_obj_request machinery is completely unnecessary here; all that's being done is fetching a metadata object - no striping, cloning, etc. More importantly, rbd_osd_req_create() grabs pool id from layout and that is becoming a data pool id. Kill offset argument - all metadata objects are small and read in full. Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Jason Dillaman <dillaman@redhat.com>
2017-02-20	rbd: initialize rbd_dev->header_oloc early	Ilya Dryomov
	No reason to delay it until image_id is known. This will be required by some rbd_obj_method_sync() callers, after rbd_obj_method_sync() is changed to take oloc. Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Jason Dillaman <dillaman@redhat.com>
2017-02-20	rbd: kill rbd_image_header::{crypt_type,comp_type}	Ilya Dryomov
	Image format 1 is deprecated and format 2 doesn't have these. Also, __rbd_dev_create() takes care of zeroing (or otherwise initializing) format 2 specific fields. Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Jason Dillaman <dillaman@redhat.com>
2017-02-20	rbd: use kstrndup() in rbd_header_from_disk()	Ilya Dryomov
	Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Jason Dillaman <dillaman@redhat.com>
2017-02-19	md/raid1: fix a use-after-free bug	Shaohua Li
	Commit fd76863 (RAID1: a new I/O barrier implementation to remove resync window) introduces a user-after-free bug. Signed-off-by: Shaohua Li <shli@fb.com>
2017-02-19	RAID1: avoid unnecessary spin locks in I/O barrier code	colyli@suse.de
	When I run a parallel reading performan testing on a md raid1 device with two NVMe SSDs, I observe very bad throughput in supprise: by fio with 64KB block size, 40 seq read I/O jobs, 128 iodepth, overall throughput is only 2.7GB/s, this is around 50% of the idea performance number. The perf reports locking contention happens at allow_barrier() and wait_barrier() code, - 41.41% fio [kernel.kallsyms] [k] _raw_spin_lock_irqsave - _raw_spin_lock_irqsave + 89.92% allow_barrier + 9.34% __wake_up - 37.30% fio [kernel.kallsyms] [k] _raw_spin_lock_irq - _raw_spin_lock_irq - 100.00% wait_barrier The reason is, in these I/O barrier related functions, - raise_barrier() - lower_barrier() - wait_barrier() - allow_barrier() They always hold conf->resync_lock firstly, even there are only regular reading I/Os and no resync I/O at all. This is a huge performance penalty. The solution is a lockless-like algorithm in I/O barrier code, and only holding conf->resync_lock when it has to. The original idea is from Hannes Reinecke, and Neil Brown provides comments to improve it. I continue to work on it, and make the patch into current form. In the new simpler raid1 I/O barrier implementation, there are two wait barrier functions, - wait_barrier() Which calls _wait_barrier(), is used for regular write I/O. If there is resync I/O happening on the same I/O barrier bucket, or the whole array is frozen, task will wait until no barrier on same barrier bucket, or the whold array is unfreezed. - wait_read_barrier() Since regular read I/O won't interfere with resync I/O (read_balance() will make sure only uptodate data will be read out), it is unnecessary to wait for barrier in regular read I/Os, waiting in only necessary when the whole array is frozen. The operations on conf->nr_pending[idx], conf->nr_waiting[idx], conf-> barrier[idx] are very carefully designed in raise_barrier(), lower_barrier(), _wait_barrier() and wait_read_barrier(), in order to avoid unnecessary spin locks in these functions. Once conf-> nr_pengding[idx] is increased, a resync I/O with same barrier bucket index has to wait in raise_barrier(). Then in _wait_barrier() if no barrier raised in same barrier bucket index and array is not frozen, the regular I/O doesn't need to hold conf->resync_lock, it can just increase conf->nr_pending[idx], and return to its caller. wait_read_barrier() is very similar to _wait_barrier(), the only difference is it only waits when array is frozen. For heavy parallel reading I/Os, the lockless I/O barrier code almostly gets rid of all spin lock cost. This patch significantly improves raid1 reading peroformance. From my testing, a raid1 device built by two NVMe SSD, runs fio with 64KB blocksize, 40 seq read I/O jobs, 128 iodepth, overall throughput increases from 2.7GB/s to 4.6GB/s (+70%). Changelog V4: - Change conf->nr_queued[] to atomic_t. - Define BARRIER_BUCKETS_NR_BITS by (PAGE_SHIFT - ilog2(sizeof(atomic_t))) V3: - Add smp_mb__after_atomic() as Shaohua and Neil suggested. - Change conf->nr_queued[] from atomic_t to int. - Change conf->array_frozen from atomic_t back to int, and use READ_ONCE(conf->array_frozen) to check value of conf->array_frozen in _wait_barrier() and wait_read_barrier(). - In _wait_barrier() and wait_read_barrier(), add a call to wake_up(&conf->wait_barrier) after atomic_dec(&conf->nr_pending[idx]), to fix a deadlock between _wait_barrier()/wait_read_barrier and freeze_array(). V2: - Remove a spin_lock/unlock pair in raid1d(). - Add more code comments to explain why there is no racy when checking two atomic_t variables at same time. V1: - Original RFC patch for comments. Signed-off-by: Coly Li <colyli@suse.de> Cc: Shaohua Li <shli@fb.com> Cc: Hannes Reinecke <hare@suse.com> Cc: Johannes Thumshirn <jthumshirn@suse.de> Cc: Guoqing Jiang <gqjiang@suse.com> Reviewed-by: Neil Brown <neilb@suse.de> Signed-off-by: Shaohua Li <shli@fb.com>
2017-02-19	RAID1: a new I/O barrier implementation to remove resync window	colyli@suse.de
	'Commit 79ef3a8aa1cb ("raid1: Rewrite the implementation of iobarrier.")' introduces a sliding resync window for raid1 I/O barrier, this idea limits I/O barriers to happen only inside a slidingresync window, for regular I/Os out of this resync window they don't need to wait for barrier any more. On large raid1 device, it helps a lot to improve parallel writing I/O throughput when there are background resync I/Os performing at same time. The idea of sliding resync widow is awesome, but code complexity is a challenge. Sliding resync window requires several variables to work collectively, this is complexed and very hard to make it work correctly. Just grep "Fixes: 79ef3a8aa1" in kernel git log, there are 8 more patches to fix the original resync window patch. This is not the end, any further related modification may easily introduce more regreassion. Therefore I decide to implement a much simpler raid1 I/O barrier, by removing resync window code, I believe life will be much easier. The brief idea of the simpler barrier is, - Do not maintain a global unique resync window - Use multiple hash buckets to reduce I/O barrier conflicts, regular I/O only has to wait for a resync I/O when both them have same barrier bucket index, vice versa. - I/O barrier can be reduced to an acceptable number if there are enough barrier buckets Here I explain how the barrier buckets are designed, - BARRIER_UNIT_SECTOR_SIZE The whole LBA address space of a raid1 device is divided into multiple barrier units, by the size of BARRIER_UNIT_SECTOR_SIZE. Bio requests won't go across border of barrier unit size, that means maximum bio size is BARRIER_UNIT_SECTOR_SIZE<<9 (64MB) in bytes. For random I/O 64MB is large enough for both read and write requests, for sequential I/O considering underlying block layer may merge them into larger requests, 64MB is still good enough. Neil also points out that for resync operation, "we want the resync to move from region to region fairly quickly so that the slowness caused by having to synchronize with the resync is averaged out over a fairly small time frame". For full speed resync, 64MB should take less then 1 second. When resync is competing with other I/O, it could take up a few minutes. Therefore 64MB size is fairly good range for resync. - BARRIER_BUCKETS_NR There are BARRIER_BUCKETS_NR buckets in total, which is defined by, #define BARRIER_BUCKETS_NR_BITS (PAGE_SHIFT - 2) #define BARRIER_BUCKETS_NR (1<<BARRIER_BUCKETS_NR_BITS) this patch makes the bellowed members of struct r1conf from integer to array of integers, - int nr_pending; - int nr_waiting; - int nr_queued; - int barrier; + int nr_pending; + int nr_waiting; + int nr_queued; + int barrier; number of the array elements is defined as BARRIER_BUCKETS_NR. For 4KB kernel space page size, (PAGE_SHIFT - 2) indecates there are 1024 I/O barrier buckets, and each array of integers occupies single memory page. 1024 means for a request which is smaller than the I/O barrier unit size has ~0.1% chance to wait for resync to pause, which is quite a small enough fraction. Also requesting single memory page is more friendly to kernel page allocator than larger memory size. - I/O barrier bucket is indexed by bio start sector If multiple I/O requests hit different I/O barrier units, they only need to compete I/O barrier with other I/Os which hit the same I/O barrier bucket index with each other. The index of a barrier bucket which a bio should look for is calculated by sector_to_idx() which is defined in raid1.h as an inline function, static inline int sector_to_idx(sector_t sector) { return hash_long(sector >> BARRIER_UNIT_SECTOR_BITS, BARRIER_BUCKETS_NR_BITS); } Here sector_nr is the start sector number of a bio. - Single bio won't go across boundary of a I/O barrier unit If a request goes across boundary of barrier unit, it will be split. A bio may be split in raid1_make_request() or raid1_sync_request(), if sectors returned by align_to_barrier_unit_end() is smaller than original bio size. Comparing to single sliding resync window, - Currently resync I/O grows linearly, therefore regular and resync I/O will conflict within a single barrier units. So the I/O behavior is similar to single sliding resync window. - But a barrier unit bucket is shared by all barrier units with identical barrier uinit index, the probability of conflict might be higher than single sliding resync window, in condition that writing I/Os always hit barrier units which have identical barrier bucket indexs with the resync I/Os. This is a very rare condition in real I/O work loads, I cannot imagine how it could happen in practice. - Therefore we can achieve a good enough low conflict rate with much simpler barrier algorithm and implementation. There are two changes should be noticed, - In raid1d(), I change the code to decrease conf->nr_pending[idx] into single loop, it looks like this, spin_lock_irqsave(&conf->device_lock, flags); conf->nr_queued[idx]--; spin_unlock_irqrestore(&conf->device_lock, flags); This change generates more spin lock operations, but in next patch of this patch set, it will be replaced by a single line code, atomic_dec(&conf->nr_queueud[idx]); So we don't need to worry about spin lock cost here. - Mainline raid1 code split original raid1_make_request() into raid1_read_request() and raid1_write_request(). If the original bio goes across an I/O barrier unit size, this bio will be split before calling raid1_read_request() or raid1_write_request(), this change the code logic more simple and clear. - In this patch wait_barrier() is moved from raid1_make_request() to raid1_write_request(). In raid_read_request(), original wait_barrier() is replaced by raid1_read_request(). The differnece is wait_read_barrier() only waits if array is frozen, using different barrier function in different code path makes the code more clean and easy to read. Changelog V4: - Add alloc_r1bio() to remove redundant r1bio memory allocation code. - Fix many typos in patch comments. - Use (PAGE_SHIFT - ilog2(sizeof(int))) to define BARRIER_BUCKETS_NR_BITS. V3: - Rebase the patch against latest upstream kernel code. - Many fixes by review comments from Neil, - Back to use pointers to replace arraries in struct r1conf - Remove total_barriers from struct r1conf - Add more patch comments to explain how/why the values of BARRIER_UNIT_SECTOR_SIZE and BARRIER_BUCKETS_NR are decided. - Use get_unqueued_pending() to replace get_all_pendings() and get_all_queued() - Increase bucket number from 512 to 1024 - Change code comments format by review from Shaohua. V2: - Use bio_split() to split the orignal bio if it goes across barrier unit bounday, to make the code more simple, by suggestion from Shaohua and Neil. - Use hash_long() to replace original linear hash, to avoid a possible confilict between resync I/O and sequential write I/O, by suggestion from Shaohua. - Add conf->total_barriers to record barrier depth, which is used to control number of parallel sync I/O barriers, by suggestion from Shaohua. - In V1 patch the bellowed barrier buckets related members in r1conf are allocated in memory page. To make the code more simple, V2 patch moves the memory space into struct r1conf, like this, - int nr_pending; - int nr_waiting; - int nr_queued; - int barrier; + int nr_pending[BARRIER_BUCKETS_NR]; + int nr_waiting[BARRIER_BUCKETS_NR]; + int nr_queued[BARRIER_BUCKETS_NR]; + int barrier[BARRIER_BUCKETS_NR]; This change is by the suggestion from Shaohua. - Remove some inrelavent code comments, by suggestion from Guoqing. - Add a missing wait_barrier() before jumping to retry_write, in raid1_make_write_request(). V1: - Original RFC patch for comments Signed-off-by: Coly Li <colyli@suse.de> Cc: Johannes Thumshirn <jthumshirn@suse.de> Cc: Guoqing Jiang <gqjiang@suse.com> Reviewed-by: Neil Brown <neilb@suse.de> Signed-off-by: Shaohua Li <shli@fb.com>
2017-02-19	of_mdio: Add "broadcom,bcm5241" to the whitelist.	David Daney
	Some Cavium dev boards have firmware which doesn't supply a proper ethernet-phy-ieee802.3-c22" compatible property. Restore these boards to working order by whitelisting this compatible value. Signed-off-by: David Daney <david.daney@cavium.com> Acked-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-19	net: ethernet: stmmac: dwmac-rk: Add RK3328 gmac support	david.wu
	Add constants and callback functions for the dwmac on rk3328 socs. As can be seen, the base structure is the same, only registers and the bits in them moved slightly. Signed-off-by: david.wu <david.wu@rock-chips.com> Signed-off-by: David S. Miller <davem@davemloft.net>