summaryrefslogtreecommitdiff
path: root/include/linux
AgeCommit message (Collapse)Author
2019-04-09memory: ti-emif-sram: Add ti_emif_run_hw_leveling for DDR3 hardware levelingDave Gerlach
In certain situations, such as when returning from low power modes, the EMIF must re-run hardware leveling to properly restore DDR3 access. This is accomplished by introducing a new ti-emif-sram-pm call, ti_emif_run_hw_leveling, to check if DDR3 is in use and if so, trigger the full write and read leveling processes. Suggested-by: Brad Griffis <bgriffis@ti.com> Signed-off-by: Dave Gerlach <d-gerlach@ti.com> Acked-by: Santosh Shilimkar <ssantosh@kernel.org> Signed-off-by: Tony Lindgren <tony@atomide.com>
2019-04-09Merge branches 'consolidate.2019.04.09a', 'doc.2019.03.26b', ↵Paul E. McKenney
'fixes.2019.03.26b', 'srcu.2019.03.26b', 'stall.2019.03.26b' and 'torture.2019.03.26b' into HEAD consolidate.2019.04.09a: Lingering RCU flavor consolidation cleanups. doc.2019.03.26b: Documentation updates. fixes.2019.03.26b: Miscellaneous fixes. srcu.2019.03.26b: SRCU updates. stall.2019.03.26b: RCU CPU stall warning updates. torture.2019.03.26b: Torture-test updates.
2019-04-08Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller
2019-04-08lib/string: Add strscpy_pad() functionTobin C. Harding
We have a function to copy strings safely and we have a function to copy strings and zero the tail of the destination (if source string is shorter than destination buffer) but we do not have a function to do both at once. This means developers must write this themselves if they desire this functionality. This is a chore, and also leaves us open to off by one errors unnecessarily. Add a function that calls strscpy() then memset()s the tail to zero if the source string is shorter than the destination buffer. Acked-by: Kees Cook <keescook@chromium.org> Signed-off-by: Tobin C. Harding <tobin@kernel.org> Signed-off-by: Shuah Khan <shuah@kernel.org>
2019-04-08net: phy: improve link partner capability detectionHeiner Kallweit
genphy_read_status() so far checks phydev->supported, not the actual PHY capabilities. This can make a difference if the supported speeds have been limited by of_set_phy_supported() or phy_set_max_speed(). It seems that this issue only affects the link partner advertisements as displayed by ethtool. Also this patch wouldn't apply to older kernels because linkmode bitmaps have been introduced recently. Therefore net-next. Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-08Merge tag 'mlx5-updates-2019-04-02' of ↵David S. Miller
git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux Saeed Mamameed says: ==================== mlx5-updates-2019-04-02 This series provides misc updates to mlx5 driver 1) Aya Levin (1): Handle event of power detection in the PCIE slot 2) Eli Britstein (6): Some TC VLAN related updates and fixes to the previous VLAN modify action support patchset. Offload TC e-switch rules with egress/ingress VLAN devices 3) Max Gurtovoy (1): Fix double mutex initialization in esiwtch.c 4) Tariq Toukan (3): Misc small updates A write memory barrier is sufficient in EQ ci update Obsolete param field holding a constant value Unify logic of MTU boundaries 5) Tonghao Zhang (4): Misc updates to en_tc.c Make the log friendly when decapsulation offload not supported Remove 'parse_attr' argument in parse_tc_fdb_actions() Deletes unnecessary setting of esw_attr->parse_attr Return -EOPNOTSUPP when attempting to offload an unsupported action ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-08netfilter: make two functions staticFlorian Westphal
They have no external callers anymore. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-04-08netfilter: nft_osf: Add version option supportFernando Fernandez Mancera
Add version option support to the nftables "osf" expression. Signed-off-by: Fernando Fernandez Mancera <ffmancera@riseup.net> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-04-08virtio: Honour 'may_reduce_num' in vring_create_virtqueueCornelia Huck
vring_create_virtqueue() allows the caller to specify via the may_reduce_num parameter whether the vring code is allowed to allocate a smaller ring than specified. However, the split ring allocation code tries to allocate a smaller ring on allocation failure regardless of what the caller specified. This may cause trouble for e.g. virtio-pci in legacy mode, which does not support ring resizing. (The packed ring code does not resize in any case.) Let's fix this by bailing out immediately in the split ring code if the requested size cannot be allocated and may_reduce_num has not been specified. While at it, fix a typo in the usage instructions. Fixes: 2a2d1382fe9d ("virtio: Add improved queue allocation API") Cc: stable@vger.kernel.org # v4.6+ Signed-off-by: Cornelia Huck <cohuck@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Reviewed-by: Halil Pasic <pasic@linux.ibm.com> Reviewed-by: Jens Freimann <jfreimann@redhat.com>
2019-04-08netfilter: replace NF_NAT_NEEDED with IS_ENABLED(CONFIG_NF_NAT)Florian Westphal
NF_NAT_NEEDED is true whenever nat support for either ipv4 or ipv6 is enabled. Now that the af-specific nat configuration switches have been removed, IS_ENABLED(CONFIG_NF_NAT) has the same effect. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-04-08netfilter: nf_tables: merge route type into coreFlorian Westphal
very little code, so it really doesn't make sense to have extra modules or even a kconfig knob for this. Merge them and make functionality available unconditionally. The merge makes inet family route support trivial, so add it as well here. Before: text data bss dec hex filename 835 832 0 1667 683 nft_chain_route_ipv4.ko 870 832 0 1702 6a6 nft_chain_route_ipv6.ko 111568 2556 529 114653 1bfdd nf_tables.ko After: text data bss dec hex filename 113133 2556 529 116218 1c5fa nf_tables.ko Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-04-08netfilter: optimize nf_inet_addr_cmpLi RongQing
optimize nf_inet_addr_cmp by 64bit xor computation similar to ipv6_addr_equal() Signed-off-by: Yuan Linsi <yuanlinsi01@baidu.com> Signed-off-by: Li RongQing <lirongqing@baidu.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-04-08time: Introduce jiffies64_to_msecs()Li RongQing
there is a similar helper in net/netfilter/nf_tables_api.c, this maybe become a common request someday, so move it to time.c Signed-off-by: Zhang Yu <zhangyu31@baidu.com> Signed-off-by: Li RongQing <lirongqing@baidu.com> Acked-by: John Stultz <john.stultz@linaro.org> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-04-08ARM: OMAP2+: pm33xx: Add support for rtc+ddr in self refresh modeKeerthy
Add support for rtc+ddr in self refresh mode. Add addtional pm hooks for save/restore and rtc suspend/resume. Signed-off-by: Keerthy <j-keerthy@ti.com> Signed-off-by: Tony Lindgren <tony@atomide.com>
2019-04-08rtc: OMAP: Add support for rtc-only modeKeerthy
Prepare rtc driver for rtc-only with DDR in self-refresh mode. omap_rtc_power_off now should cater to two features: 1) RTC plus DDR in self-refresh is power a saving mode where in the entire system including the different voltage rails from PMIC are shutdown except the ones feeding on to RTC and DDR. DDR is kept in self-refresh hence the contents are preserved. RTC ALARM2 is connected to PMIC_EN line once we the ALARM2 is triggered we enter the mode with DDR in self-refresh and RTC Ticking. After a predetermined time an RTC ALARM1 triggers waking up the system[1]. The control goes to bootloader. The bootloader then checks RTC scratchpad registers to confirm it was an rtc_only wakeup and follows a different path, configure bare minimal clocks for ddr and then jumps to the resume address in another RTC scratchpad registers and transfers the control to Kernel. Kernel then restores the saved context. omap_rtc_power_off_program does the ALARM2 programming part. [1] http://www.ti.com/lit/ug/spruhl7h/spruhl7h.pdf Page 2884 2) Power-off: This is usual poweroff mode. omap_rtc_power_off calls the above omap_rtc_power_off_program function and in addition to that programs the OMAP_RTC_PMIC_REG for any external wake ups for PMIC like the pushbutton and shuts off the PMIC. Hence the split in omap_rtc_power_off. Signed-off-by: Keerthy <j-keerthy@ti.com> Acked-by: Alexandre Belloni <alexandre.belloni@bootlin.com> [tony@atomide.com: folded in a fix for compile warning] Signed-off-by: Tony Lindgren <tony@atomide.com>
2019-04-08datagram: remove rendundant 'peeked' argumentPaolo Abeni
After commit a297569fe00a ("net/udp: do not touch skb->peeked unless really needed") the 'peeked' argument of __skb_try_recv_datagram() and friends is always equal to !!'flags & MSG_PEEK'. Since such argument is really a boolean info, and the callers have already 'flags & MSG_PEEK' handy, we can remove it and clean-up the code a bit. Signed-off-by: Paolo Abeni <pabeni@redhat.com> Acked-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-08dma-mapping: remove leftover NULL device supportChristoph Hellwig
Most dma_map_ops implementations already had some issues with a NULL device, or did simply crash if one was fed to them. Now that we have cleaned up all the obvious offenders we can stop to pretend we support this mode. Signed-off-by: Christoph Hellwig <hch@lst.de>
2019-04-08block: don't use for-inside-for in bio_for_each_segment_allMing Lei
Commit 6dc4f100c175 ("block: allow bio_for_each_segment_all() to iterate over multi-page bvec") changes bio_for_each_segment_all() to use for-inside-for. This way breaks all bio_for_each_segment_all() call with error out branch via 'break', since now 'break' can only break from the inner loop. Fixes this issue by implementing bio_for_each_segment_all() via single 'for' loop, and now the logic is very similar with normal bvec iterator. Cc: Qu Wenruo <quwenruo.btrfs@gmx.com> Cc: linux-btrfs@vger.kernel.org Cc: linux-fsdevel@vger.kernel.org Cc: Omar Sandoval <osandov@fb.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Reported-and-Tested-by: Qu Wenruo <quwenruo.btrfs@gmx.com> Fixes: 6dc4f100c175 ("block: allow bio_for_each_segment_all() to iterate over multi-page bvec") Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-04-08tracing: introduce TRACE_EVENT_NOP()Yafang Shao
Sometimes we want to define a tracepoint as a do-nothing function. So I introduce TRACE_EVENT_NOP, DECLARE_EVENT_CLASS_NOP and DEFINE_EVENT_NOP for this kind of usage. Link: http://lkml.kernel.org/r/1553602391-11926-2-git-send-email-laoar.shao@gmail.com Signed-off-by: Yafang Shao <laoar.shao@gmail.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-04-08Merge tag 'v5.1-rc3' into develLinus Walleij
Linux 5.1-rc3
2019-04-08drivers: Remove explicit invocations of mmiowb()Will Deacon
mmiowb() is now implied by spin_unlock() on architectures that require it, so there is no reason to call it from driver code. This patch was generated using coccinelle: @mmiowb@ @@ - mmiowb(); and invoked as: $ for d in drivers include/linux/qed sound; do \ spatch --include-headers --sp-file mmiowb.cocci --dir $d --in-place; done NOTE: mmiowb() has only ever guaranteed ordering in conjunction with spin_unlock(). However, pairing each mmiowb() removal in this patch with the corresponding call to spin_unlock() is not at all trivial, so there is a small chance that this change may regress any drivers incorrectly relying on mmiowb() to order MMIO writes between CPUs using lock-free synchronisation. If you've ended up bisecting to this commit, you can reintroduce the mmiowb() calls using wmb() instead, which should restore the old behaviour on all architectures other than some esoteric ia64 systems. Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Will Deacon <will.deacon@arm.com>
2019-04-08mmiowb: Hook up mmiowb helpers to spinlocks and generic I/O accessorsWill Deacon
Removing explicit calls to mmiowb() from driver code means that we must now call into the generic mmiowb_spin_{lock,unlock}() functions from the core spinlock code. In order to elide barriers following critical sections without any I/O writes, we also hook into the asm-generic I/O routines. Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Will Deacon <will.deacon@arm.com>
2019-04-08cpufreq: intel_pstate: Update max frequency on global turbo changesRafael J. Wysocki
While the cpuinfo.max_freq value doesn't really matter for intel_pstate in the active mode, in the passive mode it is used by governors as the maximum physical frequency of the CPU and the results of governor computations generally depend on it. Also it is made available to user space via sysfs and it should match the current HW configuration. For this reason, make intel_pstate update cpuinfo.max_freq for all CPUs if it detects a global change of turbo frequency settings from "disable" to "enable" or the other way associated with a _PPC change notification from the platform firmware. Note that policy_is_inactive(), cpufreq_cpu_acquire(), cpufreq_cpu_release(), and cpufreq_set_policy() need to be made available to it for this purpose. Link: https://bugzilla.kernel.org/show_bug.cgi?id=200759 Reported-by: Gabriele Mazzotta <gabriele.mzt@gmail.com> Tested-by: Gabriele Mazzotta <gabriele.mzt@gmail.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
2019-04-08gpio: mmio: Drop bgpio_dir_invertedLinus Walleij
The direction inversion semantics are now handled by simply using the registers for in/out available, no need to keep track of inversion semantics exmplicitly anymore. Reviewed-by: Bartosz Golaszewski <bgolaszewski@baylibre.com> Reviewed-by: Jan Kotas <jank@cadence.com> Signed-off-by: Linus Walleij <linus.walleij@linaro.org>
2019-04-08mtd: nand: omap: Fix comment in platform data using wrong Kconfig symbolMiquel Raynal
The symbol that is being used in the #if/#endif block is not the one which is mentioned at the bottom. Fixes: 93af53b8633c ("nand: omap2: Remove horrible ifdefs to fix module probe") Signed-off-by: Miquel Raynal <miquel.raynal@bootlin.com>
2019-04-08mtd: rawnand: Get rid of chip->ecc_{strength,step}_dsBoris Brezillon
nand_device embeds a nand_ecc_req object which contains the minimum strength and step-size required by the NAND device. Drop the chip->ecc_{strength,step}_ds fields and use chip->base.eccreq.{strength,step_size} instead. Signed-off-by: Boris Brezillon <bbrezillon@kernel.org> Signed-off-by: Miquel Raynal <miquel.raynal@bootlin.com> Reviewed-by: Frieder Schrempf <frieder.schrempf@kontron.de>
2019-04-08mtd: rawnand: Get rid of chip->numchipsBoris Brezillon
The same information is provided by nanddev_ntargets(). Signed-off-by: Boris Brezillon <bbrezillon@kernel.org> Signed-off-by: Miquel Raynal <miquel.raynal@bootlin.com> Reviewed-by: Frieder Schrempf <frieder.schrempf@kontron.de>
2019-04-08mtd: rawnand: Get rid of chip->chipsizeBoris Brezillon
The target size can now be returned by nanddev_get_targetsize(). Get rid of the chip->chipsize field and use this helper instead. Signed-off-by: Boris Brezillon <bbrezillon@kernel.org> Reviewed-by: Frieder Schrempf <frieder.schrempf@kontron.de> Signed-off-by: Miquel Raynal <miquel.raynal@bootlin.com>
2019-04-08mtd: rawnand: Get rid of chip->bits_per_cellBoris Brezillon
Now that we inherit from nand_device, we can use nand_device->memorg.bits_per_cell instead of having our own field at the nand_chip level. Signed-off-by: Boris Brezillon <bbrezillon@kernel.org> Signed-off-by: Miquel Raynal <miquel.raynal@bootlin.com> Reviewed-by: Frieder Schrempf <frieder.schrempf@kontron.de>
2019-04-08mtd: rawnand: Use nanddev_mtd_max_bad_blocks()Boris Brezillon
nanddev_mtd_max_bad_blocks() is implemented by the generic NAND layer and is already doing what we need. Reuse this function instead of having our own implementation. While at it, get rid of the ->max_bb_per_die and ->blocks_per_die fields which are now unused. Signed-off-by: Boris Brezillon <bbrezillon@kernel.org> Signed-off-by: Miquel Raynal <miquel.raynal@bootlin.com> Reviewed-by: Frieder Schrempf <frieder.schrempf@kontron.de>
2019-04-08mtd: rawnand: Move all page cache related fields to a sub-structBoris Brezillon
Looking at the field names it's hard to tell what ->data_buf, ->pagebuf and ->pagebuf_bitflips are for. Clarify that by moving those fields in a sub-struct named pagecache. Signed-off-by: Boris Brezillon <bbrezillon@kernel.org> Signed-off-by: Miquel Raynal <miquel.raynal@bootlin.com> Reviewed-by: Frieder Schrempf <frieder.schrempf@kontron.de>
2019-04-08mtd: rawnand: Provide a helper to get chip->data_bufBoris Brezillon
We plan to move cache related fields to a pagecache struct in nand_chip but some drivers access ->pagebuf directly to invalidate the cache before they start using ->data_buf. Let's provide an helper that returns a pointer to ->data_buf after invalidating the cache. Signed-off-by: Boris Brezillon <bbrezillon@kernel.org> Signed-off-by: Miquel Raynal <miquel.raynal@bootlin.com> Reviewed-by: Frieder Schrempf <frieder.schrempf@kontron.de>
2019-04-08mtd: rawnand: Prepare things to reuse the generic NAND layerBoris Brezillon
The generic NAND layer provides abstraction of NAND devices no matter the bus that is used to communicate with the chip. Basing the raw NAND core on this generic layer should avoid duplication of common operations, like iterating over all pages/blocks for MTD IO/erase operations. In order to re-use this layer, we must first inherit from nand_device and then initialize the nand_device struct appropriately. This patch is taking care of the former. Signed-off-by: Boris Brezillon <bbrezillon@kernel.org> Signed-off-by: Miquel Raynal <miquel.raynal@bootlin.com> Reviewed-by: Frieder Schrempf <frieder.schrempf@kontron.de>
2019-04-08mtd: rawnand: Use nand_to_mtd() in nand_{set,get}_flash_node()Boris Brezillon
Use the nand_to_mtd() helper to access chip->mtd as done everywhere else. Signed-off-by: Boris Brezillon <bbrezillon@kernel.org> Signed-off-by: Miquel Raynal <miquel.raynal@bootlin.com> Reviewed-by: Frieder Schrempf <frieder.schrempf@kontron.de>
2019-04-08mtd: nand: Add a helper to retrieve the number of pages per targetBoris Brezillon
Will be used by the raw NAND framework. Signed-off-by: Boris Brezillon <bbrezillon@kernel.org> Signed-off-by: Miquel Raynal <miquel.raynal@bootlin.com> Reviewed-by: Frieder Schrempf <frieder.schrempf@kontron.de>
2019-04-08mtd: nand: Add a helper returning the number of eraseblocks per targetBoris Brezillon
Some drivers in the raw NAND framework seems to need this helper, so let's just add it instead of open-coding the logic. Signed-off-by: Boris Brezillon <bbrezillon@kernel.org> Signed-off-by: Miquel Raynal <miquel.raynal@bootlin.com> Reviewed-by: Frieder Schrempf <frieder.schrempf@kontron.de>
2019-04-08mtd: nand: Add max_bad_eraseblocks_per_lun info to memorgBoris Brezillon
NAND datasheets usually give the maximum number of bad blocks per LUN and this number can be used to help upper layers decide how much blocks they should reserve for bad block handling. Add a max_bad_eraseblocks_per_lun to the nand_memory_organization struct and update the NAND_MEMORG() macro (and its users) accordingly. We also provide a default mtd->_max_bad_blocks() implementation. Signed-off-by: Boris Brezillon <bbrezillon@kernel.org> Signed-off-by: Miquel Raynal <miquel.raynal@bootlin.com> Reviewed-by: Frieder Schrempf <frieder.schrempf@kontron.de>
2019-04-08spi: add a method for configuring CS timingSowjanya Komatineni
This patch creates set_cs_timing SPI master optional method for SPI masters to implement configuring CS timing if applicable. This patch also creates spi_cs_timing accessory for SPI clients to use for requesting SPI master controllers to configure device requested CS setup time, hold time and inactive delay. Signed-off-by: Sowjanya Komatineni <skomatineni@nvidia.com> Signed-off-by: Mark Brown <broonie@kernel.org>
2019-04-08spi: bitbang: Introduce spi_bitbang_init()Andrey Smirnov
Move all of the code doing struct spi_bitbang initialization, so that it can be paired with devm_spi_register_master() in order to avoid having to call spi_bitbang_stop() explicitly. Signed-off-by: Andrey Smirnov <andrew.smirnov@gmail.com> Cc: Mark Brown <broonie@kernel.org> Cc: Chris Healy <cphealy@gmail.com> Cc: linux-spi@vger.kernel.org Cc: linux-kernel@vger.kernel.org Signed-off-by: Mark Brown <broonie@kernel.org>
2019-04-08crypto: ccp - introduce SEV_GET_ID2 commandSingh, Brijesh
The current definition and implementation of the SEV_GET_ID command does not provide the length of the unique ID returned by the firmware. As per the firmware specification, the firmware may return an ID length that is not restricted to 64 bytes as assumed by the SEV_GET_ID command. Introduce the SEV_GET_ID2 command to overcome with the SEV_GET_ID limitations. Deprecate the SEV_GET_ID in the favor of SEV_GET_ID2. At the same time update SEV API web link. Cc: Janakarajan Natarajan <Janakarajan.Natarajan@amd.com> Cc: Herbert Xu <herbert@gondor.apana.org.au> Cc: Gary Hook <gary.hook@amd.com> Cc: Tom Lendacky <thomas.lendacky@amd.com> Cc: Nathaniel McCallum <npmccallum@redhat.com> Signed-off-by: Brijesh Singh <brijesh.singh@amd.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2019-04-07rhashtable: add lockdep tracking to bucket bit-spin-locks.NeilBrown
Native bit_spin_locks are not tracked by lockdep. The bit_spin_locks used for rhashtable buckets are local to the rhashtable implementation, so there is little opportunity for the sort of misuse that lockdep might detect. However locks are held while a hash function or compare function is called, and if one of these took a lock, a misbehaviour is possible. As it is quite easy to add lockdep support this unlikely possibility seems to be enough justification. So create a lockdep class for bucket bit_spin_lock and attach through a lockdep_map in each bucket_table. Without the 'nested' annotation in rhashtable_rehash_one(), lockdep correctly reports a possible problem as this lock is taken while another bucket lock (in another table) is held. This confirms that the added support works. With the correct nested annotation in place, lockdep reports no problems. Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-07rhashtable: use bit_spin_locks to protect hash bucket.NeilBrown
This patch changes rhashtables to use a bit_spin_lock on BIT(1) of the bucket pointer to lock the hash chain for that bucket. The benefits of a bit spin_lock are: - no need to allocate a separate array of locks. - no need to have a configuration option to guide the choice of the size of this array - locking cost is often a single test-and-set in a cache line that will have to be loaded anyway. When inserting at, or removing from, the head of the chain, the unlock is free - writing the new address in the bucket head implicitly clears the lock bit. For __rhashtable_insert_fast() we ensure this always happens when adding a new key. - even when lockings costs 2 updates (lock and unlock), they are in a cacheline that needs to be read anyway. The cost of using a bit spin_lock is a little bit of code complexity, which I think is quite manageable. Bit spin_locks are sometimes inappropriate because they are not fair - if multiple CPUs repeatedly contend of the same lock, one CPU can easily be starved. This is not a credible situation with rhashtable. Multiple CPUs may want to repeatedly add or remove objects, but they will typically do so at different buckets, so they will attempt to acquire different locks. As we have more bit-locks than we previously had spinlocks (by at least a factor of two) we can expect slightly less contention to go with the slightly better cache behavior and reduced memory consumption. To enhance type checking, a new struct is introduced to represent the pointer plus lock-bit that is stored in the bucket-table. This is "struct rhash_lock_head" and is empty. A pointer to this needs to be cast to either an unsigned lock, or a "struct rhash_head *" to be useful. Variables of this type are most often called "bkt". Previously "pprev" would sometimes point to a bucket, and sometimes a ->next pointer in an rhash_head. As these are now different types, pprev is NULL when it would have pointed to the bucket. In that case, 'blk' is used, together with correct locking protocol. Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-07rhashtable: allow rht_bucket_var to return NULL.NeilBrown
Rather than returning a pointer to a static nulls, rht_bucket_var() now returns NULL if the bucket doesn't exist. This will make the next patch, which stores a bitlock in the bucket pointer, somewhat cleaner. This change involves introducing __rht_bucket_nested() which is like rht_bucket_nested(), but doesn't provide the static nulls, and changing rht_bucket_nested() to call this and possible provide a static nulls - as is still needed for the non-var case. Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-07PM / arch: x86: Rework the MSR_IA32_ENERGY_PERF_BIAS handlingRafael J. Wysocki
The current handling of MSR_IA32_ENERGY_PERF_BIAS in the kernel is problematic, because it may cause changes made by user space to that MSR (with the help of the x86_energy_perf_policy tool, for example) to be lost every time a CPU goes offline and then back online as well as during system-wide power management transitions into sleep states and back into the working state. The first problem is that if the current EPB value for a CPU going online is 0 ('performance'), the kernel will change it to 6 ('normal') regardless of whether or not this is the first bring-up of that CPU. That also happens during system-wide resume from sleep states (including, but not limited to, hibernation). However, the EPB may have been adjusted by user space this way and the kernel should not blindly override that setting. The second problem is that if the platform firmware resets the EPB values for any CPUs during system-wide resume from a sleep state, the kernel will not restore their previous EPB values that may have been set by user space before the preceding system-wide suspend transition. Again, that behavior may at least be confusing from the user space perspective. In order to address these issues, rework the handling of MSR_IA32_ENERGY_PERF_BIAS so that the EPB value is saved on CPU offline and restored on CPU online as well as (for the boot CPU) during the syscore stages of system-wide suspend and resume transitions, respectively. However, retain the policy by which the EPB is set to 6 ('normal') on the first bring-up of each CPU if its initial value is 0, based on the observation that 0 may mean 'not initialized' just as well as 'performance' in that case. While at it, move the MSR_IA32_ENERGY_PERF_BIAS handling code into a separate file and document it in Documentation/admin-guide. Fixes: abe48b108247 (x86, intel, power: Initialize MSR_IA32_ENERGY_PERF_BIAS) Fixes: b51ef52df71c (x86/cpu: Restore MSR_IA32_ENERGY_PERF_BIAS after resume) Reported-by: Thomas Renninger <trenn@suse.de> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Reviewed-by: Hannes Reinecke <hare@suse.com> Acked-by: Borislav Petkov <bp@suse.de> Acked-by: Thomas Gleixner <tglx@linutronix.de>
2019-04-07mfd: ti-lmu: Remove LM3532 backlight driver referencesDan Murphy
Remove the LM3532 backlight driver references from the ti-lmu code as dedicated driver support is available. Signed-off-by: Dan Murphy <dmurphy@ti.com> Acked-for-MFD-by: Lee Jones <lee.jones@linaro.org> Signed-off-by: Jacek Anaszewski <jacek.anaszewski@gmail.com>
2019-04-06fs: stream_open - opener for stream-like files so that read and write can ↵Kirill Smelkov
run simultaneously without deadlock Commit 9c225f2655e3 ("vfs: atomic f_pos accesses as per POSIX") added locking for file.f_pos access and in particular made concurrent read and write not possible - now both those functions take f_pos lock for the whole run, and so if e.g. a read is blocked waiting for data, write will deadlock waiting for that read to complete. This caused regression for stream-like files where previously read and write could run simultaneously, but after that patch could not do so anymore. See e.g. commit 581d21a2d02a ("xenbus: fix deadlock on writes to /proc/xen/xenbus") which fixes such regression for particular case of /proc/xen/xenbus. The patch that added f_pos lock in 2014 did so to guarantee POSIX thread safety for read/write/lseek and added the locking to file descriptors of all regular files. In 2014 that thread-safety problem was not new as it was already discussed earlier in 2006. However even though 2006'th version of Linus's patch was adding f_pos locking "only for files that are marked seekable with FMODE_LSEEK (thus avoiding the stream-like objects like pipes and sockets)", the 2014 version - the one that actually made it into the tree as 9c225f2655e3 - is doing so irregardless of whether a file is seekable or not. See https://lore.kernel.org/lkml/53022DB1.4070805@gmail.com/ https://lwn.net/Articles/180387 https://lwn.net/Articles/180396 for historic context. The reason that it did so is, probably, that there are many files that are marked non-seekable, but e.g. their read implementation actually depends on knowing current position to correctly handle the read. Some examples: kernel/power/user.c snapshot_read fs/debugfs/file.c u32_array_read fs/fuse/control.c fuse_conn_waiting_read + ... drivers/hwmon/asus_atk0110.c atk_debugfs_ggrp_read arch/s390/hypfs/inode.c hypfs_read_iter ... Despite that, many nonseekable_open users implement read and write with pure stream semantics - they don't depend on passed ppos at all. And for those cases where read could wait for something inside, it creates a situation similar to xenbus - the write could be never made to go until read is done, and read is waiting for some, potentially external, event, for potentially unbounded time -> deadlock. Besides xenbus, there are 14 such places in the kernel that I've found with semantic patch (see below): drivers/xen/evtchn.c:667:8-24: ERROR: evtchn_fops: .read() can deadlock .write() drivers/isdn/capi/capi.c:963:8-24: ERROR: capi_fops: .read() can deadlock .write() drivers/input/evdev.c:527:1-17: ERROR: evdev_fops: .read() can deadlock .write() drivers/char/pcmcia/cm4000_cs.c:1685:7-23: ERROR: cm4000_fops: .read() can deadlock .write() net/rfkill/core.c:1146:8-24: ERROR: rfkill_fops: .read() can deadlock .write() drivers/s390/char/fs3270.c:488:1-17: ERROR: fs3270_fops: .read() can deadlock .write() drivers/usb/misc/ldusb.c:310:1-17: ERROR: ld_usb_fops: .read() can deadlock .write() drivers/hid/uhid.c:635:1-17: ERROR: uhid_fops: .read() can deadlock .write() net/batman-adv/icmp_socket.c:80:1-17: ERROR: batadv_fops: .read() can deadlock .write() drivers/media/rc/lirc_dev.c:198:1-17: ERROR: lirc_fops: .read() can deadlock .write() drivers/leds/uleds.c:77:1-17: ERROR: uleds_fops: .read() can deadlock .write() drivers/input/misc/uinput.c:400:1-17: ERROR: uinput_fops: .read() can deadlock .write() drivers/infiniband/core/user_mad.c:985:7-23: ERROR: umad_fops: .read() can deadlock .write() drivers/gnss/core.c:45:1-17: ERROR: gnss_fops: .read() can deadlock .write() In addition to the cases above another regression caused by f_pos locking is that now FUSE filesystems that implement open with FOPEN_NONSEEKABLE flag, can no longer implement bidirectional stream-like files - for the same reason as above e.g. read can deadlock write locking on file.f_pos in the kernel. FUSE's FOPEN_NONSEEKABLE was added in 2008 in a7c1b990f715 ("fuse: implement nonseekable open") to support OSSPD. OSSPD implements /dev/dsp in userspace with FOPEN_NONSEEKABLE flag, with corresponding read and write routines not depending on current position at all, and with both read and write being potentially blocking operations: See https://github.com/libfuse/osspd https://lwn.net/Articles/308445 https://github.com/libfuse/osspd/blob/14a9cff0/osspd.c#L1406 https://github.com/libfuse/osspd/blob/14a9cff0/osspd.c#L1438-L1477 https://github.com/libfuse/osspd/blob/14a9cff0/osspd.c#L1479-L1510 Corresponding libfuse example/test also describes FOPEN_NONSEEKABLE as "somewhat pipe-like files ..." with read handler not using offset. However that test implements only read without write and cannot exercise the deadlock scenario: https://github.com/libfuse/libfuse/blob/fuse-3.4.2-3-ga1bff7d/example/poll.c#L124-L131 https://github.com/libfuse/libfuse/blob/fuse-3.4.2-3-ga1bff7d/example/poll.c#L146-L163 https://github.com/libfuse/libfuse/blob/fuse-3.4.2-3-ga1bff7d/example/poll.c#L209-L216 I've actually hit the read vs write deadlock for real while implementing my FUSE filesystem where there is /head/watch file, for which open creates separate bidirectional socket-like stream in between filesystem and its user with both read and write being later performed simultaneously. And there it is semantically not easy to split the stream into two separate read-only and write-only channels: https://lab.nexedi.com/kirr/wendelin.core/blob/f13aa600/wcfs/wcfs.go#L88-169 Let's fix this regression. The plan is: 1. We can't change nonseekable_open to include &~FMODE_ATOMIC_POS - doing so would break many in-kernel nonseekable_open users which actually use ppos in read/write handlers. 2. Add stream_open() to kernel to open stream-like non-seekable file descriptors. Read and write on such file descriptors would never use nor change ppos. And with that property on stream-like files read and write will be running without taking f_pos lock - i.e. read and write could be running simultaneously. 3. With semantic patch search and convert to stream_open all in-kernel nonseekable_open users for which read and write actually do not depend on ppos and where there is no other methods in file_operations which assume @offset access. 4. Add FOPEN_STREAM to fs/fuse/ and open in-kernel file-descriptors via steam_open if that bit is present in filesystem open reply. It was tempting to change fs/fuse/ open handler to use stream_open instead of nonseekable_open on just FOPEN_NONSEEKABLE flags, but grepping through Debian codesearch shows users of FOPEN_NONSEEKABLE, and in particular GVFS which actually uses offset in its read and write handlers https://codesearch.debian.net/search?q=-%3Enonseekable+%3D https://gitlab.gnome.org/GNOME/gvfs/blob/1.40.0-6-gcbc54396/client/gvfsfusedaemon.c#L1080 https://gitlab.gnome.org/GNOME/gvfs/blob/1.40.0-6-gcbc54396/client/gvfsfusedaemon.c#L1247-1346 https://gitlab.gnome.org/GNOME/gvfs/blob/1.40.0-6-gcbc54396/client/gvfsfusedaemon.c#L1399-1481 so if we would do such a change it will break a real user. 5. Add stream_open and FOPEN_STREAM handling to stable kernels starting from v3.14+ (the kernel where 9c225f2655 first appeared). This will allow to patch OSSPD and other FUSE filesystems that provide stream-like files to return FOPEN_STREAM | FOPEN_NONSEEKABLE in their open handler and this way avoid the deadlock on all kernel versions. This should work because fs/fuse/ ignores unknown open flags returned from a filesystem and so passing FOPEN_STREAM to a kernel that is not aware of this flag cannot hurt. In turn the kernel that is not aware of FOPEN_STREAM will be < v3.14 where just FOPEN_NONSEEKABLE is sufficient to implement streams without read vs write deadlock. This patch adds stream_open, converts /proc/xen/xenbus to it and adds semantic patch to automatically locate in-kernel places that are either required to be converted due to read vs write deadlock, or that are just safe to be converted because read and write do not use ppos and there are no other funky methods in file_operations. Regarding semantic patch I've verified each generated change manually - that it is correct to convert - and each other nonseekable_open instance left - that it is either not correct to convert there, or that it is not converted due to current stream_open.cocci limitations. The script also does not convert files that should be valid to convert, but that currently have .llseek = noop_llseek or generic_file_llseek for unknown reason despite file being opened with nonseekable_open (e.g. drivers/input/mousedev.c) Cc: Michael Kerrisk <mtk.manpages@gmail.com> Cc: Yongzhi Pan <panyongzhi@gmail.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: David Vrabel <david.vrabel@citrix.com> Cc: Juergen Gross <jgross@suse.com> Cc: Miklos Szeredi <miklos@szeredi.hu> Cc: Tejun Heo <tj@kernel.org> Cc: Kirill Tkhai <ktkhai@virtuozzo.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Christoph Hellwig <hch@lst.de> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Julia Lawall <Julia.Lawall@lip6.fr> Cc: Nikolaus Rath <Nikolaus@rath.org> Cc: Han-Wen Nienhuys <hanwen@google.com> Signed-off-by: Kirill Smelkov <kirr@nexedi.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-04-06block: remove CONFIG_LBDAFChristoph Hellwig
Currently support for 64-bit sector_t and blkcnt_t is optional on 32-bit architectures. These types are required to support block device and/or file sizes larger than 2 TiB, and have generally defaulted to on for a long time. Enabling the option only increases the i386 tinyconfig size by 145 bytes, and many data structures already always use 64-bit values for their in-core and on-disk data structures anyway, so there should not be a large change in dynamic memory usage either. Dropping this option removes a somewhat weird non-default config that has cause various bugs or compiler warnings when actually used. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-04-06PCI: Work around Pericom PCIe-to-PCI bridge Retrain Link erratumStefan Mätje
Due to an erratum in some Pericom PCIe-to-PCI bridges in reverse mode (conventional PCI on primary side, PCIe on downstream side), the Retrain Link bit needs to be cleared manually to allow the link training to complete successfully. If it is not cleared manually, the link training is continuously restarted and no devices below the PCI-to-PCIe bridge can be accessed. That means drivers for devices below the bridge will be loaded but won't work and may even crash because the driver is only reading 0xffff. See the Pericom Errata Sheet PI7C9X111SLB_errata_rev1.2_102711.pdf for details. Devices known as affected so far are: PI7C9X110, PI7C9X111SL, PI7C9X130. Add a new flag, clear_retrain_link, in struct pci_dev. Quirks for affected devices set this bit. Note that pcie_retrain_link() lives in aspm.c because that's currently the only place we use it, but this erratum is not specific to ASPM, and we may retrain links for other reasons in the future. Signed-off-by: Stefan Mätje <stefan.maetje@esd.eu> [bhelgaas: apply regardless of CONFIG_PCIEASPM] Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> CC: stable@vger.kernel.org
2019-04-05Merge branch 'akpm' (patches from Andrew)Linus Torvalds
Merge misc fixes from Andrew Morton: "14 fixes" * emailed patches from Andrew Morton <akpm@linux-foundation.org>: kernel/sysctl.c: fix out-of-bounds access when setting file-max mm/util.c: fix strndup_user() comment sh: fix multiple function definition build errors MAINTAINERS: add maintainer and replacing reviewer ARM/NUVOTON NPCM MAINTAINERS: fix bad pattern in ARM/NUVOTON NPCM mm: writeback: use exact memcg dirty counts psi: clarify the units used in pressure files mm/huge_memory.c: fix modifying of page protection by insert_pfn_pmd() hugetlbfs: fix memory leak for resv_map mm: fix vm_fault_t cast in VM_FAULT_GET_HINDEX() lib/lzo: fix bugs for very short or empty input include/linux/bitrev.h: fix constant bitrev kmemleak: powerpc: skip scanning holes in the .bss section lib/string.c: implement a basic bcmp
2019-04-05mm: writeback: use exact memcg dirty countsGreg Thelen
Since commit a983b5ebee57 ("mm: memcontrol: fix excessive complexity in memory.stat reporting") memcg dirty and writeback counters are managed as: 1) per-memcg per-cpu values in range of [-32..32] 2) per-memcg atomic counter When a per-cpu counter cannot fit in [-32..32] it's flushed to the atomic. Stat readers only check the atomic. Thus readers such as balance_dirty_pages() may see a nontrivial error margin: 32 pages per cpu. Assuming 100 cpus: 4k x86 page_size: 13 MiB error per memcg 64k ppc page_size: 200 MiB error per memcg Considering that dirty+writeback are used together for some decisions the errors double. This inaccuracy can lead to undeserved oom kills. One nasty case is when all per-cpu counters hold positive values offsetting an atomic negative value (i.e. per_cpu[*]=32, atomic=n_cpu*-32). balance_dirty_pages() only consults the atomic and does not consider throttling the next n_cpu*32 dirty pages. If the file_lru is in the 13..200 MiB range then there's absolutely no dirty throttling, which burdens vmscan with only dirty+writeback pages thus resorting to oom kill. It could be argued that tiny containers are not supported, but it's more subtle. It's the amount the space available for file lru that matters. If a container has memory.max-200MiB of non reclaimable memory, then it will also suffer such oom kills on a 100 cpu machine. The following test reliably ooms without this patch. This patch avoids oom kills. $ cat test mount -t cgroup2 none /dev/cgroup cd /dev/cgroup echo +io +memory > cgroup.subtree_control mkdir test cd test echo 10M > memory.max (echo $BASHPID > cgroup.procs && exec /memcg-writeback-stress /foo) (echo $BASHPID > cgroup.procs && exec dd if=/dev/zero of=/foo bs=2M count=100) $ cat memcg-writeback-stress.c /* * Dirty pages from all but one cpu. * Clean pages from the non dirtying cpu. * This is to stress per cpu counter imbalance. * On a 100 cpu machine: * - per memcg per cpu dirty count is 32 pages for each of 99 cpus * - per memcg atomic is -99*32 pages * - thus the complete dirty limit: sum of all counters 0 * - balance_dirty_pages() only sees atomic count -99*32 pages, which * it max()s to 0. * - So a workload can dirty -99*32 pages before balance_dirty_pages() * cares. */ #define _GNU_SOURCE #include <err.h> #include <fcntl.h> #include <sched.h> #include <stdlib.h> #include <stdio.h> #include <sys/stat.h> #include <sys/sysinfo.h> #include <sys/types.h> #include <unistd.h> static char *buf; static int bufSize; static void set_affinity(int cpu) { cpu_set_t affinity; CPU_ZERO(&affinity); CPU_SET(cpu, &affinity); if (sched_setaffinity(0, sizeof(affinity), &affinity)) err(1, "sched_setaffinity"); } static void dirty_on(int output_fd, int cpu) { int i, wrote; set_affinity(cpu); for (i = 0; i < 32; i++) { for (wrote = 0; wrote < bufSize; ) { int ret = write(output_fd, buf+wrote, bufSize-wrote); if (ret == -1) err(1, "write"); wrote += ret; } } } int main(int argc, char **argv) { int cpu, flush_cpu = 1, output_fd; const char *output; if (argc != 2) errx(1, "usage: output_file"); output = argv[1]; bufSize = getpagesize(); buf = malloc(getpagesize()); if (buf == NULL) errx(1, "malloc failed"); output_fd = open(output, O_CREAT|O_RDWR); if (output_fd == -1) err(1, "open(%s)", output); for (cpu = 0; cpu < get_nprocs(); cpu++) { if (cpu != flush_cpu) dirty_on(output_fd, cpu); } set_affinity(flush_cpu); if (fsync(output_fd)) err(1, "fsync(%s)", output); if (close(output_fd)) err(1, "close(%s)", output); free(buf); } Make balance_dirty_pages() and wb_over_bg_thresh() work harder to collect exact per memcg counters. This avoids the aforementioned oom kills. This does not affect the overhead of memory.stat, which still reads the single atomic counter. Why not use percpu_counter? memcg already handles cpus going offline, so no need for that overhead from percpu_counter. And the percpu_counter spinlocks are more heavyweight than is required. It probably also makes sense to use exact dirty and writeback counters in memcg oom reports. But that is saved for later. Link: http://lkml.kernel.org/r/20190329174609.164344-1-gthelen@google.com Signed-off-by: Greg Thelen <gthelen@google.com> Reviewed-by: Roman Gushchin <guro@fb.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Tejun Heo <tj@kernel.org> Cc: <stable@vger.kernel.org> [4.16+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>