summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2016-02-09selinux: nlmsgtab: add SOCK_DESTROY to the netlink mapping tablesLorenzo Colitti
Without this, using SOCK_DESTROY in enforcing mode results in: SELinux: unrecognized netlink message type=21 for sclass=32 Signed-off-by: Lorenzo Colitti <lorenzo@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-09sctp: translate network order to host order when users get a hmacidXin Long
Commit ed5a377d87dc ("sctp: translate host order to network order when setting a hmacid") corrected the hmacid byte-order when setting a hmacid. but the same issue also exists on getting a hmacid. We fix it by changing hmacids to host order when users get them with getsockopt. Fixes: Commit ed5a377d87dc ("sctp: translate host order to network order when setting a hmacid") Signed-off-by: Xin Long <lucien.xin@gmail.com> Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-09enic: increment devcmd2 result ring in case of timeoutSandeep Pillai
Firmware posts the devcmd result in result ring. In case of timeout, driver does not increment the current result pointer and firmware could post the result after timeout has occurred. During next devcmd, driver would be reading the result of previous devcmd. Fix this by incrementing result even in case of timeout. Fixes: 373fb0873d43 ("enic: add devcmd2") Signed-off-by: Sandeep Pillai <sanpilla@cisco.com> Signed-off-by: Govindarajulu Varadarajan <_govind@gmx.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-09tg3: Fix for tg3 transmit queue 0 timed out when too many gso_segsSiva Reddy Kallam
tg3_tso_bug() can hit a condition where the entire tx ring is not big enough to segment the GSO packet. For example, if MSS is very small, gso_segs can exceed the tx ring size. When we hit the condition, it will cause tx timeout. tg3_tso_bug() is called to handle TSO and DMA hardware bugs. For TSO bugs, if tg3_tso_bug() cannot succeed, we have to drop the packet. For DMA bugs, we can still fall back to linearize the SKB and let the hardware transmit the TSO packet. This patch adds a function tg3_tso_bug_gso_check() to check if there are enough tx descriptors for GSO before calling tg3_tso_bug(). The caller will then handle the error appropriately - drop or lineraize the SKB. v2: Corrected patch description to avoid confusion. Signed-off-by: Siva Reddy Kallam <siva.kallam@broadcom.com> Signed-off-by: Michael Chan <mchan@broadcom.com> Acked-by: Prashant Sreedharan <prashant@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-09net:Add sysctl_max_skb_fragsHans Westgaard Ry
Devices may have limits on the number of fragments in an skb they support. Current codebase uses a constant as maximum for number of fragments one skb can hold and use. When enabling scatter/gather and running traffic with many small messages the codebase uses the maximum number of fragments and may thereby violate the max for certain devices. The patch introduces a global variable as max number of fragments. Signed-off-by: Hans Westgaard Ry <hans.westgaard.ry@oracle.com> Reviewed-by: Håkon Bugge <haakon.bugge@oracle.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-09tcp: do not drop syn_recv on all icmp reportsEric Dumazet
Petr Novopashenniy reported that ICMP redirects on SYN_RECV sockets were leading to RST. This is of course incorrect. A specific list of ICMP messages should be able to drop a SYN_RECV. For instance, a REDIRECT on SYN_RECV shall be ignored, as we do not hold a dst per SYN_RECV pseudo request. Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=111751 Fixes: 079096f103fa ("tcp/dccp: install syn_recv requests into ehash table") Reported-by: Petr Novopashenniy <pety@rusnet.ru> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-08Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds
Pull KVM fixes from Paolo Bonzini: "KVM-ARM fixes, mostly coming from the PMU work" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: arm64: KVM: Fix guest dead loop when register accessor returns false arm64: KVM: Fix comments of the CP handler arm64: KVM: Fix wrong use of the CPSR MODE mask for 32bit guests arm64: KVM: Obey RES0/1 reserved bits when setting CPTR_EL2 arm64: KVM: Fix AArch64 guest userspace exception injection
2016-02-08Merge tag 'regmap-fix-v4.5-big-endian' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/broonie/regmap Pull regmap fix from Mark Brown: "A single revert back to v4.4 endianness handling. Commit 29bb45f25ff3 ("regmap-mmio: Use native endianness for read/write") attempted to fix some long standing bugs in the MMIO implementation for big endian systems caused by duplicate byte swapping in both regmap and readl()/writel(). Sadly the fix makes things worse rather than better, so revert it for now" * tag 'regmap-fix-v4.5-big-endian' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/regmap: regmap: mmio: Revert to v4.4 endianness handling
2016-02-08scatterlist: fix a typo in comment block of sg_miter_stop()Masahiro Yamada
Fix the doubled "started" and tidy up the following sentences. Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-02-08ipv6: fix a lockdep splatEric Dumazet
Silence lockdep false positive about rcu_dereference() being used in the wrong context. First one should use rcu_dereference_protected() as we own the spinlock. Second one should be a normal assignation, as no barrier is needed. Fixes: 18367681a10bd ("ipv6 flowlabel: Convert np->ipv6_fl_list to RCU.") Reported-by: Dave Jones <davej@codemonkey.org.uk> Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-08unix: correctly track in-flight fds in sending process user_structHannes Frederic Sowa
The commit referenced in the Fixes tag incorrectly accounted the number of in-flight fds over a unix domain socket to the original opener of the file-descriptor. This allows another process to arbitrary deplete the original file-openers resource limit for the maximum of open files. Instead the sending processes and its struct cred should be credited. To do so, we add a reference counted struct user_struct pointer to the scm_fp_list and use it to account for the number of inflight unix fds. Fixes: 712f4aad406bb1 ("unix: properly account for FDs passed over unix sockets") Reported-by: David Herrmann <dh.herrmann@gmail.com> Cc: David Herrmann <dh.herrmann@gmail.com> Cc: Willy Tarreau <w@1wt.eu> Cc: Linus Torvalds <torvalds@linux-foundation.org> Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-08Merge tag 'kvm-arm-for-4.5-rc2' of ↵Paolo Bonzini
git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into kvm-master KVM/ARM fixes for v4.5-rc2 A few random fixes, mostly coming from the PMU work by Shannon: - fix for injecting faults coming from the guest's userspace - cleanup for our CPTR_EL2 accessors (reserved bits) - fix for a bug impacting perf (user/kernel discrimination) - fix for a 32bit sysreg handling bug
2016-02-08netfilter: nft_counter: fix erroneous return valuesAnton Protopopov
The nft_counter_init() and nft_counter_clone() functions should return negative error value -ENOMEM instead of positive ENOMEM. Signed-off-by: Anton Protopopov <a.s.protopopov@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-02-08netfilter: tee: select NF_DUP_IPV6 unconditionallyArnd Bergmann
The NETFILTER_XT_TARGET_TEE option selects NF_DUP_IPV6 whenever IP6_NF_IPTABLES is enabled, and it ensures that it cannot be builtin itself if NF_CONNTRACK is a loadable module, as that is a dependency for NF_DUP_IPV6. However, NF_DUP_IPV6 can be enabled even if IP6_NF_IPTABLES is turned off, and it only really depends on IPV6. With the current check in tee_tg6, we call nf_dup_ipv6() whenever NF_DUP_IPV6 is enabled. This can however be a loadable module which is unreachable from a built-in xt_TEE: net/built-in.o: In function `tee_tg6': :(.text+0x67728): undefined reference to `nf_dup_ipv6' The bug was originally introduced in the split of the xt_TEE module into separate modules for ipv4 and ipv6, and two patches tried to fix it unsuccessfully afterwards. This is a revert of the the first incorrect attempt to fix it, going back to depending on IPV6 as the dependency, and we adapt the 'select' condition accordingly. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Fixes: bbde9fc1824a ("netfilter: factor out packet duplication for IPv4/IPv6") Fixes: 116984a316c3 ("netfilter: xt_TEE: use IS_ENABLED(CONFIG_NF_DUP_IPV6)") Fixes: 74ec4d55c4d2 ("netfilter: fix xt_TEE and xt_TPROXY dependencies") Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-02-08netfilter: nfnetlink: correctly validate length of batch messagesPhil Turnbull
If nlh->nlmsg_len is zero then an infinite loop is triggered because 'skb_pull(skb, msglen);' pulls zero bytes. The calculation in nlmsg_len() underflows if 'nlh->nlmsg_len < NLMSG_HDRLEN' which bypasses the length validation and will later trigger an out-of-bound read. If the length validation does fail then the malformed batch message is copied back to userspace. However, we cannot do this because the nlh->nlmsg_len can be invalid. This leads to an out-of-bounds read in netlink_ack: [ 41.455421] ================================================================== [ 41.456431] BUG: KASAN: slab-out-of-bounds in memcpy+0x1d/0x40 at addr ffff880119e79340 [ 41.456431] Read of size 4294967280 by task a.out/987 [ 41.456431] ============================================================================= [ 41.456431] BUG kmalloc-512 (Not tainted): kasan: bad access detected [ 41.456431] ----------------------------------------------------------------------------- ... [ 41.456431] Bytes b4 ffff880119e79310: 00 00 00 00 d5 03 00 00 b0 fb fe ff 00 00 00 00 ................ [ 41.456431] Object ffff880119e79320: 20 00 00 00 10 00 05 00 00 00 00 00 00 00 00 00 ............... [ 41.456431] Object ffff880119e79330: 14 00 0a 00 01 03 fc 40 45 56 11 22 33 10 00 05 .......@EV."3... [ 41.456431] Object ffff880119e79340: f0 ff ff ff 88 99 aa bb 00 14 00 0a 00 06 fe fb ................ ^^ start of batch nlmsg with nlmsg_len=4294967280 ... [ 41.456431] Memory state around the buggy address: [ 41.456431] ffff880119e79400: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 [ 41.456431] ffff880119e79480: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 [ 41.456431] >ffff880119e79500: 00 00 00 00 fc fc fc fc fc fc fc fc fc fc fc fc [ 41.456431] ^ [ 41.456431] ffff880119e79580: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc [ 41.456431] ffff880119e79600: fc fc fc fc fc fc fc fc fc fc fb fb fb fb fb fb [ 41.456431] ================================================================== Fix this with better validation of nlh->nlmsg_len and by setting NFNL_BATCH_FAILURE if any batch message fails length validation. CAP_NET_ADMIN is required to trigger the bugs. Fixes: 9ea2aa8b7dba ("netfilter: nfnetlink: validate nfnetlink header from batch") Signed-off-by: Phil Turnbull <phil.turnbull@oracle.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-02-07Linux 4.5-rc3v4.5-rc3Linus Torvalds
2016-02-07Merge tag 'armsoc-fixes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc Pull ARM SoC fixes from Olof Johansson: "The first real batch of fixes for this release cycle, so there are a few more than usual. Most of these are fixes and tweaks to board support (DT bugfixes, etc). I've also picked up a couple of small cleanups that seemed innocent enough that there was little reason to wait (const/ __initconst and Kconfig deps). Quite a bit of the changes on OMAP were due to fixes to no longer write to rodata from assembly when ARM_KERNMEM_PERMS was enabled, but there were also other fixes. Kirkwood had a bunch of gpio fixes for some boards. OMAP had RTC fixes on OMAP5, and Nomadik had changes to MMC parameters in DT. All in all, mostly the usual mix of various fixes" * tag 'armsoc-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc: (46 commits) ARM: multi_v7_defconfig: enable DW_WATCHDOG ARM: nomadik: fix up SD/MMC DT settings ARM64: tegra: Add chosen node for tegra132 norrin ARM: realview: use "depends on" instead of "if" after prompt ARM: tango: use "depends on" instead of "if" after prompt ARM: tango: use const and __initconst for smp_operations ARM: realview: use const and __initconst for smp_operations bus: uniphier-system-bus: revive tristate prompt arm64: dts: Add missing DMA Abort interrupt to Juno bus: vexpress-config: Add missing of_node_put ARM: dts: am57xx: sbc-am57x: correct Eth PHY settings ARM: dts: am57xx: cl-som-am57x: fix CPSW EMAC pinmux ARM: dts: am57xx: sbc-am57x: fix UART3 pinmux ARM: dts: am57xx: cl-som-am57x: update SPI Flash frequency ARM: dts: am57xx: cl-som-am57x: set HOST mode for USB2 ARM: dts: am57xx: sbc-am57x: fix SB-SOM EEPROM I2C address ARM: dts: LogicPD Torpedo: Revert Duplicative Entries ARM: dts: am437x: pixcir_tangoc: use correct flags for irq types ARM: dts: am4372: fix irq type for arm twd and global timer ARM: dts: at91: sama5d4 xplained: fix phy0 IRQ type ...
2016-02-07Merge branch 'mailbox-devel' of ↵Linus Torvalds
git://git.linaro.org/landing-teams/working/fujitsu/integration Pull mailbox fixes from Jassi Brar: - fix getting element from the pcc-channels array by simply indexing into it - prevent building mailbox-test driver for archs that don't have IOMEM * 'mailbox-devel' of git://git.linaro.org/landing-teams/working/fujitsu/integration: mailbox: Fix dependencies for !HAS_IOMEM archs mailbox: pcc: fix channel calculation in get_pcc_channel()
2016-02-07update be2net maintainers' email addressesSathya Perla
be2net maintainers' email addresses changed from avagotech.com to broadcom.com starting today. While updating the list, I'm also adding Somnath's name to the list. Signed-off-by: Sathya Perla <sathya.perla@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-06Merge tag 'usb-4.5-rc3' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb Pull USB fixes from Greg KH: "Here are some USB fixes for 4.5-rc3. The usual, xhci fixes for reported issues, combined with some small gadget driver fixes, and a MAINTAINERS file update. All have been in linux-next with no reported issues" * tag 'usb-4.5-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb: xhci: harden xhci_find_next_ext_cap against device removal xhci: Fix list corruption in urb dequeue at host removal usb: host: xhci-plat: fix NULL pointer in probe for device tree case usb: xhci-mtk: fix AHB bus hang up caused by roothubs polling usb: xhci-mtk: fix bpkts value of LS/HS periodic eps not behind TT usb: xhci: apply XHCI_PME_STUCK_QUIRK to Intel Broxton-M platforms usb: xhci: set SSIC port unused only if xhci_suspend succeeds usb: xhci: add a quirk bit for ssic port unused usb: xhci: handle both SSIC ports in PME stuck quirk usb: dwc3: gadget: set the OTG flag in dwc3 gadget driver. Revert "xhci: don't finish a TD if we get a short-transfer event mid TD" MAINTAINERS: fix my email address usb: dwc2: Fix probe problem on bcm2835 Revert "usb: dwc2: Move reset into dwc2_get_hwparams()" usb: musb: ux500: Fix NULL pointer dereference at system PM usb: phy: mxs: declare variable with initialized value usb: phy: msm: fix error handling in probe.
2016-02-06Merge tag 'staging-4.5-rc3' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging Pull staging and IIO driver fixes from Greg KH: "Here are some IIO and staging driver fixes for 4.5-rc3. All of them, except one, are for IIO drivers, and one is for a speakup driver fix caused by some earlier patches, to resolve a reported build failure" * tag 'staging-4.5-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging: Staging: speakup: Fix allyesconfig build on mn10300 iio: dht11: Use boottime iio: ade7753: avoid uninitialized data iio: pressure: mpl115: fix temperature offset sign iio: imu: Fix dependencies for !HAS_IOMEM archs staging: iio: Fix dependencies for !HAS_IOMEM archs iio: adc: Fix dependencies for !HAS_IOMEM archs iio: inkern: fix a NULL dereference on error iio:adc:ti_am335x_adc Fix buffered mode by identifying as software buffer. iio: light: acpi-als: Report data as processed iio: dac: mcp4725: set iio name property in sysfs iio: add HAS_IOMEM dependency to VF610_ADC iio: add IIO_TRIGGER dependency to STK8BA50 iio: proximity: lidar: correct return value iio-light: Use a signed return type for ltr501_match_samp_freq()
2016-02-06rtlwifi: fix broken VHT supportLarry Finger
When using a 5G-capable device with VHT (802.11ac) rates enabled was not working (packets were not delivered) and the following mac80211 warning was printed: WARNING: CPU: 3 PID: 2253 at net/mac80211/rate.c:625 ieee80211_get_tx_rates+0x22e/0x620 [mac80211]() Modules linked in: rtl8821ae btcoexist rtl_pci rtlwifi fuse drbg ansi_cprng ctr ccm bnep bluetooth af_packet nfs fscache vboxpci(O) vboxnetadp(O) vboxne tflt(O) vboxdrv(O) arc4 snd_hda_codec_generic x86_pkg_temp_thermal rtsx_pci_sdmmc mmc_core rtsx_pci_ms kvm_intel memstick iwlmvm kvm mac80211 snd_hda_intel snd_hda_cod ec snd_hwdep snd_hda_core irqbypass snd_pcm iwlwifi crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel aesni_intel aes_x86_64 snd_timer lrw gf128mul glue_h elper ablk_helper cryptd snd cfg80211 pcspkr serio_raw e1000e rtsx_pci lpc_ich ptp xhci_pci mfd_core pps_core xhci_hcd soundcore toshiba_acpi thermal sparse_keymap wmi toshiba_bluetooth rfkill acpi_cpufreq battery ac processor dm_mod i915 i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops drm sr_mod cdrom video button sg autofs4 [last unloaded: rtlwifi] CPU: 3 PID: 2253 Comm: Timer Tainted: G W O 4.5.0-rc1-wl+ #79 Hardware name: TOSHIBA TECRA A50-A/TECRA A50-A, BIOS Version 4.20 04/17/2014 ffffffffa05c4be6 ffff8802262036d8 ffffffff813d7912 0000000000000000 ffff880226203710 ffffffff8106bcb6 ffff8800c6831300 ffff8800c6831330 0000000000000000 ffff8800c683133c ffff880065923638 ffff880226203720 Call Trace: <IRQ> [<ffffffff813d7912>] dump_stack+0x4b/0x79 [<ffffffff8106bcb6>] warn_slowpath_common+0x86/0xc0 [<ffffffff8106bdaa>] warn_slowpath_null+0x1a/0x20 [<ffffffffa05511ee>] ieee80211_get_tx_rates+0x22e/0x620 [mac80211] [<ffffffffa0782232>] ? rtl_is_special_data+0x32/0x240 [rtlwifi] [<ffffffffa055209e>] ? rate_control_get_rate+0xce/0x150 [mac80211] [<ffffffff810bfc7d>] ? trace_hardirqs_on+0xd/0x10 [<ffffffff81071cc5>] ? __local_bh_enable_ip+0x65/0xd0 Signed-off-by: Kalle Valo <kvalo@codeaurora.org>
2016-02-06dwc_eth_qos: Reset hardware before PHY startRabin Vincent
The hardware reset is currently done after phy_start() is called, leading to a race where we can lose the link status if the phy state machine calls dwceqos_adjust_link() before we reset the MAC registers. Acked-by: Lars Persson <larper@axis.com> Signed-off-by: Rabin Vincent <rabinv@axis.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-06ipv6: addrconf: Fix recursive spin lock callsubashab@codeaurora.org
A rcu stall with the following backtrace was seen on a system with forwarding, optimistic_dad and use_optimistic set. To reproduce, set these flags and allow ipv6 autoconf. This occurs because the device write_lock is acquired while already holding the read_lock. Back trace below - INFO: rcu_preempt self-detected stall on CPU { 1} (t=2100 jiffies g=3992 c=3991 q=4471) <6> Task dump for CPU 1: <2> kworker/1:0 R running task 12168 15 2 0x00000002 <2> Workqueue: ipv6_addrconf addrconf_dad_work <6> Call trace: <2> [<ffffffc000084da8>] el1_irq+0x68/0xdc <2> [<ffffffc000cc4e0c>] _raw_write_lock_bh+0x20/0x30 <2> [<ffffffc000bc5dd8>] __ipv6_dev_ac_inc+0x64/0x1b4 <2> [<ffffffc000bcbd2c>] addrconf_join_anycast+0x9c/0xc4 <2> [<ffffffc000bcf9f0>] __ipv6_ifa_notify+0x160/0x29c <2> [<ffffffc000bcfb7c>] ipv6_ifa_notify+0x50/0x70 <2> [<ffffffc000bd035c>] addrconf_dad_work+0x314/0x334 <2> [<ffffffc0000b64c8>] process_one_work+0x244/0x3fc <2> [<ffffffc0000b7324>] worker_thread+0x2f8/0x418 <2> [<ffffffc0000bb40c>] kthread+0xe0/0xec v2: do addrconf_dad_kick inside read lock and then acquire write lock for ipv6_ifa_notify as suggested by Eric Fixes: 7fd2561e4ebdd ("net: ipv6: Add a sysctl to make optimistic addresses useful candidates") Cc: Eric Dumazet <edumazet@google.com> Cc: Erik Kline <ek@google.com> Cc: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: Subash Abhinov Kasiviswanathan <subashab@codeaurora.org> Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-06crypto: marvell/cesa - fix test in mv_cesa_dev_dma_init()Boris BREZILLON
We are checking twice if dma->cache_pool is not NULL but are never testing dma->padding_pool value. Cc: stable@vger.kernel.org Signed-off-by: Boris Brezillon <boris.brezillon@free-electrons.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2016-02-06crypto: atmel-sha - remove calls of clk_prepare() from atomic contextsCyrille Pitchen
clk_prepare()/clk_unprepare() must not be called within atomic context. This patch calls clk_prepare() once for all from atmel_sha_probe() and clk_unprepare() from atmel_sha_remove(). Then calls of clk_prepare_enable()/clk_disable_unprepare() were replaced by calls of clk_enable()/clk_disable(). Cc: stable@vger.kernel.org Signed-off-by: Cyrille Pitchen <cyrille.pitchen@atmel.com> Reported-by: Matthias Mayr <matthias.mayr@student.kit.edu> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2016-02-06crypto: atmel-sha - fix atmel_sha_remove()Cyrille Pitchen
Since atmel_sha_probe() uses devm_xxx functions to allocate resources, atmel_sha_remove() should no longer explicitly release them. Cc: stable@vger.kernel.org Signed-off-by: Cyrille Pitchen <cyrille.pitchen@atmel.com> Fixes: b0e8b3417a62 ("crypto: atmel - use devm_xxx() managed function") Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2016-02-06crypto: algif_skcipher - Do not set MAY_BACKLOG on the async pathHerbert Xu
The async path cannot use MAY_BACKLOG because it is not meant to block, which is what MAY_BACKLOG does. On the other hand, both the sync and async paths can make use of MAY_SLEEP. Cc: stable@vger.kernel.org Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2016-02-06crypto: algif_skcipher - Do not dereference ctx without socket lockHerbert Xu
Any access to non-constant bits of the private context must be done under the socket lock, in particular, this includes ctx->req. This patch moves such accesses under the lock, and fetches the tfm from the parent socket which is guaranteed to be constant, rather than from ctx->req. Cc: stable@vger.kernel.org Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2016-02-06crypto: algif_skcipher - Do not assume that req is unchangedHerbert Xu
The async path in algif_skcipher assumes that the crypto completion function will be called with the original request. This is not necessarily the case. In fact there is no need for this anyway since we already embed information into the request with struct skcipher_async_req. This patch adds a pointer to that struct and then passes it as the data to the callback function. Cc: stable@vger.kernel.org Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Tested-by: Tadeusz Struk <tadeusz.struk@intel.com>
2016-02-06crypto: user - lock crypto_alg_list on alg dumpMathias Krause
We miss to take the crypto_alg_sem semaphore when traversing the crypto_alg_list for CRYPTO_MSG_GETALG dumps. This allows a race with crypto_unregister_alg() removing algorithms from the list while we're still traversing it, thereby leading to a use-after-free as show below: [ 3482.071639] general protection fault: 0000 [#1] SMP [ 3482.075639] Modules linked in: aes_x86_64 glue_helper lrw ablk_helper cryptd gf128mul ipv6 pcspkr serio_raw virtio_net microcode virtio_pci virtio_ring virtio sr_mod cdrom [last unloaded: aesni_intel] [ 3482.075639] CPU: 1 PID: 11065 Comm: crconf Not tainted 4.3.4-grsec+ #126 [ 3482.075639] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.7.5-20140531_083030-gandalf 04/01/2014 [ 3482.075639] task: ffff88001cd41a40 ti: ffff88001cd422c8 task.ti: ffff88001cd422c8 [ 3482.075639] RIP: 0010:[<ffffffff93722bd3>] [<ffffffff93722bd3>] strncpy+0x13/0x30 [ 3482.075639] RSP: 0018:ffff88001f713b60 EFLAGS: 00010202 [ 3482.075639] RAX: ffff88001f6c4430 RBX: ffff88001f6c43a0 RCX: ffff88001f6c4430 [ 3482.075639] RDX: 0000000000000040 RSI: fefefefefefeff16 RDI: ffff88001f6c4430 [ 3482.075639] RBP: ffff88001f713b60 R08: ffff88001f6c4470 R09: ffff88001f6c4480 [ 3482.075639] R10: 0000000000000002 R11: 0000000000000246 R12: ffff88001ce2aa28 [ 3482.075639] R13: ffff880000093700 R14: ffff88001f5e4bf8 R15: 0000000000003b20 [ 3482.075639] FS: 0000033826fa2700(0000) GS:ffff88001e900000(0000) knlGS:0000000000000000 [ 3482.075639] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 3482.075639] CR2: ffffffffff600400 CR3: 00000000139ec000 CR4: 00000000001606f0 [ 3482.075639] Stack: [ 3482.075639] ffff88001f713bd8 ffffffff936ccd00 ffff88001e5c4200 ffff880000093700 [ 3482.075639] ffff88001f713bd0 ffffffff938ef4bf 0000000000000000 0000000000003b20 [ 3482.075639] ffff88001f5e4bf8 ffff88001f5e4848 0000000000000000 0000000000003b20 [ 3482.075639] Call Trace: [ 3482.075639] [<ffffffff936ccd00>] crypto_report_alg+0xc0/0x3e0 [ 3482.075639] [<ffffffff938ef4bf>] ? __alloc_skb+0x16f/0x300 [ 3482.075639] [<ffffffff936cd08a>] crypto_dump_report+0x6a/0x90 [ 3482.075639] [<ffffffff93935707>] netlink_dump+0x147/0x2e0 [ 3482.075639] [<ffffffff93935f99>] __netlink_dump_start+0x159/0x190 [ 3482.075639] [<ffffffff936ccb13>] crypto_user_rcv_msg+0xc3/0x130 [ 3482.075639] [<ffffffff936cd020>] ? crypto_report_alg+0x3e0/0x3e0 [ 3482.075639] [<ffffffff936cc4b0>] ? alg_test_crc32c+0x120/0x120 [ 3482.075639] [<ffffffff93933145>] ? __netlink_lookup+0xd5/0x120 [ 3482.075639] [<ffffffff936cca50>] ? crypto_add_alg+0x1d0/0x1d0 [ 3482.075639] [<ffffffff93938141>] netlink_rcv_skb+0xe1/0x130 [ 3482.075639] [<ffffffff936cc4f8>] crypto_netlink_rcv+0x28/0x40 [ 3482.075639] [<ffffffff939375a8>] netlink_unicast+0x108/0x180 [ 3482.075639] [<ffffffff93937c21>] netlink_sendmsg+0x541/0x770 [ 3482.075639] [<ffffffff938e31e1>] sock_sendmsg+0x21/0x40 [ 3482.075639] [<ffffffff938e4763>] SyS_sendto+0xf3/0x130 [ 3482.075639] [<ffffffff93444203>] ? bad_area_nosemaphore+0x13/0x20 [ 3482.075639] [<ffffffff93444470>] ? __do_page_fault+0x80/0x3a0 [ 3482.075639] [<ffffffff939d80cb>] entry_SYSCALL_64_fastpath+0x12/0x6e [ 3482.075639] Code: 88 4a ff 75 ed 5d 48 0f ba 2c 24 3f c3 66 66 2e 0f 1f 84 00 00 00 00 00 55 48 85 d2 48 89 f8 48 89 f9 4c 8d 04 17 48 89 e5 74 15 <0f> b6 16 80 fa 01 88 11 48 83 de ff 48 83 c1 01 4c 39 c1 75 eb [ 3482.075639] RIP [<ffffffff93722bd3>] strncpy+0x13/0x30 To trigger the race run the following loops simultaneously for a while: $ while : ; do modprobe aesni-intel; rmmod aesni-intel; done $ while : ; do crconf show all > /dev/null; done Fix the race by taking the crypto_alg_sem read lock, thereby preventing crypto_unregister_alg() from modifying the algorithm list during the dump. This bug has been detected by the PaX memory sanitize feature. Cc: stable@vger.kernel.org Signed-off-by: Mathias Krause <minipli@googlemail.com> Cc: Steffen Klassert <steffen.klassert@secunet.com> Cc: PaX Team <pageexec@freemail.hu> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2016-02-05Merge branch 'akpm' (patches from Andrew)Linus Torvalds
Merge fixes from Andrew Morton: "22 fixes" * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (22 commits) epoll: restrict EPOLLEXCLUSIVE to POLLIN and POLLOUT radix-tree: fix oops after radix_tree_iter_retry MAINTAINERS: trim the file triggers for ABI/API dax: dirty inode only if required thp: make deferred_split_scan() work again mm: replace vma_lock_anon_vma with anon_vma_lock_read/write ocfs2/dlm: clear refmap bit of recovery lock while doing local recovery cleanup um: asm/page.h: remove the pte_high member from struct pte_t mm, hugetlb: don't require CMA for runtime gigantic pages mm/hugetlb: fix gigantic page initialization/allocation mm: downgrade VM_BUG in isolate_lru_page() to warning mempolicy: do not try to queue pages from !vma_migratable() mm, vmstat: fix wrong WQ sleep when memory reclaim doesn't make any progress vmstat: make vmstat_update deferrable mm, vmstat: make quiet_vmstat lighter mm/Kconfig: correct description of DEFERRED_STRUCT_PAGE_INIT memblock: don't mark memblock_phys_mem_size() as __init dump_stack: avoid potential deadlocks mm: validate_mm browse_rb SMP race condition m32r: fix build failure due to SMP and MMU ...
2016-02-05Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client Pull Ceph fixes from Sage Weil: "We have a few wire protocol compatibility fixes, ports of a few recent CRUSH mapping changes, and a couple error path fixes" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: libceph: MOSDOpReply v7 encoding libceph: advertise support for TUNABLES5 crush: decode and initialize chooseleaf_stable crush: add chooseleaf_stable tunable crush: ensure take bucket value is valid crush: ensure bucket id is valid before indexing buckets array ceph: fix snap context leak in error path ceph: checking for IS_ERR instead of NULL
2016-02-05Merge branch 'drm-fixes' of git://people.freedesktop.org/~airlied/linuxLinus Torvalds
Pull drm fixes from Dave Airlie: "Fixes all over the place: - amdkfd: two static checker fixes - mst: a bunch of static checker and spec/hw interaction fixes - amdgpu: fix Iceland hw properly, and some fiji bugs, along with some write-combining fixes. - exynos: some regression fixes - adv7511: fix some EDID reading issues" * 'drm-fixes' of git://people.freedesktop.org/~airlied/linux: (38 commits) drm/dp/mst: deallocate payload on port destruction drm/dp/mst: Reverse order of MST enable and clearing VC payload table. drm/dp/mst: move GUID storage from mgr, port to only mst branch drm/dp/mst: change MST detection scheme drm/dp/mst: Calculate MST PBN with 31.32 fixed point drm: Add drm_fixp_from_fraction and drm_fixp2int_ceil drm/mst: Add range check for max_payloads during init drm/mst: Don't ignore the MST PBN self-test result drm: fix missing reference counting decrease drm/amdgpu: disable uvd and vce clockgating on Fiji drm/amdgpu: remove exp hardware support from iceland drm/amdgpu: load MEC ucode manually on iceland drm/amdgpu: don't load MEC2 on topaz drm/amdgpu: drop topaz support from gmc8 module drm/amdgpu: pull topaz gmc bits into gmc_v7 drm/amdgpu: The VI specific EXE bit should only apply to GMC v8.0 above drm/amdgpu: iceland use CI based MC IP drm/amdgpu: move gmc7 support out of CIK dependency drm/amdgpu/gfx7: enable cp inst/reg error interrupts drm/amdgpu/gfx8: enable cp inst/reg error interrupts ...
2016-02-05Merge tag 'pm+acpi-4.5-rc3' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm Pull power management and ACPI fixes from Rafael Wysocki: "These are: a fix for a recently introduced false-positive warnings about PM domain pointers being changed inappropriately (harmless but annoying), an MCH size workaround quirk for one more platform, a compiler warning fix (generic power domains framework), an ACPI LPSS (Intel SoCs) driver fixup and a cleanup of the ACPI CPPC core code. Specifics: - PM core fix to avoid false-positive warnings generated when the pm_domain field is cleared for a device that appears to be bound to a driver (Rafael Wysocki). - New MCH size workaround quirk for Intel Haswell-ULT (Josh Boyer). - Fix for an "unused function" compiler warning in the generic power domains framework (Ulf Hansson). - Fixup for the ACPI driver for Intel SoCs (acpi-lpss) to set the PM domain pointer of a device properly in one place that was overlooked by a recent PM core update (Andy Shevchenko). - Removal of a redundant function declaration in the ACPI CPPC core code (Timur Tabi)" * tag 'pm+acpi-4.5-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: PM: Avoid false-positive warnings in dev_pm_domain_set() PM / Domains: Silence compiler warning for an unused function ACPI / CPPC: remove redundant mbox_send_message() declaration ACPI / LPSS: set PM domain via helper setter PNP: Add Haswell-ULT to Intel MCH size workaround
2016-02-05epoll: restrict EPOLLEXCLUSIVE to POLLIN and POLLOUTJason Baron
In the current implementation of the EPOLLEXCLUSIVE flag (added for 4.5-rc1), if epoll waiters create different POLL* sets and register them as exclusive against the same target fd, the current implementation will stop waking any further waiters once it finds the first idle waiter. This means that waiters could miss wakeups in certain cases. For example, when we wake up a pipe for reading we do: wake_up_interruptible_sync_poll(&pipe->wait, POLLIN | POLLRDNORM); So if one epoll set or epfd is added to pipe p with POLLIN and a second set epfd2 is added to pipe p with POLLRDNORM, only epfd may receive the wakeup since the current implementation will stop after it finds any intersection of events with a waiter that is blocked in epoll_wait(). We could potentially address this by requiring all epoll waiters that are added to p be required to pass the same set of POLL* events. IE the first EPOLL_CTL_ADD that passes EPOLLEXCLUSIVE establishes the set POLL* flags to be used by any other epfds that are added as EPOLLEXCLUSIVE. However, I think it might be somewhat confusing interface as we would have to reference count the number of users for that set, and so userspace would have to keep track of that count, or we would need a more involved interface. It also adds some shared state that we'd have store somewhere. I don't think anybody will want to bloat __wait_queue_head for this. I think what we could do instead, is to simply restrict EPOLLEXCLUSIVE such that it can only be specified with EPOLLIN and/or EPOLLOUT. So that way if the wakeup includes 'POLLIN' and not 'POLLOUT', we can stop once we hit the first idle waiter that specifies the EPOLLIN bit, since any remaining waiters that only have 'POLLOUT' set wouldn't need to be woken. Likewise, we can do the same thing if 'POLLOUT' is in the wakeup bit set and not 'POLLIN'. If both 'POLLOUT' and 'POLLIN' are set in the wake bit set (there is at least one example of this I saw in fs/pipe.c), then we just wake the entire exclusive list. Having both 'POLLOUT' and 'POLLIN' both set should not be on any performance critical path, so I think that's ok (in fs/pipe.c its in pipe_release()). We also continue to include EPOLLERR and EPOLLHUP by default in any exclusive set. Thus, the user can specify EPOLLERR and/or EPOLLHUP but is not required to do so. Since epoll waiters may be interested in other events as well besides EPOLLIN, EPOLLOUT, EPOLLERR and EPOLLHUP, these can still be added by doing a 'dup' call on the target fd and adding that as one normally would with EPOLL_CTL_ADD. Since I think that the POLLIN and POLLOUT events are what we are interest in balancing, I think that the 'dup' thing could perhaps be added to only one of the waiter threads. However, I think that EPOLLIN, EPOLLOUT, EPOLLERR and EPOLLHUP should be sufficient for the majority of use-cases. Since EPOLLEXCLUSIVE is intended to be used with a target fd shared among multiple epfds, where between 1 and n of the epfds may receive an event, it does not satisfy the semantics of EPOLLONESHOT where only 1 epfd would get an event. Thus, it is not allowed to be specified in conjunction with EPOLLEXCLUSIVE. EPOLL_CTL_MOD is also not allowed if the fd was previously added as EPOLLEXCLUSIVE. It seems with the limited number of flags to not be as interesting, but this could be relaxed at some further point. Signed-off-by: Jason Baron <jbaron@akamai.com> Tested-by: Madars Vitolins <m@silodev.com> Cc: Michael Kerrisk <mtk.manpages@gmail.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Al Viro <viro@ftp.linux.org.uk> Cc: Eric Wong <normalperson@yhbt.net> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Hagen Paul Pfeifer <hagen@jauu.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-02-05radix-tree: fix oops after radix_tree_iter_retryKonstantin Khlebnikov
Helper radix_tree_iter_retry() resets next_index to the current index. In following radix_tree_next_slot current chunk size becomes zero. This isn't checked and it tries to dereference null pointer in slot. Tagged iterator is fine because retry happens only at slot 0 where tag bitmask in iter->tags is filled with single bit. Fixes: 46437f9a554f ("radix-tree: fix race in gang lookup") Signed-off-by: Konstantin Khlebnikov <koct9i@gmail.com> Cc: Matthew Wilcox <willy@linux.intel.com> Cc: Hugh Dickins <hughd@google.com> Cc: Ohad Ben-Cohen <ohad@wizery.com> Cc: Jeremiah Mahler <jmmahler@gmail.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-02-05MAINTAINERS: trim the file triggers for ABI/APIMichael Kerrisk (man-pages)
Commit ea8f8fc8631 ("MAINTAINERS: add linux-api for review of API/ABI changes") added file triggers for various paths that likely indicated API/ABI changes. However, catching all changes in Documentation/ABI/ and include/uapi/ produces a large volume of mail to linux-api, rather than only API/ABI changes. Drop those two entries, but leave include/linux/syscalls.h and kernel/sys_ni.c to catch syscall-related changes. [josh@joshtriplett.org: redid changelog] Signed-off-by: Michael Kerrisk <mtk.man-pages@gmail.com> Acked-by: Shuah khan <shuahkh@osg.samsung.com> Cc: Josh Triplett <josh@joshtriplett.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-02-05dax: dirty inode only if requiredDmitry Monakhov
Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com> Cc: Matthew Wilcox <matthew.r.wilcox@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-02-05thp: make deferred_split_scan() work againKirill A. Shutemov
We need to iterate over split_queue, not local empty list to get anything split from the shrinker. Fixes: e3ae19535c66 ("thp: limit number of object to scan on deferred_split_scan()") Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-02-05mm: replace vma_lock_anon_vma with anon_vma_lock_read/writeKonstantin Khlebnikov
Sequence vma_lock_anon_vma() - vma_unlock_anon_vma() isn't safe if anon_vma appeared between lock and unlock. We have to check anon_vma first or call anon_vma_prepare() to be sure that it's here. There are only few users of these legacy helpers. Let's get rid of them. This patch fixes anon_vma lock imbalance in validate_mm(). Write lock isn't required here, read lock is enough. And reorders expand_downwards/expand_upwards: security_mmap_addr() and wrapping-around check don't have to be under anon vma lock. Link: https://lkml.kernel.org/r/CACT4Y+Y908EjM2z=706dv4rV6dWtxTLK9nFg9_7DhRMLppBo2g@mail.gmail.com Signed-off-by: Konstantin Khlebnikov <koct9i@gmail.com> Reported-by: Dmitry Vyukov <dvyukov@google.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-02-05ocfs2/dlm: clear refmap bit of recovery lock while doing local recovery cleanupxuejiufei
When recovery master down, dlm_do_local_recovery_cleanup() only remove the $RECOVERY lock owned by dead node, but do not clear the refmap bit. Which will make umount thread falling in dead loop migrating $RECOVERY to the dead node. Signed-off-by: xuejiufei <xuejiufei@huawei.com> Reviewed-by: Joseph Qi <joseph.qi@huawei.com> Cc: Mark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-02-05um: asm/page.h: remove the pte_high member from struct pte_tNicolai Stange
Commit 16da306849d0 ("um: kill pfn_t") introduced a compile warning for defconfig (SUBARCH=i386): arch/um/kernel/skas/mmu.c:38:206: warning: right shift count >= width of type [-Wshift-count-overflow] Aforementioned patch changes the definition of the phys_to_pfn() macro from ((pfn_t) ((p) >> PAGE_SHIFT)) to ((p) >> PAGE_SHIFT) This effectively changes the phys_to_pfn() expansion's type from unsigned long long to unsigned long. Through the callchain init_stub_pte() => mk_pte(), the expansion of phys_to_pfn() is (indirectly) fed into the 'phys' argument of the pte_set_val(pte, phys, prot) macro, eventually leading to (pte).pte_high = (phys) >> 32; This results in the warning from above. Since UML only deals with 32 bit addresses, the upper 32 bits from 'phys' used to be always zero anyway. Also, all page protection flags defined by UML don't use any bits beyond bit 9. Since the contents of a PTE are defined within architecture scope only, the ->pte_high member can be safely removed. Remove the ->pte_high member from struct pte_t. Rename ->pte_low to ->pte. Adapt the pte helper macros in arch/um/include/asm/page.h. Noteworthy is the pte_copy() macro where a smp_wmb() gets dropped. This write barrier doesn't seem to be paired with any read barrier though and thus, was useless anyway. Fixes: 16da306849d0 ("um: kill pfn_t") Signed-off-by: Nicolai Stange <nicstange@gmail.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Richard Weinberger <richard@nod.at> Cc: Nicolai Stange <nicstange@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-02-05mm, hugetlb: don't require CMA for runtime gigantic pagesVlastimil Babka
Commit 944d9fec8d7a ("hugetlb: add support for gigantic page allocation at runtime") has added the runtime gigantic page allocation via alloc_contig_range(), making this support available only when CONFIG_CMA is enabled. Because it doesn't depend on MIGRATE_CMA pageblocks and the associated infrastructure, it is possible with few simple adjustments to require only CONFIG_MEMORY_ISOLATION instead of full CONFIG_CMA. After this patch, alloc_contig_range() and related functions are available and used for gigantic pages with just CONFIG_MEMORY_ISOLATION enabled. Note CONFIG_CMA selects CONFIG_MEMORY_ISOLATION. This allows supporting runtime gigantic pages without the CMA-specific checks in page allocator fastpaths. Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Cc: Luiz Capitulino <lcapitulino@redhat.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com> Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-02-05mm/hugetlb: fix gigantic page initialization/allocationMike Kravetz
Attempting to preallocate 1G gigantic huge pages at boot time with "hugepagesz=1G hugepages=1" on the kernel command line will prevent booting with the following: kernel BUG at mm/hugetlb.c:1218! When mapcount accounting was reworked, the setting of compound_mapcount_ptr in prep_compound_gigantic_page was overlooked. As a result, the validation of mapcount in free_huge_page fails. The "BUG_ON" checks in free_huge_page were also changed to "VM_BUG_ON_PAGE" to assist with debugging. Fixes: 53f9263baba69 ("mm: rework mapcount accounting to enable 4k mapping of THPs") Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: David Rientjes <rientjes@google.com> Tested-by: Vlastimil Babka <vbabka@suse.cz> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com> Cc: Jerome Marchand <jmarchan@redhat.com> Cc: Michal Hocko <mhocko@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-02-05mm: downgrade VM_BUG in isolate_lru_page() to warningKirill A. Shutemov
Calling isolate_lru_page() is wrong and shouldn't happen, but it not nessesary fatal: the page just will not be isolated if it's not on LRU. Let's downgrade the VM_BUG_ON_PAGE() to WARN_RATELIMIT(). Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: David Rientjes <rientjes@google.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Acked-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-02-05mempolicy: do not try to queue pages from !vma_migratable()Kirill A. Shutemov
Maybe I miss some point, but I don't see a reason why we try to queue pages from non migratable VMAs. This testcase steps on VM_BUG_ON_PAGE() in isolate_lru_page(): #include <fcntl.h> #include <unistd.h> #include <stdio.h> #include <sys/mman.h> #include <numaif.h> #define SIZE 0x2000 int foo; int main() { int fd; char *p; unsigned long mask = 2; fd = open("/dev/sg0", O_RDWR); p = mmap(NULL, SIZE, PROT_READ | PROT_WRITE, MAP_PRIVATE, fd, 0); /* Faultin pages */ foo = p[0] + p[0x1000]; mbind(p, SIZE, MPOL_BIND, &mask, 4, MPOL_MF_MOVE | MPOL_MF_STRICT); return 0; } The only case when we can queue pages from such VMA is MPOL_MF_STRICT plus MPOL_MF_MOVE or MPOL_MF_MOVE_ALL for VMA which has pages on LRU, but gfp mask is not sutable for migaration (see mapping_gfp_mask() check in vma_migratable()). That's looks like a bug to me. Let's filter out non-migratable vma at start of queue_pages_test_walk() and go to queue_pages_pte_range() only if MPOL_MF_MOVE or MPOL_MF_MOVE_ALL flag is set. Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Dmitry Vyukov <dvyukov@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: David Rientjes <rientjes@google.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-02-05mm, vmstat: fix wrong WQ sleep when memory reclaim doesn't make any progressTetsuo Handa
Jan Stancek has reported that system occasionally hanging after "oom01" testcase from LTP triggers OOM. Guessing from a result that there is a kworker thread doing memory allocation and the values between "Node 0 Normal free:" and "Node 0 Normal:" differs when hanging, vmstat is not up-to-date for some reason. According to commit 373ccbe59270 ("mm, vmstat: allow WQ concurrency to discover memory reclaim doesn't make any progress"), it meant to force the kworker thread to take a short sleep, but it by error used schedule_timeout(1). We missed that schedule_timeout() in state TASK_RUNNING doesn't do anything. Fix it by using schedule_timeout_uninterruptible(1) which forces the kworker thread to take a short sleep in order to make sure that vmstat is up-to-date. Fixes: 373ccbe59270 ("mm, vmstat: allow WQ concurrency to discover memory reclaim doesn't make any progress") Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Reported-by: Jan Stancek <jstancek@redhat.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Tejun Heo <tj@kernel.org> Cc: Cristopher Lameter <clameter@sgi.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Arkadiusz Miskiewicz <arekm@maven.pl> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-02-05vmstat: make vmstat_update deferrableMichal Hocko
Commit 0eb77e988032 ("vmstat: make vmstat_updater deferrable again and shut down on idle") made vmstat_shepherd deferrable. vmstat_update itself is still useing standard timer which might interrupt idle task. This is possible because "mm, vmstat: make quiet_vmstat lighter" removed cancel_delayed_work from the quiet_vmstat. Change vmstat_work to use DEFERRABLE_WORK to prevent from pointless wakeups from the idle context. Acked-by: Christoph Lameter <cl@linux.com> Signed-off-by: Michal Hocko <mhocko@suse.com> Cc: Mike Galbraith <umgwanakikbuti@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-02-05mm, vmstat: make quiet_vmstat lighterMichal Hocko
Mike has reported a considerable overhead of refresh_cpu_vm_stats from the idle entry during pipe test: 12.89% [kernel] [k] refresh_cpu_vm_stats.isra.12 4.75% [kernel] [k] __schedule 4.70% [kernel] [k] mutex_unlock 3.14% [kernel] [k] __switch_to This is caused by commit 0eb77e988032 ("vmstat: make vmstat_updater deferrable again and shut down on idle") which has placed quiet_vmstat into cpu_idle_loop. The main reason here seems to be that the idle entry has to get over all zones and perform atomic operations for each vmstat entry even though there might be no per cpu diffs. This is a pointless overhead for _each_ idle entry. Make sure that quiet_vmstat is as light as possible. First of all it doesn't make any sense to do any local sync if the current cpu is already set in oncpu_stat_off because vmstat_update puts itself there only if there is nothing to do. Then we can check need_update which should be a cheap way to check for potential per-cpu diffs and only then do refresh_cpu_vm_stats. The original patch also did cancel_delayed_work which we are not doing here. There are two reasons for that. Firstly cancel_delayed_work from idle context will blow up on RT kernels (reported by Mike): CPU: 1 PID: 0 Comm: swapper/1 Not tainted 4.5.0-rt3 #7 Hardware name: MEDION MS-7848/MS-7848, BIOS M7848W08.20C 09/23/2013 Call Trace: dump_stack+0x49/0x67 ___might_sleep+0xf5/0x180 rt_spin_lock+0x20/0x50 try_to_grab_pending+0x69/0x240 cancel_delayed_work+0x26/0xe0 quiet_vmstat+0x75/0xa0 cpu_idle_loop+0x38/0x3e0 cpu_startup_entry+0x13/0x20 start_secondary+0x114/0x140 And secondly, even on !RT kernels it might add some non trivial overhead which is not necessary. Even if the vmstat worker wakes up and preempts idle then it will be most likely a single shot noop because the stats were already synced and so it would end up on the oncpu_stat_off anyway. We just need to teach both vmstat_shepherd and vmstat_update to stop scheduling the worker if there is nothing to do. [mgalbraith@suse.de: cancel pending work of the cpu_stat_off CPU] Signed-off-by: Michal Hocko <mhocko@suse.com> Reported-by: Mike Galbraith <umgwanakikbuti@gmail.com> Acked-by: Christoph Lameter <cl@linux.com> Signed-off-by: Mike Galbraith <mgalbraith@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>