summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2018-03-31net: thunderx: add multicast filter management supportVadim Lomovtsev
The ThunderX NIC could be partitioned to up to 128 VFs and thus represented to system. Each VF is mapped to pair BGX:LMAC, and each of VF is configured by kernel individually. Eventually the bunch of VFs could be mapped onto same pair BGX:LMAC and thus could cause several multicast filtering configuration requests to LMAC with the same MAC addresses. This commit is to add ThunderX NIC BGX filtering manipulation routines. Signed-off-by: Vadim Lomovtsev <Vadim.Lomovtsev@cavium.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-31net: thunderx: add MAC address filter tracking for LMACVadim Lomovtsev
The ThunderX NIC has two Ethernet Interfaces (BGX) each of them could has up to four Logical MACs configured. Each of BGX has 32 filters to be configured for filtering ingress packets. The number of filters available to particular LMAC is from 8 (if we have four LMACs configured per BGX) up to 32 (in case of only one LMAC is configured per BGX). At the same time the NIC could present up to 128 VFs to OS as network interfaces, each of them kernel will configure with set of MAC addresses for filtering. So to prevent dupes in BGX filter registers from different network interfaces it is required to cache and track all filter configuration requests prior to applying them onto BGX filter registers. This commit is to update LMAC structures with control fields to allocate/releasing filters tracking list along with implementing dmac array allocate/release per LMAC. Signed-off-by: Vadim Lomovtsev <Vadim.Lomovtsev@cavium.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-31net: thunderx: move filter register related macro into proper placeVadim Lomovtsev
The ThunderX NIC has set of registers which allows to configure filter policy for ingress packets. There are three possible regimes of filtering multicasts, broadcasts and unicasts: accept all, reject all and accept filter allowed only. Current implementation has enum with all of them and two generic macro for enabling filtering et all (CAM_ACCEPT) and enabling/disabling broadcast packets, which also should be corrected in order to represent register bits properly. All these values are private for driver and there is no need to ‘publish’ them via header file. This commit is to move filtering register manipulation values from header file into source with explicit assignment of exact register values to them to be used while register configuring. Signed-off-by: Vadim Lomovtsev <Vadim.Lomovtsev@cavium.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-31Merge branch 'meson8b'David S. Miller
Martin Blumenstingl says: ==================== Meson8m2 support for dwmac-meson8b The Meson8m2 SoC is an updated version of the Meson8 SoC. Some of the peripherals are shared with Meson8b (for example the watchdog registers and the internal temperature sensor calibration procedure). Meson8m2 also seems to include the same Gigabit MAC register layout as Meson8b. The registers in the Amlogic dwmac "glue" seem identical between Meson8b and Meson8m2. Manual testing seems to confirm this. To be extra-safe a new compatible string is added because there's no (public) documentation on the Meson8m2 SoC. This will allow us to implement any SoC-specific variations later on (if needed). ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-31net: stmmac: dwmac-meson8b: Add support for the Meson8m2 SoCMartin Blumenstingl
The Meson8m2 SoC uses a similar (potentially even identical) register layout as the Meson8b and GXBB SoCs for the dwmac glue. Add a new compatible string and update the module description to indicate support for these SoCs. Signed-off-by: Martin Blumenstingl <martin.blumenstingl@googlemail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-31dt-bindings: net: meson-dwmac: add support for the Meson8m2 SoCMartin Blumenstingl
The Meson8m2 SoC uses a similar (potentially even identical) register layout for the dwmac glue as Meson8b and GXBB. Unfortunately there is no documentation available. Testing shows that both, RMII and RGMII PHYs are working if they are configured as on Meson8b. Add a new compatible string to the documentation so differences (if there are any) between Meson8m2 and the other SoCs can be taken care of within the driver. Signed-off-by: Martin Blumenstingl <martin.blumenstingl@googlemail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-31Merge branch 'perf-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull perf fixes from Ingo Molnar: "Two fixlets" * 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: perf/hwbp: Simplify the perf-hwbp code, fix documentation perf/x86/intel: Fix linear IP of PEBS real_ip on Haswell and later CPUs
2018-03-31Merge branch 'x86-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 fixes from Ingo Molnar: "Two UV platform fixes, and a kbuild fix" * 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/platform/UV: Fix critical UV MMR address error x86/platform/uv/BAU: Add APIC idt entry x86/purgatory: Avoid creating stray .<pid>.d files, remove -MD from KBUILD_CFLAGS
2018-03-31Merge branch 'x86-pti-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 PTI fixes from Ingo Molnar: "Two fixes: a relatively simple objtool fix that makes Clang built kernels work with ORC debug info, plus an alternatives macro fix" * 'x86-pti-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/alternatives: Fixup alternative_call_2 objtool: Add Clang support
2018-03-31perf/x86/intel: Enable C-state residency events for Cannon LakeHarry Pan
Cannon Lake supports C1/C3/C6/C7, PC2/PC3/PC6/PC7/PC8/PC9/PC10 state residency counters, this patch enables those counters. ( The MSR information is based on Intel Software Developers' Manual, Vol. 4, Order No. 335592. ) Tested-by: Puthikorn Voravootivat <puthik@chromium.org> Signed-off-by: Harry Pan <harry.pan@intel.com> Reviewed-by: Benson Leung <bleung@chromium.org> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Kan.liang@intel.com Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephane Eranian <eranian@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vince Weaver <vincent.weaver@maine.edu> Cc: gs0622@gmail.com Link: http://lkml.kernel.org/r/20180309121549.630-3-harry.pan@intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-03-31perf/x86/intel: Add Cannon Lake support for RAPL profilingHarry Pan
This patch enables RAPL counters (energy consumption counters) support for Cannon Lake processors. ( ESU and power domains refer to Intel Software Developers' Manual, Vol. 4, Order No. 335592. ) Usage example: $ perf list $ perf stat -a -e power/energy-cores/,power/energy-pkg/ sleep 10 Tested-by: Puthikorn Voravootivat <puthik@chromium.org> Signed-off-by: Harry Pan <harry.pan@intel.com> Reviewed-by: Benson Leung <bleung@chromium.org> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephane Eranian <eranian@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vince Weaver <vincent.weaver@maine.edu> Cc: colin.king@canonical.com Cc: gs0622@gmail.com Cc: kan.liang@linux.intel.com Link: http://lkml.kernel.org/r/20180309121549.630-2-harry.pan@intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-03-31ACPI / PM: Fix keyboard wakeup from suspend-to-idle on ASUS UX331UAChris Chiu
This issue happens on new ASUS laptop UX331UA which has modern standby mode (suspend-to-idle). Pressing keys on the PS2 keyboard can't wake up the system from suspend-to-idle which is not expected. However, pressing power button can wake up without problem. Per the engineers of ASUS, the keypress event is routed to Embedded Controller (EC) in standby mode. EC then signals the SCI event to BIOS so BIOS would Notify() power button to wake up the system. It's from BIOS perspective. What we observe here is that kernel receives the SCI event from SCI interrupt handler which informs that the GPE status bit belongs to EC needs to be handled and then queries the EC to find out what event is pending. Then execute the following ACPI _QDF method which defined in ACPI DSDT for EC to notify power button. Method (_QDF, 0, NotSerialized) // _Qxx: EC Query { Notify (PWRB, 0x80) // Status Change } With more debug messages added to analyze this problem, we find that the keypress does wake up the system from suspend-to-idle but it's back to suspend again almost immediately. As we see in the following messages, the acpi_button_notify() is invoked but acpi_pm_wakeup_event() can not really wake up the system here because acpi_s2idle_wakeup() is false. The acpi_s2idle_wakeup() returnd false because the acpi_s2idle_sync() has alrealdy exited. [ 52.987048] s2idle_loop going s2idle [ 59.713392] acpi_s2idle_wake enter [ 59.713394] acpi_s2idle_wake exit [ 59.760888] acpi_ev_gpe_detect enter [ 59.760893] acpi_s2idle_sync enter [ 59.760893] acpi_ec_query_flushed ec pending queries 0 [ 59.760953] Read registers for GPE 50-57: Status=01, Enable=01, RunEnable=01, WakeEnable=00 [ 59.760955] ACPI: EC: ===== IRQ (1) ===== [ 59.760972] ACPI: EC: EC_SC(R) = 0x28 SCI_EVT=1 BURST=0 CMD=1 IBF=0 OBF=0 [ 59.760979] ACPI: EC: +++++ Polling enabled +++++ [ 59.760979] ACPI: EC: ##### Command(QR_EC) submitted/blocked ##### [ 59.761003] acpi_s2idle_sync exit [ 59.769587] ACPI: EC: ##### Query(0xdf) started ##### [ 59.769611] ACPI: EC: ##### Query(0xdf) stopped ##### [ 59.774154] acpi_button_notify button type 1 [ 59.813175] s2idle_loop going s2idle acpi_s2idle_sync() already makes an effort to flush the EC event queue, but in this case, the EC event has yet to be generated when the call to acpi_ec_flush_work() is made. The event is generated shortly after, through the ongoing handling of the SCI interrupt which is happening on another CPU, and we must synchronize that to make sure that it has run and completed. Adding another call to acpi_os_wait_events_complete() solves this issue, since that function synchronizes with SCI interrupt completion. Signed-off-by: Chris Chiu <chiu@endlessm.com> [ rjw: Subject ] Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2018-03-31x86/cpu/tme: Fix spelling: "configuation" -> "configuration"Colin Ian King
Trivial fix to spelling mistake in the pr_err_once() error message text. Signed-off-by: Colin Ian King <colin.king@canonical.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: kernel-janitors@vger.kernel.org Link: http://lkml.kernel.org/r/20180313154709.1015-1-colin.king@canonical.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-03-31x86/build: Don't pass in -D__KERNEL__ multiple timesCao jin
Some .<target>.cmd files under arch/x86 are showing two instances of -D__KERNEL__, like arch/x86/boot/ and arch/x86/realmode/rm/. __KERNEL__ is already defined in KBUILD_CPPFLAGS in the top Makefile, so it can be dropped safely. Signed-off-by: Cao jin <caoj.fnst@cn.fujitsu.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Masahiro Yamada <yamada.masahiro@socionext.com> Cc: Michal Marek <michal.lkml@markovi.net> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kbuild@vger.kernel.org Link: http://lkml.kernel.org/r/20180316084944.3997-1-caoj.fnst@cn.fujitsu.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-03-31locking/Kconfig: Restructure the lock debugging menuWaiman Long
Two config options in the lock debugging menu that are probably the most frequently used, as far as I am concerned, is the PROVE_LOCKING and LOCK_STAT. From a UI perspective, they should be front and center. So these two options are now moved to the top of the lock debugging menu. The DEBUG_WW_MUTEX_SLOWPATH option is also added to the PROVE_LOCKING umbrella. Signed-off-by: Waiman Long <longman@redhat.com> Acked-by: Davidlohr Bueso <dave@stgolabs.net> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1522445280-7767-4-git-send-email-longman@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-03-31locking/Kconfig: Add LOCK_DEBUGGING_SUPPORT to make it more readableWaiman Long
There are a couples of lock debugging Kconfig options that depends on the following support options: - TRACE_IRQFLAGS_SUPPORT - STACKTRACE_SUPPORT - LOCKDEP_SUPPORT That makes those lock debugging options harder to read and understand. So a new LOCK_DEBUGGING_SUPPORT option is added that is equivalent to the above three options together. That makes the Kconfig.debug file more readable. Signed-off-by: Waiman Long <longman@redhat.com> Acked-by: Davidlohr Bueso <dave@stgolabs.net> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1522445280-7767-3-git-send-email-longman@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-03-31locking/rwsem: Add DEBUG_RWSEMS to look for lock/unlock mismatchesWaiman Long
For a rwsem, locking can either be exclusive or shared. The corresponding exclusive or shared unlock must be used. Otherwise, the protected data structures may get corrupted or the lock may be in an inconsistent state. In order to detect such anomaly, a new configuration option DEBUG_RWSEMS is added which can be enabled to look for such mismatches and print warnings that that happens. Signed-off-by: Waiman Long <longman@redhat.com> Acked-by: Davidlohr Bueso <dave@stgolabs.net> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1522445280-7767-2-git-send-email-longman@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-03-31Merge branch 'linus' into locking/core, to pick up fixesIngo Molnar
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-03-30Merge tag 'kbuild-fixes-v4.16-3' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild Pull Kbuild fixes from Masahiro Yamada: - fix missed rebuild of TRIM_UNUSED_KSYMS - fix rpm-pkg for GNU tar >= 1.29 - include scripts/dtc/include-prefixes/* to kernel header deb-pkg - add -no-integrated-as option ealier to fix building with Clang - fix netfilter Makefile for parallel building * tag 'kbuild-fixes-v4.16-3' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild: netfilter: nf_nat_snmp_basic: add correct dependency to Makefile kbuild: rpm-pkg: Support GNU tar >= 1.29 builddeb: Fix header package regarding dtc source links kbuild: set no-integrated-as before incl. arch Makefile kbuild: make scripts/adjust_autoksyms.sh robust against timestamp races
2018-03-30Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netLinus Torvalds
Pull networking fixes from David Miller: 1) Fix RCU locking in xfrm_local_error(), from Taehee Yoo. 2) Fix return value assignments and thus error checking in iwl_mvm_start_ap_ibss(), from Johannes Berg. 3) Don't count header length twice in vti4, from Stefano Brivio. 4) Fix deadlock in rt6_age_examine_exception, from Eric Dumazet. 5) Fix out-of-bounds access in nf_sk_lookup_slow{v4,v6}() from Subash Abhinov. 6) Check nladdr size in netlink_connect(), from Alexander Potapenko. 7) VF representor SQ numbers are 32 not 16 bits, in mlx5 driver, from Or Gerlitz. 8) Out of bounds read in skb_network_protocol(), from Eric Dumazet. 9) r8169 driver sets driver data pointer after register_netdev() which is too late. Fix from Heiner Kallweit. 10) Fix memory leak in mlx4 driver, from Moshe Shemesh. 11) The multi-VLAN decap fix added a regression when dealing with device that lack a MAC header, such as tun. Fix from Toshiaki Makita. 12) Fix integer overflow in dynamic interrupt coalescing code. From Tal Gilboa. 13) Use after free in vrf code, from David Ahern. 14) IPV6 route leak between VRFs fix, also from David Ahern. * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (81 commits) net: mvneta: fix enable of all initialized RXQs net/ipv6: Fix route leaking between VRFs vrf: Fix use after free and double free in vrf_finish_output ipv6: sr: fix seg6 encap performances with TSO enabled net/dim: Fix int overflow vlan: Fix vlan insertion for packets without ethernet header net: Fix untag for vlan packets without ethernet header atm: iphase: fix spelling mistake: "Receiverd" -> "Received" vhost: validate log when IOTLB is enabled qede: Do not drop rx-checksum invalidated packets. hv_netvsc: enable multicast if necessary ip_tunnel: Resolve ipsec merge conflict properly. lan78xx: Crash in lan78xx_writ_reg (Workqueue: events lan78xx_deferred_multicast_write) qede: Fix barrier usage after tx doorbell write. vhost: correctly remove wait queue during poll failure net/mlx4_core: Fix memory leak while delete slave's resources net/mlx4_en: Fix mixed PFC and Global pause user control requests net/smc: use announced length in sock_recvmsg() llc: properly handle dev_queue_xmit() return value strparser: Fix sign of err codes ...
2018-03-31kbuild: get <linux/compiler_types.h> out of <linux/kconfig.h>Masahiro Yamada
Since commit 28128c61e08e ("kconfig.h: Include compiler types to avoid missed struct attributes"), <linux/kconfig.h> pulls in kernel-space headers to unrelated places. Commit 0f9da844d877 ("MIPS: boot: Define __ASSEMBLY__ for its.S build") suppress the build error by defining __ASSEMBLY__, but ITS (i.e. DTS) is not assembly, and should not include <linux/compiler_types.h> in the first place. Looking at arch/s390/tools/Makefile, host programs gen_facilities and gen_opcode_table now pull in <linux/compiler_types.h> as well. The motivation for that commit was to define necessary attributes before any struct is defined. Obviously, this happens only in C. It is enough to include <linux/compiler_types.h> only when compiling C files, and only when compiling kernel space. Move the include to c_flags. Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com>
2018-03-31Merge branch 'bpf-cgroup-bind-connect'Daniel Borkmann
Andrey Ignatov says: ==================== v2->v3: - rebase due to conflicts - fix ipv6=m build v1->v2: - support expected_attach_type at prog load time so that prog (incl. context accesses and calls to helpers) can be validated with regard to specific attach point it is supposed to be attached to. Later, at attach time, attach type is checked so that it must be same as at load time if it was provided - reworked hooks to rely on expected_attach_type, and reduced number of new prog types from 6 to just 1: BPF_PROG_TYPE_CGROUP_SOCK_ADDR - reused BPF_PROG_TYPE_CGROUP_SOCK for sys_bind post-hooks - add selftests for post-sys_bind hook For our container management we've been using complicated and fragile setup consisting of LD_PRELOAD wrapper intercepting bind and connect calls from all containerized applications. Unfortunately it doesn't work for apps that don't use glibc and changing all applications that run in the datacenter is not possible due to 3rd party code and libraries (despite being open source code) and sheer amount of legacy code that has to be rewritten (we're rewriting what we can in parallel) These applications are written without containers in mind and have builtin assumptions about network services. Like an application X expects to connect localhost:special_port and find service Y in there. To move application X and service Y into two different containers LD_PRELOAD approach is used to help one service connect to another without rewriting them. Moving these two applications into different L2 (netns) or L3 (vrf) network isolation scopes doesn't help to solve the problem, since applications need to see each other like they were running on the host without containers. So if app X and app Y would run in different netns something would need to punch a connectivity hole in those namespaces. That would be real layering violation (with corresponding network debugging pains), since clean l2, l3 abstraction would suddenly support something that breaks through the layers. Instead we used LD_PRELOAD (and now bpf programs) at bind/connect time to help applications discover and connect to each other. All applications are running in init_nens and there are no vrfs. After bind/connect the normal fib/neighbor core networking logic works as it should always do and the whole system is clean from network point of view and can be debugged with standard tools. We also considered resurrecting Hannes's afnetns work, but all hierarchical namespace abstraction don't work due to these builtin networking assumptions inside the apps. To run an application inside cgroup container that was not written with containers in mind we have to make an illusion of running in non-containerized environment. In some cases we remember the port and container id in the post-bind hook in a bpf map and when some other task in a different container is trying to connect to a service we need to know where this service is running. It can be remote and can be local. Both client and service may or may not be written with containers in mind and this sockaddr rewrite is providing connectivity and load balancing feature. BPF+cgroup looks to be the best solution for this problem. Hence we introduce 3 hooks: - at entry into sys_bind and sys_connect to let bpf prog look and modify 'struct sockaddr' provided by user space and fail bind/connect when appropriate - post sys_bind after port is allocated The approach works great and has zero overhead for anyone who doesn't use it and very low overhead when deployed. Different use case for this feature is to do low overhead firewall that doesn't need to inspect all packets and works at bind/connect time. ==================== Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-03-31selftests/bpf: Selftest for sys_bind post-hooks.Andrey Ignatov
Add selftest for attach types `BPF_CGROUP_INET4_POST_BIND` and `BPF_CGROUP_INET6_POST_BIND`. The main things tested are: * prog load behaves as expected (valid/invalid accesses in prog); * prog attach behaves as expected (load- vs attach-time attach types); * `BPF_CGROUP_INET_SOCK_CREATE` can be attached in a backward compatible way; * post-hooks return expected result and errno. Example: # ./test_sock Test case: bind4 load with invalid access: src_ip6 .. [PASS] Test case: bind4 load with invalid access: mark .. [PASS] Test case: bind6 load with invalid access: src_ip4 .. [PASS] Test case: sock_create load with invalid access: src_port .. [PASS] Test case: sock_create load w/o expected_attach_type (compat mode) .. [PASS] Test case: sock_create load w/ expected_attach_type .. [PASS] Test case: attach type mismatch bind4 vs bind6 .. [PASS] Test case: attach type mismatch bind6 vs bind4 .. [PASS] Test case: attach type mismatch default vs bind4 .. [PASS] Test case: attach type mismatch bind6 vs sock_create .. [PASS] Test case: bind4 reject all .. [PASS] Test case: bind6 reject all .. [PASS] Test case: bind6 deny specific IP & port .. [PASS] Test case: bind4 allow specific IP & port .. [PASS] Test case: bind4 allow all .. [PASS] Test case: bind6 allow all .. [PASS] Summary: 16 PASSED, 0 FAILED Signed-off-by: Andrey Ignatov <rdna@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-03-31bpf: Post-hooks for sys_bindAndrey Ignatov
"Post-hooks" are hooks that are called right before returning from sys_bind. At this time IP and port are already allocated and no further changes to `struct sock` can happen before returning from sys_bind but BPF program has a chance to inspect the socket and change sys_bind result. Specifically it can e.g. inspect what port was allocated and if it doesn't satisfy some policy, BPF program can force sys_bind to fail and return EPERM to user. Another example of usage is recording the IP:port pair to some map to use it in later calls to sys_connect. E.g. if some TCP server inside cgroup was bound to some IP:port_n, it can be recorded to a map. And later when some TCP client inside same cgroup is trying to connect to 127.0.0.1:port_n, BPF hook for sys_connect can override the destination and connect application to IP:port_n instead of 127.0.0.1:port_n. That helps forcing all applications inside a cgroup to use desired IP and not break those applications if they e.g. use localhost to communicate between each other. == Implementation details == Post-hooks are implemented as two new attach types `BPF_CGROUP_INET4_POST_BIND` and `BPF_CGROUP_INET6_POST_BIND` for existing prog type `BPF_PROG_TYPE_CGROUP_SOCK`. Separate attach types for IPv4 and IPv6 are introduced to avoid access to IPv6 field in `struct sock` from `inet_bind()` and to IPv4 field from `inet6_bind()` since those fields might not make sense in such cases. Signed-off-by: Andrey Ignatov <rdna@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-03-31selftests/bpf: Selftest for sys_connect hooksAndrey Ignatov
Add selftest for BPF_CGROUP_INET4_CONNECT and BPF_CGROUP_INET6_CONNECT attach types. Try to connect(2) to specified IP:port and test that: * remote IP:port pair is overridden; * local end of connection is bound to specified IP. All combinations of IPv4/IPv6 and TCP/UDP are tested. Example: # tcpdump -pn -i lo -w connect.pcap 2>/dev/null & [1] 478 # strace -qqf -e connect -o connect.trace ./test_sock_addr.sh Wait for testing IPv4/IPv6 to become available ... OK Load bind4 with invalid type (can pollute stderr) ... REJECTED Load bind4 with valid type ... OK Attach bind4 with invalid type ... REJECTED Attach bind4 with valid type ... OK Load connect4 with invalid type (can pollute stderr) libbpf: load bpf \ program failed: Permission denied libbpf: -- BEGIN DUMP LOG --- libbpf: 0: (b7) r2 = 23569 1: (63) *(u32 *)(r1 +24) = r2 2: (b7) r2 = 16777343 3: (63) *(u32 *)(r1 +4) = r2 invalid bpf_context access off=4 size=4 [ 1518.404609] random: crng init done libbpf: -- END LOG -- libbpf: failed to load program 'cgroup/connect4' libbpf: failed to load object './connect4_prog.o' ... REJECTED Load connect4 with valid type ... OK Attach connect4 with invalid type ... REJECTED Attach connect4 with valid type ... OK Test case #1 (IPv4/TCP): Requested: bind(192.168.1.254, 4040) .. Actual: bind(127.0.0.1, 4444) Requested: connect(192.168.1.254, 4040) from (*, *) .. Actual: connect(127.0.0.1, 4444) from (127.0.0.4, 56068) Test case #2 (IPv4/UDP): Requested: bind(192.168.1.254, 4040) .. Actual: bind(127.0.0.1, 4444) Requested: connect(192.168.1.254, 4040) from (*, *) .. Actual: connect(127.0.0.1, 4444) from (127.0.0.4, 56447) Load bind6 with invalid type (can pollute stderr) ... REJECTED Load bind6 with valid type ... OK Attach bind6 with invalid type ... REJECTED Attach bind6 with valid type ... OK Load connect6 with invalid type (can pollute stderr) libbpf: load bpf \ program failed: Permission denied libbpf: -- BEGIN DUMP LOG --- libbpf: 0: (b7) r6 = 0 1: (63) *(u32 *)(r1 +12) = r6 invalid bpf_context access off=12 size=4 libbpf: -- END LOG -- libbpf: failed to load program 'cgroup/connect6' libbpf: failed to load object './connect6_prog.o' ... REJECTED Load connect6 with valid type ... OK Attach connect6 with invalid type ... REJECTED Attach connect6 with valid type ... OK Test case #3 (IPv6/TCP): Requested: bind(face:b00c:1234:5678::abcd, 6060) .. Actual: bind(::1, 6666) Requested: connect(face:b00c:1234:5678::abcd, 6060) from (*, *) Actual: connect(::1, 6666) from (::6, 37458) Test case #4 (IPv6/UDP): Requested: bind(face:b00c:1234:5678::abcd, 6060) .. Actual: bind(::1, 6666) Requested: connect(face:b00c:1234:5678::abcd, 6060) from (*, *) Actual: connect(::1, 6666) from (::6, 39315) ### SUCCESS # egrep 'connect\(.*AF_INET' connect.trace | \ > egrep -vw 'htons\(1025\)' | fold -b -s -w 72 502 connect(7, {sa_family=AF_INET, sin_port=htons(4040), sin_addr=inet_addr("192.168.1.254")}, 128) = 0 502 connect(8, {sa_family=AF_INET, sin_port=htons(4040), sin_addr=inet_addr("192.168.1.254")}, 128) = 0 502 connect(9, {sa_family=AF_INET6, sin6_port=htons(6060), inet_pton(AF_INET6, "face:b00c:1234:5678::abcd", &sin6_addr), sin6_flowinfo=0, sin6_scope_id=0}, 128) = 0 502 connect(10, {sa_family=AF_INET6, sin6_port=htons(6060), inet_pton(AF_INET6, "face:b00c:1234:5678::abcd", &sin6_addr), sin6_flowinfo=0, sin6_scope_id=0}, 128) = 0 # fg tcpdump -pn -i lo -w connect.pcap 2> /dev/null # tcpdump -r connect.pcap -n tcp | cut -c 1-72 reading from file connect.pcap, link-type EN10MB (Ethernet) 17:57:40.383533 IP 127.0.0.4.56068 > 127.0.0.1.4444: Flags [S], seq 1333 17:57:40.383566 IP 127.0.0.1.4444 > 127.0.0.4.56068: Flags [S.], seq 112 17:57:40.383589 IP 127.0.0.4.56068 > 127.0.0.1.4444: Flags [.], ack 1, w 17:57:40.384578 IP 127.0.0.1.4444 > 127.0.0.4.56068: Flags [R.], seq 1, 17:57:40.403327 IP6 ::6.37458 > ::1.6666: Flags [S], seq 406513443, win 17:57:40.403357 IP6 ::1.6666 > ::6.37458: Flags [S.], seq 2448389240, ac 17:57:40.403376 IP6 ::6.37458 > ::1.6666: Flags [.], ack 1, win 342, opt 17:57:40.404263 IP6 ::1.6666 > ::6.37458: Flags [R.], seq 1, ack 1, win Signed-off-by: Andrey Ignatov <rdna@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-03-31bpf: Hooks for sys_connectAndrey Ignatov
== The problem == See description of the problem in the initial patch of this patch set. == The solution == The patch provides much more reliable in-kernel solution for the 2nd part of the problem: making outgoing connecttion from desired IP. It adds new attach types `BPF_CGROUP_INET4_CONNECT` and `BPF_CGROUP_INET6_CONNECT` for program type `BPF_PROG_TYPE_CGROUP_SOCK_ADDR` that can be used to override both source and destination of a connection at connect(2) time. Local end of connection can be bound to desired IP using newly introduced BPF-helper `bpf_bind()`. It allows to bind to only IP though, and doesn't support binding to port, i.e. leverages `IP_BIND_ADDRESS_NO_PORT` socket option. There are two reasons for this: * looking for a free port is expensive and can affect performance significantly; * there is no use-case for port. As for remote end (`struct sockaddr *` passed by user), both parts of it can be overridden, remote IP and remote port. It's useful if an application inside cgroup wants to connect to another application inside same cgroup or to itself, but knows nothing about IP assigned to the cgroup. Support is added for IPv4 and IPv6, for TCP and UDP. IPv4 and IPv6 have separate attach types for same reason as sys_bind hooks, i.e. to prevent reading from / writing to e.g. user_ip6 fields when user passes sockaddr_in since it'd be out-of-bound. == Implementation notes == The patch introduces new field in `struct proto`: `pre_connect` that is a pointer to a function with same signature as `connect` but is called before it. The reason is in some cases BPF hooks should be called way before control is passed to `sk->sk_prot->connect`. Specifically `inet_dgram_connect` autobinds socket before calling `sk->sk_prot->connect` and there is no way to call `bpf_bind()` from hooks from e.g. `ip4_datagram_connect` or `ip6_datagram_connect` since it'd cause double-bind. On the other hand `proto.pre_connect` provides a flexible way to add BPF hooks for connect only for necessary `proto` and call them at desired time before `connect`. Since `bpf_bind()` is allowed to bind only to IP and autobind in `inet_dgram_connect` binds only port there is no chance of double-bind. bpf_bind() sets `force_bind_address_no_port` to bind to only IP despite of value of `bind_address_no_port` socket field. bpf_bind() sets `with_lock` to `false` when calling to __inet_bind() and __inet6_bind() since all call-sites, where bpf_bind() is called, already hold socket lock. Signed-off-by: Andrey Ignatov <rdna@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-03-31net: Introduce __inet_bind() and __inet6_bindAndrey Ignatov
Refactor `bind()` code to make it ready to be called from BPF helper function `bpf_bind()` (will be added soon). Implementation of `inet_bind()` and `inet6_bind()` is separated into `__inet_bind()` and `__inet6_bind()` correspondingly. These function can be used from both `sk_prot->bind` and `bpf_bind()` contexts. New functions have two additional arguments. `force_bind_address_no_port` forces binding to IP only w/o checking `inet_sock.bind_address_no_port` field. It'll allow to bind local end of a connection to desired IP in `bpf_bind()` w/o changing `bind_address_no_port` field of a socket. It's useful since `bpf_bind()` can return an error and we'd need to restore original value of `bind_address_no_port` in that case if we changed this before calling to the helper. `with_lock` specifies whether to lock socket when working with `struct sk` or not. The argument is set to `true` for `sk_prot->bind`, i.e. old behavior is preserved. But it will be set to `false` for `bpf_bind()` use-case. The reason is all call-sites, where `bpf_bind()` will be called, already hold that socket lock. Signed-off-by: Andrey Ignatov <rdna@fb.com> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-03-31selftests/bpf: Selftest for sys_bind hooksAndrey Ignatov
Add selftest to work with bpf_sock_addr context from `BPF_PROG_TYPE_CGROUP_SOCK_ADDR` programs. Try to bind(2) on IP:port and apply: * loads to make sure context can be read correctly, including narrow loads (byte, half) for IP and full-size loads (word) for all fields; * stores to those fields allowed by verifier. All combination from IPv4/IPv6 and TCP/UDP are tested. Both scenarios are tested: * valid programs can be loaded and attached; * invalid programs can be neither loaded nor attached. Test passes when expected data can be read from context in the BPF-program, and after the call to bind(2) socket is bound to IP:port pair that was written by BPF-program to the context. Example: # ./test_sock_addr Attached bind4 program. Test case #1 (IPv4/TCP): Requested: bind(192.168.1.254, 4040) .. Actual: bind(127.0.0.1, 4444) Test case #2 (IPv4/UDP): Requested: bind(192.168.1.254, 4040) .. Actual: bind(127.0.0.1, 4444) Attached bind6 program. Test case #3 (IPv6/TCP): Requested: bind(face:b00c:1234:5678::abcd, 6060) .. Actual: bind(::1, 6666) Test case #4 (IPv6/UDP): Requested: bind(face:b00c:1234:5678::abcd, 6060) .. Actual: bind(::1, 6666) ### SUCCESS Signed-off-by: Andrey Ignatov <rdna@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-03-31bpf: Hooks for sys_bindAndrey Ignatov
== The problem == There is a use-case when all processes inside a cgroup should use one single IP address on a host that has multiple IP configured. Those processes should use the IP for both ingress and egress, for TCP and UDP traffic. So TCP/UDP servers should be bound to that IP to accept incoming connections on it, and TCP/UDP clients should make outgoing connections from that IP. It should not require changing application code since it's often not possible. Currently it's solved by intercepting glibc wrappers around syscalls such as `bind(2)` and `connect(2)`. It's done by a shared library that is preloaded for every process in a cgroup so that whenever TCP/UDP server calls `bind(2)`, the library replaces IP in sockaddr before passing arguments to syscall. When application calls `connect(2)` the library transparently binds the local end of connection to that IP (`bind(2)` with `IP_BIND_ADDRESS_NO_PORT` to avoid performance penalty). Shared library approach is fragile though, e.g.: * some applications clear env vars (incl. `LD_PRELOAD`); * `/etc/ld.so.preload` doesn't help since some applications are linked with option `-z nodefaultlib`; * other applications don't use glibc and there is nothing to intercept. == The solution == The patch provides much more reliable in-kernel solution for the 1st part of the problem: binding TCP/UDP servers on desired IP. It does not depend on application environment and implementation details (whether glibc is used or not). It adds new eBPF program type `BPF_PROG_TYPE_CGROUP_SOCK_ADDR` and attach types `BPF_CGROUP_INET4_BIND` and `BPF_CGROUP_INET6_BIND` (similar to already existing `BPF_CGROUP_INET_SOCK_CREATE`). The new program type is intended to be used with sockets (`struct sock`) in a cgroup and provided by user `struct sockaddr`. Pointers to both of them are parts of the context passed to programs of newly added types. The new attach types provides hooks in `bind(2)` system call for both IPv4 and IPv6 so that one can write a program to override IP addresses and ports user program tries to bind to and apply such a program for whole cgroup. == Implementation notes == [1] Separate attach types for `AF_INET` and `AF_INET6` are added intentionally to prevent reading/writing to offsets that don't make sense for corresponding socket family. E.g. if user passes `sockaddr_in` it doesn't make sense to read from / write to `user_ip6[]` context fields. [2] The write access to `struct bpf_sock_addr_kern` is implemented using special field as an additional "register". There are just two registers in `sock_addr_convert_ctx_access`: `src` with value to write and `dst` with pointer to context that can't be changed not to break later instructions. But the fields, allowed to write to, are not available directly and to access them address of corresponding pointer has to be loaded first. To get additional register the 1st not used by `src` and `dst` one is taken, its content is saved to `bpf_sock_addr_kern.tmp_reg`, then the register is used to load address of pointer field, and finally the register's content is restored from the temporary field after writing `src` value. Signed-off-by: Andrey Ignatov <rdna@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-03-31libbpf: Support expected_attach_type at prog loadAndrey Ignatov
Support setting `expected_attach_type` at prog load time in both `bpf/bpf.h` and `bpf/libbpf.h`. Since both headers already have API to load programs, new functions are added not to break backward compatibility for existing ones: * `bpf_load_program_xattr()` is added to `bpf/bpf.h`; * `bpf_prog_load_xattr()` is added to `bpf/libbpf.h`. Both new functions accept structures, `struct bpf_load_program_attr` and `struct bpf_prog_load_attr` correspondingly, where new fields can be added in the future w/o changing the API. Standard `_xattr` suffix is used to name the new API functions. Since `bpf_load_program_name()` is not used as heavily as `bpf_load_program()`, it was removed in favor of more generic `bpf_load_program_xattr()`. Signed-off-by: Andrey Ignatov <rdna@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-03-31bpf: Check attach type at prog load timeAndrey Ignatov
== The problem == There are use-cases when a program of some type can be attached to multiple attach points and those attach points must have different permissions to access context or to call helpers. E.g. context structure may have fields for both IPv4 and IPv6 but it doesn't make sense to read from / write to IPv6 field when attach point is somewhere in IPv4 stack. Same applies to BPF-helpers: it may make sense to call some helper from some attach point, but not from other for same prog type. == The solution == Introduce `expected_attach_type` field in in `struct bpf_attr` for `BPF_PROG_LOAD` command. If scenario described in "The problem" section is the case for some prog type, the field will be checked twice: 1) At load time prog type is checked to see if attach type for it must be known to validate program permissions correctly. Prog will be rejected with EINVAL if it's the case and `expected_attach_type` is not specified or has invalid value. 2) At attach time `attach_type` is compared with `expected_attach_type`, if prog type requires to have one, and, if they differ, attach will be rejected with EINVAL. The `expected_attach_type` is now available as part of `struct bpf_prog` in both `bpf_verifier_ops->is_valid_access()` and `bpf_verifier_ops->get_func_proto()` () and can be used to check context accesses and calls to helpers correspondingly. Initially the idea was discussed by Alexei Starovoitov <ast@fb.com> and Daniel Borkmann <daniel@iogearbox.net> here: https://marc.info/?l=linux-netdev&m=152107378717201&w=2 Signed-off-by: Andrey Ignatov <rdna@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-03-30ext4: add extra checks to ext4_xattr_block_get()Theodore Ts'o
Add explicit checks in ext4_xattr_block_get() just in case the e_value_offs and e_value_size fields in the the xattr block are corrupted in memory after the buffer_verified bit is set on the xattr block. Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org
2018-03-31btrfs: lift errors from add_extent_changeset to the callersDavid Sterba
The missing error handling in add_extent_changeset was hidden, so make it at least visible in the callers. Signed-off-by: David Sterba <dsterba@suse.com>
2018-03-31Btrfs: print error messages when failing to read treesLiu Bo
When mount fails to read trees like fs tree, checksum tree, extent tree, etc, there is not enough information about where went wrong. With this, messages like "BTRFS warning (device sdf): failed to read root (objectid=7): -5" would help us a bit. Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2018-03-31btrfs: user proper type for btrfs_mask_flags flagsDavid Sterba
All users pass a local unsigned int and not the __uXX types that are supposed to be used for userspace interfaces. Signed-off-by: David Sterba <dsterba@suse.com>
2018-03-31btrfs: split dev-replace locking helpers for read and writeDavid Sterba
The current calls are unclear in what way btrfs_dev_replace_lock takes the locks, so drop the argument, split the helpers and use similar naming as for read and write locks. Signed-off-by: David Sterba <dsterba@suse.com>
2018-03-31btrfs: remove stale comments about fs_mutexDavid Sterba
The fs_mutex has been killed in 2008, a213501153fd66e2 ("Btrfs: Replace the big fs_mutex with a collection of other locks"), still remembered in some comments. We don't have any extra needs for locking in the ACL handlers. Signed-off-by: David Sterba <dsterba@suse.com>
2018-03-31btrfs: use RCU in btrfs_show_devname for device list traversalDavid Sterba
The show_devname callback is used to print device name in /proc/self/mounts, we need to traverse the device list consistently and read the name that's copied to a seq buffer so we don't need further locking. If the first device is being deleted at the same time, the RCU will allow us to read the device name, though it will become stale right after the RCU protection ends. This is unavoidable and the user can expect that the device will disappear from the filesystem's list at some point. The device_list_mutex was pretty heavy as it is used eg. for writing superblock and a few other IO related contexts. This can stall any application that reads the proc file for no reason. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
2018-03-31btrfs: update barrier in should_cow_blockDavid Sterba
Once there was a simple int force_cow that was used with the plain barriers, and then converted to a bit, so we should use the appropriate barrier helper. Other variables in the complex if condition do not depend on a barrier, so we should be fine in case the atomic barrier becomes a no-op. Signed-off-by: David Sterba <dsterba@suse.com>
2018-03-31btrfs: use lockdep_assert_held for mutexesDavid Sterba
Using lockdep_assert_held is preferred, replace mutex_is_locked. Signed-off-by: David Sterba <dsterba@suse.com>
2018-03-31btrfs: use lockdep_assert_held for spinlocksDavid Sterba
Using lockdep_assert_held is preferred, replace assert_spin_locked. Signed-off-by: David Sterba <dsterba@suse.com>
2018-03-31btrfs: Validate child tree block's level and first keyQu Wenruo
We have several reports about node pointer points to incorrect child tree blocks, which could have even wrong owner and level but still with valid generation and checksum. Although btrfs check could handle it and print error message like: leaf parent key incorrect 60670574592 Kernel doesn't have enough check on this type of corruption correctly. At least add such check to read_tree_block() and btrfs_read_buffer(), where we need two new parameters @level and @first_key to verify the child tree block. The new @level check is mandatory and all call sites are already modified to extract expected level from its call chain. While @first_key is optional, the following call sites are skipping such check: 1) Root node/leaf As ROOT_ITEM doesn't contain the first key, skip @first_key check. 2) Direct backref Only parent bytenr and level is known and we need to resolve the key all by ourselves, skip @first_key check. Another note of this verification is, it needs extra info from nodeptr or ROOT_ITEM, so it can't fit into current tree-checker framework, which is limited to node/leaf boundary. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2018-03-31btrfs: tests/qgroup: Fix wrong tree backref levelQu Wenruo
The extent tree of the test fs is like the following: BTRFS info (device (null)): leaf 16327509003777336587 total ptrs 1 free space 3919 item 0 key (4096 168 4096) itemoff 3944 itemsize 51 extent refs 1 gen 1 flags 2 tree block key (68719476736 0 0) level 1 ^^^^^^^ ref#0: tree block backref root 5 And it's using an empty tree for fs tree, so there is no way that its level can be 1. For REAL (created by mkfs) fs tree backref with no skinny metadata, the result should look like: item 3 key (30408704 EXTENT_ITEM 4096) itemoff 3845 itemsize 51 refs 1 gen 4 flags TREE_BLOCK tree block key (256 INODE_ITEM 0) level 0 ^^^^^^^ tree block backref root 5 Fix the level to 0, so it won't break later tree level checker. Fixes: faa2dbf004e8 ("Btrfs: add sanity tests for new qgroup accounting code") Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2018-03-31Btrfs: fix copy_items() return value when logging an inodeFilipe Manana
When logging an inode, at tree-log.c:copy_items(), if we call btrfs_next_leaf() at the loop which checks for the need to log holes, we need to make sure copy_items() returns the value 1 to its caller and not 0 (on success). This is because the path the caller passed was released and is now different from what is was before, and the caller expects a return value of 0 to mean both success and that the path has not changed, while a return value of 1 means both success and signals the caller that it can not reuse the path, it has to perform another tree search. Even though this is a case that should not be triggered on normal circumstances or very rare at least, its consequences can be very unpredictable (especially when replaying a log tree). Fixes: 16e7549f045d ("Btrfs: incompatible format change to remove hole extents") Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2018-03-31Btrfs: fix fsync after hole punching when using no-holes featureFilipe Manana
When we have the no-holes mode enabled and fsync a file after punching a hole in it, we can end up not logging the whole hole range in the log tree. This happens if the file has extent items that span more than one leaf and we punch a hole that covers a range that starts in a leaf but does not go beyond the offset of the first extent in the next leaf. Example: $ mkfs.btrfs -f -O no-holes -n 65536 /dev/sdb $ mount /dev/sdb /mnt $ for ((i = 0; i <= 831; i++)); do offset=$((i * 2 * 256 * 1024)) xfs_io -f -c "pwrite -S 0xab -b 256K $offset 256K" \ /mnt/foobar >/dev/null done $ sync # We now have 2 leafs in our filesystem fs tree, the first leaf has an # item corresponding the extent at file offset 216530944 and the second # leaf has a first item corresponding to the extent at offset 217055232. # Now we punch a hole that partially covers the range of the extent at # offset 216530944 but does go beyond the offset 217055232. $ xfs_io -c "fpunch $((216530944 + 128 * 1024 - 4000)) 256K" /mnt/foobar $ xfs_io -c "fsync" /mnt/foobar <power fail> # mount to replay the log $ mount /dev/sdb /mnt # Before this patch, only the subrange [216658016, 216662016[ (length of # 4000 bytes) was logged, leaving an incorrect file layout after log # replay. Fix this by checking if there is a hole between the last extent item that we processed and the first extent item in the next leaf, and if there is one, log an explicit hole extent item. Fixes: 16e7549f045d ("Btrfs: incompatible format change to remove hole extents") Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2018-03-31btrfs: use helper to set ulist aux from a qgroupDavid Sterba
We have a nice helper to do proper casting of a qgroup to a ulist aux value. And several places that could make use of it. Signed-off-by: David Sterba <dsterba@suse.com>
2018-03-31Revert "btrfs: qgroups: Retry after commit on getting EDQUOT"Qu Wenruo
This reverts commit 48a89bc4f2ceab87bc858a8eb189636b09c846a7. The idea to commit transaction and free some space after hitting qgroup limit is good, although the problem is it can easily cause deadlocks. One deadlock example is caused by trying to flush data while still holding it: Call Trace: __schedule+0x49d/0x10f0 schedule+0xc6/0x290 schedule_timeout+0x187/0x1c0 wait_for_completion+0x204/0x3a0 btrfs_wait_ordered_extents+0xa40/0xaf0 [btrfs] qgroup_reserve+0x913/0xa10 [btrfs] btrfs_qgroup_reserve_data+0x3ef/0x580 [btrfs] btrfs_check_data_free_space+0x96/0xd0 [btrfs] __btrfs_buffered_write+0x3ac/0xd40 [btrfs] btrfs_file_write_iter+0x62a/0xba0 [btrfs] __vfs_write+0x320/0x430 vfs_write+0x107/0x270 SyS_write+0xbf/0x150 do_syscall_64+0x1b0/0x3d0 entry_SYSCALL64_slow_path+0x25/0x25 Another can be caused by trying to commit one transaction while nesting with trans handle held by ourselves: btrfs_start_transaction() |- btrfs_qgroup_reserve_meta_pertrans() |- qgroup_reserve() |- btrfs_join_transaction() |- btrfs_commit_transaction() The retry is causing more problems than exppected when limit is enabled. At least a graceful EDQUOT is way better than deadlock. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2018-03-31btrfs: qgroup: Update trace events for metadata reservationQu Wenruo
Now trace_qgroup_meta_reserve() will have extra type parameter. And introduce two new trace events: 1) trace_qgroup_meta_free_all_pertrans() For btrfs_qgroup_free_meta_all_pertrans() 2) trace_qgroup_meta_convert() For btrfs_qgroup_convert_reserved_meta() Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2018-03-31btrfs: qgroup: Use root::qgroup_meta_rsv_* to record qgroup meta reserved spaceQu Wenruo
For quota disabled->enable case, it's possible that at reservation time quota was not enabled so no bytes were really reserved, while at release time, quota was enabled so we will try to release some bytes we didn't really own. Such situation can cause metadata reserveation underflow, for both types, also less possible for per-trans type since quota enable will commit transaction. To address this, record qgroup meta reserved bytes into root::qgroup_meta_rsv_pertrans and ::prealloc. So at releasing time we won't free any bytes we didn't reserve. For DATA, it's already handled by io_tree, so nothing needs to be done there. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2018-03-31btrfs: delayed-inode: Use new qgroup meta rsv for delayed inode and itemQu Wenruo
Quite similar for delalloc, some modification to delayed-inode and delayed-item reservation. Also needs extra parameter for release case to distinguish normal release and error release. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>