summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2020-09-30bpf, libbpf: Add bpf_tail_call_static helper for bpf programsDaniel Borkmann
Port of tail_call_static() helper function from Cilium's BPF code base [0] to libbpf, so others can easily consume it as well. We've been using this in production code for some time now. The main idea is that we guarantee that the kernel's BPF infrastructure and JIT (here: x86_64) can patch the JITed BPF insns with direct jumps instead of having to fall back to using expensive retpolines. By using inline asm, we guarantee that the compiler won't merge the call from different paths with potentially different content of r2/r3. We're also using Cilium's __throw_build_bug() macro (here as: __bpf_unreachable()) in different places as a neat trick to trigger compilation errors when compiler does not remove code at compilation time. This works for the BPF back end as it does not implement the __builtin_trap(). [0] https://github.com/cilium/cilium/commit/f5537c26020d5297b70936c6b7d03a1e412a1035 Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Andrii Nakryiko <andriin@fb.com> Link: https://lore.kernel.org/bpf/1656a082e077552eb46642d513b4a6bde9a7dd01.1601477936.git.daniel@iogearbox.net
2020-09-30bpf: Add redirect_neigh helper as redirect drop-inDaniel Borkmann
Add a redirect_neigh() helper as redirect() drop-in replacement for the xmit side. Main idea for the helper is to be very similar in semantics to the latter just that the skb gets injected into the neighboring subsystem in order to let the stack do the work it knows best anyway to populate the L2 addresses of the packet and then hand over to dev_queue_xmit() as redirect() does. This solves two bigger items: i) skbs don't need to go up to the stack on the host facing veth ingress side for traffic egressing the container to achieve the same for populating L2 which also has the huge advantage that ii) the skb->sk won't get orphaned in ip_rcv_core() when entering the IP routing layer on the host stack. Given that skb->sk neither gets orphaned when crossing the netns as per 9c4c325252c5 ("skbuff: preserve sock reference when scrubbing the skb.") the helper can then push the skbs directly to the phys device where FQ scheduler can do its work and TCP stack gets proper backpressure given we hold on to skb->sk as long as skb is still residing in queues. With the helper used in BPF data path to then push the skb to the phys device, I observed a stable/consistent TCP_STREAM improvement on veth devices for traffic going container -> host -> host -> container from ~10Gbps to ~15Gbps for a single stream in my test environment. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Reviewed-by: David Ahern <dsahern@gmail.com> Acked-by: Martin KaFai Lau <kafai@fb.com> Cc: David Ahern <dsahern@kernel.org> Link: https://lore.kernel.org/bpf/f207de81629e1724899b73b8112e0013be782d35.1601477936.git.daniel@iogearbox.net
2020-09-30bpf, net: Rework cookie generator as per-cpu oneDaniel Borkmann
With its use in BPF, the cookie generator can be called very frequently in particular when used out of cgroup v2 hooks (e.g. connect / sendmsg) and attached to the root cgroup, for example, when used in v1/v2 mixed environments. In particular, when there's a high churn on sockets in the system there can be many parallel requests to the bpf_get_socket_cookie() and bpf_get_netns_cookie() helpers which then cause contention on the atomic counter. As similarly done in f991bd2e1421 ("fs: introduce a per-cpu last_ino allocator"), add a small helper library that both can use for the 64 bit counters. Given this can be called from different contexts, we also need to deal with potential nested calls even though in practice they are considered extremely rare. One idea as suggested by Eric Dumazet was to use a reverse counter for this situation since we don't expect 64 bit overflows anyways; that way, we can avoid bigger gaps in the 64 bit counter space compared to just batch-wise increase. Even on machines with small number of cores (e.g. 4) the cookie generation shrinks from min/max/med/avg (ns) of 22/50/40/38.9 down to 10/35/14/17.3 when run in parallel from multiple CPUs. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Reviewed-by: Eric Dumazet <edumazet@google.com> Acked-by: Martin KaFai Lau <kafai@fb.com> Cc: Eric Dumazet <eric.dumazet@gmail.com> Link: https://lore.kernel.org/bpf/8a80b8d27d3c49f9a14e1d5213c19d8be87d1dc8.1601477936.git.daniel@iogearbox.net
2020-09-30bpf: Add classid helper only based on skb->skDaniel Borkmann
Similarly to 5a52ae4e32a6 ("bpf: Allow to retrieve cgroup v1 classid from v2 hooks"), add a helper to retrieve cgroup v1 classid solely based on the skb->sk, so it can be used as key as part of BPF map lookups out of tc from host ns, in particular given the skb->sk is retained these days when crossing net ns thanks to 9c4c325252c5 ("skbuff: preserve sock reference when scrubbing the skb."). This is similar to bpf_skb_cgroup_id() which implements the same for v2. Kubernetes ecosystem is still operating on v1 however, hence net_cls needs to be used there until this can be dropped in with the v2 helper of bpf_skb_cgroup_id(). Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Martin KaFai Lau <kafai@fb.com> Link: https://lore.kernel.org/bpf/ed633cf27a1c620e901c5aa99ebdefb028dce600.1601477936.git.daniel@iogearbox.net
2020-09-30bpf: fix raw_tp test run in preempt kernelSong Liu
In preempt kernel, BPF_PROG_TEST_RUN on raw_tp triggers: [ 35.874974] BUG: using smp_processor_id() in preemptible [00000000] code: new_name/87 [ 35.893983] caller is bpf_prog_test_run_raw_tp+0xd4/0x1b0 [ 35.900124] CPU: 1 PID: 87 Comm: new_name Not tainted 5.9.0-rc6-g615bd02bf #1 [ 35.907358] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1 04/01/2014 [ 35.916941] Call Trace: [ 35.919660] dump_stack+0x77/0x9b [ 35.923273] check_preemption_disabled+0xb4/0xc0 [ 35.928376] bpf_prog_test_run_raw_tp+0xd4/0x1b0 [ 35.933872] ? selinux_bpf+0xd/0x70 [ 35.937532] __do_sys_bpf+0x6bb/0x21e0 [ 35.941570] ? find_held_lock+0x2d/0x90 [ 35.945687] ? vfs_write+0x150/0x220 [ 35.949586] do_syscall_64+0x2d/0x40 [ 35.953443] entry_SYSCALL_64_after_hwframe+0x44/0xa9 Fix this by calling migrate_disable() before smp_processor_id(). Fixes: 1b4d60ec162f ("bpf: Enable BPF_PROG_TEST_RUN for raw_tracepoint") Reported-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Song Liu <songliubraving@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2020-09-30can: mcp25xxfd: narrow down wildcards in device tree bindings to ↵Thomas Kopp
"microchip,mcp251xfd" The wildcard should be narrowed down to prevent existing and future devices that are not compatible from matching. It is very unlikely that incompatible devices will be released that do not match the wildcard. Discussion Reference: https://lore.kernel.org/r/CAMuHMdVkwGjr6dJuMyhQNqFoJqbh6Ec5V2b5LenCshwpM2SDsQ@mail.gmail.com Reported-by: Geert Uytterhoeven <geert@linux-m68k.org> Signed-off-by: Thomas Kopp <thomas.kopp@microchip.com> Link: https://lore.kernel.org/r/20200930091423.755-1-thomas.kopp@microchip.com Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
2020-09-30dt-binding: can: mcp251xfd: narrow down wildcards in device tree bindings to ↵Thomas Kopp
"microchip,mcp251xfd" The wildcard should be narrowed down to prevent existing and future devices that are not compatible from matching. It is very unlikely that incompatible devices will be released that do not match the wildcard. This is the documentation part of the commit. Discussion Reference: https://lore.kernel.org/r/CAMuHMdVkwGjr6dJuMyhQNqFoJqbh6Ec5V2b5LenCshwpM2SDsQ@mail.gmail.com Reported-by: Geert Uytterhoeven <geert@linux-m68k.org> Signed-off-by: Thomas Kopp <thomas.kopp@microchip.com> Link: https://lore.kernel.org/r/20200930091423.755-2-thomas.kopp@microchip.com [mkl: rename file, too] Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
2020-09-30dt-binding: can: mcp25xxfd: documentation fixesOleksij Rempel
Apply following fixes: - Use 'interrupts'. (interrupts-extended will automagically be supported by the tools) - *-supply is always a single item. So, drop maxItems=1 - add "additionalProperties: false" flag to detect unneeded properties. Signed-off-by: Oleksij Rempel <o.rempel@pengutronix.de> Link: https://lore.kernel.org/r/20200923125301.27200-1-o.rempel@pengutronix.de Reported-by: Rob Herring <robh@kernel.org> Reviewed-by: Rob Herring <robh@kernel.org> Fixes: 1b5a78e69c1f ("dt-binding: can: mcp25xxfd: document device tree bindings") Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
2020-09-30can: mcp25xxfd: mcp25xxfd_irq(): add missing initialization of variable ↵Marc Kleine-Budde
set_normal mode This patch fixes the following warning: drivers/net/can/spi/mcp25xxfd/mcp25xxfd-core.c:2155 mcp25xxfd_irq() error: uninitialized symbol 'set_normal_mode'. by adding the missing initialization. Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Reviewed-by: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org> Fixes: 55e5b97f003e ("can: mcp25xxfd: add driver for Microchip MCP25xxFD SPI CAN") Link: https://lore.kernel.org/r/20200923114726.2704426-1-mkl@pengutronix.de Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
2020-09-30can: mcp25xxfd: mcp25xxfd_ring_free(): fix memory leak during cleanupDan Carpenter
This loop doesn't free the first element of the array. The "i > 0" has to be changed to "i >= 0". Fixes: 55e5b97f003e ("can: mcp25xxfd: add driver for Microchip MCP25xxFD SPI CAN") Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Link: https://lore.kernel.org/r/20200923112752.GA1473821@mwanda Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
2020-09-30can: mcp25xxfd: mcp25xxfd_probe(): add SPI clk limit related errata informationThomas Kopp
This patch adds a reference to the recent released MCP2517FD and MCP2518FD errata sheets and paste the explanation. The driver already implements the proposed fix. Signed-off-by: Thomas Kopp <thomas.kopp@microchip.com> Link: https://lore.kernel.org/r/20200925065606.358-1-thomas.kopp@microchip.com [mkl: split into two patches, adjust subject and commit message] Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
2020-09-30can: mcp25xxfd: mcp25xxfd_handle_eccif(): add ECC related errata and update ↵Thomas Kopp
log messages This patch adds a reference to the recent released MCP2517FD and MCP2518FD errata sheets and paste the explanation. The single error correction does not always work, so always indicate that a single error occurred. If the location of the ECC error is outside of the TX-RAM always use netdev_notice() to log the problem. For ECC errors in the TX-RAM, there is a recovery procedure. Signed-off-by: Thomas Kopp <thomas.kopp@microchip.com> Link: https://lore.kernel.org/r/20200925065606.358-1-thomas.kopp@microchip.com [mkl: split into two patches, adjust subject and commit message] Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
2020-09-29Merge branch 'HW-support-for-VCAP-IS1-and-ES0-in-mscc_ocelot'David S. Miller
Vladimir Oltean says: ==================== HW support for VCAP IS1 and ES0 in mscc_ocelot The patches from RFC series "Offload tc-flower to mscc_ocelot switch using VCAP chains" have been split into 2: https://patchwork.ozlabs.org/project/netdev/list/?series=204810&state=* This is the boring part, that deals with the prerequisites, and not with tc-flower integration. Apart from the initialization of some hardware blocks, which at this point still don't do anything, no new functionality is introduced. - Key and action field offsets are defined for the supported switches. - VCAP properties are added to the driver for the new TCAM blocks. But instead of adding them manually as was done for IS2, which is error prone, the driver is refactored to read these parameters from hardware, which is possible. - Some improvements regarding the processing of struct ocelot_vcap_filter. - Extending the code to be compatible with full and quarter keys. This series was tested, along with other patches not yet submitted, on the Felix and Seville switches. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-29net: mscc: ocelot: look up the filters in flower_stats() and flower_destroy()Vladimir Oltean
Currently a new filter is created, containing just enough correct information to be able to call ocelot_vcap_block_find_filter_by_index() on it. This will be limiting us in the future, when we'll have more metadata associated with a filter, which will matter in the stats() and destroy() callbacks, and which we can't make up on the spot. For example, we'll start "offloading" some dummy tc filter entries for the TCAM skeleton, but we won't actually be adding them to the hardware, or to block->rules. So, it makes sense to avoid deleting those rules too. That's the kind of thing which is difficult to determine unless we look up the real filter. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-29net: mscc: ocelot: add a new ocelot_vcap_block_find_filter_by_id functionVladimir Oltean
And rename the existing find to ocelot_vcap_block_find_filter_by_index. The index is the position in the TCAM, and the id is the flow cookie given by tc. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-29net: mscc: ocelot: rename variable 'cnt' in vcap_data_offset_get()Vladimir Oltean
The 'cnt' variable is actually used for 2 purposes, to hold the number of sub-words per VCAP entry, and the number of sub-words per VCAP action. In fact, I'm pretty sure these 2 numbers can never be different from one another. By hardware definition, the entry (key) TCAM rows are divided into the same number of sub-words as its associated action RAM rows. But nonetheless, let's at least rename the variables such that observations like this one are easier to make in the future. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-29net: mscc: ocelot: rename variable 'count' in vcap_data_offset_get()Vladimir Oltean
This gets rid of one of the 2 variables named, very generically, "count". Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-29net: mscc: ocelot: calculate vcap offsets correctly for full and quarter entriesXiaoliang Yang
When calculating the offsets for the current entry within the row and placing them inside struct vcap_data, the function assumes half key entry (2 keys per row). This patch modifies the vcap_data_offset_get() function to calculate a correct data offset when the setting VCAP Type-Group of a key to VCAP_TG_FULL or VCAP_TG_QUARTER. This is needed because, for example, VCAP ES0 only supports full keys. Also rename the 'count' variable to 'num_entries_per_row' to make the function just one tiny bit easier to follow. Signed-off-by: Xiaoliang Yang <xiaoliang.yang_1@nxp.com> Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-29net: mscc: ocelot: parse flower action before keyVladimir Oltean
When we'll make the switch to multiple chain offloading, we'll want to know first what VCAP block the rule is offloaded to. This impacts what keys are available. Since the VCAP block is determined by what actions are used, parse the action first. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-29net: mscc: ocelot: remove unneeded VCAP parameters for IS2Vladimir Oltean
Now that we are deriving these from the constants exposed by the hardware, we can delete the static info we're keeping in the driver. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-29net: mscc: ocelot: automatically detect VCAP constantsVladimir Oltean
The numbers in struct vcap_props are not intuitive to derive, because they are not a straightforward copy-and-paste from the reference manual but instead rely on a fairly detailed level of understanding of the layout of an entry in the TCAM and in the action RAM. For this reason, bugs are very easy to introduce here. Ease the work of hardware porters and read from hardware the constants that were exported for this particular purpose. Note that this implies that struct vcap_props can no longer be const. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-29net: mscc: ocelot: add definitions for VCAP ES0 keys, actions and targetVladimir Oltean
As a preparation step for the offloading to ES0, let's create the infrastructure for talking with this hardware block. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-29net: mscc: ocelot: add definitions for VCAP IS1 keys, actions and targetVladimir Oltean
As a preparation step for the offloading to IS1, let's create the infrastructure for talking with this hardware block. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-29net: mscc: ocelot: generalize existing code for VCAPVladimir Oltean
In the Ocelot switches there are 3 TCAMs: VCAP ES0, IS1 and IS2, which have the same configuration interface, but different sets of keys and actions. The driver currently only supports VCAP IS2. In preparation of VCAP IS1 and ES0 support, the existing code must be generalized to work with any VCAP. In that direction, we should move the structures that depend upon VCAP instantiation, like vcap_is2_keys and vcap_is2_actions, out of struct ocelot and into struct vcap_props .keys and .actions, a structure that is replicated 3 times, once per VCAP. We'll pass that structure as an argument to each function that does the key and action packing - only the control logic needs to distinguish between ocelot->vcap[VCAP_IS2] or IS1 or ES0. Another change is to make use of the newly introduced ocelot_target_read and ocelot_target_write API, since the 3 VCAPs have the same registers but put at different addresses. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-29net: mscc: ocelot: return error if VCAP filter is not foundXiaoliang Yang
Although it doesn't look like it is possible to hit these conditions from user space, there are 2 separate, but related, issues. First, the ocelot_vcap_block_get_filter_index function, née ocelot_ace_rule_get_index_id prior to the aae4e500e106 ("net: mscc: ocelot: generalize the "ACE/ACL" names") rename, does not do what the author probably intended. If the desired filter entry is not present in the ACL block, this function returns an index equal to the total number of filters, instead of -1, which is maybe what was intended, judging from the curious initialization with -1, and the "++index" idioms. Either way, none of the callers seems to expect this behavior. Second issue, the callers don't actually check the return value at all. So in case the filter is not found in the rule list, propagate the return code. So update the callers and also take the opportunity to get rid of the odd coding idioms that appear to work but don't. Signed-off-by: Xiaoliang Yang <xiaoliang.yang_1@nxp.com> Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-29net: mscc: ocelot: introduce a new ocelot_target_{read,write} APIVladimir Oltean
There are some targets (register blocks) in the Ocelot switch that are instantiated more than once. For example, the VCAP IS1, IS2 and ES0 blocks all share the same register layout for interacting with the cache for the TCAM and the action RAM. For the VCAPs, the procedure for servicing them is actually common. We just need an API specifying which VCAP we are talking to, and we do that via these raw ocelot_target_read and ocelot_target_write accessors. In plain ocelot_read, the target is encoded into the register enum itself: u16 target = reg >> TARGET_OFFSET; For the VCAPs, the registers are currently defined like this: enum ocelot_reg { [...] S2_CORE_UPDATE_CTRL = S2 << TARGET_OFFSET, S2_CORE_MV_CFG, S2_CACHE_ENTRY_DAT, S2_CACHE_MASK_DAT, S2_CACHE_ACTION_DAT, S2_CACHE_CNT_DAT, S2_CACHE_TG_DAT, [...] }; which is precisely what we want to avoid, because we'd have to duplicate the same register map for S1 and for S0, and then figure out how to pass VCAP instance-specific registers to the ocelot_read calls (basically another lookup table that undoes the effect of shifting with TARGET_OFFSET). So for some targets, propose a more raw API, similar to what is currently done with ocelot_port_readl and ocelot_port_writel. Those targets can only be accessed with ocelot_target_{read,write} and not with ocelot_{read,write} after the conversion, which is fine. The VCAP registers are not actually modified to use this new API as of this patch. They will be modified in the next one. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Acked-by: Alexandre Belloni <alexandre.belloni@bootlin.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-29net: mvneta: avoid possible cache misses in mvneta_rx_swbmLorenzo Bianconi
Do not use rx_desc pointers if possible since rx descriptors are stored in uncached memory and dereferencing rx_desc pointers generate extra loads. This patch improves XDP_DROP performance of ~ 110Kpps (700Kpps vs 590Kpps) on Marvell Espressobin Analyzed-by: Ilias Apalodimas <ilias.apalodimas@linaro.org> Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-29libbpf: Compile in PIC mode only for shared library caseAndrii Nakryiko
Libbpf compiles .o's for static and shared library modes separately, so no need to specify -fPIC for both. Keep it only for shared library mode. Signed-off-by: Andrii Nakryiko <andriin@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Martin KaFai Lau <kafai@fb.com> Link: https://lore.kernel.org/bpf/20200929220604.833631-3-andriin@fb.com
2020-09-29libbpf: Compile libbpf under -O2 level by default and catch extra warningsAndrii Nakryiko
For some reason compiler doesn't complain about uninitialized variable, fixed in previous patch, if libbpf is compiled without -O2 optimization level. So do compile it with -O2 and never let similar issue slip by again. -Wall is added unconditionally, so no need to specify it again. Signed-off-by: Andrii Nakryiko <andriin@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Martin KaFai Lau <kafai@fb.com> Link: https://lore.kernel.org/bpf/20200929220604.833631-2-andriin@fb.com
2020-09-29libbpf: Fix uninitialized variable in btf_parse_type_secAndrii Nakryiko
Fix obvious unitialized variable use that wasn't reported by compiler. libbpf Makefile changes to catch such errors are added separately. Fixes: 3289959b97ca ("libbpf: Support BTF loading and raw data output in both endianness") Signed-off-by: Andrii Nakryiko <andriin@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Martin KaFai Lau <kafai@fb.com> Link: https://lore.kernel.org/bpf/20200929220604.833631-1-andriin@fb.com
2020-09-29Merge branch 'bpf, x64: optimize JIT's pro/epilogue'Alexei Starovoitov
Maciej Fijalkowski says: ==================== Hi! This small set can be considered as a followup after recent addition of support for tailcalls in bpf subprograms and is focused on optimizing x64 JIT prologue and epilogue sections. Turns out the popping tail call counter is not needed anymore and %rsp handling when stack depth is 0 can be skipped. For longer explanations, please see commit messages. Thank you, Maciej ==================== Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2020-09-29bpf: x64: Do not emit sub/add 0, %rsp when !stack_depthMaciej Fijalkowski
There is no particular reason for keeping the "sub 0, %rsp" insn within the BPF's x64 JIT prologue. When tail call code was skipping the whole prologue section these 7 bytes that represent the rsp subtraction could not be simply discarded as the jump target address would be broken. An option to address that would be to substitute it with nop7. Right now tail call is skipping only first 11 bytes of target program's prologue and "sub X, %rsp" is the first insn that is processed, so if stack depth is zero then this insn could be omitted without the need for nop7 swap. Therefore, do not emit the "sub 0, %rsp" in prologue when program is not making use of R10 register. Also, make the emission of "add X, %rsp" conditional in tail call code logic and take into account the presence of mentioned insn when calculating the jump offsets. Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20200929204653.4325-3-maciej.fijalkowski@intel.com
2020-09-29bpf, x64: Drop "pop %rcx" instruction on BPF JIT epilogueMaciej Fijalkowski
Back when all of the callee-saved registers where always pushed to stack in x64 JIT prologue, tail call counter was placed at the bottom of the BPF program's stack frame that had a following layout: +-------------+ | ret addr | +-------------+ | rbp | <- rbp +-------------+ | | | free space | | from: | | sub $x,%rsp | | | +-------------+ | rbx | +-------------+ | r13 | +-------------+ | r14 | +-------------+ | r15 | +-------------+ | tail call | <- rsp | counter | +-------------+ In order to restore the callee saved registers, epilogue needed to explicitly toss away the tail call counter via "pop %rbx" insn, so that %rsp would be back at the place where %r15 was stored. Currently, the tail call counter is placed on stack *before* the callee saved registers (brackets on rbx through r15 mean that they are now pushed to stack only if they are used): +-------------+ | ret addr | +-------------+ | rbp | <- rbp +-------------+ | | | free space | | from: | | sub $x,%rsp | | | +-------------+ | tail call | | counter | +-------------+ ( rbx ) +-------------+ ( r13 ) +-------------+ ( r14 ) +-------------+ ( r15 ) <- rsp +-------------+ For the record, the epilogue insns consist of (assuming all of the callee saved registers are used by program): pop %r15 pop %r14 pop %r13 pop %rbx pop %rcx leaveq retq "pop %rbx" for getting rid of tail call counter was not an option anymore as it would overwrite the restored value of %rbx register, so it was changed to use the %rcx register. Since epilogue can start popping the callee saved registers right away without any additional work, the "pop %rcx" could be dropped altogether as "leave" insn will simply move the %rbp to %rsp. IOW, tail call counter does not need the explicit handling. Having in mind the explanation above and the actual reason for that, let's piggy back on "leave" insn for discarding the tail call counter from stack and remove the "pop %rcx" from epilogue. Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20200929204653.4325-2-maciej.fijalkowski@intel.com
2020-09-29selftests/bpf: Fix endianness issues in sk_lookup/ctx_narrow_accessIlya Leoshkevich
This test makes a lot of narrow load checks while assuming little endian architecture, and therefore fails on s390. Fix by introducing LSB and LSW macros and using them to perform narrow loads. Fixes: 0ab5539f8584 ("selftests/bpf: Tests for BPF_SK_LOOKUP attach point") Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20200929201814.44360-1-iii@linux.ibm.com
2020-09-29lib8390: Replace panic() call with BUILD_BUG_ONArmin Wolf
Replace panic() call in lib8390.c with BUILD_BUG_ON() since checking the size of struct e8390_pkt_hdr should happen at compile-time. Signed-off-by: Armin Wolf <W_Armin@gmx.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-29Merge branch 'net-in_interrupt-cleanup-and-fixes'David S. Miller
Thomas Gleixner says: ==================== net: in_interrupt() cleanup and fixes in the discussion about preempt count consistency accross kernel configurations: https://lore.kernel.org/r/20200914204209.256266093@linutronix.de/ Linus clearly requested that code in drivers and libraries which changes behaviour based on execution context should either be split up so that e.g. task context invocations and BH invocations have different interfaces or if that's not possible the context information has to be provided by the caller which knows in which context it is executing. This includes conditional locking, allocation mode (GFP_*) decisions and avoidance of code paths which might sleep. In the long run, usage of 'preemptible, in_*irq etc.' should be banned from driver code completely. This is the second version of the first batch of related changes. V1 can be found here: https://lore.kernel.org/r/20200927194846.045411263@linutronix.de Changes vs. V1: - Rebased to net-next - Fixed the half done rename sillyness in the ENIC patch. - Fixed the IONIC driver fallout. - Picked up the SFC fix from Edward and adjusted the GFP_KERNEL change accordingly. - Addressed the review comments vs. BCRFMAC. - Collected Reviewed/Acked-by tags as appropriate. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-29net: rtlwifi: Replace in_interrupt() for context detectionSebastian Andrzej Siewior
rtl_lps_enter() and rtl_lps_leave() are using in_interrupt() to detect whether it is safe to acquire a mutex or if it is required to defer to a workqueue. The usage of in_interrupt() in drivers is phased out and Linus clearly requested that code which changes behaviour depending on context should either be seperated or the context be conveyed in an argument passed by the caller, which usually knows the context. in_interrupt() also is only partially correct because it fails to chose the correct code path when just preemption or interrupts are disabled. Add an argument 'may_block' to both functions and adjust the callers to pass the context information. The following call chains were analyzed to be safe to block: rtl_watchdog_wq_callback() rlf_lps_leave/enter() rtl_op_suspend() rtl_lps_leave() rtl_op_bss_info_changed() rtl_lps_leave() rtl_op_sw_scan_start() rtl_lps_leave() The following call chains were analyzed to be unsafe to block: _rtl_pci_interrupt() _rtl_pci_rx_interrupt() rtl_lps_leave() _rtl_pci_interrupt() _rtl_pci_rx_interrupt() rtl_is_special_data() rtl_lps_leave() _rtl_pci_interrupt() _rtl_pci_rx_interrupt() rtl_is_special_data() setup_special_tx() rtl_lps_leave() _rtl_pci_interrupt() _rtl_pci_tx_isr rtl_lps_leave() halbtc_leave_lps() rtl_lps_leave() This leaves four callers of rtl_lps_enter/leave() where the analyzis stopped dead in the maze of several nested pointer based callchains and lack of rtlwifi hardware to debug this via tracing: halbtc_leave_lps(), halbtc_enter_lps(), halbtc_normal_lps(), halbtc_pre_normal_lps() These four have been cautionally marked to be unable to block which is the safe option, but the rtwifi wizards should be able to clarify that. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Kalle Valo <kvalo@codeaurora.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-29net: rtlwifi: Remove in_interrupt() from debug macroSebastian Andrzej Siewior
The usage of in_interrupt() in drivers in is phased out. rtl_dbg() a printk based debug aid is using in_interrupt() in the underlying C function _rtl_dbg_out() which is almost identical to _rtl_dbg_print(). The only difference is the printout of in_interrupt(). The decoding of in_interrupt() as hexvalue is non-trivial and aside of being phased out for driver usage the return value is just by chance the masked preempt count value and not a boolean. These home brewn printk debug aids are tedious to work with and provide only minimal context. They should be replaced by trace_printk() or a debug tracepoint which automatically records all context information. To make progress on the in_interrupt() cleanup, make rtl_dbg() use _rtl_dbg_print() and remove _rtl_dbg_out(). Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Kalle Valo <kvalo@codeaurora.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-29net: rtlwifi: Remove void* casts related to delayed workSebastian Andrzej Siewior
INIT_DELAYED_WORK() takes two arguments: A pointer to the delayed work and a function reference for the callback. The rtl code casts all function references to (void *) because the callbacks in use are not matching the required function signature. That's error prone and bad pratice. Some of the callback functions are also global, but only used in a single file. Clean the mess up by: - Adding the proper arguments to the callback functions and using them in the container_of() constructs correctly which removes the hideous container_of_dwork_rtl() macro as well. - Removing the type cast at the initializers - Making the unnecessary global functions static Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Kalle Valo <kvalo@codeaurora.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-29net: libertas: Use netif_rx_any_context()Sebastian Andrzej Siewior
The usage of in_interrupt() in non-core code is phased out. Ideally the information of the calling context should be passed by the callers or the functions be split as appropriate. libertas uses in_interupt() to select the netif_rx*() variant which matches the calling context. The attempt to consolidate the code by passing an arguemnt or by distangling it failed due lack of knowledge about this driver and because the call chains are hard to follow. As a stop gap use netif_rx_any_context() which invokes the correct code path depending on context and confines the in_interrupt() usage to core code. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Kalle Valo <kvalo@codeaurora.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-29net: libertas libertas_tf: Remove in_interrupt() from debug macro.Sebastian Andrzej Siewior
The debug macro prints (INT) when in_interrupt() returns true. The value of this information is dubious as it does not distinguish between the various contexts which are covered by in_interrupt(). As the usage of in_interrupt() in drivers is phased out and the same information can be more precisely obtained with tracing, remove the in_interrupt() conditional from this debug printk. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Kalle Valo <kvalo@codeaurora.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-29net: mwifiex: Use netif_rx_any_context().Sebastian Andrzej Siewior
The usage of in_interrupt() in non-core code is phased out. Ideally the information of the calling context should be passed by the callers or the functions be split as appropriate. mwifiex uses in_interupt() to select the netif_rx*() variant which matches the calling context. The attempt to consolidate the code by passing an arguemnt or by distangling it failed due lack of knowledge about this driver and because the call chains are hard to follow. As a stop gap use netif_rx_any_context() which invokes the correct code path depending on context and confines the in_interrupt() usage to core code. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Kalle Valo <kvalo@codeaurora.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-29net: hostap: Remove in_interrupt() usageSebastian Andrzej Siewior
in_interrupt() is ill defined and does not provide what the name suggests. The usage especially in driver code is deprecated and a tree wide effort to clean up and consolidate the (ab)usage of in_interrupt() and related checks is happening. hfa384x_cmd() and prism2_hw_reset() check in_interrupt() at function entry and if true emit a printk at debug loglevel and return. This is clearly debug code. Both functions invoke functions which can sleep. These functions already have appropriate debug checks which cover all invalid contexts, while in_interrupt() fails to detect context which just has preemption or interrupts disabled. Remove both checks as they are incomplete, debug only and already covered by the subsequently invoked functions properly. If called from invalid context the resulting back trace is definitely more helpful to analyze the problem than a printk at debug loglevel. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Kalle Valo <kvalo@codeaurora.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-29net: iwlwifi: Remove in_interrupt() from tracing macro.Sebastian Andrzej Siewior
The usage of in_interrupt) in driver code is phased out. The iwlwifi_dbg tracepoint records in_interrupt() seperately, but that's superfluous because the trace header already records all kind of state and context information like hardirq status, softirq status, preemption count etc. Aside of that the recording of in_interrupt() as boolean does not allow to distinguish between the possible contexts (hard interrupt, soft interrupt, bottom half disabled) while the trace header gives precise information. Remove the duplicate information from the tracepoint and fixup the caller. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Luca Coelho <luca@coelho.fi> Acked-by: Kalle Valo <kvalo@codeaurora.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-29net: ipw2x00,iwlegacy,iwlwifi: Remove in_interrupt() from debug macrosSebastian Andrzej Siewior
The usage of in_interrupt() in non-core code is phased out. The debugging macros in these drivers use in_interrupt() to print 'I' or 'U' depending on the return value of in_interrupt(). While 'U' is confusing at best and 'I' is not really describing the actual context (hard interupt, soft interrupt, bottom half disabled section) these debug macros originate from the pre ftrace kernel era and their value today is questionable. They probably should be removed completely. The macros weere added initially for ipw2100 and then spreaded when the driver was forked. Remove the in_interrupt() usage at least.. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Kalle Valo <kvalo@codeaurora.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-29net: brcmfmac: Convey allocation mode as argumentSebastian Andrzej Siewior
The usage of in_interrupt() in drivers is phased out and Linus clearly requested that code which changes behaviour depending on context should either be seperated or the context be conveyed in an argument passed by the caller, which usually knows the context. brcmf_fweh_process_event() uses in_interrupt() to select the allocation mode GFP_KERNEL/GFP_ATOMIC. Aside of the above reasons this check is incomplete as it cannot detect contexts which just have preemption or interrupts disabled. All callchains leading to brcmf_fweh_process_event() can clearly identify the calling context. Convey a 'gfp' argument through the callchains and let the callers hand in the appropriate GFP mode. This has also the advantage that any change of execution context or preemption/interrupt state in these callchains will be detected by the memory allocator for all GFP_KERNEL allocations. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-29net: brcmfmac: Convey execution context via argument to brcmf_netif_rx()Thomas Gleixner
bcrmgf_netif_rx() uses in_interrupt to chose between netif_rx() and netif_rx_ni(). in_interrupt() usage in drivers is phased out. Convey the execution mode via an 'inirq' argument through the various callchains leading to brcmf_netif_rx(): brcmf_pcie_isr_thread() <- Task context brcmf_proto_msgbuf_rx_trigger() brcmf_msgbuf_process_rx() brcmf_msgbuf_process_msgtype() brcmf_msgbuf_process_rx_complete() brcmf_netif_mon_rx() brcmf_netif_rx(isirq = false) brcmf_netif_rx(isirq = false) brcmf_sdio_readframes() <- Task context sdio_claim_host() might sleep brcmf_rx_frame(isirq = false) brcmf_sdio_rxglom() <- Task context sdio_claim_host() might sleep brcmf_rx_frame(isirq = false) brcmf_usb_rx_complete() <- Interrupt context brcmf_rx_frame(isirq = true) brcmf_rx_frame() brcmf_proto_rxreorder() brcmf_proto_bcdc_rxreorder() brcmf_fws_rxreorder() brcmf_netif_rx() brcmf_netif_rx() Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Arend van Spriel <arend.vanspriel@broadcom.com> Cc: Kalle Valo <kvalo@codeaurora.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-29net: brcmfmac: Replace in_interrupt()Sebastian Andrzej Siewior
brcmf_sdio_isr() is using in_interrupt() to distinguish if it is called from a interrupt service routine or from a worker thread. Passing such information from the calling context is preferred and requested by Linus, so add an argument `in_isr' to brcmf_sdio_isr() and let the callers pass the information about the calling context. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Arend van Spriel <arend.vanspriel@broadcom.com> Acked-by: Kalle Valo <kvalo@codeaurora.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-29net: wan/lmc: Remove lmc_trace()Sebastian Andrzej Siewior
lmc_trace() was first introduced in commit e7a392d5158af ("Import 2.3.99pre6-5") and was not touched ever since. The reason for looking at this was to get rid of the in_interrupt() usage, but while looking at it the following observations were made: - At least lmc_get_stats() (->ndo_get_stats()) is invoked with disabled preemption which is not detected by the in_interrupt() check, which would cause schedule() to be called from invalid context. - The code is hidden behind #ifdef LMC_TRACE which is not defined within the kernel and wasn't at the time it was introduced. - Three jiffies don't match 50ms. msleep() would be a better match which would also avoid the schedule() invocation. But why have it to begin with? - Nobody would do something like this today. Either netdev_dbg() or trace_printk() or a trace event would be used. If only the functions related to this driver are interesting then ftrace can be used with filtering. As it is obviously broken for years, simply remove it. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-29net: usb: net1080: Remove in_interrupt() commentSebastian Andrzej Siewior
The comment above nc_vendor_write() suggests that the function could become async so that is usable in `in_interrupt()' context or that it already is safe to be called from such a context. Eitherway: The function did not become async since v2.4.9.2 (2002) and it must be not be called from `in_interrupt()' context because it sleeps on mutltiple occations. Remove the misleading comment. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: David S. Miller <davem@davemloft.net>