summaryrefslogtreecommitdiff
path: root/include
AgeCommit message (Collapse)Author
2021-06-11ice: add low level PTP clock access functionsJacob Keller
Add the ice_ptp_hw.c file and some associated definitions to the ice driver folder. This file contains basic low level definitions for functions that interact with the device hardware. For now, only E810-based devices are supported. The ice hardware supports 2 major variants which have different PHYs with different procedures necessary for interacting with the device clock. Because the device captures timestamps in the PHY, each PHY has its own internal timer. The timers are synchronized in hardware by first preparing the source timer and the PHY timer shadow registers, and then issuing a synchronization command. This ensures that both the source timer and PHY timers are programmed simultaneously. The timers themselves are all driven from the same oscillator source. The functions in ice_ptp_hw.c abstract over the differences between how the PHYs in E810 are programmed vs how the PHYs in E822 devices are programmed. This series only implements E810 support, but E822 support will be added in a future change. Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Tested-by: Tony Brelinski <tonyx.brelinski@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2021-06-11xfrm: remove hdr_offset indirectionFlorian Westphal
After previous patches all remaining users set the function pointer to the same function: xfrm6_find_1stfragopt. So remove this function pointer and call ip6_find_1stfragopt directly. Reduces size of xfrm_type to 64 bytes on 64bit platforms. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2021-06-11perf: Add EVENT_ATTR_ID to simplify event attributesQi Liu
Similar EVENT_ATTR macros are defined in many PMU drivers, like Arm PMU driver, Arm SMMU PMU driver. So add a generic macro to simplify code. Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Signed-off-by: Qi Liu <liuqi115@huawei.com> Link: https://lore.kernel.org/r/1623220863-58233-2-git-send-email-liuqi115@huawei.com Signed-off-by: Will Deacon <will@kernel.org>
2021-06-10audit: remove trailing spaces and tabsZhen Lei
Run the following command to find and remove the trailing spaces and tabs: sed -r -i 's/[ \t]+$//' <audit_files> The files to be checked are as follows: kernel/audit* include/linux/audit.h include/uapi/linux/audit.h Signed-off-by: Zhen Lei <thunder.leizhen@huawei.com> Acked-by: Richard Guy Briggs <rgb@redhat.com> Signed-off-by: Paul Moore <paul@paul-moore.com>
2021-06-10io_uring: add feature flag for rsrc tagsPavel Begunkov
Add IORING_FEAT_RSRC_TAGS indicating that io_uring supports a bunch of new IORING_REGISTER operations, in particular IORING_REGISTER_[FILES[,UPDATE]2,BUFFERS[2,UPDATE]] that support rsrc tagging, and also indicating implemented dynamic fixed buffer updates. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/9b995d4045b6c6b4ab7510ca124fd25ac2203af7.1623339162.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-06-10io_uring: change registration/upd/rsrc tagging ABIPavel Begunkov
There are ABI moments about recently added rsrc registration/update and tagging that might become a nuisance in the future. First, IORING_REGISTER_RSRC[_UPD] hide different types of resources under it, so breaks fine control over them by restrictions. It works for now, but once those are wanted under restrictions it would require a rework. It was also inconvenient trying to fit a new resource not supporting all the features (e.g. dynamic update) into the interface, so better to return to IORING_REGISTER_* top level dispatching. Second, register/update were considered to accept a type of resource, however that's not a good idea because there might be several ways of registration of a single resource type, e.g. we may want to add non-contig buffers or anything more exquisite as dma mapped memory. So, remove IORING_RSRC_[FILE,BUFFER] out of the ABI, and place them internally for now to limit changes. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/9b554897a7c17ad6e3becc48dfed2f7af9f423d5.1623339162.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-06-10inet: annotate date races around sk->sk_txhashEric Dumazet
UDP sendmsg() path can be lockless, it is possible for another thread to re-connect an change sk->sk_txhash under us. There is no serious impact, but we can use READ_ONCE()/WRITE_ONCE() pair to document the race. BUG: KCSAN: data-race in __ip4_datagram_connect / skb_set_owner_w write to 0xffff88813397920c of 4 bytes by task 30997 on cpu 1: sk_set_txhash include/net/sock.h:1937 [inline] __ip4_datagram_connect+0x69e/0x710 net/ipv4/datagram.c:75 __ip6_datagram_connect+0x551/0x840 net/ipv6/datagram.c:189 ip6_datagram_connect+0x2a/0x40 net/ipv6/datagram.c:272 inet_dgram_connect+0xfd/0x180 net/ipv4/af_inet.c:580 __sys_connect_file net/socket.c:1837 [inline] __sys_connect+0x245/0x280 net/socket.c:1854 __do_sys_connect net/socket.c:1864 [inline] __se_sys_connect net/socket.c:1861 [inline] __x64_sys_connect+0x3d/0x50 net/socket.c:1861 do_syscall_64+0x4a/0x90 arch/x86/entry/common.c:47 entry_SYSCALL_64_after_hwframe+0x44/0xae read to 0xffff88813397920c of 4 bytes by task 31039 on cpu 0: skb_set_hash_from_sk include/net/sock.h:2211 [inline] skb_set_owner_w+0x118/0x220 net/core/sock.c:2101 sock_alloc_send_pskb+0x452/0x4e0 net/core/sock.c:2359 sock_alloc_send_skb+0x2d/0x40 net/core/sock.c:2373 __ip6_append_data+0x1743/0x21a0 net/ipv6/ip6_output.c:1621 ip6_make_skb+0x258/0x420 net/ipv6/ip6_output.c:1983 udpv6_sendmsg+0x160a/0x16b0 net/ipv6/udp.c:1527 inet6_sendmsg+0x5f/0x80 net/ipv6/af_inet6.c:642 sock_sendmsg_nosec net/socket.c:654 [inline] sock_sendmsg net/socket.c:674 [inline] ____sys_sendmsg+0x360/0x4d0 net/socket.c:2350 ___sys_sendmsg net/socket.c:2404 [inline] __sys_sendmmsg+0x315/0x4b0 net/socket.c:2490 __do_sys_sendmmsg net/socket.c:2519 [inline] __se_sys_sendmmsg net/socket.c:2516 [inline] __x64_sys_sendmmsg+0x53/0x60 net/socket.c:2516 do_syscall_64+0x4a/0x90 arch/x86/entry/common.c:47 entry_SYSCALL_64_after_hwframe+0x44/0xae value changed: 0xbca3c43d -> 0xfdb309e0 Reported by Kernel Concurrency Sanitizer on: CPU: 0 PID: 31039 Comm: syz-executor.2 Not tainted 5.13.0-rc3-syzkaller #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: syzbot <syzkaller@googlegroups.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-10net: annotate data race in sock_error()Eric Dumazet
sock_error() is known to be racy. The code avoids an atomic operation is sk_err is zero, and this field could be changed under us, this is fine. Sysbot reported: BUG: KCSAN: data-race in sock_alloc_send_pskb / unix_release_sock write to 0xffff888131855630 of 4 bytes by task 9365 on cpu 1: unix_release_sock+0x2e9/0x6e0 net/unix/af_unix.c:550 unix_release+0x2f/0x50 net/unix/af_unix.c:859 __sock_release net/socket.c:599 [inline] sock_close+0x6c/0x150 net/socket.c:1258 __fput+0x25b/0x4e0 fs/file_table.c:280 ____fput+0x11/0x20 fs/file_table.c:313 task_work_run+0xae/0x130 kernel/task_work.c:164 tracehook_notify_resume include/linux/tracehook.h:189 [inline] exit_to_user_mode_loop kernel/entry/common.c:174 [inline] exit_to_user_mode_prepare+0x156/0x190 kernel/entry/common.c:208 __syscall_exit_to_user_mode_work kernel/entry/common.c:290 [inline] syscall_exit_to_user_mode+0x20/0x40 kernel/entry/common.c:301 do_syscall_64+0x56/0x90 arch/x86/entry/common.c:57 entry_SYSCALL_64_after_hwframe+0x44/0xae read to 0xffff888131855630 of 4 bytes by task 9385 on cpu 0: sock_error include/net/sock.h:2269 [inline] sock_alloc_send_pskb+0xe4/0x4e0 net/core/sock.c:2336 unix_dgram_sendmsg+0x478/0x1610 net/unix/af_unix.c:1671 unix_seqpacket_sendmsg+0xc2/0x100 net/unix/af_unix.c:2055 sock_sendmsg_nosec net/socket.c:654 [inline] sock_sendmsg net/socket.c:674 [inline] ____sys_sendmsg+0x360/0x4d0 net/socket.c:2350 __sys_sendmsg_sock+0x25/0x30 net/socket.c:2416 io_sendmsg fs/io_uring.c:4367 [inline] io_issue_sqe+0x231a/0x6750 fs/io_uring.c:6135 __io_queue_sqe+0xe9/0x360 fs/io_uring.c:6414 __io_req_task_submit fs/io_uring.c:2039 [inline] io_async_task_func+0x312/0x590 fs/io_uring.c:5074 __tctx_task_work fs/io_uring.c:1910 [inline] tctx_task_work+0x1d4/0x3d0 fs/io_uring.c:1924 task_work_run+0xae/0x130 kernel/task_work.c:164 tracehook_notify_signal include/linux/tracehook.h:212 [inline] handle_signal_work kernel/entry/common.c:145 [inline] exit_to_user_mode_loop kernel/entry/common.c:171 [inline] exit_to_user_mode_prepare+0xf8/0x190 kernel/entry/common.c:208 __syscall_exit_to_user_mode_work kernel/entry/common.c:290 [inline] syscall_exit_to_user_mode+0x20/0x40 kernel/entry/common.c:301 do_syscall_64+0x56/0x90 arch/x86/entry/common.c:57 entry_SYSCALL_64_after_hwframe+0x44/0xae value changed: 0x00000000 -> 0x00000068 Reported by Kernel Concurrency Sanitizer on: CPU: 0 PID: 9385 Comm: syz-executor.3 Not tainted 5.13.0-rc4-syzkaller #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: syzbot <syzkaller@googlegroups.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-10Merge tag 'mlx5-updates-2021-06-09' of ↵David S. Miller
git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux Saeed Mahameed says: ==================== mlx5-updates-2021-06-09 Introduce steering header insert/remove and switchdev bridge offloads 1) From Yevgeny, Steering header insert/remove support ConnectX supports offloading of various encapsulations and decapsulations (e.g. VXLAN), which are performed by 'Packet Reformat' action. Starting with ConnectX-6 DX, a new reformat type is supported - INSERT_HEADER. This reformat allows inserting an arbitrary size buffer at a selected location in the packet on RX flows. The insert/remove header support are needed as a prerequisite for the bridge offloads vlan pop/push supprt, see below. 2) From Vlad, Support for bridge offloads for switchdev mode This change implements bridge offloads with VLAN-support that works on top of mlx5 representors in switchdev mode. HIGH-LEVEL OVERVIEW Hardware supported by mlx5 driver doesn't provide dynamic learning or aging functionality and requires the driver to emulate all switch-like behavior in software. As such, all packets by default go through miss path, appear on representor and get to software bridge, if it is the upper device of the representor. This causes bridge to process packet in software, learn the MAC address to FDB and send SWITCHDEV_FDB_ADD_TO_DEVICE event to all subscribers. Upon reception of SWITCHDEV_FDB_ADD_TO_DEVICE notification mlx5 bridge offloads the FDB to hardware and sends back SWITCHDEV_FDB_ADD_TO_BRIDGE notification to prevent such entries from being aged out by kernel bridge. Leaving aging to kernel bridge would result deletion of offloaded dynamic FDB entries every aging_time period due to packets being processed by hardware and, consecutively, 'used' timestamp for FDB entry not being updated. Hardware aging is emulated in driver by running periodic workqueue task that manually updates the rules according to their hardware counter: - If hardware counter has changed since last update, the handler updates 'used' timestamp in kernel bridge dynamic entry by sending SWITCHDEV_FDB_ADD_TO_BRIDGE notification for the entry. - If FDB entry wasn't updated for user-controllable aging_time period, then the FDB entry is unoffloaded from hardware and corresponding SWITCHDEV_FDB_DEL_TO_BRIDGE notification is sent to kernel bridge. The mlx5 bridge offload implementation fully supports port VLAN objects, including PVID (vlan push) and "Egress Untagged" (vlan pop). SOFTWARE ARCHITECTURE Mlx5_eswitch is extended with pointer to new mlx5_esw_bridge_offloads structure which has a linked list of mlx5_esw_bridge objects. Struct mlx5_esw_bridge is the main switch object in mlx5 that holds all data for offloaded FDB entries and metadata for bridge ports and their vlans. The mlx5_esw_bridge object is created when first representor of eswitch vport is added to bridge and deleted when the last representor is detached from it. Bridge FDB entries are saved in linked list (to iterate over all FDB entries in aging workqueue task) and also in hashtable for quick lookup by MAC+VLAN tuple. Bridge FDB entries are saved in linked list (to iterate over all FDB entries in aging workqueue task) and in hashtable for quick lookup by MAC+VLAN tuple. Port metadata is stored in struct mlx5_esw_bridge_port that is saved in xarray to allow quick lookup by vport number. Part of the port metadata is the set of port vlans that are represented by mlx5_esw_bridge_vlan structure. The vlan structure points to all FDBs on vlan/port via fdb_list linked list. Simplified diagram of mlx5 bridge objects: +------------------+ | mxl5_eswitch | | | | br_offloads | +--------+---------+ | +--------v-------------------+ | mlx5_esw_bridge_offloads | | | +--> bridges | | +-------+--------------------+ | | | | | +---v---------------+ | | mlx5_esw_bridge | | | | | | vports | | | | | | fdb_ht | | +---+---------------+ | | | +---v---------------+ +------+ mlx5_esw_bridge | | | +-------------------------+ vports | | | | | | fdb_ht +------------------------------------------+ | +-------------------+ | | | | | | +----------------------+ +---------------------------+ | +-> mlx5_esw_bridge_port | +--> mlx5_esw_bridge_fdb_entry <-+ | | | +----------------------+ | +--+------------------------+ | | | vlans +--+-> mlx5_esw_bridge_vlan | | | | | | | | | | | +--v------------------------+ | | +----------------------+ | | fdb_list +--+ | mlx5_esw_bridge_fdb_entry <-+ | | +-------^--------------+ +--+------------------------+ | | +----------------------+ | | | | +-> mlx5_esw_bridge_port | | +-----------------------+ | | | | | | vlans | | -----------------------+ | | | +-> mlx5_esw_bridge_vlan | | +----------------------+ | | +---------------------------+ | | fdb_list +-----> mlx5_esw_bridge_fdb_entry <-+ +-------^--------------+ +--+------------------------+ | | +-----------------------+ HARDWARE REPRESENTATION In order to adhere to kernel software datapath model bridge offloads must come after TC and NF FDBs. However, since netfilter offload in mlx5 is implemented with unmanaged tables, its miss path is not automatically connected to next priority and requires the code to manually connect with slow table. To keep bridge offloads encapsulated and not mix it with eswitch offloads new FDB_TC_MISS priority is created between FDB_FT_OFFLOAD and FDB_SLOW_PATH which allows bridge offloads to be created without exposing its internal tables to any other modules since miss path of managed TC-miss table is automatically wired to next priority. The bridge tables are created with new priority FDB_BR_OFFLOAD in FDB namespace. The new priority is between tc-miss and slow path priorities. Priority consist of two levels: the ingress table that is global per eswitch and matches incoming packets by src_mac/vid and redirects them to next level (egress table) that is chosen according to ingress port bridge membership and matches on dst_mac/vid in order to redirect packet to vport according to the following diagram: + | +---------v----------+ | | | FDB_TC_OFFLOAD | | | +---------+----------+ | | +---------v----------+ | | | FDB_FT_OFFLOAD | | | +---------+----------+ | | +---------v----------+ | | | FDB_TC_MISS | | | +---------+----------+ | +--------------------------------------+ | | | | +------+ | | | | | +------v--------+ FDB_BR_OFFLOAD | | | INGRESS_TABLE | | | +------+---+----+ | | | | match | | | +---------+ | | | | | +-------+ | | +-------v-------+ match | | | | | | EGRESS_TABLE +------------> vport | | | +-------+-------+ | | | | | | | +-------+ | | miss | | | +------+------+ | | | | +--------------------------------------+ | | +---------v----------+ | | | FDB_SLOW_PATH | | | +---------+----------+ | v PATCHES OVERVIEW 1-3 - Miscellaneous refactorings and infrastructure changes. 4 - Mlx5 bridge offload infrastructure and dedicated fs_core namespace/tables implementation. 5 - FDB entry offload. 6 - Dynamic FDB entry aging. 7-10 - VLAN filtering offload. 11 - Tracepoints for main mlx5 bridge offload events (FDB entry offload/unoffload, VLAN add/delete, etc.) ==================== Signed-off-by: David S. Miller <davem@davemloft.net> --
2021-06-10netlink: simplify NLMSG_DATA with NLMSG_HDRLENChen Li
The NLMSG_LENGTH(0) may confuse the API users, NLMSG_HDRLEN is much more clear. Besides, some code style problems are also fixed. Signed-off-by: Chen Li <chenli@uniontech.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-10Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdmaLinus Torvalds
Pull rdma fixes from Jason Gunthorpe: "A mixture of small bug fixes and a small security issue: - WARN_ON when IPoIB is automatically moved between namespaces - Long standing bug where mlx5 would use the wrong page for the doorbell recovery memory if fork is used - Security fix for mlx4 that disables the timestamp feature - Several crashers for mlx5 - Plug a recent mlx5 memory leak for the sig_mr" * tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: IB/mlx5: Fix initializing CQ fragments buffer RDMA/mlx5: Delete right entry from MR signature database RDMA: Verify port when creating flow rule RDMA/mlx5: Block FDB rules when not in switchdev mode RDMA/mlx4: Do not map the core_clock page to user space unless enabled RDMA/mlx5: Use different doorbell memory for different processes RDMA/ipoib: Fix warning caused by destroying non-initial netns
2021-06-10ACPI: Add \_SB._OSC bit for PRMErik Kaneda
Signed-off-by: Erik Kaneda <erik.kaneda@intel.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2021-06-10ACPI: PRM: implement OperationRegion handler for the PlatformRtMechanism subtypeErik Kaneda
Platform Runtime Mechanism (PRM) is a firmware interface that exposes a set of binary executables that can either be called from the AML interpreter or device drivers by bypassing the AML interpreter. This change implements the AML interpreter path. According to the specification [1], PRM services are listed in an ACPI table called the PRMT. This patch parses module and handler information listed in the PRMT and registers the PlatformRtMechanism OpRegion handler before ACPI tables are loaded. Each service is defined by a 16-byte GUID and called from writing a 26-byte ASL buffer containing the identifier to a FieldUnit object defined inside a PlatformRtMechanism OperationRegion. OperationRegion (PRMR, PlatformRtMechanism, 0, 26) Field (PRMR, BufferAcc, NoLock, Preserve) { PRMF, 208 // Write to this field to invoke the OperationRegion Handler } The 26-byte ASL buffer is defined as the following: Byte Offset Byte Length Description ============================================================= 0 1 PRM OperationRegion handler status 1 8 PRM service status 9 1 PRM command 10 16 PRM handler GUID The ASL caller fills out a 26-byte buffer containing the PRM command and the PRM handler GUID like so: /* Local0 is the PRM data buffer */ Local0 = buffer (26){} /* Create byte fields over the buffer */ CreateByteField (Local0, 0x9, CMD) CreateField (Local0, 0x50, 0x80, GUID) /* Fill in the command and data fields of the data buffer */ CMD = 0 // run command GUID = ToUUID("xxxx-xx-xxx-xxxx") /* * Invoke PRM service with an ID that matches GUID and save the * result. */ Local0 = (\_SB.PRMT.PRMF = Local0) Byte offset 0 - 8 are written by the handler as a status passed back to AML and used by ASL like so: /* Create byte fields over the buffer */ CreateByteField (Local0, 0x0, PSTA) CreateQWordField (Local0, 0x1, USTA) In this ASL code, PSTA contains a status from the OperationRegion and USTA contains a status from the PRM service. The 26-byte buffer is recieved by acpi_platformrt_space_handler. This handler will look at the command value and the handler guid and take the approperiate actions. Command value Action ===================================================================== 0 Run the PRM service indicated by the PRM handler GUID (bytes 10-26) 1 Prevent PRM runtime updates from happening to the service's parent module 2 Allow PRM updates from happening to the service's parent module This patch enables command value 0. Link: https://uefi.org/sites/default/files/resources/Platform%20Runtime%20Mechanism%20-%20with%20legal%20notice.pdf # [1] Signed-off-by: Erik Kaneda <erik.kaneda@intel.com> [ rjw: Subject and changelog edits ] Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2021-06-10ACPICA: Add PRMT module header to facilitate parsingErik Kaneda
ACPICA commit bd46cb07e614fd85ea69e54c1f6f0ae0a5fb20ab This structure is used in to parse PRMT in other Operating Systems that relies on using subtable headers in order to parse ACPI tables. Although the PRMT doesn't have "subtables" it has a list of module information structures that act as subtables. Link: https://github.com/acpica/acpica/commit/bd46cb07 Signed-off-by: Erik Kaneda <erik.kaneda@intel.com> Signed-off-by: Bob Moore <robert.moore@intel.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2021-06-10genirq: Move non-irqdomain handle_domain_irq() handling into ARM's handle_IRQ()Marc Zyngier
Despite the name, handle_domain_irq() deals with non-irqdomain handling for the sake of a handful of legacy ARM platforms. Move such handling into ARM's handle_IRQ(), allowing for better code generation for everyone else. This allows us get rid of some complexity, and to rearrange the guards on the various helpers in a more logical way. Signed-off-by: Marc Zyngier <maz@kernel.org>
2021-06-10genirq: Add generic_handle_domain_irq() helperMarc Zyngier
Provide generic_handle_domain_irq() as a pendent to handle_domain_irq() for non-root interrupt controllers Signed-off-by: Marc Zyngier <maz@kernel.org>
2021-06-10irqdesc: Fix __handle_domain_irq() commentMarc Zyngier
It appears that the comment about a NULL domain meaning anything has always been wrong. Fix it. Signed-off-by: Marc Zyngier <maz@kernel.org>
2021-06-10genirq: Use irq_resolve_mapping() to implement __handle_domain_irq() and coMarc Zyngier
In order to start reaping the benefits of irq_resolve_mapping(), start using it in __handle_domain_irq() and handle_domain_nmi(). This involves splitting generic_handle_irq() to be able to directly provide the irq_desc. Signed-off-by: Marc Zyngier <maz@kernel.org>
2021-06-10irqdomain: Introduce irq_resolve_mapping()Marc Zyngier
Rework irq_find_mapping() to return an both an irq_desc pointer, optionally the virtual irq number, and rename the result to __irq_resolve_mapping(). a new helper called irq_resolve_mapping() is provided for code that doesn't need the virtual irq number. irq_find_mapping() is also rewritten in terms of __irq_resolve_mapping(). Signed-off-by: Marc Zyngier <maz@kernel.org>
2021-06-10irqdomain: Protect the linear revmap with RCUMarc Zyngier
It is pretty odd that the radix tree uses RCU while the linear portion doesn't, leading to potential surprises for the users, depending on how the irqdomain has been created. Fix this by moving the update of the linear revmap under the mutex, and the lookup under the RCU read-side lock. The mutex name is updated to reflect that it doesn't only cover the radix-tree anymore. Signed-off-by: Marc Zyngier <maz@kernel.org>
2021-06-10irqdomain: Cache irq_data instead of a virq number in the revmapMarc Zyngier
Caching a virq number in the revmap is pretty inefficient, as it means we will need to convert it back to either an irq_data or irq_desc to do anything with it. It is also a bit odd, as the radix tree does cache irq_data pointers. Change the revmap type to be an irq_data pointer instead of an unsigned int, and preserve the current API for now. Signed-off-by: Marc Zyngier <maz@kernel.org>
2021-06-10irqdomain: Make normal and nomap irqdomains exclusiveMarc Zyngier
Direct mappings are completely exclusive of normal mappings, meaning that we can refactor the code slightly so that we can get rid of the revmap_direct_max_irq field and use the revmap_size field instead, reducing the size of the irqdomain structure. Signed-off-by: Marc Zyngier <maz@kernel.org>
2021-06-10powerpc: Move the use of irq_domain_add_nomap() behind a config optionMarc Zyngier
Only a handful of old PPC systems are still using the old 'nomap' variant of the irqdomain library. Move the associated definitions behind a configuration option, which will allow us to make some more radical changes. Signed-off-by: Marc Zyngier <maz@kernel.org>
2021-06-10irqdomain: Reimplement irq_linear_revmap() with irq_find_mapping()Marc Zyngier
irq_linear_revmap() is supposed to be a fast path for domain lookups, but it only exposes low-level details of the irqdomain implementation, details which are better kept private. The *overhead* between the two is only a function call and a couple of tests, so it is likely that noone can show any meaningful difference compared to the cost of taking an interrupt. Reimplement irq_linear_revmap() with irq_find_mapping() in order to preserve source code compatibility, and rename the internal field for a measure. Signed-off-by: Marc Zyngier <maz@kernel.org>
2021-06-10irqdomain: Kill irq_domain_add_legacy_isaMarc Zyngier
This helper doesn't have a user anymore, let's remove it. Signed-off-by: Marc Zyngier <maz@kernel.org>
2021-06-09net/mlx5: Bridge, add offload infrastructureVlad Buslov
Create new files bridge.{c|h} in en/rep directory that implement bridge interaction with representor netdevices and handle required events/notifications, bridge.{c|h} in esw directory that implement all necessary eswitch offloading infrastructure and works on vport/eswitch level. Provide new kconfig MLX5_BRIDGE which is automatically selected when both kernel bridge and mlx5 eswitch configs are enabled. Provide basic infrastructure for bridge offloads: - struct mlx5_esw_bridge_offloads - per-eswitch bridge offload structure that encapsulates generic bridge-offloads data (notifier blocks, ingress flow table/group, etc.) that is created/deleted on enable/disable eswitch offloads. - struct mlx5_esw_bridge - per-bridge structure that encapsulates per-bridge data (reference counter, FDB, egress flow table/group, etc.) that is created when first eswitch represetor is attached to new bridge and deleted when last representor is removed from the bridge as a result of NETDEV_CHANGEUPPER event. The bridge tables are created with new priority FDB_BR_OFFLOAD in FDB namespace. The new priority is between tc-miss and slow path priorities. Priority consist of two levels: the ingress table that is global per eswitch and matches incoming packets by src_mac/vid and redirects them to next level (egress table) that is chosen according to ingress port bridge membership and matches on dst_mac/vid in order to redirect packet to vport according to the following diagram: + | +---------v----------+ | | | FDB_TC_OFFLOAD | | | +---------+----------+ | | +---------v----------+ | | | FDB_FT_OFFLOAD | | | +---------+----------+ | | +---------v----------+ | | | FDB_TC_MISS | | | +---------+----------+ | +--------------------------------------+ | | | | +------+ | | | | | +------v--------+ FDB_BR_OFFLOAD | | | INGRESS_TABLE | | | +------+---+----+ | | | | match | | | +---------+ | | | | | +-------+ | | +-------v-------+ match | | | | | | EGRESS_TABLE +------------> vport | | | +-------+-------+ | | | | | | | +-------+ | | miss | | | +------+------+ | | | | +--------------------------------------+ | | +---------v----------+ | | | FDB_SLOW_PATH | | | +---------+----------+ | v Signed-off-by: Vlad Buslov <vladbu@nvidia.com> Reviewed-by: Jianbo Liu <jianbol@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-06-09net/mlx5: Create TC-miss priority and tableVlad Buslov
In order to adhere to kernel software datapath model bridge offloads must come after TC and NF FDBs. Following patches in this series add new FDB priority for bridge after FDB_FT_OFFLOAD. However, since netfilter offload is implemented with unmanaged tables, its miss path is not automatically connected to next priority and requires the code to manually connect with slow table. To keep bridge offloads encapsulated and not mix it with eswitch offloads, create a new FDB_TC_MISS priority between FDB_FT_OFFLOAD and FDB_SLOW_PATH: + | +---------v----------+ | | | FDB_TC_OFFLOAD | | | +---------+----------+ | | | +---------v----------+ | | | FDB_FT_OFFLOAD | | | +---------+----------+ | | | +---------v----------+ | | | FDB_TC_MISS | | | +---------+----------+ | | | +---------v----------+ | | | FDB_SLOW_PATH | | | +---------+----------+ | v Initialize the new priority with single default empty managed table and use the table as TC/NF miss patch instead of slow table. This approach allows bridge offloads to be created as new FDB namespace priority between FDB_TC_MISS and FDB_SLOW_PATH without exposing its internal tables to any other modules since miss path of managed TC-miss table is automatically wired to next priority. Signed-off-by: Vlad Buslov <vladbu@nvidia.com> Reviewed-by: Jianbo Liu <jianbol@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-06-09net/mlx5: Added new parameters to reformat contextYevgeny Kliteynik
Adding new reformat context type (INSERT_HEADER) requires adding two new parameters to reformat context - reformat_param_0 and reformat_param_1. As defined by HW spec, these parameters have different meaning for different reformat context type. The first parameter (reformat_param_0) is not new to HW spec, but it wasn't used by any of the supported reformats. The second parameter (reformat_param_1) is new to the HW spec - it was added to allow supporting INSERT_HEADER. For NSERT_HEADER, reformat_param_0 indicates the header used to reference the location of the inserted header, and reformat_param_1 indicates the offset of the inserted header from the reference point defined by reformat_param_0. Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-06-09net/mlx5: mlx5_ifc support for header insert/removeYevgeny Kliteynik
Add support for HCA caps 2 that contains capabilities for the new insert/remove header actions. Added the required definitions for supporting the new reformat type: added packet reformat parameters, reformat anchors and definitions to allow copy/set into the inserted EMD (Embedded MetaData) tag. Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com> Signed-off-by: Vlad Buslov <vladbu@nvidia.com> Reviewed-by: Jianbo Liu <jianbol@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-06-09net/mlx5e: Fix page reclaim for dead peer hairpinDima Chumak
When adding a hairpin flow, a firmware-side send queue is created for the peer net device, which claims some host memory pages for its internal ring buffer. If the peer net device is removed/unbound before the hairpin flow is deleted, then the send queue is not destroyed which leads to a stack trace on pci device remove: [ 748.005230] mlx5_core 0000:08:00.2: wait_func:1094:(pid 12985): MANAGE_PAGES(0x108) timeout. Will cause a leak of a command resource [ 748.005231] mlx5_core 0000:08:00.2: reclaim_pages:514:(pid 12985): failed reclaiming pages: err -110 [ 748.001835] mlx5_core 0000:08:00.2: mlx5_reclaim_root_pages:653:(pid 12985): failed reclaiming pages (-110) for func id 0x0 [ 748.002171] ------------[ cut here ]------------ [ 748.001177] FW pages counter is 4 after reclaiming all pages [ 748.001186] WARNING: CPU: 1 PID: 12985 at drivers/net/ethernet/mellanox/mlx5/core/pagealloc.c:685 mlx5_reclaim_startup_pages+0x34b/0x460 [mlx5_core] [ +0.002771] Modules linked in: cls_flower mlx5_ib mlx5_core ptp pps_core act_mirred sch_ingress openvswitch nsh xt_conntrack xt_MASQUERADE nf_conntrack_netlink nfnetlink xt_addrtype iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 br_netfilter rpcrdma rdma_ucm ib_iser libiscsi scsi_transport_iscsi rdma_cm ib_umad ib_ipoib iw_cm ib_cm ib_uverbs ib_core overlay fuse [last unloaded: pps_core] [ 748.007225] CPU: 1 PID: 12985 Comm: tee Not tainted 5.12.0+ #1 [ 748.001376] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014 [ 748.002315] RIP: 0010:mlx5_reclaim_startup_pages+0x34b/0x460 [mlx5_core] [ 748.001679] Code: 28 00 00 00 0f 85 22 01 00 00 48 81 c4 b0 00 00 00 31 c0 5b 5d 41 5c 41 5d 41 5e 41 5f c3 48 c7 c7 40 cc 19 a1 e8 9f 71 0e e2 <0f> 0b e9 30 ff ff ff 48 c7 c7 a0 cc 19 a1 e8 8c 71 0e e2 0f 0b e9 [ 748.003781] RSP: 0018:ffff88815220faf8 EFLAGS: 00010286 [ 748.001149] RAX: 0000000000000000 RBX: ffff8881b4900280 RCX: 0000000000000000 [ 748.001445] RDX: 0000000000000027 RSI: 0000000000000004 RDI: ffffed102a441f51 [ 748.001614] RBP: 00000000000032b9 R08: 0000000000000001 R09: ffffed1054a15ee8 [ 748.001446] R10: ffff8882a50af73b R11: ffffed1054a15ee7 R12: fffffbfff07c1e30 [ 748.001447] R13: dffffc0000000000 R14: ffff8881b492cba8 R15: 0000000000000000 [ 748.001429] FS: 00007f58bd08b580(0000) GS:ffff8882a5080000(0000) knlGS:0000000000000000 [ 748.001695] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 748.001309] CR2: 000055a026351740 CR3: 00000001d3b48006 CR4: 0000000000370ea0 [ 748.001506] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 748.001483] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 748.001654] Call Trace: [ 748.000576] ? mlx5_satisfy_startup_pages+0x290/0x290 [mlx5_core] [ 748.001416] ? mlx5_cmd_teardown_hca+0xa2/0xd0 [mlx5_core] [ 748.001354] ? mlx5_cmd_init_hca+0x280/0x280 [mlx5_core] [ 748.001203] mlx5_function_teardown+0x30/0x60 [mlx5_core] [ 748.001275] mlx5_uninit_one+0xa7/0xc0 [mlx5_core] [ 748.001200] remove_one+0x5f/0xc0 [mlx5_core] [ 748.001075] pci_device_remove+0x9f/0x1d0 [ 748.000833] device_release_driver_internal+0x1e0/0x490 [ 748.001207] unbind_store+0x19f/0x200 [ 748.000942] ? sysfs_file_ops+0x170/0x170 [ 748.001000] kernfs_fop_write_iter+0x2bc/0x450 [ 748.000970] new_sync_write+0x373/0x610 [ 748.001124] ? new_sync_read+0x600/0x600 [ 748.001057] ? lock_acquire+0x4d6/0x700 [ 748.000908] ? lockdep_hardirqs_on_prepare+0x400/0x400 [ 748.001126] ? fd_install+0x1c9/0x4d0 [ 748.000951] vfs_write+0x4d0/0x800 [ 748.000804] ksys_write+0xf9/0x1d0 [ 748.000868] ? __x64_sys_read+0xb0/0xb0 [ 748.000811] ? filp_open+0x50/0x50 [ 748.000919] ? syscall_enter_from_user_mode+0x1d/0x50 [ 748.001223] do_syscall_64+0x3f/0x80 [ 748.000892] entry_SYSCALL_64_after_hwframe+0x44/0xae [ 748.001026] RIP: 0033:0x7f58bcfb22f7 [ 748.000944] Code: 0d 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f 1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 28 48 89 54 24 18 48 89 74 24 [ 748.003925] RSP: 002b:00007fffd7f2aaa8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001 [ 748.001732] RAX: ffffffffffffffda RBX: 000000000000000d RCX: 00007f58bcfb22f7 [ 748.001426] RDX: 000000000000000d RSI: 00007fffd7f2abc0 RDI: 0000000000000003 [ 748.001746] RBP: 00007fffd7f2abc0 R08: 0000000000000000 R09: 0000000000000001 [ 748.001631] R10: 00000000000001b6 R11: 0000000000000246 R12: 000000000000000d [ 748.001537] R13: 00005597ac2c24a0 R14: 000000000000000d R15: 00007f58bd084700 [ 748.001564] irq event stamp: 0 [ 748.000787] hardirqs last enabled at (0): [<0000000000000000>] 0x0 [ 748.001399] hardirqs last disabled at (0): [<ffffffff813132cf>] copy_process+0x146f/0x5eb0 [ 748.001854] softirqs last enabled at (0): [<ffffffff8131330e>] copy_process+0x14ae/0x5eb0 [ 748.013431] softirqs last disabled at (0): [<0000000000000000>] 0x0 [ 748.001492] ---[ end trace a6fabd773d1c51ae ]--- Fix by destroying the send queue of a hairpin peer net device that is being removed/unbound, which returns the allocated ring buffer pages to the host. Fixes: 4d8fcf216c90 ("net/mlx5e: Avoid unbounded peer devices when unpairing TC hairpin rules") Signed-off-by: Dima Chumak <dchumak@nvidia.com> Reviewed-by: Roi Dayan <roid@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-06-09Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-nextDavid S. Miller
Pablo Neira Ayuso says: ==================== Netfilter updates for net-next The following patchset contains Netfilter updates for net-next: 1) Add nfgenmsg field to nfnetlink's struct nfnl_info and use it. 2) Remove nft_ctx_init_from_elemattr() and nft_ctx_init_from_setattr() helper functions. 3) Add the nf_ct_pernet() helper function to fetch the conntrack pernetns data area. 4) Expose TCP and UDP flowtable offload timeouts through sysctl, from Oz Shlomo. 5) Add nfnetlink_hook subsystem to fetch the netfilter hook pipeline configuration, from Florian Westphal. This also includes a new field to annotate the hook type as metadata. 6) Fix unsafe memory access to non-linear skbuff in the new SCTP chunk support for nft_exthdr, from Phil Sutter. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-09Merge tag 'compiler-attributes-for-linus-v5.13-rc6' of ↵Linus Torvalds
git://github.com/ojeda/linux Pull compiler attribute update from Miguel Ojeda: "A trivial update to the compiler attributes: Add 'continue' keyword to documentation in comment (from Wei Ming Chen)" * tag 'compiler-attributes-for-linus-v5.13-rc6' of git://github.com/ojeda/linux: Compiler Attributes: Add continue in comment
2021-06-09Merge tag 'mac80211-for-net-2021-06-09' of ↵David S. Miller
git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211 Johannes berg says: ==================== A fair number of fixes: * fix more fallout from RTNL locking changes * fixes for some of the bugs found by syzbot * drop multicast fragments in mac80211 to align with the spec and what drivers are doing now * fix NULL-ptr deref in radiotap injection ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-09Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds
Pull kvm fixes from Paolo Bonzini: "Bugfixes, including a TLB flush fix that affects processors without nested page tables" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: kvm: fix previous commit for 32-bit builds kvm: avoid speculation-based attacks from out-of-range memslot accesses KVM: x86: Unload MMU on guest TLB flush if TDP disabled to force MMU sync KVM: x86: Ensure liveliness of nested VM-Enter fail tracepoint message selftests: kvm: Add support for customized slot0 memory size KVM: selftests: introduce P47V64 for s390x KVM: x86: Ensure PV TLB flush tracepoint reflects KVM behavior KVM: X86: MMU: Use the correct inherited permissions to get shadow page KVM: LAPIC: Write 0 to TMICT should also cancel vmx-preemption timer KVM: SVM: Fix SEV SEND_START session length & SEND_UPDATE_DATA query length after commit 238eca821cee
2021-06-09misc: rtsx: separate aspm mode into MODE_REG and MODE_CFGRicky Wu
aspm (Active State Power Management) rtsx_comm_set_aspm: this function is for driver to make sure not enter power saving when processing of init and card_detcct ASPM_MODE_CFG: 8411 5209 5227 5229 5249 5250 Change back to use original way to control aspm ASPM_MODE_REG: 5227A 524A 5250A 5260 5261 5228 Keep the new way to control aspm Fixes: 121e9c6b5c4c ("misc: rtsx: modify and fix init_hw function") Reported-by: Chris Chiu <chris.chiu@canonical.com> Tested-by: Gordon Lack <gordon.lack@dsl.pipex.com> Cc: stable <stable@vger.kernel.org> Signed-off-by: Ricky Wu <ricky_wu@realtek.com> Link: https://lore.kernel.org/r/20210607101634.4948-1-ricky_wu@realtek.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-06-09spi: remove spi_set_cs_timing()Greg Kroah-Hartman
No one seems to be using this global and exported function, so remove it as it is no longer needed. Cc: Mark Brown <broonie@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Link: https://lore.kernel.org/r/20210609071918.2852069-1-gregkh@linuxfoundation.org Signed-off-by: Mark Brown <broonie@kernel.org>
2021-06-09xfrm: remove description from xfrm_type structFlorian Westphal
Its set but never read. Reduces size of xfrm_type to 64 bytes on 64bit. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2021-06-09kvm: fix previous commit for 32-bit buildsPaolo Bonzini
array_index_nospec does not work for uint64_t on 32-bit builds. However, the size of a memory slot must be less than 20 bits wide on those system, since the memory slot must fit in the user address space. So just store it in an unsigned long. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-06-08net: stmmac: explicitly deassert GMAC_AHB_RESETMatthew Hagan
We are currently assuming that GMAC_AHB_RESET will already be deasserted by the bootloader. However if this has not been done, probing of the GMAC will fail. To remedy this we must ensure GMAC_AHB_RESET has been deasserted prior to probing. v2 changes: - remove NULL condition check for stmmac_ahb_rst in stmmac_main.c - unwrap dev_err() message in stmmac_main.c - add PTR_ERR() around plat->stmmac_ahb_rst in stmmac_platform.c v3 changes: - add error pointer to dev_err() output - add reset_control_assert(stmmac_ahb_rst) in stmmac_dvr_remove - revert PTR_ERR() around plat->stmmac_ahb_rst since this is performed on the returned value of ret by the calling function Signed-off-by: Matthew Hagan <mnhagan88@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-08net: wwan: make WWAN_PORT_MAX meaning less surprisedSergey Ryazanov
It is quite unusual when some value can not be equal to a defined range max value. Also most subsystems defines FOO_TYPE_MAX as a maximum valid value. So turn the WAN_PORT_MAX meaning from the number of supported port types to the maximum valid port type. Signed-off-by: Sergey Ryazanov <ryazanov.s.a@gmail.com> Reviewed-by: Loic Poulain <loic.poulain@linaro.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-08net: stmmac: enable Intel mGbE 2.5Gbps link speedVoon Weifeng
The Intel mGbE supports 2.5Gbps link speed by increasing the clock rate by 2.5 times of the original rate. In this mode, the serdes/PHY operates at a serial baud rate of 3.125 Gbps and the PCS data path and GMII interface of the MAC operate at 312.5 MHz instead of 125 MHz. For Intel mGbE, the overclocking of 2.5 times clock rate to support 2.5G is only able to be configured in the BIOS during boot time. Kernel driver has no access to modify the clock rate for 1Gbps/2.5G mode. The way to determined the current 1G/2.5G mode is by reading a dedicated adhoc register through mdio bus. In short, after the system boot up, it is either in 1G mode or 2.5G mode which not able to be changed on the fly. Compared to 1G mode, the 2.5G mode selects the 2500BASEX as PHY interface and disables the xpcs_an_inband. This is to cater for some PHYs that only supports 2500BASEX PHY interface with no autonegotiation. v2: remove MAC supported link speed masking v3: Restructure to introduce intel_speed_mode_2500() to read serdes registers for max speed supported and select the appropritate configuration. Use max_speed to determine the supported link speed mask. Signed-off-by: Voon Weifeng <weifeng.voon@intel.com> Signed-off-by: Michael Sit Wei Hong <michael.wei.hong.sit@intel.com> Reviewed-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-08net: pcs: add 2500BASEX support for Intel mGbE controllerVoon Weifeng
XPCS IP supports 2500BASEX as PHY interface. It is configured as autonegotiation disable to cater for PHYs that does not supports 2500BASEX autonegotiation. v2: Add supported link speed masking. v3: Restructure to introduce xpcs_config_2500basex() used to configure the xpcs for 2.5G speeds. Added 2500BASEX specific information for configuration. v4: Fix indentation error Signed-off-by: Voon Weifeng <weifeng.voon@intel.com> Signed-off-by: Michael Sit Wei Hong <michael.wei.hong.sit@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-08rq-qos: fix missed wake-ups in rq_qos_throttle try twoJan Kara
Commit 545fbd0775ba ("rq-qos: fix missed wake-ups in rq_qos_throttle") tried to fix a problem that a process could be sleeping in rq_qos_wait() without anyone to wake it up. However the fix is not complete and the following can still happen: CPU1 (waiter1) CPU2 (waiter2) CPU3 (waker) rq_qos_wait() rq_qos_wait() acquire_inflight_cb() -> fails acquire_inflight_cb() -> fails completes IOs, inflight decreased prepare_to_wait_exclusive() prepare_to_wait_exclusive() has_sleeper = !wq_has_single_sleeper() -> true as there are two sleepers has_sleeper = !wq_has_single_sleeper() -> true io_schedule() io_schedule() Deadlock as now there's nobody to wakeup the two waiters. The logic automatically blocking when there are already sleepers is really subtle and the only way to make it work reliably is that we check whether there are some waiters in the queue when adding ourselves there. That way, we are guaranteed that at least the first process to enter the wait queue will recheck the waiting condition before going to sleep and thus guarantee forward progress. Fixes: 545fbd0775ba ("rq-qos: fix missed wake-ups in rq_qos_throttle") CC: stable@vger.kernel.org Signed-off-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20210607112613.25344-1-jack@suse.cz Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-06-08kvm: avoid speculation-based attacks from out-of-range memslot accessesPaolo Bonzini
KVM's mechanism for accessing guest memory translates a guest physical address (gpa) to a host virtual address using the right-shifted gpa (also known as gfn) and a struct kvm_memory_slot. The translation is performed in __gfn_to_hva_memslot using the following formula: hva = slot->userspace_addr + (gfn - slot->base_gfn) * PAGE_SIZE It is expected that gfn falls within the boundaries of the guest's physical memory. However, a guest can access invalid physical addresses in such a way that the gfn is invalid. __gfn_to_hva_memslot is called from kvm_vcpu_gfn_to_hva_prot, which first retrieves a memslot through __gfn_to_memslot. While __gfn_to_memslot does check that the gfn falls within the boundaries of the guest's physical memory or not, a CPU can speculate the result of the check and continue execution speculatively using an illegal gfn. The speculation can result in calculating an out-of-bounds hva. If the resulting host virtual address is used to load another guest physical address, this is effectively a Spectre gadget consisting of two consecutive reads, the second of which is data dependent on the first. Right now it's not clear if there are any cases in which this is exploitable. One interesting case was reported by the original author of this patch, and involves visiting guest page tables on x86. Right now these are not vulnerable because the hva read goes through get_user(), which contains an LFENCE speculation barrier. However, there are patches in progress for x86 uaccess.h to mask kernel addresses instead of using LFENCE; once these land, a guest could use speculation to read from the VMM's ring 3 address space. Other architectures such as ARM already use the address masking method, and would be susceptible to this same kind of data-dependent access gadgets. Therefore, this patch proactively protects from these attacks by masking out-of-bounds gfns in __gfn_to_hva_memslot, which blocks speculation of invalid hvas. Sean Christopherson noted that this patch does not cover kvm_read_guest_offset_cached. This however is limited to a few bytes past the end of the cache, and therefore it is unlikely to be useful in the context of building a chain of data dependent accesses. Reported-by: Artemiy Margaritov <artemiy.margaritov@gmail.com> Co-developed-by: Artemiy Margaritov <artemiy.margaritov@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-06-08block: return the correct bvec when checking for gapsLong Li
After commit 07173c3ec276 ("block: enable multipage bvecs"), a bvec can have multiple pages. But bio_will_gap() still assumes one page bvec while checking for merging. If the pages in the bvec go across the seg_boundary_mask, this check for merging can potentially succeed if only the 1st page is tested, and can fail if all the pages are tested. Later, when SCSI builds the SG list the same check for merging is done in __blk_segment_map_sg_merge() with all the pages in the bvec tested. This time the check may fail if the pages in bvec go across the seg_boundary_mask (but tested okay in bio_will_gap() earlier, so those BIOs were merged). If this check fails, we end up with a broken SG list for drivers assuming the SG list not having offsets in intermediate pages. This results in incorrect pages written to the disk. Fix this by returning the multi-page bvec when testing gaps for merging. Cc: Jens Axboe <axboe@kernel.dk> Cc: Johannes Thumshirn <johannes.thumshirn@wdc.com> Cc: Pavel Begunkov <asml.silence@gmail.com> Cc: Ming Lei <ming.lei@redhat.com> Cc: Tejun Heo <tj@kernel.org> Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org> Cc: Jeffle Xu <jefflexu@linux.alibaba.com> Cc: linux-kernel@vger.kernel.org Cc: stable@vger.kernel.org Fixes: 07173c3ec276 ("block: enable multipage bvecs") Signed-off-by: Long Li <longli@microsoft.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/1623094445-22332-1-git-send-email-longli@linuxonhyperv.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-06-08seqlock: Remove trailing semicolon in macrosHuilong Deng
Macros should not use a trailing semicolon. Signed-off-by: Huilong Deng <denghuilong@cdjrlc.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20210605045302.37154-1-denghuilong@cdjrlc.com
2021-06-08Merge tag 'orphans-v5.13-rc6' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux Pull orphan section fixes from Kees Cook: "These two corner case fixes have been in -next for about a week: - Avoid orphan section in ARM cpuidle (Arnd Bergmann) - Avoid orphan section with !SMP (Nathan Chancellor)" * tag 'orphans-v5.13-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux: vmlinux.lds.h: Avoid orphan section with !SMP ARM: cpuidle: Avoid orphan section warning
2021-06-08Merge tag 'regulator-fix-v5.13-rc4' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/broonie/regulator Pull regulator fixes from Mark Brown: "A collection of fixes for the regulator API that have come up since the merge window, including a big batch of fixes from Axel Lin's usual careful and detailed review. The one stand out fix here is Dmitry Baryshkov's fix for an issue where we fail to power on the parents of always on regulators during system startup if they weren't already powered on" * tag 'regulator-fix-v5.13-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/regulator: (21 commits) regulator: rt4801: Fix NULL pointer dereference if priv->enable_gpios is NULL regulator: hi6421v600: Fix .vsel_mask setting regulator: bd718x7: Fix the BUCK7 voltage setting on BD71837 regulator: atc260x: Fix n_voltages and min_sel for pickable linear ranges regulator: rtmv20: Fix to make regcache value first reading back from HW regulator: mt6315: Fix function prototype for mt6315_map_mode regulator: rtmv20: Add Richtek to Kconfig text regulator: rtmv20: Fix .set_current_limit/.get_current_limit callbacks regulator: hisilicon: use the correct HiSilicon copyright regulator: bd71828: Fix .n_voltages settings regulator: bd70528: Fix off-by-one for buck123 .n_voltages setting regulator: max77620: Silence deferred probe error regulator: max77620: Use device_set_of_node_from_dev() regulator: scmi: Fix off-by-one for linear regulators .n_voltages setting regulator: core: resolve supply for boot-on/always-on regulators regulator: fixed: Ensure enable_counter is correct if reg_domain_disable fails regulator: Check ramp_delay_table for regulator_set_ramp_delay_regmap regulator: fan53880: Fix missing n_voltages setting regulator: da9121: Return REGULATOR_MODE_INVALID for invalid mode regulator: fan53555: fix TCS4525 voltage calulation ...
2021-06-08media: uapi: Add a control for HANTRO driverBenjamin Gaignard
The HEVC HANTRO driver needs to know the number of bits to skip at the beginning of the slice header. That is a hardware specific requirement so create a dedicated control for this purpose. Signed-off-by: Benjamin Gaignard <benjamin.gaignard@collabora.com> Signed-off-by: Hans Verkuil <hverkuil-cisco@xs4all.nl> Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
2021-06-08media: hevc: Add decode params controlBenjamin Gaignard
Add decode params control and the associated structure to group all the information that are needed to decode a reference frame as is described in ITU-T Rec. H.265 section "8.3.2 Decoding process for reference picture set". Adapt Cedrus driver to these changes. Signed-off-by: Benjamin Gaignard <benjamin.gaignard@collabora.com> Reviewed-by: Ezequiel Garcia <ezequiel@collabora.com> Signed-off-by: Hans Verkuil <hverkuil-cisco@xs4all.nl> Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>