summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2019-11-25Merge branch 'x86/core' into perf/core, to resolve conflicts and to pick up ↵Ingo Molnar
completed topic tree Conflicts: tools/perf/check-headers.sh Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-11-25Merge branch 'perf/urgent' into perf/core, to pick up fixesIngo Molnar
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-11-25Merge branch 'x86/build' into x86/asm, to pick up completed topic branchIngo Molnar
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-11-25x86/pti/32: Calculate the various PTI cpu_entry_area sizes correctly, make ↵Ingo Molnar
the CPU_ENTRY_AREA_PAGES assert precise When two recent commits that increased the size of the 'struct cpu_entry_area' were merged in -tip, the 32-bit defconfig build started failing on the following build time assert: ./include/linux/compiler.h:391:38: error: call to ‘__compiletime_assert_189’ declared with attribute error: BUILD_BUG_ON failed: CPU_ENTRY_AREA_PAGES * PAGE_SIZE < CPU_ENTRY_AREA_MAP_SIZE arch/x86/mm/cpu_entry_area.c:189:2: note: in expansion of macro ‘BUILD_BUG_ON’ In function ‘setup_cpu_entry_area_ptes’, Which corresponds to the following build time assert: BUILD_BUG_ON(CPU_ENTRY_AREA_PAGES * PAGE_SIZE < CPU_ENTRY_AREA_MAP_SIZE); The purpose of this assert is to sanity check the fixed-value definition of CPU_ENTRY_AREA_PAGES arch/x86/include/asm/pgtable_32_types.h: #define CPU_ENTRY_AREA_PAGES (NR_CPUS * 41) The '41' is supposed to match sizeof(struct cpu_entry_area)/PAGE_SIZE, which value we didn't want to define in such a low level header, because it would cause dependency hell. Every time the size of cpu_entry_area is changed, we have to adjust CPU_ENTRY_AREA_PAGES accordingly - and this assert is checking that constraint. But the assert is both imprecise and buggy, primarily because it doesn't include the single readonly IDT page that is mapped at CPU_ENTRY_AREA_BASE (which begins at a PMD boundary). This bug was hidden by the fact that by accident CPU_ENTRY_AREA_PAGES is defined too large upstream (v5.4-rc8): #define CPU_ENTRY_AREA_PAGES (NR_CPUS * 40) While 'struct cpu_entry_area' is 155648 bytes, or 38 pages. So we had two extra pages, which hid the bug. The following commit (not yet upstream) increased the size to 40 pages: x86/iopl: ("Restrict iopl() permission scope") ... but increased CPU_ENTRY_AREA_PAGES only 41 - i.e. shortening the gap to just 1 extra page. Then another not-yet-upstream commit changed the size again: 880a98c33996: ("x86/cpu_entry_area: Add guard page for entry stack on 32bit") Which increased the cpu_entry_area size from 38 to 39 pages, but didn't change CPU_ENTRY_AREA_PAGES (kept it at 40). This worked fine, because we still had a page left from the accidental 'reserve'. But when these two commits were merged into the same tree, the combined size of cpu_entry_area grew from 38 to 40 pages, while CPU_ENTRY_AREA_PAGES finally caught up to 40 as well. Which is fine in terms of functionality, but the assert broke: BUILD_BUG_ON(CPU_ENTRY_AREA_PAGES * PAGE_SIZE < CPU_ENTRY_AREA_MAP_SIZE); because CPU_ENTRY_AREA_MAP_SIZE is the total size of the area, which is 1 page larger due to the IDT page. To fix all this, change the assert to two precise asserts: BUILD_BUG_ON((CPU_ENTRY_AREA_PAGES+1)*PAGE_SIZE != CPU_ENTRY_AREA_MAP_SIZE); BUILD_BUG_ON(CPU_ENTRY_AREA_TOTAL_SIZE != CPU_ENTRY_AREA_MAP_SIZE); This takes the IDT page into account, and also connects the size-based define of CPU_ENTRY_AREA_TOTAL_SIZE with the address-subtraction based define of CPU_ENTRY_AREA_MAP_SIZE. Also clean up some of the names which made it rather confusing: - 'CPU_ENTRY_AREA_TOT_SIZE' wasn't actually the 'total' size of the cpu-entry-area, but the per-cpu array size, so rename this to CPU_ENTRY_AREA_ARRAY_SIZE. - Introduce CPU_ENTRY_AREA_TOTAL_SIZE that _is_ the total mapping size, with the IDT included. - Add comments where '+1' denotes the IDT mapping - it wasn't obvious and took me about 3 hours to decode... Finally, because this particular commit is actually applied after this patch: 880a98c33996: ("x86/cpu_entry_area: Add guard page for entry stack on 32bit") Fix the CPU_ENTRY_AREA_PAGES value from 40 pages to the correct 39 pages. All future commits that change cpu_entry_area will have to adjust this value precisely. As a side note, we should probably attempt to remove CPU_ENTRY_AREA_PAGES and derive its value directly from the structure, without causing header hell - but that is an adventure for another day! :-) Fixes: 880a98c33996: ("x86/cpu_entry_area: Add guard page for entry stack on 32bit") Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Borislav Petkov <bp@alien8.de> Cc: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: stable@kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-11-25cifs: dump channel info in DebugDataAurelien Aptel
* show server&TCP states for extra channels * mention if an interface has a channel connected to it In this version three of the patch, fixed minor printk format issue pointed out by the kbuild robot. Reported-by: kbuild test robot <lkp@intel.com> Signed-off-by: Aurelien Aptel <aaptel@suse.com> Signed-off-by: Steve French <stfrench@microsoft.com>
2019-11-25smb3: dump in_send and num_waiters stats counters by defaultSteve French
Number of requests in_send and the number of waiters on sendRecv are useful counters in various cases, move them from CONFIG_CIFS_STATS2 to be on by default especially with multichannel Signed-off-by: Steve French <stfrench@microsoft.com> Acked-by: Ronnie Sahlberg <lsahlber@redhat.com>
2019-11-25cifs: try harder to open new channelsAurelien Aptel
Previously we would only loop over the iface list once. This patch tries to loop over multiple times until all channels are opened. It will also try to reuse RSS ifaces. Signed-off-by: Aurelien Aptel <aaptel@suse.com> Signed-off-by: Steve French <stfrench@microsoft.com>
2019-11-25CIFS: Properly process SMB3 lease breaksPavel Shilovsky
Currenly we doesn't assume that a server may break a lease from RWH to RW which causes us setting a wrong lease state on a file and thus mistakenly flushing data and byte-range locks and purging cached data on the client. This leads to performance degradation because subsequent IOs go directly to the server. Fix this by propagating new lease state and epoch values to the oplock break handler through cifsFileInfo structure and removing the use of cifsInodeInfo flags for that. It allows to avoid some races of several lease/oplock breaks using those flags in parallel. Signed-off-by: Pavel Shilovsky <pshilov@microsoft.com> Signed-off-by: Steve French <stfrench@microsoft.com>
2019-11-25cifs: move cifsFileInfo_put logic into a work-queueRonnie Sahlberg
This patch moves the final part of the cifsFileInfo_put() logic where we need a write lock on lock_sem to be processed in a separate thread that holds no other locks. This is to prevent deadlocks like the one below: > there are 6 processes looping to while trying to down_write > cinode->lock_sem, 5 of them from _cifsFileInfo_put, and one from > cifs_new_fileinfo > > and there are 5 other processes which are blocked, several of them > waiting on either PG_writeback or PG_locked (which are both set), all > for the same page of the file > > 2 inode_lock() (inode->i_rwsem) for the file > 1 wait_on_page_writeback() for the page > 1 down_read(inode->i_rwsem) for the inode of the directory > 1 inode_lock()(inode->i_rwsem) for the inode of the directory > 1 __lock_page > > > so processes are blocked waiting on: > page flags PG_locked and PG_writeback for one specific page > inode->i_rwsem for the directory > inode->i_rwsem for the file > cifsInodeInflock_sem > > > > here are the more gory details (let me know if I need to provide > anything more/better): > > [0 00:48:22.765] [UN] PID: 8863 TASK: ffff8c691547c5c0 CPU: 3 > COMMAND: "reopen_file" > #0 [ffff9965007e3ba8] __schedule at ffffffff9b6e6095 > #1 [ffff9965007e3c38] schedule at ffffffff9b6e64df > #2 [ffff9965007e3c48] rwsem_down_write_slowpath at ffffffff9af283d7 > #3 [ffff9965007e3cb8] legitimize_path at ffffffff9b0f975d > #4 [ffff9965007e3d08] path_openat at ffffffff9b0fe55d > #5 [ffff9965007e3dd8] do_filp_open at ffffffff9b100a33 > #6 [ffff9965007e3ee0] do_sys_open at ffffffff9b0eb2d6 > #7 [ffff9965007e3f38] do_syscall_64 at ffffffff9ae04315 > * (I think legitimize_path is bogus) > > in path_openat > } else { > const char *s = path_init(nd, flags); > while (!(error = link_path_walk(s, nd)) && > (error = do_last(nd, file, op)) > 0) { <<<< > > do_last: > if (open_flag & O_CREAT) > inode_lock(dir->d_inode); <<<< > else > so it's trying to take inode->i_rwsem for the directory > > DENTRY INODE SUPERBLK TYPE PATH > ffff8c68bb8e79c0 ffff8c691158ef20 ffff8c6915bf9000 DIR /mnt/vm1_smb/ > inode.i_rwsem is ffff8c691158efc0 > > <struct rw_semaphore 0xffff8c691158efc0>: > owner: <struct task_struct 0xffff8c6914275d00> (UN - 8856 - > reopen_file), counter: 0x0000000000000003 > waitlist: 2 > 0xffff9965007e3c90 8863 reopen_file UN 0 1:29:22.926 > RWSEM_WAITING_FOR_WRITE > 0xffff996500393e00 9802 ls UN 0 1:17:26.700 > RWSEM_WAITING_FOR_READ > > > the owner of the inode.i_rwsem of the directory is: > > [0 00:00:00.109] [UN] PID: 8856 TASK: ffff8c6914275d00 CPU: 3 > COMMAND: "reopen_file" > #0 [ffff99650065b828] __schedule at ffffffff9b6e6095 > #1 [ffff99650065b8b8] schedule at ffffffff9b6e64df > #2 [ffff99650065b8c8] schedule_timeout at ffffffff9b6e9f89 > #3 [ffff99650065b940] msleep at ffffffff9af573a9 > #4 [ffff99650065b948] _cifsFileInfo_put.cold.63 at ffffffffc0a42dd6 [cifs] > #5 [ffff99650065ba38] cifs_writepage_locked at ffffffffc0a0b8f3 [cifs] > #6 [ffff99650065bab0] cifs_launder_page at ffffffffc0a0bb72 [cifs] > #7 [ffff99650065bb30] invalidate_inode_pages2_range at ffffffff9b04d4bd > #8 [ffff99650065bcb8] cifs_invalidate_mapping at ffffffffc0a11339 [cifs] > #9 [ffff99650065bcd0] cifs_revalidate_mapping at ffffffffc0a1139a [cifs] > #10 [ffff99650065bcf0] cifs_d_revalidate at ffffffffc0a014f6 [cifs] > #11 [ffff99650065bd08] path_openat at ffffffff9b0fe7f7 > #12 [ffff99650065bdd8] do_filp_open at ffffffff9b100a33 > #13 [ffff99650065bee0] do_sys_open at ffffffff9b0eb2d6 > #14 [ffff99650065bf38] do_syscall_64 at ffffffff9ae04315 > > cifs_launder_page is for page 0xffffd1e2c07d2480 > > crash> page.index,mapping,flags 0xffffd1e2c07d2480 > index = 0x8 > mapping = 0xffff8c68f3cd0db0 > flags = 0xfffffc0008095 > > PAGE-FLAG BIT VALUE > PG_locked 0 0000001 > PG_uptodate 2 0000004 > PG_lru 4 0000010 > PG_waiters 7 0000080 > PG_writeback 15 0008000 > > > inode is ffff8c68f3cd0c40 > inode.i_rwsem is ffff8c68f3cd0ce0 > DENTRY INODE SUPERBLK TYPE PATH > ffff8c68a1f1b480 ffff8c68f3cd0c40 ffff8c6915bf9000 REG > /mnt/vm1_smb/testfile.8853 > > > this process holds the inode->i_rwsem for the parent directory, is > laundering a page attached to the inode of the file it's opening, and in > _cifsFileInfo_put is trying to down_write the cifsInodeInflock_sem > for the file itself. > > > <struct rw_semaphore 0xffff8c68f3cd0ce0>: > owner: <struct task_struct 0xffff8c6914272e80> (UN - 8854 - > reopen_file), counter: 0x0000000000000003 > waitlist: 1 > 0xffff9965005dfd80 8855 reopen_file UN 0 1:29:22.912 > RWSEM_WAITING_FOR_WRITE > > this is the inode.i_rwsem for the file > > the owner: > > [0 00:48:22.739] [UN] PID: 8854 TASK: ffff8c6914272e80 CPU: 2 > COMMAND: "reopen_file" > #0 [ffff99650054fb38] __schedule at ffffffff9b6e6095 > #1 [ffff99650054fbc8] schedule at ffffffff9b6e64df > #2 [ffff99650054fbd8] io_schedule at ffffffff9b6e68e2 > #3 [ffff99650054fbe8] __lock_page at ffffffff9b03c56f > #4 [ffff99650054fc80] pagecache_get_page at ffffffff9b03dcdf > #5 [ffff99650054fcc0] grab_cache_page_write_begin at ffffffff9b03ef4c > #6 [ffff99650054fcd0] cifs_write_begin at ffffffffc0a064ec [cifs] > #7 [ffff99650054fd30] generic_perform_write at ffffffff9b03bba4 > #8 [ffff99650054fda8] __generic_file_write_iter at ffffffff9b04060a > #9 [ffff99650054fdf0] cifs_strict_writev.cold.70 at ffffffffc0a4469b [cifs] > #10 [ffff99650054fe48] new_sync_write at ffffffff9b0ec1dd > #11 [ffff99650054fed0] vfs_write at ffffffff9b0eed35 > #12 [ffff99650054ff00] ksys_write at ffffffff9b0eefd9 > #13 [ffff99650054ff38] do_syscall_64 at ffffffff9ae04315 > > the process holds the inode->i_rwsem for the file to which it's writing, > and is trying to __lock_page for the same page as in the other processes > > > the other tasks: > [0 00:00:00.028] [UN] PID: 8859 TASK: ffff8c6915479740 CPU: 2 > COMMAND: "reopen_file" > #0 [ffff9965007b39d8] __schedule at ffffffff9b6e6095 > #1 [ffff9965007b3a68] schedule at ffffffff9b6e64df > #2 [ffff9965007b3a78] schedule_timeout at ffffffff9b6e9f89 > #3 [ffff9965007b3af0] msleep at ffffffff9af573a9 > #4 [ffff9965007b3af8] cifs_new_fileinfo.cold.61 at ffffffffc0a42a07 [cifs] > #5 [ffff9965007b3b78] cifs_open at ffffffffc0a0709d [cifs] > #6 [ffff9965007b3cd8] do_dentry_open at ffffffff9b0e9b7a > #7 [ffff9965007b3d08] path_openat at ffffffff9b0fe34f > #8 [ffff9965007b3dd8] do_filp_open at ffffffff9b100a33 > #9 [ffff9965007b3ee0] do_sys_open at ffffffff9b0eb2d6 > #10 [ffff9965007b3f38] do_syscall_64 at ffffffff9ae04315 > > this is opening the file, and is trying to down_write cinode->lock_sem > > > [0 00:00:00.041] [UN] PID: 8860 TASK: ffff8c691547ae80 CPU: 2 > COMMAND: "reopen_file" > [0 00:00:00.057] [UN] PID: 8861 TASK: ffff8c6915478000 CPU: 3 > COMMAND: "reopen_file" > [0 00:00:00.059] [UN] PID: 8858 TASK: ffff8c6914271740 CPU: 2 > COMMAND: "reopen_file" > [0 00:00:00.109] [UN] PID: 8862 TASK: ffff8c691547dd00 CPU: 6 > COMMAND: "reopen_file" > #0 [ffff9965007c3c78] __schedule at ffffffff9b6e6095 > #1 [ffff9965007c3d08] schedule at ffffffff9b6e64df > #2 [ffff9965007c3d18] schedule_timeout at ffffffff9b6e9f89 > #3 [ffff9965007c3d90] msleep at ffffffff9af573a9 > #4 [ffff9965007c3d98] _cifsFileInfo_put.cold.63 at ffffffffc0a42dd6 [cifs] > #5 [ffff9965007c3e88] cifs_close at ffffffffc0a07aaf [cifs] > #6 [ffff9965007c3ea0] __fput at ffffffff9b0efa6e > #7 [ffff9965007c3ee8] task_work_run at ffffffff9aef1614 > #8 [ffff9965007c3f20] exit_to_usermode_loop at ffffffff9ae03d6f > #9 [ffff9965007c3f38] do_syscall_64 at ffffffff9ae0444c > > closing the file, and trying to down_write cifsi->lock_sem > > > [0 00:48:22.839] [UN] PID: 8857 TASK: ffff8c6914270000 CPU: 7 > COMMAND: "reopen_file" > #0 [ffff9965006a7cc8] __schedule at ffffffff9b6e6095 > #1 [ffff9965006a7d58] schedule at ffffffff9b6e64df > #2 [ffff9965006a7d68] io_schedule at ffffffff9b6e68e2 > #3 [ffff9965006a7d78] wait_on_page_bit at ffffffff9b03cac6 > #4 [ffff9965006a7e10] __filemap_fdatawait_range at ffffffff9b03b028 > #5 [ffff9965006a7ed8] filemap_write_and_wait at ffffffff9b040165 > #6 [ffff9965006a7ef0] cifs_flush at ffffffffc0a0c2fa [cifs] > #7 [ffff9965006a7f10] filp_close at ffffffff9b0e93f1 > #8 [ffff9965006a7f30] __x64_sys_close at ffffffff9b0e9a0e > #9 [ffff9965006a7f38] do_syscall_64 at ffffffff9ae04315 > > in __filemap_fdatawait_range > wait_on_page_writeback(page); > for the same page of the file > > > > [0 00:48:22.718] [UN] PID: 8855 TASK: ffff8c69142745c0 CPU: 7 > COMMAND: "reopen_file" > #0 [ffff9965005dfc98] __schedule at ffffffff9b6e6095 > #1 [ffff9965005dfd28] schedule at ffffffff9b6e64df > #2 [ffff9965005dfd38] rwsem_down_write_slowpath at ffffffff9af283d7 > #3 [ffff9965005dfdf0] cifs_strict_writev at ffffffffc0a0c40a [cifs] > #4 [ffff9965005dfe48] new_sync_write at ffffffff9b0ec1dd > #5 [ffff9965005dfed0] vfs_write at ffffffff9b0eed35 > #6 [ffff9965005dff00] ksys_write at ffffffff9b0eefd9 > #7 [ffff9965005dff38] do_syscall_64 at ffffffff9ae04315 > > inode_lock(inode); > > > and one 'ls' later on, to see whether the rest of the mount is available > (the test file is in the root, so we get blocked up on the directory > ->i_rwsem), so the entire mount is unavailable > > [0 00:36:26.473] [UN] PID: 9802 TASK: ffff8c691436ae80 CPU: 4 > COMMAND: "ls" > #0 [ffff996500393d28] __schedule at ffffffff9b6e6095 > #1 [ffff996500393db8] schedule at ffffffff9b6e64df > #2 [ffff996500393dc8] rwsem_down_read_slowpath at ffffffff9b6e9421 > #3 [ffff996500393e78] down_read_killable at ffffffff9b6e95e2 > #4 [ffff996500393e88] iterate_dir at ffffffff9b103c56 > #5 [ffff996500393ec8] ksys_getdents64 at ffffffff9b104b0c > #6 [ffff996500393f30] __x64_sys_getdents64 at ffffffff9b104bb6 > #7 [ffff996500393f38] do_syscall_64 at ffffffff9ae04315 > > in iterate_dir: > if (shared) > res = down_read_killable(&inode->i_rwsem); <<<< > else > res = down_write_killable(&inode->i_rwsem); > Reported-by: Frank Sorenson <sorenson@redhat.com> Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com> Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com> Signed-off-by: Steve French <stfrench@microsoft.com>
2019-11-25cifs: try opening channels after mountingAurelien Aptel
After doing mount() successfully we call cifs_try_adding_channels() which will open as many channels as it can. Channels are closed when the master session is closed. The master connection becomes the first channel. ,-------------> global cifs_tcp_ses_list <-------------------------. | | '- TCP_Server_Info <--> TCP_Server_Info <--> TCP_Server_Info <-' (master con) (chan#1 con) (chan#2 con) | ^ ^ ^ v '--------------------|--------------------' cifs_ses | - chan_count = 3 | - chans[] ---------------------' - smb3signingkey[] (master signing key) Note how channel connections don't have sessions. That's because cifs_ses can only be part of one linked list (list_head are internal to the elements). For signing keys, each channel has its own signing key which must be used only after the channel has been bound. While it's binding it must use the master session signing key. For encryption keys, since channel connections do not have sessions attached we must now find matching session by looping over all sessions in smb2_get_enc_key(). Each channel is opened like a regular server connection but at the session setup request step it must set the SMB2_SESSION_REQ_FLAG_BINDING flag and use the session id to bind to. Finally, while sending in compound_send_recv() for requests that aren't negprot, ses-setup or binding related, use a channel by cycling through the available ones (round-robin). Signed-off-by: Aurelien Aptel <aaptel@suse.com> Signed-off-by: Steve French <stfrench@microsoft.com>
2019-11-25CIFS: refactor cifs_get_inode_info()Aurelien Aptel
Make logic of cifs_get_inode() much clearer by moving code to sub functions and adding comments. Document the steps this function does. cifs_get_inode_info() gets and updates a file inode metadata from its file path. * If caller already has raw info data from server they can pass it. * If inode already exists (just need to update) caller can pass it. Step 1: get raw data from server if none was passed Step 2: parse raw data into intermediate internal cifs_fattr struct Step 3: set fattr uniqueid which is later used for inode number. This can sometime be done from raw data Step 4: tweak fattr according to mount options (file_mode, acl to mode bits, uid, gid, etc) Step 5: update or create inode from final fattr struct * add is_smb1_server() helper * add is_inode_cache_good() helper * move SMB1-backupcreds-getinfo-retry to separate func cifs_backup_query_path_info(). * move set-uniqueid code to separate func cifs_set_fattr_ino() * don't clobber uniqueid from backup cred retry * fix some probable corner cases memleaks Signed-off-by: Aurelien Aptel <aaptel@suse.com> Signed-off-by: Steve French <stfrench@microsoft.com>
2019-11-25cifs: switch servers depending on binding stateAurelien Aptel
Currently a lot of the code to initialize a connection & session uses the cifs_ses as input. But depending on if we are opening a new session or a new channel we need to use different server pointers. Add a "binding" flag in cifs_ses and a helper function that returns the server ptr a session should use (only in the sess establishment code path). Signed-off-by: Aurelien Aptel <aaptel@suse.com> Signed-off-by: Steve French <stfrench@microsoft.com>
2019-11-25cifs: add server paramAurelien Aptel
As we get down to the transport layer, plenty of functions are passed the session pointer and assume the transport to use is ses->server. Instead we modify those functions to pass (ses, server) so that we can decouple the session from the server. Signed-off-by: Aurelien Aptel <aaptel@suse.com> Signed-off-by: Steve French <stfrench@microsoft.com>
2019-11-25cifs: add multichannel mount options and data structsAurelien Aptel
adds: - [no]multichannel to enable/disable multichannel - max_channels=N to control how many channels to create these options are then stored in the volume struct. - store channels and max_channels in cifs_ses Signed-off-by: Aurelien Aptel <aaptel@suse.com> Signed-off-by: Steve French <stfrench@microsoft.com>
2019-11-25cifs: sort interface list by speedAurelien Aptel
New channels are going to be opened by walking the list sequentially, so by sorting it we will connect to the fastest interfaces first. Signed-off-by: Aurelien Aptel <aaptel@suse.com> Signed-off-by: Steve French <stfrench@microsoft.com>
2019-11-25CIFS: Fix SMB2 oplock break processingPavel Shilovsky
Even when mounting modern protocol version the server may be configured without supporting SMB2.1 leases and the client uses SMB2 oplock to optimize IO performance through local caching. However there is a problem in oplock break handling that leads to missing a break notification on the client who has a file opened. It latter causes big latencies to other clients that are trying to open the same file. The problem reproduces when there are multiple shares from the same server mounted on the client. The processing code tries to match persistent and volatile file ids from the break notification with an open file but it skips all share besides the first one. Fix this by looking up in all shares belonging to the server that issued the oplock break. Cc: Stable <stable@vger.kernel.org> Signed-off-by: Pavel Shilovsky <pshilov@microsoft.com> Signed-off-by: Steve French <stfrench@microsoft.com>
2019-11-25cifs: don't use 'pre:' for MODULE_SOFTDEPRonnie Sahlberg
It can cause to fail with modprobe: FATAL: Module <module> is builtin. RHBZ: 1767094 Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com> Signed-off-by: Steve French <stfrench@microsoft.com>
2019-11-25cifs: smbd: Return -EAGAIN when transport is reconnectingLong Li
During reconnecting, the transport may have already been destroyed and is in the process being reconnected. In this case, return -EAGAIN to not fail and to retry this I/O. Signed-off-by: Long Li <longli@microsoft.com> Cc: stable@vger.kernel.org Signed-off-by: Steve French <stfrench@microsoft.com>
2019-11-25cifs: smbd: Only queue work for error recovery on memory registrationLong Li
It's not necessary to queue invalidated memory registration to work queue, as all we need to do is to unmap the SG and make it usable again. This can save CPU cycles in normal data paths as memory registration errors are rare and normally only happens during reconnection. Signed-off-by: Long Li <longli@microsoft.com> Cc: stable@vger.kernel.org Signed-off-by: Steve French <stfrench@microsoft.com>
2019-11-25smb3: add debug messages for closing unmatched openRonnie Sahlberg
Helps distinguish between an interrupted close and a truly unmatched open. Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com> Signed-off-by: Steve French <stfrench@microsoft.com>
2019-11-25CIFS: Do not miss cancelled OPEN responsesPavel Shilovsky
When an OPEN command is cancelled we mark a mid as cancelled and let the demultiplex thread process it by closing an open handle. The problem is there is a race between a system call thread and the demultiplex thread and there may be a situation when the mid has been already processed before it is set as cancelled. Fix this by processing cancelled requests when mids are being destroyed which means that there is only one thread referencing a particular mid. Also set mids as cancelled unconditionally on their state. Cc: Stable <stable@vger.kernel.org> Tested-by: Frank Sorenson <sorenson@redhat.com> Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com> Signed-off-by: Pavel Shilovsky <pshilov@microsoft.com> Signed-off-by: Steve French <stfrench@microsoft.com>
2019-11-25CIFS: Fix NULL pointer dereference in mid callbackPavel Shilovsky
There is a race between a system call processing thread and the demultiplex thread when mid->resp_buf becomes NULL and later is being accessed to get credits. It happens when the 1st thread wakes up before a mid callback is called in the 2nd one but the mid state has already been set to MID_RESPONSE_RECEIVED. This causes NULL pointer dereference in mid callback. Fix this by saving credits from the response before we update the mid state and then use this value in the mid callback rather then accessing a response buffer. Cc: Stable <stable@vger.kernel.org> Fixes: ee258d79159afed5 ("CIFS: Move credit processing to mid callbacks for SMB3") Tested-by: Frank Sorenson <sorenson@redhat.com> Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com> Signed-off-by: Pavel Shilovsky <pshilov@microsoft.com> Signed-off-by: Steve French <stfrench@microsoft.com>
2019-11-25CIFS: Close open handle after interrupted closePavel Shilovsky
If Close command is interrupted before sending a request to the server the client ends up leaking an open file handle. This wastes server resources and can potentially block applications that try to remove the file or any directory containing this file. Fix this by putting the close command into a worker queue, so another thread retries it later. Cc: Stable <stable@vger.kernel.org> Tested-by: Frank Sorenson <sorenson@redhat.com> Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com> Signed-off-by: Pavel Shilovsky <pshilov@microsoft.com> Signed-off-by: Steve French <stfrench@microsoft.com>
2019-11-25CIFS: Respect O_SYNC and O_DIRECT flags during reconnectPavel Shilovsky
Currently the client translates O_SYNC and O_DIRECT flags into corresponding SMB create options when openning a file. The problem is that on reconnect when the file is being re-opened the client doesn't set those flags and it causes a server to reject re-open requests because create options don't match. The latter means that any subsequent system call against that open file fail until a share is re-mounted. Fix this by properly setting SMB create options when re-openning files after reconnects. Fixes: 1013e760d10e6: ("SMB3: Don't ignore O_SYNC/O_DSYNC and O_DIRECT flags") Cc: Stable <stable@vger.kernel.org> Signed-off-by: Pavel Shilovsky <pshilov@microsoft.com> Signed-off-by: Steve French <stfrench@microsoft.com>
2019-11-25smb3: remove confusing dmesg when mounting with encryption ("seal")Steve French
The smb2/smb3 message checking code was logging to dmesg when mounting with encryption ("seal") for compounded SMB3 requests. When encrypted the whole frame (including potentially multiple compounds) is read so the length field is longer than in the case of non-encrypted case (where length field will match the the calculated length for the particular SMB3 request in the compound being validated). Avoids the warning on mount (with "seal"): "srv rsp padded more than expected. Length 384 not ..." Signed-off-by: Steve French <stfrench@microsoft.com>
2019-11-25cifs: close the shared root handle on tree disconnectRonnie Sahlberg
Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com> Signed-off-by: Steve French <stfrench@microsoft.com>
2019-11-25CIFS: Return directly after a failed build_path_from_dentry() in ↵Markus Elfring
cifs_do_create() Return directly after a call of the function "build_path_from_dentry" failed at the beginning. Signed-off-by: Markus Elfring <elfring@users.sourceforge.net> Signed-off-by: Steve French <stfrench@microsoft.com>
2019-11-25CIFS: Use common error handling code in smb2_ioctl_query_info()Markus Elfring
Move the same error code assignments so that such exception handling can be better reused at the end of this function. This issue was detected by using the Coccinelle software. Signed-off-by: Markus Elfring <elfring@users.sourceforge.net> Signed-off-by: Steve French <stfrench@microsoft.com>
2019-11-25CIFS: Use memdup_user() rather than duplicating its implementationMarkus Elfring
Reuse existing functionality from memdup_user() instead of keeping duplicate source code. Generated by: scripts/coccinelle/api/memdup_user.cocci Fixes: f5b05d622a3e99e6a97a189fe500414be802a05c ("cifs: add IOCTL for QUERY_INFO passthrough to userspace") Signed-off-by: Markus Elfring <elfring@users.sourceforge.net> Signed-off-by: Steve French <stfrench@microsoft.com>
2019-11-25cifs: smbd: Return -ECONNABORTED when trasnport is not in connected stateLong Li
The transport should return this error so the upper layer will reconnect. Signed-off-by: Long Li <longli@microsoft.com> Cc: stable@vger.kernel.org Signed-off-by: Steve French <stfrench@microsoft.com>
2019-11-25cifs: smbd: Add messages on RDMA session destroy and reconnectionLong Li
Log these activities to help production support. Signed-off-by: Long Li <longli@microsoft.com> Cc: stable@vger.kernel.org Signed-off-by: Steve French <stfrench@microsoft.com>
2019-11-25cifs: smbd: Return -EINVAL when the number of iovs exceeds SMBDIRECT_MAX_SGELong Li
While it's not friendly to fail user processes that issue more iovs than we support, at least we should return the correct error code so the user process gets a chance to retry with smaller number of iovs. Signed-off-by: Long Li <longli@microsoft.com> Cc: stable@vger.kernel.org Signed-off-by: Steve French <stfrench@microsoft.com>
2019-11-25cifs: smbd: Invalidate and deregister memory registration on re-send for ↵Long Li
direct I/O On re-send, there might be a reconnect and all prevoius memory registrations need to be invalidated and deregistered. Signed-off-by: Long Li <longli@microsoft.com> Signed-off-by: Steve French <stfrench@microsoft.com>
2019-11-25cifs: Don't display RDMA transport on reconnectLong Li
On reconnect, the transport data structure is NULL and its information is not available. Signed-off-by: Long Li <longli@microsoft.com> Cc: stable@vger.kernel.org Signed-off-by: Steve French <stfrench@microsoft.com>
2019-11-25CIFS: remove set but not used variables 'cinode' and 'netfid'YueHaibing
Fixes gcc '-Wunused-but-set-variable' warning: fs/cifs/file.c: In function 'cifs_flock': fs/cifs/file.c:1704:8: warning: variable 'netfid' set but not used [-Wunused-but-set-variable] fs/cifs/file.c:1702:24: warning: variable 'cinode' set but not used [-Wunused-but-set-variable] Reported-by: Hulk Robot <hulkci@huawei.com> Signed-off-by: YueHaibing <yuehaibing@huawei.com> Signed-off-by: Steve French <stfrench@microsoft.com>
2019-11-25cifs: add support for flockSteve French
The flock system call locks the whole file rather than a byte range and so is currently emulated by various other file systems by simply sending a byte range lock for the whole file. Add flock handling for cifs.ko in similar way. xfstest generic/504 passes with this as well Signed-off-by: Steve French <stfrench@microsoft.com> Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com> Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com>
2019-11-25cifs: remove unused variable 'sid_user'YueHaibing
fs/cifs/cifsacl.c:43:30: warning: sid_user defined but not used [-Wunused-const-variable=] It is never used, so remove it. Signed-off-by: YueHaibing <yuehaibing@huawei.com> Signed-off-by: Steve French <stfrench@microsoft.com>
2019-11-25cifs: rename a variable in SendReceive()Dan Carpenter
Smatch gets confused because we sometimes refer to "server->srv_mutex" and sometimes to "sess->server->srv_mutex". They refer to the same lock so let's just make this consistent. Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Steve French <stfrench@microsoft.com>
2019-11-24Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-nextDavid S. Miller
Alexei Starovoitov says: ==================== pull-request: bpf-next 2019-11-24 The following pull-request contains BPF updates for your *net-next* tree. We've added 27 non-merge commits during the last 4 day(s) which contain a total of 50 files changed, 2031 insertions(+), 548 deletions(-). The main changes are: 1) Optimize bpf_tail_call() from retpoline-ed indirect jump to direct jump, from Daniel. 2) Support global variables in libbpf, from Andrii. 3) Cleanup selftests with BPF_TRACE_x() macro, from Martin. 4) Fix devmap hash, from Toke. 5) Fix register bounds after 32-bit conditional jumps, from Yonghong. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-24Merge branch 'for-upstream' of ↵Jakub Kicinski
git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth-next Johan Hedberg says: ==================== pull request: bluetooth-next 2019-11-24 Here's one last bluetooth-next pull request for the 5.5 kernel: - Fix BDADDR_PROPERTY & INVALID_BDADDR quirk handling - Added support for BCM4334B0 and BCM4335A0 controllers - A few other smaller fixes related to locking and memory leaks ==================== Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
2019-11-24ax88179_178a: add ethtool_op_get_ts_info()Andreas K. Besslein
This enables the use of SW timestamping. ax88179_178a uses the usbnet transmit function usbnet_start_xmit() which implements software timestamping. ax88179_178a overrides ethtool_ops but missed to set .get_ts_info. This caused SOF_TIMESTAMPING_TX_SOFTWARE capability to be not available. Signed-off-by: Andreas K. Besslein <besslein.andreas@gmail.com> Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
2019-11-24Merge branch 'mlxsw-Two-small-updates'Jakub Kicinski
Ido Schimmel says: ==================== mlxsw: Two small updates Patch #1 from Petr handles a corner case in GRE tunnel offload. Patch #2 from Amit fixes a recent issue where the driver was programming the device to use an adjacency index (for a nexthop) that was not properly initialized. ==================== Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
2019-11-24mlxsw: spectrum_router: Fix use of uninitialized adjacency indexAmit Cohen
When mlxsw_sp_adj_discard_write() is called for the first time, the value stored in 'mlxsw_sp->router->adj_discard_index' is invalid, as indicated by 'mlxsw_sp->router->adj_discard_index_valid' being set to 'false'. In this case, we should not use the value initially stored in 'mlxsw_sp->router->adj_discard_index' (0) and instead use the value allocated later in the function. Fixes: 983db6198f0d ("mlxsw: spectrum_router: Allocate discard adjacency entry when needed") Signed-off-by: Amit Cohen <amitc@mellanox.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
2019-11-24mlxsw: spectrum_router: After underlay moves, demote conflicting tunnelsPetr Machata
When a GRE tunnel is bound to an underlay netdevice and that netdevice is moved to a different VRF, that could cause two tunnels to have the same underlay local address in the same VRF. Linux in this situation dispatches the traffic according to the tunnel key (or lack thereof), but that cannot be offloaded to Spectrum devices. Detect this situation and unoffload the two impacted tunnels when it happens. Signed-off-by: Petr Machata <petrm@mellanox.com> Signed-off-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
2019-11-24bpf: Simplify __bpf_arch_text_poke poke type handlingDaniel Borkmann
Given that we have BPF_MOD_NOP_TO_{CALL,JUMP}, BPF_MOD_{CALL,JUMP}_TO_NOP and BPF_MOD_{CALL,JUMP}_TO_{CALL,JUMP} poke types and that we also pass in old_addr as well as new_addr, it's a bit redundant and unnecessarily complicates __bpf_arch_text_poke() itself since we can derive the same from the *_addr that were passed in. Hence simplify and use BPF_MOD_{CALL,JUMP} as types which also allows to clean up call-sites. In addition to that, __bpf_arch_text_poke() currently verifies that text matches expected old_insn before we invoke text_poke_bp(). Also add a check on new_insn and skip rewrite if it already matches. Reason why this is rather useful is that it avoids making any special casing in prog_array_map_poke_run() when old and new prog were NULL and has the benefit that also for this case we perform a check on text whether it really matches our expectations. Suggested-by: Andrii Nakryiko <andriin@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/fcb00a2b0b288d6c73de4ef58116a821c8fe8f2f.1574555798.git.daniel@iogearbox.net
2019-11-24bpf: Introduce BPF_TRACE_x helper for the tracing testsMartin KaFai Lau
For BPF_PROG_TYPE_TRACING, the bpf_prog's ctx is an array of u64. This patch borrows the idea from BPF_CALL_x in filter.h to convert a u64 to the arg type of the traced function. The new BPF_TRACE_x has an arg to specify the return type of a bpf_prog. It will be used in the future TCP-ops bpf_prog that may return "void". The new macros are defined in the new header file "bpf_trace_helpers.h". It is under selftests/bpf/ for now. It could be moved to libbpf later after seeing more upcoming non-tracing use cases. The tests are changed to use these new macros also. Hence, the k[s]u8/16/32/64 are no longer needed and they are removed from the bpf_helpers.h. Signed-off-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20191123202504.1502696-1-kafai@fb.com
2019-11-24bpf: Add bpf_jit_blinding_enabled for !CONFIG_BPF_JITDaniel Borkmann
Add a definition of bpf_jit_blinding_enabled() when CONFIG_BPF_JIT is not set in order to fix a recent build regression: [...] CC kernel/bpf/verifier.o CC kernel/bpf/inode.o kernel/bpf/verifier.c: In function ‘fixup_bpf_calls’: kernel/bpf/verifier.c:9132:25: error: implicit declaration of function ‘bpf_jit_blinding_enabled’; did you mean ‘bpf_jit_kallsyms_enabled’? [-Werror=implicit-function-declaration] 9132 | bool expect_blinding = bpf_jit_blinding_enabled(prog); | ^~~~~~~~~~~~~~~~~~~~~~~~ | bpf_jit_kallsyms_enabled CC kernel/bpf/helpers.o CC kernel/bpf/hashtab.o [...] Fixes: d2e4c1e6c294 ("bpf: Constant map key tracking for prog array pokes") Reported-by: Jakub Sitnicki <jakub@cloudflare.com> Reported-by: Andrii Nakryiko <andriin@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/40baf8f3507cac4851a310578edfb98ce73b5605.1574541375.git.daniel@iogearbox.net
2019-11-24Merge branch 'optimize-bpf_tail_call'Alexei Starovoitov
Daniel Borkmann says: ==================== This gets rid of indirect jumps for BPF tail calls whenever possible. The series adds emission for *direct* jumps for tail call maps in order to avoid the retpoline overhead from a493a87f38cf ("bpf, x64: implement retpoline for tail call") for situations that allow for it, meaning, for known constant keys at verification time which are used as index into the tail call map. See patch 7/8 for more general details. Thanks! v1 -> v2: - added more test cases - u8 ip_stable -> bool (Andrii) - removed bpf_map_poke_{un,}lock and simplified the code (Andrii) - added break into prog_array_map_poke_untrack since there's just one prog (Andrii) - fixed typo: for for in commit msg (Andrii) - reworked __bpf_arch_text_poke (Andrii) - added subtests, and comment on tests themselves, NULL-NULL transistion (Andrii) - in constant map key tracking I've moved the map_poke_track callback to once we've finished creating the poke tab as otherwise concurrent access from tail call map would blow up (since we realloc the table) rfc -> v1: - Applied Alexei's and Andrii's feeback from https://lore.kernel.org/bpf/cover.1573779287.git.daniel@iogearbox.net/T/#t ==================== Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2019-11-24bpf, testing: Add various tail call test casesDaniel Borkmann
Add several BPF kselftest cases for tail calls which test the various patch directions, and that multiple locations are patched in same and different programs. # ./test_progs -n 45 #45/1 tailcall_1:OK #45/2 tailcall_2:OK #45/3 tailcall_3:OK #45/4 tailcall_4:OK #45/5 tailcall_5:OK #45 tailcalls:OK Summary: 1/5 PASSED, 0 SKIPPED, 0 FAILED I've also verified the JITed dump after each of the rewrite cases that it matches expectations. Also regular test_verifier suite passes fine which contains further tail call tests: # ./test_verifier [...] Summary: 1563 PASSED, 0 SKIPPED, 0 FAILED Checked under JIT, interpreter and JIT + hardening. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Andrii Nakryiko <andriin@fb.com> Link: https://lore.kernel.org/bpf/3d6cbecbeb171117dccfe153306e479798fb608d.1574452833.git.daniel@iogearbox.net
2019-11-24bpf, x86: Emit patchable direct jump as tail callDaniel Borkmann
Add initial code emission for *direct* jumps for tail call maps in order to avoid the retpoline overhead from a493a87f38cf ("bpf, x64: implement retpoline for tail call") for situations that allow for it, meaning, for known constant keys at verification time which are used as index into the tail call map. In case of Cilium which makes heavy use of tail calls, constant keys are used in the vast majority, only for a single occurrence we use a dynamic key. High level outline is that if the target prog is NULL in the map, we emit a 5-byte nop for the fall-through case and if not, we emit a 5-byte direct relative jmp to the target bpf_func + skipped prologue offset. Later during runtime, we patch these 5-byte nop/jmps upon tail call map update or deletions dynamically. Note that on x86-64 the direct jmp works as we reuse the same stack frame and skip prologue (as opposed to some other JIT implementations). One of the issues is that the tail call map slots can change at any given time even during JITing. Therefore, we have two passes: i) emit nops for all patchable locations during main JITing phase until we declare prog->jited = 1 eventually. At this point the image is stable, not public yet and with all jmps disabled. While JITing, we collect additional info like poke->ip in order to remember the patch location for later modifications. In ii) bpf_tail_call_direct_fixup() walks over the progs poke_tab, locks the tail call maps poke_mutex to prevent from parallel updates and patches in the right locations via __bpf_arch_text_poke(). Note, the main bpf_arch_text_poke() cannot be used at this point since we're not yet exposed to kallsyms. For the update we use plain memcpy() since the image is not public and still in read-write mode. After patching, we activate that poke entry through poke->ip_stable. Meaning, at this point any tail call map updates/deletions are not going to ignore that poke entry anymore. Then, bpf_arch_text_poke() might still occur on the read-write image until we finally locked it as read-only. Both modifications on the given image are under text_mutex to avoid interference with each other when update requests come in in parallel for different tail call maps (current one we have locked in JIT and different one where poke->ip_stable was already set). Example prog: # ./bpftool p d x i 1655 0: (b7) r3 = 0 1: (18) r2 = map[id:526] 3: (85) call bpf_tail_call#12 4: (b7) r0 = 1 5: (95) exit Before: # ./bpftool p d j i 1655 0xffffffffc076e55c: 0: nopl 0x0(%rax,%rax,1) 5: push %rbp 6: mov %rsp,%rbp 9: sub $0x200,%rsp 10: push %rbx 11: push %r13 13: push %r14 15: push %r15 17: pushq $0x0 _ 19: xor %edx,%edx |_ index (arg 3) 1b: movabs $0xffff88d95cc82600,%rsi |_ map (arg 2) 25: mov %edx,%edx | index >= array->map.max_entries 27: cmp %edx,0x24(%rsi) | 2a: jbe 0x0000000000000066 |_ 2c: mov -0x224(%rbp),%eax | tail call limit check 32: cmp $0x20,%eax | 35: ja 0x0000000000000066 | 37: add $0x1,%eax | 3a: mov %eax,-0x224(%rbp) |_ 40: mov 0xd0(%rsi,%rdx,8),%rax |_ prog = array->ptrs[index] 48: test %rax,%rax | prog == NULL check 4b: je 0x0000000000000066 |_ 4d: mov 0x30(%rax),%rax | goto *(prog->bpf_func + prologue_size) 51: add $0x19,%rax | 55: callq 0x0000000000000061 | retpoline for indirect jump 5a: pause | 5c: lfence | 5f: jmp 0x000000000000005a | 61: mov %rax,(%rsp) | 65: retq |_ 66: mov $0x1,%eax 6b: pop %rbx 6c: pop %r15 6e: pop %r14 70: pop %r13 72: pop %rbx 73: leaveq 74: retq After; state after JIT: # ./bpftool p d j i 1655 0xffffffffc08e8930: 0: nopl 0x0(%rax,%rax,1) 5: push %rbp 6: mov %rsp,%rbp 9: sub $0x200,%rsp 10: push %rbx 11: push %r13 13: push %r14 15: push %r15 17: pushq $0x0 _ 19: xor %edx,%edx |_ index (arg 3) 1b: movabs $0xffff9d8afd74c000,%rsi |_ map (arg 2) 25: mov -0x224(%rbp),%eax | tail call limit check 2b: cmp $0x20,%eax | 2e: ja 0x000000000000003e | 30: add $0x1,%eax | 33: mov %eax,-0x224(%rbp) |_ 39: jmpq 0xfffffffffffd1785 |_ [direct] goto *(prog->bpf_func + prologue_size) 3e: mov $0x1,%eax 43: pop %rbx 44: pop %r15 46: pop %r14 48: pop %r13 4a: pop %rbx 4b: leaveq 4c: retq After; state after map update (target prog): # ./bpftool p d j i 1655 0xffffffffc08e8930: 0: nopl 0x0(%rax,%rax,1) 5: push %rbp 6: mov %rsp,%rbp 9: sub $0x200,%rsp 10: push %rbx 11: push %r13 13: push %r14 15: push %r15 17: pushq $0x0 19: xor %edx,%edx 1b: movabs $0xffff9d8afd74c000,%rsi 25: mov -0x224(%rbp),%eax 2b: cmp $0x20,%eax . 2e: ja 0x000000000000003e . 30: add $0x1,%eax . 33: mov %eax,-0x224(%rbp) |_ 39: jmpq 0xffffffffffb09f55 |_ goto *(prog->bpf_func + prologue_size) 3e: mov $0x1,%eax 43: pop %rbx 44: pop %r15 46: pop %r14 48: pop %r13 4a: pop %rbx 4b: leaveq 4c: retq After; state after map update (no prog): # ./bpftool p d j i 1655 0xffffffffc08e8930: 0: nopl 0x0(%rax,%rax,1) 5: push %rbp 6: mov %rsp,%rbp 9: sub $0x200,%rsp 10: push %rbx 11: push %r13 13: push %r14 15: push %r15 17: pushq $0x0 19: xor %edx,%edx 1b: movabs $0xffff9d8afd74c000,%rsi 25: mov -0x224(%rbp),%eax 2b: cmp $0x20,%eax . 2e: ja 0x000000000000003e . 30: add $0x1,%eax . 33: mov %eax,-0x224(%rbp) |_ 39: nopl 0x0(%rax,%rax,1) |_ fall-through nop 3e: mov $0x1,%eax 43: pop %rbx 44: pop %r15 46: pop %r14 48: pop %r13 4a: pop %rbx 4b: leaveq 4c: retq Nice bonus is that this also shrinks the code emission quite a bit for every tail call invocation. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/6ada4c1c9d35eeb5f4ecfab94593dafa6b5c4b09.1574452833.git.daniel@iogearbox.net