summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2015-11-02Merge branch 'bpf-persistent'David S. Miller
Daniel Borkmann says: ==================== BPF updates This set adds support for persistent maps/progs. Please see individual patches for further details. A man-page update to bpf(2) will be sent later on, also a iproute2 patch for support in tc. v1 -> v2: - Reworked most of patch 4 and 5 - Rebased to latest net-next ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2015-11-02bpf: add sample usages for persistent maps/progsDaniel Borkmann
This patch adds a couple of stand-alone examples on how BPF_OBJ_PIN and BPF_OBJ_GET commands can be used. Example with maps: # ./fds_example -F /sys/fs/bpf/m -P -m -k 1 -v 42 bpf: map fd:3 (Success) bpf: pin ret:(0,Success) bpf: fd:3 u->(1:42) ret:(0,Success) # ./fds_example -F /sys/fs/bpf/m -G -m -k 1 bpf: get fd:3 (Success) bpf: fd:3 l->(1):42 ret:(0,Success) # ./fds_example -F /sys/fs/bpf/m -G -m -k 1 -v 24 bpf: get fd:3 (Success) bpf: fd:3 u->(1:24) ret:(0,Success) # ./fds_example -F /sys/fs/bpf/m -G -m -k 1 bpf: get fd:3 (Success) bpf: fd:3 l->(1):24 ret:(0,Success) # ./fds_example -F /sys/fs/bpf/m2 -P -m bpf: map fd:3 (Success) bpf: pin ret:(0,Success) # ./fds_example -F /sys/fs/bpf/m2 -G -m -k 1 bpf: get fd:3 (Success) bpf: fd:3 l->(1):0 ret:(0,Success) # ./fds_example -F /sys/fs/bpf/m2 -G -m bpf: get fd:3 (Success) Example with progs: # ./fds_example -F /sys/fs/bpf/p -P -p bpf: prog fd:3 (Success) bpf: pin ret:(0,Success) bpf sock:4 <- fd:3 attached ret:(0,Success) # ./fds_example -F /sys/fs/bpf/p -G -p bpf: get fd:3 (Success) bpf: sock:4 <- fd:3 attached ret:(0,Success) # ./fds_example -F /sys/fs/bpf/p2 -P -p -o ./sockex1_kern.o bpf: prog fd:5 (Success) bpf: pin ret:(0,Success) bpf: sock:3 <- fd:5 attached ret:(0,Success) # ./fds_example -F /sys/fs/bpf/p2 -G -p bpf: get fd:3 (Success) bpf: sock:4 <- fd:3 attached ret:(0,Success) Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-11-02bpf: add support for persistent maps/progsDaniel Borkmann
This work adds support for "persistent" eBPF maps/programs. The term "persistent" is to be understood that maps/programs have a facility that lets them survive process termination. This is desired by various eBPF subsystem users. Just to name one example: tc classifier/action. Whenever tc parses the ELF object, extracts and loads maps/progs into the kernel, these file descriptors will be out of reach after the tc instance exits. So a subsequent tc invocation won't be able to access/relocate on this resource, and therefore maps cannot easily be shared, f.e. between the ingress and egress networking data path. The current workaround is that Unix domain sockets (UDS) need to be instrumented in order to pass the created eBPF map/program file descriptors to a third party management daemon through UDS' socket passing facility. This makes it a bit complicated to deploy shared eBPF maps or programs (programs f.e. for tail calls) among various processes. We've been brainstorming on how we could tackle this issue and various approches have been tried out so far, which can be read up further in the below reference. The architecture we eventually ended up with is a minimal file system that can hold map/prog objects. The file system is a per mount namespace singleton, and the default mount point is /sys/fs/bpf/. Any subsequent mounts within a given namespace will point to the same instance. The file system allows for creating a user-defined directory structure. The objects for maps/progs are created/fetched through bpf(2) with two new commands (BPF_OBJ_PIN/BPF_OBJ_GET). I.e. a bpf file descriptor along with a pathname is being passed to bpf(2) that in turn creates (we call it eBPF object pinning) the file system nodes. Only the pathname is being passed to bpf(2) for getting a new BPF file descriptor to an existing node. The user can use that to access maps and progs later on, through bpf(2). Removal of file system nodes is being managed through normal VFS functions such as unlink(2), etc. The file system code is kept to a very minimum and can be further extended later on. The next step I'm working on is to add dump eBPF map/prog commands to bpf(2), so that a specification from a given file descriptor can be retrieved. This can be used by things like CRIU but also applications can inspect the meta data after calling BPF_OBJ_GET. Big thanks also to Alexei and Hannes who significantly contributed in the design discussion that eventually let us end up with this architecture here. Reference: https://lkml.org/lkml/2015/10/15/925 Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-11-02bpf: consolidate bpf_prog_put{, _rcu} dismantle pathsDaniel Borkmann
We currently have duplicated cleanup code in bpf_prog_put() and bpf_prog_put_rcu() cleanup paths. Back then we decided that it was not worth it to make it a common helper called by both, but with the recent addition of resource charging, we could have avoided the fix in commit ac00737f4e81 ("bpf: Need to call bpf_prog_uncharge_memlock from bpf_prog_put") if we would have had only a single, common path. We can simplify it further by assigning aux->prog only once during allocation time. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-11-02bpf: align and clean bpf_{map,prog}_get helpersDaniel Borkmann
Add a bpf_map_get() function that we're going to use later on and align/clean the remaining helpers a bit so that we have them a bit more consistent: - __bpf_map_get() and __bpf_prog_get() that both work on the fd struct, check whether the descriptor is eBPF and return the pointer to the map/prog stored in the private data. Also, we can return f.file->private_data directly, the function signature is enough of a documentation already. - bpf_map_get() and bpf_prog_get() that both work on u32 user fd, call their respective __bpf_map_get()/__bpf_prog_get() variants, and take a reference. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-11-02bpf: abstract anon_inode_getfd invocationsDaniel Borkmann
Since we're going to use anon_inode_getfd() invocations in more than just the current places, make a helper function for both, so that we only need to pass a map/prog pointer to the helper itself in order to get a fd. The new helpers are called bpf_map_new_fd() and bpf_prog_new_fd(). Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-11-02net: fix percpu memory leaksEric Dumazet
This patch fixes following problems : 1) percpu_counter_init() can return an error, therefore init_frag_mem_limit() must propagate this error so that inet_frags_init_net() can do the same up to its callers. 2) If ip[46]_frags_ns_ctl_register() fail, we must unwind properly and free the percpu_counter. Without this fix, we leave freed object in percpu_counters global list (if CONFIG_HOTPLUG_CPU) leading to crashes. This bug was detected by KASAN and syzkaller tool (http://github.com/google/syzkaller) Fixes: 6d7b857d541e ("net: use lib/percpu_counter API for fragmentation mem accounting") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Dmitry Vyukov <dvyukov@google.com> Cc: Hannes Frederic Sowa <hannes@stressinduktion.org> Cc: Jesper Dangaard Brouer <brouer@redhat.com> Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-11-02net: avoid NULL deref in inet_ctl_sock_destroy()Eric Dumazet
Under low memory conditions, tcp_sk_init() and icmp_sk_init() can both iterate on all possible cpus and call inet_ctl_sock_destroy(), with eventual NULL pointer. Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Dmitry Vyukov <dvyukov@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-11-03drm/exynos/gem: remove DMA-mapping hacks used for constructing page arrayMarek Szyprowski
Exynos GEM objects contains an array of pointers to the pages, which the allocated buffer consists of. Till now the code used some hacks (like relying on DMA-mapping internal structures or using ARM-specific dma_to_pfn helper) to build this array. This patch fixes this by adding proper call to dma_get_sgtable_attrs() and using the acquired scatter-list to construct needed array. This approach is more portable (work also for ARM64) and finally fixes the layering violation that was present in this code. Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com> Signed-off-by: Inki Dae <inki.dae@samsung.com>
2015-11-03ARM: exynos_defconfig: enable Exynos DRM Mixer driverAndrzej Hajda
Mixer driver is selected by CONFIG_DRM_EXYNOS_HDMI option. Since Exynos5433 HDMI does not require Mixer. There will be separate options to select Mixer and HDMI. Adding new option to defconfig before Kconfig will allow to keep bisectability. Signed-off-by: Andrzej Hajda <a.hajda@samsung.com> Reviewed-by: Krzysztof Kozlowski <k.kozlowski@samsung.com> Acked-by: Krzysztof Kozlowski <k.kozlowski@samsung.com> Signed-off-by: Inki Dae <inki.dae@samsung.com>
2015-11-03drm/exynos: simplify Kconfig component namesAndrzej Hajda
Many Exynos DRM sub-options mentions Exynos DRM in their titles. It is redundant and can be safely shortened. The patch additionally makes some entries more descriptive. Signed-off-by: Andrzej Hajda <a.hajda@samsung.com> Signed-off-by: Inki Dae <inki.dae@samsung.com>
2015-11-03drm/exynos: re-arrange Kconfig entriesAndrzej Hajda
Exynos DRM driver have quite big number of components and options. The patch re-arranges them into three logical groups: - CRTCs, - Encoders and Bridges, - Sub-drivers. It should make driver options more clear. Signed-off-by: Andrzej Hajda <a.hajda@samsung.com> Signed-off-by: Inki Dae <inki.dae@samsung.com>
2015-11-03drm/exynos: abstract out common dependencyAndrzej Hajda
All options depends on DRM_EXYNOS so it can be moved to enclosing if clause. Signed-off-by: Andrzej Hajda <a.hajda@samsung.com> Signed-off-by: Inki Dae <inki.dae@samsung.com>
2015-11-03drm/exynos: separate Mixer and HDMI driversAndrzej Hajda
Latest Exynos SoCs does not have Mixer IP, but they still have HDMI IP. Their drivers should be configurable separately. Signed-off-by: Andrzej Hajda <a.hajda@samsung.com> Signed-off-by: Inki Dae <inki.dae@samsung.com>
2015-11-03drm/exynos/mixer: replace direct cross-driver call with drm mode validationAndrzej Hajda
HDMI driver called directly function from MIXER driver to invalidate modes not supported by MIXER. The patch replaces the hack with proper .atomic_check callback. Signed-off-by: Andrzej Hajda <a.hajda@samsung.com> Signed-off-by: Inki Dae <inki.dae@samsung.com>
2015-11-03drm/exynos: add atomic_check callback to exynos_crtcAndrzej Hajda
Some CRTCs needs mode validation, this patch adds neccessary callback to Exynos DRM framework. It is called from DRM core via atomic_check helper for drm_crtc. Signed-off-by: Andrzej Hajda <a.hajda@samsung.com> Signed-off-by: Inki Dae <inki.dae@samsung.com>
2015-11-03drm/exynos/decon5433: add support for DECON-TVAndrzej Hajda
DECON-TV IP is responsible for generating video stream which is transferred to HDMI IP. It is almost fully compatible with DECON IP. The patch is based on initial work of Hyungwon Hwang. Signed-off-by: Andrzej Hajda <a.hajda@samsung.com> Signed-off-by: Inki Dae <inki.dae@samsung.com>
2015-11-03drm/exynos/decon5433: remove duplicated initializationAndrzej Hajda
Field .commit is already initialized few lines above. Signed-off-by: Andrzej Hajda <a.hajda@samsung.com> Signed-off-by: Inki Dae <inki.dae@samsung.com>
2015-11-03drm/exynos/decon5433: merge different flag fieldsAndrzej Hajda
Driver uses four different fields for internal flags. They can be merged into one. Signed-off-by: Andrzej Hajda <a.hajda@samsung.com> Signed-off-by: Inki Dae <inki.dae@samsung.com>
2015-11-03drm/exynos/decon5433: add function to set particular register bitsAndrzej Hajda
The driver often sets only particular bits of configuration registers. Using separate function to such action simplifies the code. Signed-off-by: Andrzej Hajda <a.hajda@samsung.com> Signed-off-by: Inki Dae <inki.dae@samsung.com>
2015-11-03drm/exynos/decon5433: fix timing registers writesAndrzej Hajda
All timing registers should contain values decreased by one. Signed-off-by: Andrzej Hajda <a.hajda@samsung.com> Signed-off-by: Inki Dae <inki.dae@samsung.com>
2015-11-03drm/exynos/decon5433: add PCLK clockAndrzej Hajda
PCLK clock is used by DECON IP. The patch also replaces magic number with number of clocks in array definition. Signed-off-by: Andrzej Hajda <a.hajda@samsung.com> Signed-off-by: Inki Dae <inki.dae@samsung.com>
2015-11-03Merge branch 'xfs-dax-updates' into for-nextDave Chinner
2015-11-03Merge branch 'xfs-misc-fixes-for-4.4-2' into for-nextDave Chinner
2015-11-03xfs: optimise away log forces on timestamp updates for fdatasyncDave Chinner
xfs: timestamp updates cause excessive fdatasync log traffic Sage Weil reported that a ceph test workload was writing to the log on every fdatasync during an overwrite workload. Event tracing showed that the only metadata modification being made was the timestamp updates during the write(2) syscall, but fdatasync(2) is supposed to ignore them. The key observation was that the transactions in the log all looked like this: INODE: #regs: 4 ino: 0x8b flags: 0x45 dsize: 32 And contained a flags field of 0x45 or 0x85, and had data and attribute forks following the inode core. This means that the timestamp updates were triggering dirty relogging of previously logged parts of the inode that hadn't yet been flushed back to disk. There are two parts to this problem. The first is that XFS relogs dirty regions in subsequent transactions, so it carries around the fields that have been dirtied since the last time the inode was written back to disk, not since the last time the inode was forced into the log. The second part is that on v5 filesystems, the inode change count update during inode dirtying also sets the XFS_ILOG_CORE flag, so on v5 filesystems this makes a timestamp update dirty the entire inode. As a result when fdatasync is run, it looks at the dirty fields in the inode, and sees more than just the timestamp flag, even though the only metadata change since the last fdatasync was just the timestamps. Hence we force the log on every subsequent fdatasync even though it is not needed. To fix this, add a new field to the inode log item that tracks changes since the last time fsync/fdatasync forced the log to flush the changes to the journal. This flag is updated when we dirty the inode, but we do it before updating the change count so it does not carry the "core dirty" flag from timestamp updates. The fields are zeroed when the inode is marked clean (due to writeback/freeing) or when an fsync/datasync forces the log. Hence if we only dirty the timestamps on the inode between fsync/fdatasync calls, the fdatasync will not trigger another log force. Over 100 runs of the test program: Ext4 baseline: runtime: 1.63s +/- 0.24s avg lat: 1.59ms +/- 0.24ms iops: ~2000 XFS, vanilla kernel: runtime: 2.45s +/- 0.18s avg lat: 2.39ms +/- 0.18ms log forces: ~400/s iops: ~1000 XFS, patched kernel: runtime: 1.49s +/- 0.26s avg lat: 1.46ms +/- 0.25ms log forces: ~30/s iops: ~1500 Reported-by: Sage Weil <sage@redhat.com> Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2015-11-03xfs: don't leak uuid table on rmmodDarrick J. Wong
Don't leak the UUID table when the module is unloaded. (Found with kmemleak.) Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2015-11-03xfs: invalidate cached acl if set via ioctlAndreas Gruenbacher
Setting or removing the "SGI_ACL_[FILE|DEFAULT]" attributes via the XFS_IOC_ATTRMULTI_BY_HANDLE ioctl completely bypasses the POSIX ACL infrastructure, like setting the "trusted.SGI_ACL_[FILE|DEFAULT]" xattrs did until commit 6caa1056. Similar to that commit, invalidate cached acls when setting/removing them via the ioctl as well. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2015-11-03xfs: Plug memory leak in xfs_attrmulti_attr_setAndreas Gruenbacher
When setting attributes via XFS_IOC_ATTRMULTI_BY_HANDLE, the user-space buffer is copied into a new kernel-space buffer via memdup_user; that buffer then isn't freed. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2015-11-03xfs: Validate the length of on-disk ACLsAndreas Gruenbacher
In xfs_acl_from_disk, instead of trusting that xfs_acl.acl_cnt is correct, make sure that the length of the attributes is correct as well. Also, turn the aclp parameter into a const pointer. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2015-11-03xfs: invalidate cached acl if set directly via xattrBrian Foster
ACLs are stored as extended attributes of the inode to which they apply. XFS converts the standard "system.posix_acl_[access|default]" attribute names used to control ACLs to "trusted.SGI_ACL_[FILE|DEFAULT]" as stored on-disk. These xattrs are directly exposed in on-disk format via getxattr/setxattr, without any ACL aware code in the path to perform validation, etc. This is partly historical and supports backup/restore applications such as xfsdump to back up and restore the binary blob that represents ACLs as-is. Andreas reports that the ACLs observed via the getfacl interface is not consistent when ACLs are set directly via the setxattr path. This occurs because the ACLs are cached in-core against the inode and the xattr path has no knowledge that the operation relates to ACLs. Update the xattr set codepath to trap writes of the special XFS ACL attributes and invalidate the associated cached ACL when this occurs. This ensures that the correct ACLs are used on a subsequent operation through the actual ACL interface. Note that this does not update or add support for setting the ACL xattrs directly beyond the restore use case that requires a correctly formatted binary blob and to restore a consistent i_mode at the same time. It is still possible for a root user to set an invalid or inconsistent (with i_mode) ACL blob on-disk and potentially cause corruption. [ With fixes from Andreas Gruenbacher. ] Reported-by: Andreas Gruenbacher <agruenba@redhat.com> Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2015-11-03xfs: xfs_filemap_pmd_fault treats read faults as write faultsDave Chinner
The code initially committed didn't have the same checks for write faults as the dax_pmd_fault code and hence treats all faults as write faults. We can get read faults through this path because they is no pmd_mkwrite path for write faults similar to the normal page fault path. Hence we need to ensure that we only do c/mtime updates on write faults, and freeze protection is unnecessary for read faults. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2015-11-03xfs: add ->pfn_mkwrite support for DAXDave Chinner
->pfn_mkwrite support is needed so that when a page with allocated backing store takes a write fault we can check that the fault has not raced with a truncate and is pointing to a region beyond the current end of file. This also allows us to update the timestamp on the inode, too, which fixes a generic/080 failure. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2015-11-03xfs: DAX does not use IO completion callbacksDave Chinner
For DAX, we are now doing block zeroing during allocation. This means we no longer need a special DAX fault IO completion callback to do unwritten extent conversion. Because mmap never extends the file size (it SEGVs the process) we don't need a callback to update the file size, either. Hence we can remove the completion callbacks from the __dax_fault and __dax_mkwrite calls. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2015-11-03xfs: Don't use unwritten extents for DAXDave Chinner
DAX has a page fault serialisation problem with block allocation. Because it allows concurrent page faults and does not have a page lock to serialise faults to the same page, it can get two concurrent faults to the page that race. When two read faults race, this isn't a huge problem as the data underlying the page is not changing and so "detect and drop" works just fine. The issues are to do with write faults. When two write faults occur, we serialise block allocation in get_blocks() so only one faul will allocate the extent. It will, however, be marked as an unwritten extent, and that is where the problem lies - the DAX fault code cannot differentiate between a block that was just allocated and a block that was preallocated and needs zeroing. The result is that both write faults end up zeroing the block and attempting to convert it back to written. The problem is that the first fault can zero and convert before the second fault starts zeroing, resulting in the zeroing for the second fault overwriting the data that the first fault wrote with zeros. The second fault then attempts to convert the unwritten extent, which is then a no-op because it's already written. Data loss occurs as a result of this race. Because there is no sane locking construct in the page fault code that we can use for serialisation across the page faults, we need to ensure block allocation and zeroing occurs atomically in the filesystem. This means we can still take concurrent page faults and the only time they will serialise is in the filesystem mapping/allocation callback. The page fault code will always see written, initialised extents, so we will be able to remove the unwritten extent handling from the DAX code when all filesystems are converted. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2015-11-03xfs: introduce BMAPI_ZERO for allocating zeroed extentsDave Chinner
To enable DAX to do atomic allocation of zeroed extents, we need to drive the block zeroing deep into the allocator. Because xfs_bmapi_write() can return merged extents on allocation that were only partially allocated (i.e. requested range spans allocated and hole regions, allocation into the hole was contiguous), we cannot zero the extent returned from xfs_bmapi_write() as that can overwrite existing data with zeros. Hence we have to drive the extent zeroing into the allocation code, prior to where we merge the extents into the BMBT and return the resultant map. This means we need to propagate this need down to the xfs_alloc_vextent() and issue the block zeroing at this point. While this functionality is being introduced for DAX, there is no reason why it is specific to DAX - we can per-zero blocks during the allocation transaction on any type of device. It's just slow (and usually slower than unwritten allocation and conversion) on traditional block devices so doesn't tend to get used. We can, however, hook hardware zeroing optimisations via sb_issue_zeroout() to this operation, so it may be useful in future and hence the "allocate zeroed blocks" API needs to be implementation neutral. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2015-11-03xfs: fix inode size update overflow in xfs_map_direct()Dave Chinner
Both direct IO and DAX pass an offset and count into get_blocks that will overflow a s64 variable when an IO goes into the last supported block in a file (i.e. at offset 2^63 - 1FSB bytes). This can be seen from the tracing: xfs_get_blocks_alloc: [...] offset 0x7ffffffffffff000 count 4096 xfs_gbmap_direct: [...] offset 0x7ffffffffffff000 count 4096 xfs_gbmap_direct_none:[...] offset 0x7ffffffffffff000 count 4096 0x7ffffffffffff000 + 4096 = 0x8000000000000000, and hence that overflows the s64 offset and we fail to detect the need for a filesize update and an ioend is not allocated. This is *mostly* avoided for direct IO because such extending IOs occur with full block allocation, and so the "IS_UNWRITTEN()" check still evaluates as true and we get an ioend that way. However, doing single sector extending IOs to this last block will expose the fact that file size updates will not occur after the first allocating direct IO as the overflow will then be exposed. There is one further complexity: the DAX page fault path also exposes the same issue in block allocation. However, page faults cannot extend the file size, so in this case we want to allocate the block but do not want to allocate an ioend to enable file size update at IO completion. Hence we now need to distinguish between the direct IO patch allocation and dax fault path allocation to avoid leaking ioend structures. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2015-11-02Documentation: add new description of path-name lookup.Neil Brown
This document is based on three recent lwn.net articles. Some of the introductory material and linkage between articles has been removed, and some time-based descriptions have been revised. Also all links to code have been removed as the code is very close by. Contains corrections and improvements from Randy Dunlap <rdunlap@infradead.org>. Signed-off-by: NeilBrown <neil@brown.name> Signed-off-by: Jonathan Corbet <corbet@lwn.net>
2015-11-02Documentation/vm/slub.txt: document slabinfo-gnuplot.shSergey Senozhatsky
Add documentation on how to use slabinfo-gnuplot.sh script. Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Acked-by: Christoph Lameter <cl@linux.com> Signed-off-by: Jonathan Corbet <corbet@lwn.net>
2015-11-02Doc: ABI/stable: Fix typo in ABI/stableMasanari Iida
This patch fix some spelling typos in Documentation/ABI/stable. Signed-off-by: Masanari Iida <standby24x7@gmail.com> Signed-off-by: Jonathan Corbet <corbet@lwn.net>
2015-11-02Merge tag 'regmap-v4.4' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/broonie/regmap Pull regmap updates from Mark Brown: "Quite a few new features for regmap this time, mostly expanding things around the edges of the existing functionality to cover more devices rather than thinsg with wide applicability: - Support for offload of the update_bits() operation to hardware where devices implement bit level access. - Support for a few extra operations that need scratch buffers on fast_io devices where we can't sleep. - Expanded the feature set of regmap_irq to cope with some extra register layouts. - Cleanups to the debugfs code" * tag 'regmap-v4.4' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/regmap: regmap: Allow installing custom reg_update_bits function regmap: debugfs: simplify regmap_reg_ranges_read_file() slightly regmap: debugfs: use memcpy instead of snprintf regmap: debugfs: use snprintf return value in regmap_reg_ranges_read_file() regmap: Add generic macro to define regmap_irq regmap: debugfs: Remove scratch buffer for register length calculation regmap: irq: add ack_invert flag for chips using cleared bits as ack regmap: irq: add support for chips who have separate unmask registers regmap: Allocate buffers with GFP_ATOMIC when fast_io == true
2015-11-03rtc: rtctest: enabling UIE for a chip that doesn't support it returns EINVALUwe Kleine-König
Calling ioctl(..., RTC_UIE_ON, ...) without CONFIG_RTC_INTF_DEV_UIE_EMUL either ends in rtc_update_irq_enable if rtc->uie_unsupported is true or in __rtc_set_alarm in the if (!rtc->ops->set_alarm) branch. In both cases the return value is -EINVAL. So check for that one instead of ENOTTY. Signed-off-by: Uwe Kleine-König <u.kleine-koenig@pengutronix.de> Signed-off-by: Alexandre Belloni <alexandre.belloni@free-electrons.com>
2015-11-03rtc: pcf2127: make module license match the file headerUwe Kleine-König
The header of the pcf2127 driver specifies GPL v2 only as license, so use "GPL v2" as module license specifier instead of "GPL" as the latter means "GNU Public License v2 or later". Signed-off-by: Uwe Kleine-König <u.kleine-koenig@pengutronix.de> Signed-off-by: Alexandre Belloni <alexandre.belloni@free-electrons.com>
2015-11-02tracepoints: Fix documentation of RCU lockdep checksMathieu Desnoyers
The documentation on top of __DECLARE_TRACE() does not match its implementation since the condition check has been added to the RCU lockdep checks. Update the documentation to match its implementation. Link: http://lkml.kernel.org/r/1446504164-21563-1-git-send-email-mathieu.desnoyers@efficios.com CC: Dave Hansen <dave@sr71.net> Fixes: a05d59a56733 "tracing: Add condition check to RCU lockdep checks" Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2015-11-02libceph: clear msg->con in ceph_msg_release() onlyIlya Dryomov
The following bit in ceph_msg_revoke_incoming() is unsafe: struct ceph_connection *con = msg->con; if (!con) return; mutex_lock(&con->mutex); <more msg->con use> There is nothing preventing con from getting destroyed right after msg->con test. One easy way to reproduce this is to disable message signing only on the server side and try to map an image. The system will go into a libceph: read_partial_message ffff880073f0ab68 signature check failed libceph: osd0 192.168.255.155:6801 bad crc/signature libceph: read_partial_message ffff880073f0ab68 signature check failed libceph: osd0 192.168.255.155:6801 bad crc/signature loop which has to be interrupted with Ctrl-C. Hit Ctrl-C and you are likely to end up with a random GP fault if the reset handler executes "within" ceph_msg_revoke_incoming(): <yet another reply w/o a signature> ... <Ctrl-C> rbd_obj_request_end ceph_osdc_cancel_request __unregister_request ceph_osdc_put_request ceph_msg_revoke_incoming ... osd_reset __kick_osd_requests __reset_osd remove_osd ceph_con_close reset_connection <clear con->in_msg->con> <put con ref> put_osd <free osd/con> <msg->con use> <-- !!! If ceph_msg_revoke_incoming() executes "before" the reset handler, osd/con will be leaked because ceph_msg_revoke_incoming() clears con->in_msg but doesn't put con ref, while reset_connection() only puts con ref if con->in_msg != NULL. The current msg->con scheme was introduced by commits 38941f8031bf ("libceph: have messages point to their connection") and 92ce034b5a74 ("libceph: have messages take a connection reference"), which defined when messages get associated with a connection and when that association goes away. Part of the problem is that this association is supposed to go away in much too many places; closing this race entirely requires either a rework of the existing or an addition of a new layer of synchronization. In lieu of that, we can make it *much* less likely to hit by disassociating messages only on their destruction and resend through a different connection. This makes the code simpler and is probably a good thing to do regardless - this patch adds a msg_con_set() helper which is is called from only three places: ceph_con_send() and ceph_con_in_msg_alloc() to set msg->con and ceph_msg_release() to clear it. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2015-11-02libceph: add nocephx_sign_messages optionIlya Dryomov
Support for message signing was merged into 3.19, along with nocephx_require_signatures option. But, all that option does is allow the kernel client to talk to clusters that don't support MSG_AUTH feature bit. That's pretty useless, given that it's been supported since bobtail. Meanwhile, if one disables message signing on the server side with "cephx sign messages = false", it becomes impossible to use the kernel client since it expects messages to be signed if MSG_AUTH was negotiated. Add nocephx_sign_messages option to support this use case. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2015-11-02libceph: stop duplicating client fields in messengerIlya Dryomov
supported_features and required_features serve no purpose at all, while nocrc and tcp_nodelay belong to ceph_options::flags. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2015-11-02libceph: drop authorizer check from cephx msg signing routinesIlya Dryomov
I don't see a way for auth->authorizer to be NULL in ceph_x_sign_message() or ceph_x_check_message_signature(). Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2015-11-02libceph: msg signing callouts don't need con argumentIlya Dryomov
We can use msg->con instead - at the point we sign an outgoing message or check the signature on the incoming one, msg->con is always set. We wouldn't know how to sign a message without an associated session (i.e. msg->con == NULL) and being able to sign a message using an explicitly provided authorizer is of no use. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2015-11-02libceph: evaluate osd_req_op_data() arguments only onceIoana Ciornei
This patch changes the osd_req_op_data() macro to not evaluate arguments more than once in order to follow the kernel coding style. Signed-off-by: Ioana Ciornei <ciorneiioana@gmail.com> Reviewed-by: Alex Elder <elder@linaro.org> [idryomov@gmail.com: changelog, formatting] Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2015-11-02ceph: make fsync() wait unsafe requests that created/modified inodeYan, Zheng
If we get a unsafe reply for request that created/modified inode, add the unsafe request to a list in the newly created/modified inode. So we can make fsync() wait these unsafe requests. Signed-off-by: Yan, Zheng <zyan@redhat.com>