summaryrefslogtreecommitdiff
path: root/drivers/block/zram/zram_drv.h
AgeCommit message (Collapse)Author
2021-07-02Merge branch 'akpm' (patches from Andrew)Linus Torvalds
Merge more updates from Andrew Morton: "190 patches. Subsystems affected by this patch series: mm (hugetlb, userfaultfd, vmscan, kconfig, proc, z3fold, zbud, ras, mempolicy, memblock, migration, thp, nommu, kconfig, madvise, memory-hotplug, zswap, zsmalloc, zram, cleanups, kfence, and hmm), procfs, sysctl, misc, core-kernel, lib, lz4, checkpatch, init, kprobes, nilfs2, hfs, signals, exec, kcov, selftests, compress/decompress, and ipc" * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (190 commits) ipc/util.c: use binary search for max_idx ipc/sem.c: use READ_ONCE()/WRITE_ONCE() for use_global_lock ipc: use kmalloc for msg_queue and shmid_kernel ipc sem: use kvmalloc for sem_undo allocation lib/decompressors: remove set but not used variabled 'level' selftests/vm/pkeys: exercise x86 XSAVE init state selftests/vm/pkeys: refill shadow register after implicit kernel write selftests/vm/pkeys: handle negative sys_pkey_alloc() return code selftests/vm/pkeys: fix alloc_random_pkey() to make it really, really random kcov: add __no_sanitize_coverage to fix noinstr for all architectures exec: remove checks in __register_bimfmt() x86: signal: don't do sas_ss_reset() until we are certain that sigframe won't be abandoned hfsplus: report create_date to kstat.btime hfsplus: remove unnecessary oom message nilfs2: remove redundant continue statement in a while-loop kprobes: remove duplicated strong free_insn_page in x86 and s390 init: print out unknown kernel parameters checkpatch: do not complain about positive return values starting with EPOLL checkpatch: improve the indented label test checkpatch: scripts/spdxcheck.py now requires python3 ...
2021-07-01zram: move backing_dev under macro CONFIG_ZRAM_WRITEBACKYue Hu
backing_dev is never used when not enable CONFIG_ZRAM_WRITEBACK and it's introduced from writeback feature. So it's needless also affect readability in that case. Link: https://lkml.kernel.org/r/20210521060544.2385-1-zbestahu@gmail.com Signed-off-by: Yue Hu <huyue2@yulong.com> Reviewed-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Acked-by: Minchan Kim <minchan@kernel.org> Cc: Sergey Senozhatsky <senozhatsky@chromium.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-01block: move bd_mutex to struct gendiskChristoph Hellwig
Replace the per-block device bd_mutex with a per-gendisk open_mutex, thus simplifying locking wherever we deal with partitions. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Ming Lei <ming.lei@redhat.com> Acked-by: Roger Pau Monné <roger.pau@citrix.com> Link: https://lore.kernel.org/r/20210525061301.2242282-4-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-12-16Merge tag 'for-5.11/block-2020-12-14' of git://git.kernel.dk/linux-blockLinus Torvalds
Pull block updates from Jens Axboe: "Another series of killing more code than what is being added, again thanks to Christoph's relentless cleanups and tech debt tackling. This contains: - blk-iocost improvements (Baolin Wang) - part0 iostat fix (Jeffle Xu) - Disable iopoll for split bios (Jeffle Xu) - block tracepoint cleanups (Christoph Hellwig) - Merging of struct block_device and hd_struct (Christoph Hellwig) - Rework/cleanup of how block device sizes are updated (Christoph Hellwig) - Simplification of gendisk lookup and removal of block device aliasing (Christoph Hellwig) - Block device ioctl cleanups (Christoph Hellwig) - Removal of bdget()/blkdev_get() as exported API (Christoph Hellwig) - Disk change rework, avoid ->revalidate_disk() (Christoph Hellwig) - sbitmap improvements (Pavel Begunkov) - Hybrid polling fix (Pavel Begunkov) - bvec iteration improvements (Pavel Begunkov) - Zone revalidation fixes (Damien Le Moal) - blk-throttle limit fix (Yu Kuai) - Various little fixes" * tag 'for-5.11/block-2020-12-14' of git://git.kernel.dk/linux-block: (126 commits) blk-mq: fix msec comment from micro to milli seconds blk-mq: update arg in comment of blk_mq_map_queue blk-mq: add helper allocating tagset->tags Revert "block: Fix a lockdep complaint triggered by request queue flushing" nvme-loop: use blk_mq_hctx_set_fq_lock_class to set loop's lock class blk-mq: add new API of blk_mq_hctx_set_fq_lock_class block: disable iopoll for split bio block: Improve blk_revalidate_disk_zones() checks sbitmap: simplify wrap check sbitmap: replace CAS with atomic and sbitmap: remove swap_lock sbitmap: optimise sbitmap_deferred_clear() blk-mq: skip hybrid polling if iopoll doesn't spin blk-iocost: Factor out the base vrate change into a separate function blk-iocost: Factor out the active iocgs' state check into a separate function blk-iocost: Move the usage ratio calculation to the correct place blk-iocost: Remove unnecessary advance declaration blk-iocost: Fix some typos in comments blktrace: fix up a kerneldoc comment block: remove the request_queue to argument request based tracepoints ...
2020-12-15zram: add stat to gather incompressible pages since zram set upMinchan Kim
Currently, zram supports the stat via /sys/block/zram/mm_stat to represent how many of incompressible pages are stored at the moment but it couldn't show how many times incompressible pages were wrote down since zram set up. It's also good indication to see how zram is effective in the system. Link: https://lkml.kernel.org/r/20201130201907.1284910-1-minchan@kernel.org Signed-off-by: Minchan Kim <minchan@kernel.org> Reviewed-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-12-01zram: do not call set_blocksizeChristoph Hellwig
set_blocksize is used by file systems to use their preferred buffer cache block size. Block drivers should not set it. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Minchan Kim <minchan@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-01-08zram: idle writeback fixes and cleanupMinchan Kim
This patch includes some fixes and cleanup for idle-page writeback. 1. writeback_limit interface Now writeback_limit interface is rather conusing. For example, once writeback limit budget is exausted, admin can see 0 from /sys/block/zramX/writeback_limit which is same semantic with disable writeback_limit at this moment. IOW, admin cannot tell that zero came from disable writeback limit or exausted writeback limit. To make the interface clear, let's sepatate enable of writeback limit to another knob - /sys/block/zram0/writeback_limit_enable * before: while true : # to re-enable writeback limit once previous one is used up echo 0 > /sys/block/zram0/writeback_limit echo $((200<<20)) > /sys/block/zram0/writeback_limit .. .. # used up the writeback limit budget * new # To enable writeback limit, from the beginning, admin should # enable it. echo $((200<<20)) > /sys/block/zram0/writeback_limit echo 1 > /sys/block/zram/0/writeback_limit_enable while true : echo $((200<<20)) > /sys/block/zram0/writeback_limit .. .. # used up the writeback limit budget It's much strightforward. 2. fix condition check idle/huge writeback mode check The mode in writeback_store is not bit opeartion any more so no need to use bit operations. Furthermore, current condition check is broken in that it does writeback every pages regardless of huge/idle. 3. clean up idle_store No need to use goto. [minchan@kernel.org: missed spin_lock_init] Link: http://lkml.kernel.org/r/20190103001601.GA255139@google.com Link: http://lkml.kernel.org/r/20181224033529.19450-1-minchan@kernel.org Signed-off-by: Minchan Kim <minchan@kernel.org> Suggested-by: John Dias <joaodias@google.com> Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com> Cc: John Dias <joaodias@google.com> Cc: Srinivas Paladugu <srnvs@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-28zram: writeback throttleMinchan Kim
If there are lots of write IO with flash device, it could have a wearout problem of storage. To overcome the problem, admin needs to design write limitation to guarantee flash health for entire product life. This patch creates a new knob "writeback_limit" for zram. writeback_limit's default value is 0 so that it doesn't limit any writeback. If admin want to measure writeback count in a certain period, he could know it via /sys/block/zram0/bd_stat's 3rd column. If admin want to limit writeback as per-day 400M, he could do it like below. MB_SHIFT=20 4K_SHIFT=12 echo $((400<<MB_SHIFT>>4K_SHIFT)) > \ /sys/block/zram0/writeback_limit. If admin want to allow further write again, he could do it like below echo 0 > /sys/block/zram0/writeback_limit If admin want to see remaining writeback budget, cat /sys/block/zram0/writeback_limit The writeback_limit count will reset whenever you reset zram (e.g., system reboot, echo 1 > /sys/block/zramX/reset) so keeping how many of writeback happened until you reset the zram to allocate extra writeback budget in next setting is user's job. [minchan@kernel.org: v4] Link: http://lkml.kernel.org/r/20181203024045.153534-8-minchan@kernel.org Link: http://lkml.kernel.org/r/20181127055429.251614-8-minchan@kernel.org Signed-off-by: Minchan Kim <minchan@kernel.org> Reviewed-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Cc: Joey Pabalinas <joeypabalinas@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-28zram: add bd_stat statisticsMinchan Kim
bd_stat represents things that happened in the backing device. Currently it supports bd_counts, bd_reads and bd_writes which are helpful to understand wearout of flash and memory saving. [minchan@kernel.org: v4] Link: http://lkml.kernel.org/r/20181203024045.153534-7-minchan@kernel.org Link: http://lkml.kernel.org/r/20181127055429.251614-7-minchan@kernel.org Signed-off-by: Minchan Kim <minchan@kernel.org> Reviewed-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Cc: Joey Pabalinas <joeypabalinas@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-28zram: support idle/huge page writebackMinchan Kim
Add a new feature "zram idle/huge page writeback". In the zram-swap use case, zram usually has many idle/huge swap pages. It's pointless to keep them in memory (ie, zram). To solve this problem, this feature introduces idle/huge page writeback to the backing device so the goal is to save more memory space on embedded systems. Normal sequence to use idle/huge page writeback feature is as follows, while (1) { # mark allocated zram slot to idle echo all > /sys/block/zram0/idle # leave system working for several hours # Unless there is no access for some blocks on zram, # they are still IDLE marked pages. echo "idle" > /sys/block/zram0/writeback or/and echo "huge" > /sys/block/zram0/writeback # write the IDLE or/and huge marked slot into backing device # and free the memory. } Per the discussion at https://lore.kernel.org/lkml/20181122065926.GG3441@jagdpanzerIV/T/#u, This patch removes direct incommpressibe page writeback feature (d2afd25114f4 ("zram: write incompressible pages to backing device")). Below concerns from Sergey: == &< == "IDLE writeback" is superior to "incompressible writeback". "incompressible writeback" is completely unpredictable and uncontrollable; it depens on data patterns and compression algorithms. While "IDLE writeback" is predictable. I even suspect, that, *ideally*, we can remove "incompressible writeback". "IDLE pages" is a super set which also includes "incompressible" pages. So, technically, we still can do "incompressible writeback" from "IDLE writeback" path; but a much more reasonable one, based on a page idling period. I understand that you want to keep "direct incompressible writeback" around. ZRAM is especially popular on devices which do suffer from flash wearout, so I can see "incompressible writeback" path becoming a dead code, long term. == &< == Below concerns from Minchan: == &< == My concern is if we enable CONFIG_ZRAM_WRITEBACK in this implementation, both hugepage/idlepage writeck will turn on. However someuser want to enable only idlepage writeback so we need to introduce turn on/off knob for hugepage or new CONFIG_ZRAM_IDLEPAGE_WRITEBACK for those usecase. I don't want to make it complicated *if possible*. Long term, I imagine we need to make VM aware of new swap hierarchy a little bit different with as-is. For example, first high priority swap can return -EIO or -ENOCOMP, swap try to fallback to next lower priority swap device. With that, hugepage writeback will work tranparently. So we could regard it as regression because incompressible pages doesn't go to backing storage automatically. Instead, user should do it via "echo huge" > /sys/block/zram/writeback" manually. == &< == Link: http://lkml.kernel.org/r/20181127055429.251614-6-minchan@kernel.org Signed-off-by: Minchan Kim <minchan@kernel.org> Reviewed-by: Joey Pabalinas <joeypabalinas@gmail.com> Reviewed-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-28zram: introduce ZRAM_IDLE flagMinchan Kim
To support idle page writeback with upcoming patches, this patch introduces a new ZRAM_IDLE flag. Userspace can mark zram slots as "idle" via "echo all > /sys/block/zramX/idle" which marks every allocated zram slot as ZRAM_IDLE. User could see it by /sys/kernel/debug/zram/zram0/block_state. 300 75.033841 ...i 301 63.806904 s..i 302 63.806919 ..hi Once there is IO for the slot, the mark will be disappeared. 300 75.033841 ... 301 63.806904 s..i 302 63.806919 ..hi Therefore, 300th block is idle zpage. With this feature, user can how many zram has idle pages which are waste of memory. Link: http://lkml.kernel.org/r/20181127055429.251614-5-minchan@kernel.org Signed-off-by: Minchan Kim <minchan@kernel.org> Reviewed-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Reviewed-by: Joey Pabalinas <joeypabalinas@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-28zram: refactor flags and writeback stuffMinchan Kim
Rename some variables and restructure some code for better readability in writeback and zs_free_page. Link: http://lkml.kernel.org/r/20181127055429.251614-4-minchan@kernel.org Signed-off-by: Minchan Kim <minchan@kernel.org> Reviewed-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Reviewed-by: Joey Pabalinas <joeypabalinas@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-28zram: fix lockdep warning of free block handlingMinchan Kim
Patch series "zram idle page writeback", v3. Inherently, swap device has many idle pages which are rare touched since it was allocated. It is never problem if we use storage device as swap. However, it's just waste for zram-swap. This patchset supports zram idle page writeback feature. * Admin can define what is idle page "no access since X time ago" * Admin can define when zram should writeback them * Admin can define when zram should stop writeback to prevent wearout Details are in each patch's description. This patch (of 7): ================================ WARNING: inconsistent lock state 4.19.0+ #390 Not tainted -------------------------------- inconsistent {SOFTIRQ-ON-W} -> {IN-SOFTIRQ-W} usage. zram_verify/2095 [HC0[0]:SC1[1]:HE1:SE0] takes: 00000000b1828693 (&(&zram->bitmap_lock)->rlock){+.?.}, at: put_entry_bdev+0x1e/0x50 {SOFTIRQ-ON-W} state was registered at: _raw_spin_lock+0x2c/0x40 zram_make_request+0x755/0xdc9 generic_make_request+0x373/0x6a0 submit_bio+0x6c/0x140 __swap_writepage+0x3a8/0x480 shrink_page_list+0x1102/0x1a60 shrink_inactive_list+0x21b/0x3f0 shrink_node_memcg.constprop.99+0x4f8/0x7e0 shrink_node+0x7d/0x2f0 do_try_to_free_pages+0xe0/0x300 try_to_free_pages+0x116/0x2b0 __alloc_pages_slowpath+0x3f4/0xf80 __alloc_pages_nodemask+0x2a2/0x2f0 __handle_mm_fault+0x42e/0xb50 handle_mm_fault+0x55/0xb0 __do_page_fault+0x235/0x4b0 page_fault+0x1e/0x30 irq event stamp: 228412 hardirqs last enabled at (228412): [<ffffffff98245846>] __slab_free+0x3e6/0x600 hardirqs last disabled at (228411): [<ffffffff98245625>] __slab_free+0x1c5/0x600 softirqs last enabled at (228396): [<ffffffff98e0031e>] __do_softirq+0x31e/0x427 softirqs last disabled at (228403): [<ffffffff98072051>] irq_exit+0xd1/0xe0 other info that might help us debug this: Possible unsafe locking scenario: CPU0 ---- lock(&(&zram->bitmap_lock)->rlock); <Interrupt> lock(&(&zram->bitmap_lock)->rlock); *** DEADLOCK *** no locks held by zram_verify/2095. stack backtrace: CPU: 5 PID: 2095 Comm: zram_verify Not tainted 4.19.0+ #390 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1 04/01/2014 Call Trace: <IRQ> dump_stack+0x67/0x9b print_usage_bug+0x1bd/0x1d3 mark_lock+0x4aa/0x540 __lock_acquire+0x51d/0x1300 lock_acquire+0x90/0x180 _raw_spin_lock+0x2c/0x40 put_entry_bdev+0x1e/0x50 zram_free_page+0xf6/0x110 zram_slot_free_notify+0x42/0xa0 end_swap_bio_read+0x5b/0x170 blk_update_request+0x8f/0x340 scsi_end_request+0x2c/0x1e0 scsi_io_completion+0x98/0x650 blk_done_softirq+0x9e/0xd0 __do_softirq+0xcc/0x427 irq_exit+0xd1/0xe0 do_IRQ+0x93/0x120 common_interrupt+0xf/0xf </IRQ> With writeback feature, zram_slot_free_notify could be called in softirq context by end_swap_bio_read. However, bitmap_lock is not aware of that so lockdep yell out: get_entry_bdev spin_lock(bitmap->lock); irq softirq end_swap_bio_read zram_slot_free_notify zram_slot_lock <-- deadlock prone zram_free_page put_entry_bdev spin_lock(bitmap->lock); <-- deadlock prone With akpm's suggestion (i.e. bitmap operation is already atomic), we could remove bitmap lock. It might fail to find a empty slot if serious contention happens. However, it's not severe problem because huge page writeback has already possiblity to fail if there is severe memory pressure. Worst case is just keeping the incompressible in memory, not storage. The other problem is zram_slot_lock in zram_slot_slot_free_notify. To make it safe is this patch introduces zram_slot_trylock where zram_slot_free_notify uses it. Although it's rare to be contented, this patch adds new debug stat "miss_free" to keep monitoring how often it happens. Link: http://lkml.kernel.org/r/20181127055429.251614-2-minchan@kernel.org Signed-off-by: Minchan Kim <minchan@kernel.org> Reviewed-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Reviewed-by: Joey Pabalinas <joeypabalinas@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-06-07zram: introduce zram memory trackingMinchan Kim
zRam as swap is useful for small memory device. However, swap means those pages on zram are mostly cold pages due to VM's LRU algorithm. Especially, once init data for application are touched for launching, they tend to be not accessed any more and finally swapped out. zRAM can store such cold pages as compressed form but it's pointless to keep in memory. Better idea is app developers free them directly rather than remaining them on heap. This patch tell us last access time of each block of zram via "cat /sys/kernel/debug/zram/zram0/block_state". The output is as follows, 300 75.033841 .wh 301 63.806904 s.. 302 63.806919 ..h First column is zram's block index and 3rh one represents symbol (s: same page w: written page to backing store h: huge page) of the block state. Second column represents usec time unit of the block was last accessed. So above example means the 300th block is accessed at 75.033851 second and it was huge so it was written to the backing store. Admin can leverage this information to catch cold|incompressible pages of process with *pagemap* once part of heaps are swapped out. I used the feature a few years ago to find memory hoggers in userspace to notify them what memory they have wasted without touch for a long time. With it, they could reduce unnecessary memory space. However, at that time, I hacked up zram for the feature but now I need the feature again so I decided it would be better to upstream rather than keeping it alone. I hope I submit the userspace tool to use the feature soon. [akpm@linux-foundation.org: fix i386 printk warning] [minchan@kernel.org: use ktime_get_boottime() instead of sched_clock()] Link: http://lkml.kernel.org/r/20180420063525.GA253739@rodete-desktop-imager.corp.google.com [akpm@linux-foundation.org: documentation tweak] [akpm@linux-foundation.org: fix i386 printk warning] [minchan@kernel.org: fix compile warning] Link: http://lkml.kernel.org/r/20180508104849.GA8209@rodete-desktop-imager.corp.google.com [rdunlap@infradead.org: fix printk formats] Link: http://lkml.kernel.org/r/3652ccb1-96ef-0b0b-05d1-f661d7733dcc@infradead.org Link: http://lkml.kernel.org/r/20180416090946.63057-5-minchan@kernel.org Signed-off-by: Minchan Kim <minchan@kernel.org> Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Reviewed-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-06-07zram: record accessed secondMinchan Kim
zRam as swap is useful for small memory device. However, swap means those pages on zram are mostly cold pages due to VM's LRU algorithm. Especially, once init data for application are touched for launching, they tend to be not accessed any more and finally swapped out. zRAM can store such cold pages as compressed form but it's pointless to keep in memory. Better idea is app developers free them directly rather than remaining them on heap. This patch records last access time of each block of zram so that With upcoming zram memory tracking, it could help userspace developers to reduce memory footprint. Link: http://lkml.kernel.org/r/20180416090946.63057-4-minchan@kernel.org Signed-off-by: Minchan Kim <minchan@kernel.org> Reviewed-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-06-07zram: mark incompressible page as ZRAM_HUGEMinchan Kim
Mark incompressible pages so that we could investigate who is the owner of the incompressible pages once the page is swapped out via using upcoming zram memory tracker feature. With it, we could prevent such pages to be swapped out by using mlock. Otherwise we might remove them. This patch exposes new stat for huge pages via mm_stat. Link: http://lkml.kernel.org/r/20180416090946.63057-3-minchan@kernel.org Signed-off-by: Minchan Kim <minchan@kernel.org> Reviewed-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-06-07zram: correct flag name of ZRAM_ACCESSMinchan Kim
Patch series "zram memory tracking", v5. zRam as swap is useful for small memory device. However, swap means those pages on zram are mostly cold pages due to VM's LRU algorithm. Especially, once init data for application are touched for launching, they tend to be not accessed any more and finally swapped out. zRAM can store such cold pages as compressed form but it's pointless to keep in memory. As well, it's pointless to store incompressible pages to zram so better idea is app developers manages them directly like free or mlock rather than remaining them on heap. This patch provides a debugfs /sys/kernel/debug/zram/zram0/block_state to represent each block's state so admin can investigate what memory is cold|incompressible|same page with using pagemap once the pages are swapped out. The output is as follows: 300 75.033841 .wh 301 63.806904 s.. 302 63.806919 ..h First column is zram's block index and 3rh one represents symbol (s: same page w: written page to backing store h: huge page) of the block state. Second column represents usec time unit of the block was last accessed. So above example means the 300th block is accessed at 75.033851 second and it was huge so it was written to the backing store. This patch (of 4): ZRAM_ACCESS is used for locking a slot of zram so correct the name. It is also not a common flag to indicate status of the block so move the declare position on top of the flag. Lastly, let's move the function to the top of source code to be able to use it easily without forward declaration. Link: http://lkml.kernel.org/r/20180416090946.63057-2-minchan@kernel.org Signed-off-by: Minchan Kim <minchan@kernel.org> Reviewed-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-06Merge branch 'akpm' (patches from Andrew)Linus Torvalds
Merge updates from Andrew Morton: - a few misc things - ocfs2 updates - the v9fs maintainers have been missing for a long time. I've taken over v9fs patch slinging. - most of MM * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (116 commits) mm,oom_reaper: check for MMF_OOM_SKIP before complaining mm/ksm: fix interaction with THP mm/memblock.c: cast constant ULLONG_MAX to phys_addr_t headers: untangle kmemleak.h from mm.h include/linux/mmdebug.h: make VM_WARN* non-rvals mm/page_isolation.c: make start_isolate_page_range() fail if already isolated mm: change return type to vm_fault_t mm, oom: remove 3% bonus for CAP_SYS_ADMIN processes mm, page_alloc: wakeup kcompactd even if kswapd cannot free more memory kernel/fork.c: detect early free of a live mm mm: make counting of list_lru_one::nr_items lockless mm/swap_state.c: make bool enable_vma_readahead and swap_vma_readahead() static block_invalidatepage(): only release page if the full page was invalidated mm: kernel-doc: add missing parameter descriptions mm/swap.c: remove @cold parameter description for release_pages() mm/nommu: remove description of alloc_vm_area zram: drop max_zpage_size and use zs_huge_class_size() zsmalloc: introduce zs_huge_class_size() mm: fix races between swapoff and flush dcache fs/direct-io.c: minor cleanups in do_blockdev_direct_IO ...
2018-04-05zram: drop max_zpage_size and use zs_huge_class_size()Sergey Senozhatsky
Remove ZRAM's enforced "huge object" value and use zsmalloc huge-class watermark instead, which makes more sense. TEST - I used a 1G zram device, LZO compression back-end, original data set size was 444MB. Looking at zsmalloc classes stats the test ended up to be pretty fair. BASE ZRAM/ZSMALLOC ===================== zram mm_stat 498978816 191482495 199831552 0 199831552 15634 0 zsmalloc classes class size almost_full almost_empty obj_allocated obj_used pages_used pages_per_zspage freeable ... 151 2448 0 0 1240 1240 744 3 0 168 2720 0 0 4200 4200 2800 2 0 190 3072 0 0 10100 10100 7575 3 0 202 3264 0 0 380 380 304 4 0 254 4096 0 0 10620 10620 10620 1 0 Total 7 46 106982 106187 48787 0 PATCHED ZRAM/ZSMALLOC ===================== zram mm_stat 498978816 182579184 194248704 0 194248704 15628 0 zsmalloc classes class size almost_full almost_empty obj_allocated obj_used pages_used pages_per_zspage freeable ... 151 2448 0 0 1240 1240 744 3 0 168 2720 0 0 4200 4200 2800 2 0 190 3072 0 0 10100 10100 7575 3 0 202 3264 0 0 7180 7180 5744 4 0 254 4096 0 0 3820 3820 3820 1 0 Total 8 45 106959 106193 47424 0 As we can see, we reduced the number of objects stored in class-4096, because a huge number of objects which we previously forcibly stored in class-4096 now stored in non-huge class-3264. This results in lower memory consumption: - zsmalloc now uses 47424 physical pages, which is less than 48787 pages zsmalloc used before. - objects that we store in class-3264 share zspages. That's why overall the number of pages that both class-4096 and class-3264 consumed went down from 10924 to 9564. [sergey.senozhatsky.work@gmail.com: add pool param to zs_huge_class_size()] Link: http://lkml.kernel.org/r/20180314081833.1096-3-sergey.senozhatsky@gmail.com Link: http://lkml.kernel.org/r/20180306070639.7389-3-sergey.senozhatsky@gmail.com Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Acked-by: Minchan Kim <minchan@kernel.org> Cc: Mike Rapoport <rppt@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-03-17block: Move SECTOR_SIZE and SECTOR_SHIFT definitions into <linux/blkdev.h>Bart Van Assche
It happens often while I'm preparing a patch for a block driver that I'm wondering: is a definition of SECTOR_SIZE and/or SECTOR_SHIFT available for this driver? Do I have to introduce definitions of these constants before I can use these constants? To avoid this confusion, move the existing definitions of SECTOR_SIZE and SECTOR_SHIFT into the <linux/blkdev.h> header file such that these become available for all block drivers. Make the SECTOR_SIZE definition in the uapi msdos_fs.h header file conditional to avoid that including that header file after <linux/blkdev.h> causes the compiler to complain about a SECTOR_SIZE redefinition. Note: the SECTOR_SIZE / SECTOR_SHIFT / SECTOR_BITS definitions have not been removed from uapi header files nor from NAND drivers in which these constants are used for another purpose than converting block layer offsets and sizes into a number of sectors. Cc: David S. Miller <davem@davemloft.net> Cc: Mike Snitzer <snitzer@redhat.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Nitin Gupta <ngupta@vflare.org> Reviewed-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-09-06zram: write incompressible pages to backing deviceMinchan Kim
This patch enables write IO to transfer data to backing device. For that, it implements write_to_bdev function which creates new bio and chaining with parent bio to make the parent bio asynchrnous. For rw_page which don't have parent bio, it submit owned bio and handle IO completion by zram_page_end_io. Also, this patch defines new flag ZRAM_WB to mark written page for later read IO. [xieyisheng1@huawei.com: fix typo in comment] Link: http://lkml.kernel.org/r/1502707447-6944-2-git-send-email-xieyisheng1@huawei.com Link: http://lkml.kernel.org/r/1498459987-24562-8-git-send-email-minchan@kernel.org Signed-off-by: Minchan Kim <minchan@kernel.org> Signed-off-by: Yisheng Xie <xieyisheng1@huawei.com> Cc: Juneho Choi <juno.choi@lge.com> Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-09-06zram: add free space management in backing deviceMinchan Kim
With backing device, zram needs management of free space of backing device. This patch adds bitmap logic to manage free space which is very naive. However, it would be simple enough as considering uncompressible pages's frequenty in zram. Link: http://lkml.kernel.org/r/1498459987-24562-6-git-send-email-minchan@kernel.org Signed-off-by: Minchan Kim <minchan@kernel.org> Cc: Juneho Choi <juno.choi@lge.com> Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-09-06zram: add interface to specif backing deviceMinchan Kim
For writeback feature, user should set up backing device before the zram working. This patch enables the interface via /sys/block/zramX/backing_dev. Currently, it supports block device only but it could be enhanced for file as well. Link: http://lkml.kernel.org/r/1498459987-24562-5-git-send-email-minchan@kernel.org Signed-off-by: Minchan Kim <minchan@kernel.org> Cc: Juneho Choi <juno.choi@lge.com> Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-05-03zram: remove zram_meta structureMinchan Kim
It's redundant now. Instead, remove it and use zram structure directly. Link: http://lkml.kernel.org/r/1492052365-16169-5-git-send-email-minchan@kernel.org Signed-off-by: Minchan Kim <minchan@kernel.org> Cc: Hannes Reinecke <hare@suse.com> Cc: Johannes Thumshirn <jthumshirn@suse.de> Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-02-24zram: extend zero pages to same element pageszhouxianrong
The idea is that without doing more calculations we extend zero pages to same element pages for zram. zero page is special case of same element page with zero element. 1. the test is done under android 7.0 2. startup too many applications circularly 3. sample the zero pages, same pages (none-zero element) and total pages in function page_zero_filled the result is listed as below: ZERO SAME TOTAL 36214 17842 598196 ZERO/TOTAL SAME/TOTAL (ZERO+SAME)/TOTAL ZERO/SAME AVERAGE 0.060631909 0.024990816 0.085622726 2.663825038 STDEV 0.00674612 0.005887625 0.009707034 2.115881328 MAX 0.069698422 0.030046087 0.094975336 7.56043956 MIN 0.03959586 0.007332205 0.056055193 1.928985507 from the above data, the benefit is about 2.5% and up to 3% of total swapout pages. The defect of the patch is that when we recovery a page from non-zero element the operations are low efficient for partial read. This patch extends zero_page to same_page so if there is any user to have monitored zero_pages, he will be surprised if the number is increased but it's not harmful, I believe. [minchan@kernel.org: do not free same element pages in zram_meta_free] Link: http://lkml.kernel.org/r/20170207065741.GA2567@bbox Link: http://lkml.kernel.org/r/1483692145-75357-1-git-send-email-zhouxianrong@huawei.com Link: http://lkml.kernel.org/r/1486307804-27903-1-git-send-email-minchan@kernel.org Signed-off-by: zhouxianrong <zhouxianrong@huawei.com> Signed-off-by: Minchan Kim <minchan@kernel.org> Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-02-24zram: remove waitqueue for IO doneMinchan Kim
zram_reset_device() waits for ongoing writepage pages to be completed by zram->refcount logic. However, it's pointless because before the reset, we prevent further opening of zram by zram->claim and flush all of pending IO by fsync_bdev so there should be no pending IO at the zram_reset_device(). So let's remove that code which is even broken due to the lack of wake_up elsewhere. Link: http://lkml.kernel.org/r/1485145031-11661-1-git-send-email-minchan@kernel.org Signed-off-by: Minchan Kim <minchan@kernel.org> Reviewed-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-26zram: use crypto api to check alg availabilitySergey Senozhatsky
There is no way to get a string with all the crypto comp algorithms supported by the crypto comp engine, so we need to maintain our own backends list. At the same time we additionally need to use crypto_has_comp() to make sure that the user has requested a compression algorithm that is recognized by the crypto comp engine. Relying on /proc/crypto is not an options here, because it does not show not-yet-inserted compression modules. Example: modprobe zram cat /proc/crypto | grep -i lz4 modprobe lz4 cat /proc/crypto | grep -i lz4 name : lz4 driver : lz4-generic module : lz4 So the user can't tell exactly if the lz4 is really supported from /proc/crypto output, unless someone or something has loaded it. This patch also adds crypto_has_comp() to zcomp_available_show(). We store all the compression algorithms names in zcomp's `backends' array, regardless the CONFIG_CRYPTO_FOO configuration, but show only those that are also supported by crypto engine. This helps user to know the exact list of compression algorithms that can be used. Example: module lz4 is not loaded yet, but is supported by the crypto engine. /proc/crypto has no information on this module, while zram's `comp_algorithm' lists it: cat /proc/crypto | grep -i lz4 cat /sys/block/zram0/comp_algorithm [lzo] lz4 deflate lz4hc 842 We still use the `backends' array to determine if the requested compression backend is known to crypto api. This array, however, may not contain some entries, therefore as the last step we call crypto_has_comp() function which attempts to insmod the requested compression algorithm to determine if crypto api supports it. The advantage of this method is that now we permit the usage of out-of-tree crypto compression modules (implementing S/W or H/W compression). [sergey.senozhatsky@gmail.com: zram-use-crypto-api-to-check-alg-availability-v3] Link: http://lkml.kernel.org/r/20160604024902.11778-4-sergey.senozhatsky@gmail.com Link: http://lkml.kernel.org/r/20160531122017.2878-5-sergey.senozhatsky@gmail.com Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Acked-by: Minchan Kim <minchan@kernel.org> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-20zram: introduce per-device debug_stat sysfs nodeSergey Senozhatsky
debug_stat sysfs is read-only and represents various debugging data that zram developers may need. This file is not meant to be used by anyone else: its content is not documented and will change any time w/o any notice. Therefore, the output of debug_stat file contains a version string. To avoid any confusion, we will increase the version number every time we modify the output. At the moment this file exports only one value -- the number of re-compressions, IOW, the number of times compression fast path has failed. This stat is temporary any will be useful in case if any per-cpu compression streams regressions will be reported. Link: http://lkml.kernel.org/r/20160513230834.GB26763@bbox Link: http://lkml.kernel.org/r/20160511134553.12655-1-sergey.senozhatsky@gmail.com Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Signed-off-by: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-20zram: remove max_comp_streams internalsSergey Senozhatsky
Remove the internal part of max_comp_streams interface, since we switched to per-cpu streams. We will keep RW max_comp_streams attr around, because: a) we may (silently) switch back to idle compression streams list and don't want to disturb user space b) max_comp_streams attr must wait for the next 'lay off cycle'; we give user space 2 years to adjust before we remove/downgrade the attr, and there are already several attrs scheduled for removal in 4.11, so it's too late for max_comp_streams. This slightly change a user visible behaviour: - First, reading from max_comp_stream file now will always return the number of online CPUs. - Second, writing to max_comp_stream will not take any effect. Link: http://lkml.kernel.org/r/20160503165546.25201-1-sergey.senozhatsky@gmail.com Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Cc: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-08zsmalloc/zram: introduce zs_pool_stats apiSergey Senozhatsky
`zs_compact_control' accounts the number of migrated objects but it has a limited lifespan -- we lose it as soon as zs_compaction() returns back to zram. It worked fine, because (a) zram had it's own counter of migrated objects and (b) only zram could trigger compaction. However, this does not work for automatic pool compaction (not issued by zram). To account objects migrated during auto-compaction (issued by the shrinker) we need to store this number in zs_pool. Define a new `struct zs_pool_stats' structure to keep zs_pool's stats there. It provides only `num_migrated', as of this writing, but it surely can be extended. A new zsmalloc zs_pool_stats() symbol exports zs_pool's stats back to caller. Use zs_pool_stats() in zram and remove `num_migrated' from zram_stats. Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Suggested-by: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-06-25zram: close race by open overridingSergey Senozhatsky
[ Original patch from Minchan Kim <minchan@kernel.org> ] Commit ba6b17d68c8e ("zram: fix umount-reset_store-mount race condition") introduced bdev->bd_mutex to protect a race between mount and reset. At that time, we don't have dynamic zram-add/remove feature so it was okay. However, as we introduce dynamic device feature, bd_mutex became trouble. CPU 0 echo 1 > /sys/block/zram<id>/reset -> kernfs->s_active(A) -> zram:reset_store->bd_mutex(B) CPU 1 echo <id> > /sys/class/zram/zram-remove ->zram:zram_remove: bd_mutex(B) -> sysfs_remove_group -> kernfs->s_active(A) IOW, AB -> BA deadlock The reason we are holding bd_mutex for zram_remove is to prevent any incoming open /dev/zram[0-9]. Otherwise, we could remove zram others already have opened. But it causes above deadlock problem. To fix the problem, this patch overrides block_device.open and it returns -EBUSY if zram asserts he claims zram to reset so any incoming open will be failed so we don't need to hold bd_mutex for zram_remove ayn more. This patch is to prepare for zram-add/remove feature. [sergey.senozhatsky@gmail.com: simplify reset_store()] Signed-off-by: Minchan Kim <minchan@kernel.org> Acked-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-06-25zram: remove max_num_devices limitationSergey Senozhatsky
Limiting the number of zram devices to 32 (default max_num_devices value) is confusing, let's drop it. A user with 2TB or 4TB of RAM, for example, can request as many devices as he can handle. Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Acked-by: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-15zram: support compactionMinchan Kim
Now that zsmalloc supports compaction, zram can use it. For the first step, this patch exports compact knob via sysfs so user can do compaction via "echo 1 > /sys/block/zram0/compact". Signed-off-by: Minchan Kim <minchan@kernel.org> Cc: Juneho Choi <juno.choi@lge.com> Cc: Gunho Lee <gunho.lee@lge.com> Cc: Luigi Semenzato <semenzato@google.com> Cc: Dan Streetman <ddstreet@ieee.org> Cc: Seth Jennings <sjennings@variantweb.net> Cc: Nitin Gupta <ngupta@vflare.org> Cc: Jerome Marchand <jmarchan@redhat.com> Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Mel Gorman <mel@csn.ul.ie> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-02-12zram: remove request_queue from struct zramSergey Senozhatsky
`struct zram' contains both `struct gendisk' and `struct request_queue'. the latter can be deleted, because zram->disk carries ->queue pointer, and ->queue carries zram pointer: create_device() zram->queue->queuedata = zram zram->disk->queue = zram->queue zram->disk->private_data = zram so zram->queue is not needed, we can access all necessary data anyway. Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Jerome Marchand <jmarchan@redhat.com> Cc: Nitin Gupta <ngupta@vflare.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-02-12zram: remove init_lock in zram_make_requestMinchan Kim
Admin could reset zram during I/O operation going on so we have used zram->init_lock as read-side lock in I/O path to prevent sudden zram meta freeing. However, the init_lock is really troublesome. We can't do call zram_meta_alloc under init_lock due to lockdep splat because zram_rw_page is one of the function under reclaim path and hold it as read_lock while other places in process context hold it as write_lock. So, we have used allocation out of the lock to avoid lockdep warn but it's not good for readability and fainally, I met another lockdep splat between init_lock and cpu_hotplug from kmem_cache_destroy during working zsmalloc compaction. :( Yes, the ideal is to remove horrible init_lock of zram in rw path. This patch removes it in rw path and instead, add atomic refcount for meta lifetime management and completion to free meta in process context. It's important to free meta in process context because some of resource destruction needs mutex lock, which could be held if we releases the resource in reclaim context so it's deadlock, again. As a bonus, we could remove init_done check in rw path because zram_meta_get will do a role for it, instead. Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Signed-off-by: Minchan Kim <minchan@kernel.org> Cc: Nitin Gupta <ngupta@vflare.org> Cc: Jerome Marchand <jmarchan@redhat.com> Cc: Ganesh Mahendran <opensource.ganesh@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-13mm/zram: correct ZRAM_ZERO flag bit positionMahendran Ganesh
In struct zram_table_entry, the element *value* contains obj size and obj zram flags. Bit 0 to bit (ZRAM_FLAG_SHIFT - 1) represent obj size, and bit ZRAM_FLAG_SHIFT to the highest bit of unsigned long represent obj zram_flags. So the first zram flag(ZRAM_ZERO) should be from ZRAM_FLAG_SHIFT instead of (ZRAM_FLAG_SHIFT + 1). This patch fixes this cosmetic issue. Also fix a typo, "page in now accessed" -> "page is now accessed" Signed-off-by: Mahendran Ganesh <opensource.ganesh@gmail.com> Acked-by: Minchan Kim <minchan@kernel.org> Acked-by: Weijie Yang <weijie.yang@samsung.com> Acked-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Cc: Nitin Gupta <ngupta@vflare.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-10-09zram: report maximum used memoryMinchan Kim
Normally, zram user could get maximum memory usage zram consumed via polling mem_used_total with sysfs in userspace. But it has a critical problem because user can miss peak memory usage during update inverval of polling. For avoiding that, user should poll it with shorter interval(ie, 0.0000000001s) with mlocking to avoid page fault delay when memory pressure is heavy. It would be troublesome. This patch adds new knob "mem_used_max" so user could see the maximum memory usage easily via reading the knob and reset it via "echo 0 > /sys/block/zram0/mem_used_max". Signed-off-by: Minchan Kim <minchan@kernel.org> Reviewed-by: Dan Streetman <ddstreet@ieee.org> Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Cc: Jerome Marchand <jmarchan@redhat.com> Cc: <juno.choi@lge.com> Cc: <seungho1.park@lge.com> Cc: Luigi Semenzato <semenzato@google.com> Cc: Nitin Gupta <ngupta@vflare.org> Cc: Seth Jennings <sjennings@variantweb.net> Reviewed-by: David Horner <ds2horner@gmail.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-10-09zram: zram memory size limitationMinchan Kim
Since zram has no control feature to limit memory usage, it makes hard to manage system memrory. This patch adds new knob "mem_limit" via sysfs to set up the a limit so that zram could fail allocation once it reaches the limit. In addition, user could change the limit in runtime so that he could manage the memory more dynamically. Initial state is no limit so it doesn't break old behavior. [akpm@linux-foundation.org: fix typo, per Sergey] Signed-off-by: Minchan Kim <minchan@kernel.org> Cc: Dan Streetman <ddstreet@ieee.org> Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Cc: Jerome Marchand <jmarchan@redhat.com> Cc: <juno.choi@lge.com> Cc: <seungho1.park@lge.com> Cc: Luigi Semenzato <semenzato@google.com> Cc: Nitin Gupta <ngupta@vflare.org> Cc: Seth Jennings <sjennings@variantweb.net> Cc: David Horner <ds2horner@gmail.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-29zram: fix incorrect stat with failed_readsChao Yu
Since we allocate a temporary buffer in zram_bvec_read to handle partial page operations in commit 924bd88d703e ("Staging: zram: allow partial page operations"), our ->failed_reads value may be incorrect as we do not increase its value when failing to allocate the temporary buffer. Let's fix this issue and correct the annotation of failed_reads. Signed-off-by: Chao Yu <chao2.yu@samsung.com> Acked-by: Minchan Kim <minchan@kernel.org> Cc: Nitin Gupta <ngupta@vflare.org> Acked-by: Jerome Marchand <jmarchan@redhat.com> Acked-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-06zram: replace global tb_lock with fine grain lockWeijie Yang
Currently, we use a rwlock tb_lock to protect concurrent access to the whole zram meta table. However, according to the actual access model, there is only a small chance for upper user to access the same table[index], so the current lock granularity is too big. The idea of optimization is to change the lock granularity from whole meta table to per table entry (table -> table[index]), so that we can protect concurrent access to the same table[index], meanwhile allow the maximum concurrency. With this in mind, several kinds of locks which could be used as a per-entry lock were tested and compared: Test environment: x86-64 Intel Core2 Q8400, system memory 4GB, Ubuntu 12.04, kernel v3.15.0-rc3 as base, zram with 4 max_comp_streams LZO. iozone test: iozone -t 4 -R -r 16K -s 200M -I +Z (1GB zram with ext4 filesystem, take the average of 10 tests, KB/s) Test base CAS spinlock rwlock bit_spinlock ------------------------------------------------------------------- Initial write 1381094 1425435 1422860 1423075 1421521 Rewrite 1529479 1641199 1668762 1672855 1654910 Read 8468009 11324979 11305569 11117273 10997202 Re-read 8467476 11260914 11248059 11145336 10906486 Reverse Read 6821393 8106334 8282174 8279195 8109186 Stride read 7191093 8994306 9153982 8961224 9004434 Random read 7156353 8957932 9167098 8980465 8940476 Mixed workload 4172747 5680814 5927825 5489578 5972253 Random write 1483044 1605588 1594329 1600453 1596010 Pwrite 1276644 1303108 1311612 1314228 1300960 Pread 4324337 4632869 4618386 4457870 4500166 To enhance the possibility of access the same table[index] concurrently, set zram a small disksize(10MB) and let threads run with large loop count. fio test: fio --bs=32k --randrepeat=1 --randseed=100 --refill_buffers --scramble_buffers=1 --direct=1 --loops=3000 --numjobs=4 --filename=/dev/zram0 --name=seq-write --rw=write --stonewall --name=seq-read --rw=read --stonewall --name=seq-readwrite --rw=rw --stonewall --name=rand-readwrite --rw=randrw --stonewall (10MB zram raw block device, take the average of 10 tests, KB/s) Test base CAS spinlock rwlock bit_spinlock ------------------------------------------------------------- seq-write 933789 999357 1003298 995961 1001958 seq-read 5634130 6577930 6380861 6243912 6230006 seq-rw 1405687 1638117 1640256 1633903 1634459 rand-rw 1386119 1614664 1617211 1609267 1612471 All the optimization methods show a higher performance than the base, however, it is hard to say which method is the most appropriate. On the other hand, zram is mostly used on small embedded system, so we don't want to increase any memory footprint. This patch pick the bit_spinlock method, pack object size and page_flag into an unsigned long table.value, so as to not increase any memory overhead on both 32-bit and 64-bit system. On the third hand, even though different kinds of locks have different performances, we can ignore this difference, because: if zram is used as zram swapfile, the swap subsystem can prevent concurrent access to the same swapslot; if zram is used as zram-blk for set up filesystem on it, the upper filesystem and the page cache also prevent concurrent access of the same block mostly. So we can ignore the different performances among locks. Acked-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Reviewed-by: Davidlohr Bueso <davidlohr@hp.com> Signed-off-by: Weijie Yang <weijie.yang@samsung.com> Signed-off-by: Minchan Kim <minchan@kernel.org> Cc: Jerome Marchand <jmarchan@redhat.com> Cc: Nitin Gupta <ngupta@vflare.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-06zram: remove unused SECTOR_SIZE defineSergey Senozhatsky
Drop SECTOR_SIZE define, because it's not used. Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Nitin Gupta <ngupta@vflare.org> Cc: Weijie Yang <weijie.yang@samsung.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-06zram: rename struct `table' to `zram_table_entry'Sergey Senozhatsky
Andrew Morton has recently noted that `struct table' actually represents table entry and, thus, should be renamed. Rename to `zram_table_entry'. Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Nitin Gupta <ngupta@vflare.org> Cc: Weijie Yang <weijie.yang@samsung.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-07zram: make compression algorithm selection possibleSergey Senozhatsky
Add and document `comp_algorithm' device attribute. This attribute allows to show supported compression and currently selected compression algorithms: cat /sys/block/zram0/comp_algorithm [lzo] lz4 and change selected compression algorithm: echo lzo > /sys/block/zram0/comp_algorithm Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Acked-by: Minchan Kim <minchan@kernel.org> Cc: Jerome Marchand <jmarchan@redhat.com> Cc: Nitin Gupta <ngupta@vflare.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-07zram: add multi stream functionalitySergey Senozhatsky
Existing zram (zcomp) implementation has only one compression stream (buffer and algorithm private part), so in order to prevent data corruption only one write (compress operation) can use this compression stream, forcing all concurrent write operations to wait for stream lock to be released. This patch changes zcomp to keep a compression streams list of user-defined size (via sysfs device attr). Each write operation still exclusively holds compression stream, the difference is that we can have N write operations (depending on size of streams list) executing in parallel. See TEST section later in commit message for performance data. Introduce struct zcomp_strm_multi and a set of functions to manage zcomp_strm stream access. zcomp_strm_multi has a list of idle zcomp_strm structs, spinlock to protect idle list and wait queue, making it possible to perform parallel compressions. The following set of functions added: - zcomp_strm_multi_find()/zcomp_strm_multi_release() find and release a compression stream, implement required locking - zcomp_strm_multi_create()/zcomp_strm_multi_destroy() create and destroy zcomp_strm_multi zcomp ->strm_find() and ->strm_release() callbacks are set during initialisation to zcomp_strm_multi_find()/zcomp_strm_multi_release() correspondingly. Each time zcomp issues a zcomp_strm_multi_find() call, the following set of operations performed: - spin lock strm_lock - if idle list is not empty, remove zcomp_strm from idle list, spin unlock and return zcomp stream pointer to caller - if idle list is empty, current adds itself to wait queue. it will be awaken by zcomp_strm_multi_release() caller. zcomp_strm_multi_release(): - spin lock strm_lock - add zcomp stream to idle list - spin unlock, wake up sleeper Minchan Kim reported that spinlock-based locking scheme has demonstrated a severe perfomance regression for single compression stream case, comparing to mutex-based (see https://lkml.org/lkml/2014/2/18/16) base spinlock mutex ==Initial write ==Initial write ==Initial write records: 5 records: 5 records: 5 avg: 1642424.35 avg: 699610.40 avg: 1655583.71 std: 39890.95(2.43%) std: 232014.19(33.16%) std: 52293.96 max: 1690170.94 max: 1163473.45 max: 1697164.75 min: 1568669.52 min: 573429.88 min: 1553410.23 ==Rewrite ==Rewrite ==Rewrite records: 5 records: 5 records: 5 avg: 1611775.39 avg: 501406.64 avg: 1684419.11 std: 17144.58(1.06%) std: 15354.41(3.06%) std: 18367.42 max: 1641800.95 max: 531356.78 max: 1706445.84 min: 1593515.27 min: 488817.78 min: 1655335.73 When only one compression stream available, mutex with spin on owner tends to perform much better than frequent wait_event()/wake_up(). This is why single stream implemented as a special case with mutex locking. Introduce and document zram device attribute max_comp_streams. This attr shows and stores current zcomp's max number of zcomp streams (max_strm). Extend zcomp's zcomp_create() with `max_strm' parameter. `max_strm' limits the number of zcomp_strm structs in compression backend's idle list (max_comp_streams). max_comp_streams used during initialisation as follows: -- passing to zcomp_create() max_strm equals to 1 will initialise zcomp using single compression stream zcomp_strm_single (mutex-based locking). -- passing to zcomp_create() max_strm greater than 1 will initialise zcomp using multi compression stream zcomp_strm_multi (spinlock-based locking). default max_comp_streams value is 1, meaning that zram with single stream will be initialised. Later patch will introduce configuration knob to change max_comp_streams on already initialised and used zcomp. TEST iozone -t 3 -R -r 16K -s 60M -I +Z test base 1 strm (mutex) 3 strm (spinlock) ----------------------------------------------------------------------- Initial write 589286.78 583518.39 718011.05 Rewrite 604837.97 596776.38 1515125.72 Random write 584120.11 595714.58 1388850.25 Pwrite 535731.17 541117.38 739295.27 Fwrite 1418083.88 1478612.72 1484927.06 Usage example: set max_comp_streams to 4 echo 4 > /sys/block/zram0/max_comp_streams show current max_comp_streams (default value is 1). cat /sys/block/zram0/max_comp_streams Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Acked-by: Minchan Kim <minchan@kernel.org> Cc: Jerome Marchand <jmarchan@redhat.com> Cc: Nitin Gupta <ngupta@vflare.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-07zram: use zcomp compressing backendsSergey Senozhatsky
Do not perform direct LZO compress/decompress calls, initialise and use zcomp LZO backend (single compression stream) instead. [akpm@linux-foundation.org: resolve conflicts with zram-delete-zram_init_device-fix.patch] Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Acked-by: Minchan Kim <minchan@kernel.org> Cc: Jerome Marchand <jmarchan@redhat.com> Cc: Nitin Gupta <ngupta@vflare.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-07zram: drop not used table `count' memberSergey Senozhatsky
struct table `count' member is not used. Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Cc: Minchan Kim <minchan@kernel.org> Acked-by: Jerome Marchand <jmarchan@redhat.com> Cc: Nitin Gupta <ngupta@vflare.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-07zram: use atomic64_t for all zram statsSergey Senozhatsky
This is a preparation patch for stats code duplication removal. 1) use atomic64_t for `pages_zero' and `pages_stored' zram stats. 2) `compr_size' and `pages_zero' struct zram_stats members did not follow the existing device attr naming scheme: zram_stats.ATTR has ATTR_show() function. rename them: -- compr_size -> compr_data_size -- pages_zero -> zero_pages Minchan Kim's note: If we really have trouble with atomic stat operation, we could change it with percpu_counter so that it could solve atomic overhead and unnecessary memory space by introducing unsigned long instead of 64bit atomic_t. Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Acked-by: Minchan Kim <minchan@kernel.org> Acked-by: Jerome Marchand <jmarchan@redhat.com> Cc: Nitin Gupta <ngupta@vflare.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-07zram: remove good and bad compress statsSergey Senozhatsky
Remove `good' and `bad' compressed sub-requests stats. RW request may cause a number of RW sub-requests. zram used to account `good' compressed sub-queries (with compressed size less than 50% of original size), `bad' compressed sub-queries (with compressed size greater that 75% of original size), leaving sub-requests with compression size between 50% and 75% of original size not accounted and not reported. zram already accounts each sub-request's compression size so we can calculate real device compression ratio. Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Acked-by: Minchan Kim <minchan@kernel.org> Acked-by: Jerome Marchand <jmarchan@redhat.com> Cc: Nitin Gupta <ngupta@vflare.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-07zram: drop `init_done' struct zram memberSergey Senozhatsky
Introduce init_done() helper function which allows us to drop `init_done' struct zram member. init_done() uses the fact that ->init_done == 1 equals to ->meta != NULL. Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Acked-by: Minchan Kim <minchan@kernel.org> Acked-by: Jerome Marchand <jmarchan@redhat.com> Cc: Nitin Gupta <ngupta@vflare.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-01-30zram: remove zram->lock in read path and change it with mutexMinchan Kim
Finally, we separated zram->lock dependency from 32bit stat/ table handling so there is no reason to use rw_semaphore between read and write path so this patch removes the lock from read path totally and changes rw_semaphore with mutex. So, we could do old: read-read: OK read-write: NO write-write: NO Now: read-read: OK read-write: OK write-write: NO The below data proves mixed workload performs well 11 times and there is also enhance on write-write path because current rw-semaphore doesn't support SPIN_ON_OWNER. It's side effect but anyway good thing for us. Write-related tests perform better (from 61% to 1058%) but read path has good/bad(from -2.22% to 1.45%) but they are all marginal within stddev. CPU 12 iozone -t -T -l 12 -u 12 -r 16K -s 60M -I +Z -V 0 ==Initial write ==Initial write records: 10 records: 10 avg: 516189.16 avg: 839907.96 std: 22486.53 (4.36%) std: 47902.17 (5.70%) max: 546970.60 max: 909910.35 min: 481131.54 min: 751148.38 ==Rewrite ==Rewrite records: 10 records: 10 avg: 509527.98 avg: 1050156.37 std: 45799.94 (8.99%) std: 40695.44 (3.88%) max: 611574.27 max: 1111929.26 min: 443679.95 min: 980409.62 ==Read ==Read records: 10 records: 10 avg: 4408624.17 avg: 4472546.76 std: 281152.61 (6.38%) std: 163662.78 (3.66%) max: 4867888.66 max: 4727351.03 min: 4058347.69 min: 4126520.88 ==Re-read ==Re-read records: 10 records: 10 avg: 4462147.53 avg: 4363257.75 std: 283546.11 (6.35%) std: 247292.63 (5.67%) max: 4912894.44 max: 4677241.75 min: 4131386.50 min: 4035235.84 ==Reverse Read ==Reverse Read records: 10 records: 10 avg: 4565865.97 avg: 4485818.08 std: 313395.63 (6.86%) std: 248470.10 (5.54%) max: 5232749.16 max: 4789749.94 min: 4185809.62 min: 3963081.34 ==Stride read ==Stride read records: 10 records: 10 avg: 4515981.80 avg: 4418806.01 std: 211192.32 (4.68%) std: 212837.97 (4.82%) max: 4889287.28 max: 4686967.22 min: 4210362.00 min: 4083041.84 ==Random read ==Random read records: 10 records: 10 avg: 4410525.23 avg: 4387093.18 std: 236693.22 (5.37%) std: 235285.23 (5.36%) max: 4713698.47 max: 4669760.62 min: 4057163.62 min: 3952002.16 ==Mixed workload ==Mixed workload records: 10 records: 10 avg: 243234.25 avg: 2818677.27 std: 28505.07 (11.72%) std: 195569.70 (6.94%) max: 288905.23 max: 3126478.11 min: 212473.16 min: 2484150.69 ==Random write ==Random write records: 10 records: 10 avg: 555887.07 avg: 1053057.79 std: 70841.98 (12.74%) std: 35195.36 (3.34%) max: 683188.28 max: 1096125.73 min: 437299.57 min: 992481.93 ==Pwrite ==Pwrite records: 10 records: 10 avg: 501745.93 avg: 810363.09 std: 16373.54 (3.26%) std: 19245.01 (2.37%) max: 518724.52 max: 833359.70 min: 464208.73 min: 765501.87 ==Pread ==Pread records: 10 records: 10 avg: 4539894.60 avg: 4457680.58 std: 197094.66 (4.34%) std: 188965.60 (4.24%) max: 4877170.38 max: 4689905.53 min: 4226326.03 min: 4095739.72 Signed-off-by: Minchan Kim <minchan@kernel.org> Cc: Nitin Gupta <ngupta@vflare.org> Tested-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Cc: Jerome Marchand <jmarchan@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>