summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2018-03-30ext4: simplify kobject usageTyson Nottingham
Replace kset with generic kobject provided by kobject_create_and_add(), since the latter is sufficient. Signed-off-by: Tyson Nottingham <tgnottingham@gmail.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2018-03-30ext4: remove unused parameters in sysfs codeTyson Nottingham
Signed-off-by: Tyson Nottingham <tgnottingham@gmail.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2018-03-30ext4: null out kobject* during sysfs cleanupTyson Nottingham
Make cleanup of ext4_feat kobject consistent with similar objects. Signed-off-by: Tyson Nottingham <tgnottingham@gmail.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2018-03-29dm: fix dropped return code from dm_get_bdev_for_ioctlMike Snitzer
dm_get_bdev_for_ioctl()'s return of 0 or 1 must be the result from prepare_ioctl (1 means the ioctl was issued to a partition, 0 means it wasn't). Unfortunately commit 519049afea ("dm: use blkdev_get rather than bdgrab when issuing pass-through ioctl") reused the variable 'r' to store the return from blkdev_get() that follows prepare_ioctl() -- whereby dropping prepare_ioctl()'s result on the floor. This can lead to an ioctl or persistent reservation being issued to a partition going unnoticed, which implies the extra permission check for CAP_SYS_RAWIO is skipped. Fix this by using a different variable to store blkdev_get()'s return. Fixes: 519049afea ("dm: use blkdev_get rather than bdgrab when issuing pass-through ioctl") Reported-by: Alasdair G Kergon <agk@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2018-03-29ext4: don't allow r/w mounts if metadata blocks overlap the superblockTheodore Ts'o
If some metadata block, such as an allocation bitmap, overlaps the superblock, it's very likely that if the file system is mounted read/write, the results will not be pretty. So disallow r/w mounts for file systems corrupted in this particular way. Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@vger.kernel.org
2018-03-29ext4: always initialize the crc32c checksum driverTheodore Ts'o
The extended attribute code now uses the crc32c checksum for hashing purposes, so we should just always always initialize it. We also want to prevent NULL pointer dereferences if one of the metadata checksum features is enabled after the file sytsem is originally mounted. This issue has been assigned CVE-2018-1094. https://bugzilla.kernel.org/show_bug.cgi?id=199183 https://bugzilla.redhat.com/show_bug.cgi?id=1560788 Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@vger.kernel.org
2018-03-29ext4: fail ext4_iget for root directory if unallocatedTheodore Ts'o
If the root directory has an i_links_count of zero, then when the file system is mounted, then when ext4_fill_super() notices the problem and tries to call iput() the root directory in the error return path, ext4_evict_inode() will try to free the inode on disk, before all of the file system structures are set up, and this will result in an OOPS caused by a NULL pointer dereference. This issue has been assigned CVE-2018-1092. https://bugzilla.kernel.org/show_bug.cgi?id=199179 https://bugzilla.redhat.com/show_bug.cgi?id=1560777 Reported-by: Wen Xu <wen.xu@gatech.edu> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@vger.kernel.org
2018-03-29Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpfDavid S. Miller
Daniel Borkman says: ==================== pull-request: bpf 2018-03-29 The following pull-request contains BPF updates for your *net* tree. The main changes are: 1) Fix nfp to properly check max insn count while emitting instructions in the JIT which was wrongly comparing bytes against number of instructions before, from Jakub. 2) Fix for bpftool to avoid usage of hex numbers in JSON output since JSON doesn't accept hex numbers with 0x prefix, also from Jakub. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-29lightnvm: remove function name in stringsMatias Bjørling
For the sysfs functions, the function names are embedded into their error strings. If the function name later changes, the string may not be updated accordingly. Update the strings to use __func__ to avoid this. Signed-off-by: Matias Bjørling <mb@lightnvm.io> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-03-29lightnvm: pblk: remove some unnecessary NULL checksDan Carpenter
Smatch complains that flush_workqueue() dereferences the work queue pointer but then we check if it's NULL on the next line when it's too late. These NULL checks can be removed because the module won't load if we can't allocate the work queues. Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Matias Bjørling <mb@lightnvm.io> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-03-29lightnvm: pblk: don't recover unwritten linesHans Holmberg
If the line has not been written to, we should not try to recover any data from it, so check the state of the chunks in the line before attempting to read smeta. Signed-off-by: Hans Holmberg <hans.holmberg@cnexlabs.com> Signed-off-by: Javier González <javier@cnexlabs.com> Signed-off-by: Matias Bjørling <mb@lightnvm.io> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-03-29lightnvm: pblk: implement 2.0 supportJavier González
Implement 2.0 support in pblk. This includes the address formatting and mapping paths, as well as the sysfs entries for them. Signed-off-by: Javier González <javier@cnexlabs.com> Signed-off-by: Matias Bjørling <mb@lightnvm.io> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-03-29lightnvm: pblk: implement get log report chunkJavier González
In preparation of pblk supporting 2.0, implement the get log report chunk in pblk. Also, define the chunk states as given in the 2.0 spec. Signed-off-by: Javier González <javier@cnexlabs.com> Signed-off-by: Matias Bjørling <mb@lightnvm.io> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-03-29lightnvm: pblk: rename ppaf* to addrf*Javier González
In preparation for 2.0 support in pblk, rename variables referring to the address format to addrf and reserve ppaf for the 1.2 path. Signed-off-by: Javier González <javier@cnexlabs.com> Signed-off-by: Matias Bjørling <mb@lightnvm.io> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-03-29lightnvm: pblk: check for supported versionJavier González
At this point, only 1.2 spec is supported, thus check for it. Also, since device-side L2P is only supported in the 1.2 spec, make sure to only check its value under 1.2. Signed-off-by: Javier González <javier@cnexlabs.com> Signed-off-by: Matias Bjørling <mb@lightnvm.io> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-03-29lightnvm: implement get log report chunk helpersJavier González
The 2.0 spec provides a report chunk log page that can be retrieved using the stangard nvme get log page. This replaces the dedicated get/put bad block table in 1.2. This patch implements the helper functions to allow targets retrieve the chunk metadata using get log page. It makes nvme_get_log_ext available outside of nvme core so that we can use it form lightnvm. Signed-off-by: Javier González <javier@cnexlabs.com> Signed-off-by: Matias Bjørling <mb@lightnvm.io> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-03-29lightnvm: make address conversions depend on generic deviceJavier González
On address conversions, use the generic device, instead of the target device. This allows to use conversions outside of the target's realm. Signed-off-by: Javier González <javier@cnexlabs.com> Signed-off-by: Matias Bjørling <mb@lightnvm.io> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-03-29lightnvm: add support for 2.0 address formatJavier González
Add support for 2.0 address format. Also, align address bits for 1.2 and 2.0 to be able to operate on channel and luns without requiring a format conversion. Use a generic address format for this purpose. Also, convert the generic operations to the generic format in pblk. Signed-off-by: Javier González <javier@cnexlabs.com> Signed-off-by: Matias Bjørling <mb@lightnvm.io> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-03-29lightnvm: normalize geometry nomenclatureJavier González
Normalize nomenclature for naming channels, luns, chunks, planes and sectors as well as derivations in order to improve readability. Signed-off-by: Javier González <javier@cnexlabs.com> Signed-off-by: Matias Bjørling <mb@lightnvm.io> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-03-29lightnvm: complete geo structure with maxoc*Javier González
Complete the generic geometry structure with the maxoc and maxocpu felds, present in the 2.0 spec. Also, expose them through sysfs. Signed-off-by: Javier González <javier@cnexlabs.com> Signed-off-by: Matias Bjørling <mb@lightnvm.io> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-03-29lightnvm: add shorten OCSSD version in geoJavier González
Create a shorten version to use in the generic geometry. Signed-off-by: Javier González <javier@cnexlabs.com> Signed-off-by: Matias Bjørling <mb@lightnvm.io> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-03-29lightnvm: add minor version to generic geometryJavier González
Separate the version between major and minor on the generic geometry and represent it through sysfs in the 2.0 path. The 1.2 path only shows the major version to preserve the existing user space interface. Signed-off-by: Javier González <javier@cnexlabs.com> Signed-off-by: Matias Bjørling <mb@lightnvm.io> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-03-29lightnvm: simplify geometry structureJavier González
Currently, the device geometry is stored redundantly in the nvm_id and nvm_geo structures at a device level. Moreover, when instantiating targets on a specific number of LUNs, these structures are replicated and manually modified to fit the instance channel and LUN partitioning. Instead, create a generic geometry around nvm_geo, which can be used by (i) the underlying device to describe the geometry of the whole device, and (ii) instances to describe their geometry independently. Signed-off-by: Javier González <javier@cnexlabs.com> Signed-off-by: Matias Bjørling <mb@lightnvm.io> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-03-29lightnvm: pblk: refactor init/exit sequencesJavier González
Refactor init and exit sequences to eliminate dependencies among init modules and improve readability. Signed-off-by: Javier González <javier@cnexlabs.com> Signed-off-by: Matias Bjørling <mb@lightnvm.io> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-03-29lightnvm: Avoid validation of default op valueHeiner Litz
Fixes: 38401d231de65 ("lightnvm: set target over-provision on create ioctl") Signed-off-by: Heiner Litz <hlitz@ucsc.edu> Reviewed-by: Javier González <javier@cnexlabs.com> Signed-off-by: Matias Bjørling <mb@lightnvm.io> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-03-29lightnvm: centralize permission check for lightnvm ioctlJohannes Thumshirn
Currently all functions for handling the lightnvm core ioctl commands do a check for CAP_SYS_ADMIN. Change this to fail early in nvm_ctl_ioctl(), so we don't have to duplicate the permission checks all over. Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Matias Bjørling <mb@lightnvm.io> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-03-29lightnvm: fix bad block initializationHeiner Litz
fix reading bad block device information to correctly setup the per line blk_bitmap during lightnvm initialization Signed-off-by: Heiner Litz <hlitz@ucsc.edu> Signed-off-by: Matias Bjørling <mb@lightnvm.io> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-03-29nvme: lightnvm: add late setup of block size and metadataMatias Bjørling
The nvme driver sets up the size of the nvme namespace in two steps. First it initializes the device with standard logical block and metadata sizes, and then sets the correct logical block and metadata size. Due to the OCSSD 2.0 specification relies on the namespace to expose these sizes for correct initialization, let it be updated appropriately on the LightNVM side as well. Signed-off-by: Matias Bjørling <mb@lightnvm.io> Acked-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-03-29lightnvm: remove nvm_dev_ops->max_phys_sectMatias Bjørling
The value of max_phys_sect is always static. Instead of defining it in the nvm_dev_ops structure, declare it as a global value. Signed-off-by: Matias Bjørling <mb@lightnvm.io> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-03-29lightnvm: remove max_rq_sizeMatias Bjørling
The field is no longer used. Signed-off-by: Matias Bjørling <mb@lightnvm.io> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-03-29lightnvm: add 2.0 geometry identificationMatias Bjørling
Implement the geometry data structures for 2.0 and enable a drive to be identified as one, including exposing the appropriate 2.0 sysfs entries. Signed-off-by: Matias Bjørling <mb@lightnvm.io> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-03-29lightnvm: flatten nvm_id_group into nvm_idMatias Bjørling
There are no groups in the 2.0 specification, make sure that the nvm_id structure is flattened before 2.0 data structures are added. Signed-off-by: Matias Bjørling <mb@lightnvm.io> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-03-29lightnvm: make 1.2 data structures explicitMatias Bjørling
Make the 1.2 data structures explicit, so it will be easy to identify the 2.0 data structures. Also fix the order of which the nvme_nvm_* are declared, such that they follow the nvme_nvm_command order. Signed-off-by: Matias Bjørling <mb@lightnvm.io> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-03-29lightnvm: pblk: refactor bad block identificationJavier González
In preparation for the OCSSD 2.0 spec. bad block identification, refactor the current code to generalize bad block get/set functions and structures. Signed-off-by: Javier González <javier@cnexlabs.com> Signed-off-by: Matias Bjørling <mb@lightnvm.io> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-03-29lightnvm: pblk: prevent race in pblk_rb_flush_point_setHans Holmberg
Make sure that we are not advancing the sync pointer while we're adding bios to the write buffer entry completion list. This race condition results in bios not completing and was identified by a hang when running xfstest generic/113. Signed-off-by: Hans Holmberg <hans.holmberg@cnexlabs.com> Reviewed-by: Javier González <javier@cnexlabs.com> Signed-off-by: Matias Bjørling <mb@lightnvm.io> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-03-29lightnvm: pblk: allow allocation of new lines during shutdownHans Holmberg
When shutting down pblk the write buffer is flushed and if the current line can't fit the data in the write buffer we need to allocate a new line, so remove the check that prevents this. Signed-off-by: Hans Holmberg <hans.holmberg@cnexlabs.com> Reviewed-by: Javier González <javier@cnexlabs.com> Signed-off-by: Matias Bjørling <mb@lightnvm.io> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-03-29lightnvm: pblk: delete writer kick timer before stopping threadHans Holmberg
Unless we delete the timer that wakes up the write thread before we stop the thread we risk re-starting the thread, so delete the timer first. Signed-off-by: Hans Holmberg <hans.holmberg@cnexlabs.com> Reviewed-by: Javier González <javier@cnexlabs.com> Signed-off-by: Matias Bjørling <mb@lightnvm.io> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-03-29lightnvm: pblk: add padding distribution sysfs attributeHans Holmberg
When pblk receives a sync, all data up to that point in the write buffer must be comitted to persistent storage, and as flash memory comes with a minimal write size there is a significant cost involved both in terms of time for completing the sync and in terms of write amplification padded sectors for filling up to the minimal write size. In order to get a better understanding of the costs involved for syncs, Add a sysfs attribute to pblk: padded_dist, showing a normalized distribution of sectors padded. In order to facilitate measurements of specific workloads during the lifetime of the pblk instance, the distribution can be reset by writing 0 to the attribute. Do this by introducing counters for each possible padding: {0..(minimal write size - 1)} and calculate the normalized distribution when showing the attribute. Signed-off-by: Hans Holmberg <hans.holmberg@cnexlabs.com> Signed-off-by: Javier González <javier@cnexlabs.com> Rearranged total_buckets statement in pblk_sysfs_get_padding_dist Signed-off-by: Matias Bjørling <mb@lightnvm.io> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-03-29lightnvm: remove multiple groups in 1.2 data structureMatias Bjørling
Only one id group from the 1.2 specification is supported. Make sure that only the first group is accessible. Signed-off-by: Matias Bjørling <mb@lightnvm.io> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-03-29lightnvm: remove mlc pairs structureMatias Bjørling
The known implementations of the 1.2 specification, and upcoming 2.0 implementation all expose a sequential list of pages to write. Remove the data structure, as it is no longer needed. Signed-off-by: Matias Bjørling <mb@lightnvm.io> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-03-29lightnvm: pblk: export write amplification counters to sysfsHans Holmberg
In a SSD, write amplification, WA, is defined as the average number of page writes per user page write. Write amplification negatively affects write performance and decreases the lifetime of the disk, so it's a useful metric to add to sysfs. In plkb's case, the number of writes per user sector is the sum of: (1) number of user writes (2) number of sectors written by the garbage collector (3) number of sectors padded (i.e. due to syncs) This patch adds persistent counters for 1-3 and two sysfs attributes to export these along with WA calculated with five decimals: write_amp_mileage: the accumulated write amplification stats for the lifetime of the pblk instance write_amp_trip: resetable stats to facilitate delta measurements, values reset at creation and if 0 is written to the attribute. 64-bit counters are used as a 32 bit counter would wrap around already after about 17 TB worth of user data. It will take a long long time before the 64 bit sector counters wrap around. The counters are stored after the bad block bitmap in the first emeta sector of each written line. There is plenty of space in the first emeta sector, so we don't need to bump the major version of the line data format. Signed-off-by: Hans Holmberg <hans.holmberg@cnexlabs.com> Signed-off-by: Javier González <javier@cnexlabs.com> Signed-off-by: Matias Bjørling <mb@lightnvm.io> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-03-29lightnvm: pblk: check data lines version on recoveryHans Holmberg
As a preparation for future bumps of data line persistent storage versions, we need to start checking the emeta line version during recovery. Also slit up the current emeta/smeta version into two bytes (major,minor). Recovering lines with the same major number as the current pblk data line version must succeed. This means that any changes in the persistent format must be: (1) Backward compatible: if we switch back to and older kernel, recovery of lines stored with major == current_major and minor > current_minor must succeed. (2) Forward compatible: switching to a newer kernel, recovery of lines stored with major=current_major and minor < minor must handle the data format differences gracefully(i.e. initialize new data structures to default values). If we detect lines that have a different major number than the current we must abort recovery. The user must manually migrate the data in this case. Previously the version stored in the emeta header was copied from smeta, which has version 1, so we need to set the minor version to 1. Signed-off-by: Hans Holmberg <hans.holmberg@cnexlabs.com> Signed-off-by: Javier González <javier@cnexlabs.com> Signed-off-by: Matias Bjørling <mb@lightnvm.io> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-03-29lightnvm: pblk: handle bad sectors in the emeta area correctlyHans Holmberg
Unless we check if there are bad sectors in the entire emeta-area we risk ending up with valid bitmap / available sector count inconsistency. This results in lines with a bad chunk at the last LUN marked as bad, so go through the whole emeta area and mark up the invalid sectors. Signed-off-by: Hans Holmberg <hans.holmberg@cnexlabs.com> Signed-off-by: Matias Bjørling <mb@lightnvm.io> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-03-29lightnvm: remove chnl_offset in nvme_nvm_identityMatias Bjørling
The identity structure is initialized to zero in the beginning of the nvme_nvm_identity function. The chnl_offset is separately set to zero. Since both the variable and assignment is never changed, remove them. Signed-off-by: Matias Bjørling <mb@lightnvm.io> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-03-29lightnvm/pblk-gc: Delete an error message for a failed memory allocation in ↵Markus Elfring
pblk_gc_line_prepare_ws() Omit an extra message for a memory allocation failure in this function. This issue was detected by using the Coccinelle software. Signed-off-by: Markus Elfring <elfring@users.sourceforge.net> Reviewed-by: Javier González <javier@cnexlabs.com> Signed-off-by: Matias Bjørling <mb@lightnvm.io> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-03-30Merge branch 'bpf-sockmap-ingress'Daniel Borkmann
John Fastabend says: ==================== This series adds the BPF_F_INGRESS flag support to the redirect APIs. Bringing the sockmap API in-line with the cls_bpf redirect APIs. We add it to both variants of sockmap programs, the first patch adds support for tx ulp hooks and the third patch adds support for the recv skb hooks. Patches two and four add tests for the corresponding ingress redirect hooks. Follow on patches can address busy polling support, but next series from me will move the sockmap sample program into selftests. v2: added static to function definition caught by kbuild bot v3: fixed an error branch with missing mem_uncharge in recvmsg op moved receive_queue check outside of RCU region ==================== Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-03-30bpf: sockmap, more BPF_SK_SKB_STREAM_VERDICT testsJohn Fastabend
Add BPF_SK_SKB_STREAM_VERDICT tests for ingress hook. While we do this also bring stream tests in-line with MSG based testing. A map for skb options is added for userland to push options at BPF programs. Signed-off-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-03-30bpf: sockmap, BPF_F_INGRESS flag for BPF_SK_SKB_STREAM_VERDICT:John Fastabend
Add support for the BPF_F_INGRESS flag in skb redirect helper. To do this convert skb into a scatterlist and push into ingress queue. This is the same logic that is used in the sk_msg redirect helper so it should feel familiar. Signed-off-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-03-30bpf: sockmap, add BPF_F_INGRESS testsJohn Fastabend
Add a set of tests to verify ingress flag in redirect helpers works correctly with various msg sizes. Signed-off-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-03-30bpf: sockmap redirect ingress supportJohn Fastabend
Add support for the BPF_F_INGRESS flag in sk_msg redirect helper. To do this add a scatterlist ring for receiving socks to check before calling into regular recvmsg call path. Additionally, because the poll wakeup logic only checked the skb recv queue we need to add a hook in TCP stack (similar to write side) so that we have a way to wake up polling socks when a scatterlist is redirected to that sock. After this all that is needed is for the redirect helper to push the scatterlist into the psock receive queue. Signed-off-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>