summaryrefslogtreecommitdiff
path: root/drivers/md
AgeCommit message (Collapse)Author
2018-12-07dm: set the static flush bio device on demandDennis Zhou
The next patch changes the macro bio_set_dev() to associate a bio with a blkg based on the device set. However, dm creates a static bio to be used as the basis for cloning empty flush bios on creation. The bio_set_dev() call in alloc_dev() will cause problems with the next patch adding association to bio_set_dev() because the call is before the bdev is associated with a gendisk (bd_disk is %NULL). To get around this, set the device on the static bio every time and use that to clone to the other bios. Signed-off-by: Dennis Zhou <dennis@kernel.org> Acked-by: Mike Snitzer <snitzer@redhat.com> Cc: Alasdair Kergon <agk@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-12-07dm zoned: Fix target BIO completion handlingDamien Le Moal
struct bioctx includes the ref refcount_t to track the number of I/O fragments used to process a target BIO as well as ensure that the zone of the BIO is kept in the active state throughout the lifetime of the BIO. However, since decrementing of this reference count is done in the target .end_io method, the function bio_endio() must be called multiple times for read and write target BIOs, which causes problems with the value of the __bi_remaining struct bio field for chained BIOs (e.g. the clone BIO passed by dm core is large and splits into fragments by the block layer), resulting in incorrect values and inconsistencies with the BIO_CHAIN flag setting. This is turn triggers the BUG_ON() call: BUG_ON(atomic_read(&bio->__bi_remaining) <= 0); in bio_remaining_done() called from bio_endio(). Fix this ensuring that bio_endio() is called only once for any target BIO by always using internal clone BIOs for processing any read or write target BIO. This allows reference counting using the target BIO context counter to trigger the target BIO completion bio_endio() call once all data, metadata and other zone work triggered by the BIO complete. Overall, this simplifies the code too as the target .end_io becomes unnecessary and differences between read and write BIO issuing and completion processing disappear. Fixes: 3b1a94c88b79 ("dm zoned: drive-managed zoned block device target") Cc: stable@vger.kernel.org Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2018-12-07dm: call blk_queue_split() to impose device limits on biosMike Snitzer
Otherwise the incoming bios, of various types, won't be shaped based on the DM device's advertised limits. Depends-on: af67c31fba ("blk: remove bio_set arg from blk_queue_split()") Fixes: 744889b7cb ("block: don't deal with discard limit in blkdev_issue_discard()") Cc: stable@vger.kernel.org Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2018-12-07dm cache metadata: verify cache has blocks in blocks_are_clean_separate_dirty()Mike Snitzer
Otherwise dm_bitset_cursor_begin() return -ENODATA. Other calls to dm_bitset_cursor_begin() have similar negative checks. Fixes inability to create a cache in passthrough mode (even though doing so makes no sense). Fixes: 0d963b6e65 ("dm cache metadata: fix metadata2 format's blocks_are_clean_separate_dirty") Cc: stable@vger.kernel.org Reported-by: David Teigland <teigland@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2018-11-20crypto: drop mask=CRYPTO_ALG_ASYNC from 'shash' tfm allocationsEric Biggers
'shash' algorithms are always synchronous, so passing CRYPTO_ALG_ASYNC in the mask to crypto_alloc_shash() has no effect. Many users therefore already don't pass it, but some still do. This inconsistency can cause confusion, especially since the way the 'mask' argument works is somewhat counterintuitive. Thus, just remove the unneeded CRYPTO_ALG_ASYNC flags. This patch shouldn't change any actual behavior. Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2018-11-20crypto: drop mask=CRYPTO_ALG_ASYNC from 'cipher' tfm allocationsEric Biggers
'cipher' algorithms (single block ciphers) are always synchronous, so passing CRYPTO_ALG_ASYNC in the mask to crypto_alloc_cipher() has no effect. Many users therefore already don't pass it, but some still do. This inconsistency can cause confusion, especially since the way the 'mask' argument works is somewhat counterintuitive. Thus, just remove the unneeded CRYPTO_ALG_ASYNC flags. This patch shouldn't change any actual behavior. Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2018-11-16block: add queue_is_mq() helperJens Axboe
Various spots check for q->mq_ops being non-NULL, but provide a helper to do this instead. Where the ->mq_ops != NULL check is redundant, remove it. Since mq == rq-based now that legacy is gone, get rid of the queue_is_rq_based() and just use queue_is_mq() everywhere. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-11-15block: remove the lock argument to blk_alloc_queue_nodeChristoph Hellwig
With the legacy request path gone there is no real need to override the queue_lock. Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-11-02Merge tag 'for-linus-20181102' of git://git.kernel.dk/linux-blockLinus Torvalds
Pull block layer fixes from Jens Axboe: "The biggest part of this pull request is the revert of the blkcg cleanup series. It had one fix earlier for a stacked device issue, but another one was reported. Rather than play whack-a-mole with this, revert the entire series and try again for the next kernel release. Apart from that, only small fixes/changes. Summary: - Indentation fixup for mtip32xx (Colin Ian King) - The blkcg cleanup series revert (Dennis Zhou) - Two NVMe fixes. One fixing a regression in the nvme request initialization in this merge window, causing nvme-fc to not work. The other is a suspend/resume p2p resource issue (James, Keith) - Fix sg discard merge, allowing us to merge in cases where we didn't before (Jianchao Wang) - Call rq_qos_exit() after the queue is frozen, preventing a hang (Ming) - Fix brd queue setup, fixing an oops if we fail setting up all devices (Ming)" * tag 'for-linus-20181102' of git://git.kernel.dk/linux-block: nvme-pci: fix conflicting p2p resource adds nvme-fc: fix request private initialization blkcg: revert blkcg cleanups series block: brd: associate with queue until adding disk block: call rq_qos_exit() after queue is frozen mtip32xx: clean an indentation issue, remove extraneous tabs block: fix the DISCARD request merge
2018-11-01blkcg: revert blkcg cleanups seriesDennis Zhou
This reverts a series committed earlier due to null pointer exception bug report in [1]. It seems there are edge case interactions that I did not consider and will need some time to understand what causes the adverse interactions. The original series can be found in [2] with a follow up series in [3]. [1] https://www.spinics.net/lists/cgroups/msg20719.html [2] https://lore.kernel.org/lkml/20180911184137.35897-1-dennisszhou@gmail.com/ [3] https://lore.kernel.org/lkml/20181020185612.51587-1-dennis@kernel.org/ This reverts the following commits: d459d853c2ed, b2c3fa546705, 101246ec02b5, b3b9f24f5fcc, e2b0989954ae, f0fcb3ec89f3, c839e7a03f92, bdc2491708c4, 74b7c02a9bc1, 5bf9a1f3b4ef, a7b39b4e961c, 07b05bcc3213, 49f4c2dc2b50, 27e6fa996c53 Signed-off-by: Dennis Zhou <dennis@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-10-26Merge branch 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/shli/mdLinus Torvalds
Pull md updates from Shaohua Li: "This mainly improves raid10 cluster and fixes some bugs: - raid10 cluster improvements from Guoqing - Memory leak fixes from Jack and Xiao - raid10 hang fix from Alex - raid5 block faulty device fix from Mariusz - metadata update fix from Neil - Invalid disk role fix from Me - Other clearnups" * 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/shli/md: MD: Memory leak when flush bio size is zero md: fix memleak for mempool md-cluster: remove suspend_info md-cluster: send BITMAP_NEEDS_SYNC message if reshaping is interrupted md-cluster/bitmap: don't call md_bitmap_sync_with_cluster during reshaping stage md-cluster/raid10: don't call remove_and_add_spares during reshaping stage md-cluster/raid10: call update_size in md_reap_sync_thread md-cluster: introduce resync_info_get interface for sanity check md-cluster/raid10: support add disk under grow mode md-cluster/raid10: resize all the bitmaps before start reshape MD: fix invalid stored role for a disk - try2 md/bitmap: use mddev_suspend/resume instead of ->quiesce() md: remove redundant code that is no longer reachable md: allow metadata updates while suspending an array - fix MD: fix invalid stored role for a disk md/raid10: Fix raid10 replace hang when new added disk faulty raid5: block failing device if raid will be failed
2018-10-26Merge tag 'for-4.20/dm-changes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm Pull device mapper updates from Mike Snitzer: - The biggest change this cycle is to remove support for the legacy IO path (.request_fn) from request-based DM. Jens has already started preparing for complete removal of the legacy IO path in 4.21 but this earlier removal of support from DM has been coordinated with Jens (as evidenced by the commit being attributed to him). Making request-based DM exclussively blk-mq only cleans up that portion of DM core quite nicely. - Convert the thinp and zoned targets over to using refcount_t where applicable. - A couple fixes to the DM zoned target for refcounting and other races buried in the implementation of metadata block creation and use. - Small cleanups to remove redundant unlikely() around a couple WARN_ON_ONCE(). - Simplify how dm-ioctl copies from userspace, eliminating some potential for a malicious user trying to change the executed ioctl after its processing has begun. - Tweaked DM crypt target to use the DM device name when naming the various workqueues created for a particular DM crypt device (makes the N workqueues for a DM crypt device more easily understood and enhances user's accounting capabilities at a glance via "ps") - Small fixup to remove dead branch in DM writecache's memory_entry(). * tag 'for-4.20/dm-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: dm writecache: remove disabled code in memory_entry() dm zoned: fix various dmz_get_mblock() issues dm zoned: fix metadata block ref counting dm raid: avoid bitmap with raid4/5/6 journal device dm crypt: make workqueue names device-specific dm: add dm_table_device_name() dm ioctl: harden copy_params()'s copy_from_user() from malicious users dm: remove unnecessary unlikely() around WARN_ON_ONCE() dm zoned: target: use refcount_t for dm zoned reference counters dm thin: use refcount_t for thin_c reference counting dm table: require that request-based DM be layered on blk-mq devices dm: rename DM_TYPE_MQ_REQUEST_BASED to DM_TYPE_REQUEST_BASED dm: remove legacy request-based IO path
2018-10-26Merge tag 'for-linus-20181026' of git://git.kernel.dk/linux-blockLinus Torvalds
Pull more block layer updates from Jens Axboe: - Set of patches improving support for zoned devices. This was ready before the merge window, but I was late in picking it up and hence it missed the original pull request (Damien, Christoph) - libata no link power management quirk addition for a Samsung drive (Diego Viola) - Fix for a performance regression in BFQ that went into this merge window (Federico Motta) - Fix for a missing dma mask setting return value check (Gustavo) - Typo in the gdrom queue failure case (me) - NULL pointer deref fix for xen-blkfront (Vasilis Liaskovitis) - Fixing the get_rq trace point placement in blk-mq (Xiaoguang Wang) - Removal of a set-but-not-read variable in cdrom (zhong jiang) * tag 'for-linus-20181026' of git://git.kernel.dk/linux-block: libata: Apply NOLPM quirk for SAMSUNG MZ7TD256HAFV-000L9 block, bfq: fix asymmetric scenarios detection gdrom: fix mistake in assignment of error blk-mq: place trace_block_getrq() in correct place block: Introduce blk_revalidate_disk_zones() block: add a report_zones method block: Expose queue nr_zones in sysfs block: Improve zone reset execution block: Introduce BLKGETNRZONES ioctl block: Introduce BLKGETZONESZ ioctl block: Limit allocation of zone descriptors for report zones block: Introduce blkdev_nr_zones() helper scsi: sd_zbc: Fix sd_zbc_check_zones() error checks scsi: sd_zbc: Reduce boot device scan and revalidate time scsi: sd_zbc: Rearrange code cdrom: remove set but not used variable 'tocuse' skd: fix unchecked return values xen/blkfront: avoid NULL blkfront_info dereference on device removal
2018-10-25Merge branch 'linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6 Pull crypto updates from Herbert Xu: "API: - Remove VLA usage - Add cryptostat user-space interface - Add notifier for new crypto algorithms Algorithms: - Add OFB mode - Remove speck Drivers: - Remove x86/sha*-mb as they are buggy - Remove pcbc(aes) from x86/aesni - Improve performance of arm/ghash-ce by up to 85% - Implement CTS-CBC in arm64/aes-blk, faster by up to 50% - Remove PMULL based arm64/crc32 driver - Use PMULL in arm64/crct10dif - Add aes-ctr support in s5p-sss - Add caam/qi2 driver Others: - Pick better transform if one becomes available in crc-t10dif" * 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6: (124 commits) crypto: chelsio - Update ntx queue received from cxgb4 crypto: ccree - avoid implicit enum conversion crypto: caam - add SPDX license identifier to all files crypto: caam/qi - simplify CGR allocation, freeing crypto: mxs-dcp - make symbols 'sha1_null_hash' and 'sha256_null_hash' static crypto: arm64/aes-blk - ensure XTS mask is always loaded crypto: testmgr - fix sizeof() on COMP_BUF_SIZE crypto: chtls - remove set but not used variable 'csk' crypto: axis - fix platform_no_drv_owner.cocci warnings crypto: x86/aes-ni - fix build error following fpu template removal crypto: arm64/aes - fix handling sub-block CTS-CBC inputs crypto: caam/qi2 - avoid double export crypto: mxs-dcp - Fix AES issues crypto: mxs-dcp - Fix SHA null hashes and output length crypto: mxs-dcp - Implement sha import/export crypto: aegis/generic - fix for big endian systems crypto: morus/generic - fix for big endian systems crypto: lrw - fix rebase error after out of bounds fix crypto: cavium/nitrox - use pci_alloc_irq_vectors() while enabling MSI-X. crypto: cavium/nitrox - NITROX command queue changes. ...
2018-10-25block: Introduce blk_revalidate_disk_zones()Damien Le Moal
Drivers exposing zoned block devices have to initialize and maintain correctness (i.e. revalidate) of the device zone bitmaps attached to the device request queue (seq_zones_bitmap and seq_zones_wlock). To simplify coding this, introduce a generic helper function blk_revalidate_disk_zones() suitable for most (and likely all) cases. This new function always update the seq_zones_bitmap and seq_zones_wlock bitmaps as well as the queue nr_zones field when called for a disk using a request based queue. For a disk using a BIO based queue, only the number of zones is updated since these queues do not have schedulers and so do not need the zone bitmaps. With this change, the zone bitmap initialization code in sd_zbc.c can be replaced with a call to this function in sd_zbc_read_zones(), which is called from the disk revalidate block operation method. A call to blk_revalidate_disk_zones() is also added to the null_blk driver for devices created with the zoned mode enabled. Finally, to ensure that zoned devices created with dm-linear or dm-flakey expose the correct number of zones through sysfs, a call to blk_revalidate_disk_zones() is added to dm_table_set_restrictions(). The zone bitmaps allocated and initialized with blk_revalidate_disk_zones() are freed automatically from __blk_release_queue() using the block internal function blk_queue_free_zone_bitmaps(). Reviewed-by: Hannes Reinecke <hare@suse.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-10-25block: add a report_zones methodChristoph Hellwig
Dispatching a report zones command through the request queue is a major pain due to the command reply payload rewriting necessary. Given that blkdev_report_zones() is executing everything synchronously, implement report zones as a block device file operation instead, allowing major simplification of the code in many places. sd, null-blk, dm-linear and dm-flakey being the only block device drivers supporting exposing zoned block devices, these drivers are modified to provide the device side implementation of the report_zones() block device file operation. For device mappers, a new report_zones() target type operation is defined so that the upper block layer calls blkdev_report_zones() can be propagated down to the underlying devices of the dm targets. Implementation for this new operation is added to the dm-linear and dm-flakey targets. Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Christoph Hellwig <hch@lst.de> [Damien] * Changed method block_device argument to gendisk * Various bug fixes and improvements * Added support for null_blk, dm-linear and dm-flakey. Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-10-25block: Introduce blkdev_nr_zones() helperDamien Le Moal
Introduce the blkdev_nr_zones() helper function to get the total number of zones of a zoned block device. This number is always 0 for a regular block device (q->limits.zoned == BLK_ZONED_NONE case). Replace hard-coded number of zones calculation in dmz_get_zoned_device() with a call to this helper. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-10-22Merge tag 'for-4.20/block-20181021' of git://git.kernel.dk/linux-blockLinus Torvalds
Pull block layer updates from Jens Axboe: "This is the main pull request for block changes for 4.20. This contains: - Series enabling runtime PM for blk-mq (Bart). - Two pull requests from Christoph for NVMe, with items such as; - Better AEN tracking - Multipath improvements - RDMA fixes - Rework of FC for target removal - Fixes for issues identified by static checkers - Fabric cleanups, as prep for TCP transport - Various cleanups and bug fixes - Block merging cleanups (Christoph) - Conversion of drivers to generic DMA mapping API (Christoph) - Series fixing ref count issues with blkcg (Dennis) - Series improving BFQ heuristics (Paolo, et al) - Series improving heuristics for the Kyber IO scheduler (Omar) - Removal of dangerous bio_rewind_iter() API (Ming) - Apply single queue IPI redirection logic to blk-mq (Ming) - Set of fixes and improvements for bcache (Coly et al) - Series closing a hotplug race with sysfs group attributes (Hannes) - Set of patches for lightnvm: - pblk trace support (Hans) - SPDX license header update (Javier) - Tons of refactoring patches to cleanly abstract the 1.2 and 2.0 specs behind a common core interface. (Javier, Matias) - Enable pblk to use a common interface to retrieve chunk metadata (Matias) - Bug fixes (Various) - Set of fixes and updates to the blk IO latency target (Josef) - blk-mq queue number updates fixes (Jianchao) - Convert a bunch of drivers from the old legacy IO interface to blk-mq. This will conclude with the removal of the legacy IO interface itself in 4.21, with the rest of the drivers (me, Omar) - Removal of the DAC960 driver. The SCSI tree will introduce two replacement drivers for this (Hannes)" * tag 'for-4.20/block-20181021' of git://git.kernel.dk/linux-block: (204 commits) block: setup bounce bio_sets properly blkcg: reassociate bios when make_request() is called recursively blkcg: fix edge case for blk_get_rl() under memory pressure nvme-fabrics: move controller options matching to fabrics nvme-rdma: always have a valid trsvcid mtip32xx: fully switch to the generic DMA API rsxx: switch to the generic DMA API umem: switch to the generic DMA API sx8: switch to the generic DMA API sx8: remove dead IF_64BIT_DMA_IS_POSSIBLE code skd: switch to the generic DMA API ubd: remove use of blk_rq_map_sg nvme-pci: remove duplicate check drivers/block: Remove DAC960 driver nvme-pci: fix hot removal during error handling nvmet-fcloop: suppress a compiler warning nvme-core: make implicit seed truncation explicit nvmet-fc: fix kernel-doc headers nvme-fc: rework the request initialization code nvme-fc: introduce struct nvme_fcp_op_w_sgl ...
2018-10-22MD: Memory leak when flush bio size is zeroXiao Ni
flush_pool is leaked when flush bio size is zero Fixes: 5a409b4f56d5 ("MD: fix lock contention for flush bios") Signed-off-by: David Jeffery <djeffery@redhat.com> Signed-off-by: Xiao Ni <xni@redhat.com> Signed-off-by: Shaohua Li <shli@fb.com>
2018-10-22md: fix memleak for mempoolJack Wang
I noticed kmemleak report memory leak when run create/stop md in a loop, backtrace: [<000000001ca975e7>] mempool_create_node+0x86/0xd0 [<0000000095576bcd>] md_run+0x1057/0x1410 [md_mod] [<000000007b45c5fc>] do_md_run+0x15/0x130 [md_mod] [<000000001ede9ec0>] md_ioctl+0x1f49/0x25d0 [md_mod] [<000000004142cacf>] blkdev_ioctl+0x680/0xd00 The root cause is we alloc mddev->flush_pool and mddev->flush_bio_pool in md_run, but from do_md_stop will not call into md_stop but __md_stop, move the mempool_destroy to __md_stop fixes the problem for me. The bug was introduced in 5a409b4f56d5, the fixes should go to 4.18+ Fixes: 5a409b4f56d5 ("MD: fix lock contention for flush bios") Signed-off-by: Jack Wang <jinpu.wang@profitbricks.com> Reviewed-by: Xiao Ni <xni@redhat.com> Signed-off-by: Shaohua Li <shli@fb.com>
2018-10-22dm writecache: remove disabled code in memory_entry()Mike Snitzer
This dead branch was missed during review. It only makes memory_entry() more inefficient due to needless call to is_power_of_2(), etc. Reported-by: shenghui <shhuiw@foxmail.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2018-10-18dm zoned: fix various dmz_get_mblock() issuesDamien Le Moal
dmz_fetch_mblock() called from dmz_get_mblock() has a race since the allocation of the new metadata block descriptor and its insertion in the cache rbtree with the READING state is not atomic. Two different contexts requesting the same block may end up each adding two different descriptors of the same block to the cache. Another problem for this function is that the BIO for processing the block read is allocated after the metadata block descriptor is inserted in the cache rbtree. If the BIO allocation fails, the metadata block descriptor is freed without first being removed from the rbtree. Fix the first problem by checking again if the requested block is not in the cache right before inserting the newly allocated descriptor, atomically under the mblk_lock spinlock. The second problem is fixed by simply allocating the BIO before inserting the new block in the cache. Finally, since dmz_fetch_mblock() also increments a block reference counter, rename the function to dmz_get_mblock_slow(). To be symmetric and clear, also rename dmz_lookup_mblock() to dmz_get_mblock_fast() and increment the block reference counter directly in that function rather than in dmz_get_mblock(). Fixes: 3b1a94c88b79 ("dm zoned: drive-managed zoned block device target") Cc: stable@vger.kernel.org Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2018-10-18dm zoned: fix metadata block ref countingDamien Le Moal
Since the ref field of struct dmz_mblock is always used with the spinlock of struct dmz_metadata locked, there is no need to use an atomic_t type. Change the type of the ref field to an unsigne integer. Fixes: 3b1a94c88b79 ("dm zoned: drive-managed zoned block device target") Cc: stable@vger.kernel.org Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2018-10-18dm raid: avoid bitmap with raid4/5/6 journal deviceHeinz Mauelshagen
With raid4/5/6, journal device and write intent bitmap are mutually exclusive. Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2018-10-18md-cluster: remove suspend_infoGuoqing Jiang
Previously, we allow multiple nodes can resync device, but we had changed it to only support one node can do resync at one time, but suspend_info is still used. Now, let's remove the structure and use suspend_lo/hi to record the range. Reviewed-by: NeilBrown <neilb@suse.com> Signed-off-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>
2018-10-18md-cluster: send BITMAP_NEEDS_SYNC message if reshaping is interruptedGuoqing Jiang
We need to continue the reshaping if it was interrupted in original node. So original node should call resync_bitmap in case reshaping is aborted. Then BITMAP_NEEDS_SYNC message is broadcasted to other nodes, node which continues the reshaping should restart reshape from mddev->reshape_position instead of from the first beginning. Reviewed-by: NeilBrown <neilb@suse.com> Signed-off-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>
2018-10-18md-cluster/bitmap: don't call md_bitmap_sync_with_cluster during reshaping stageGuoqing Jiang
When reshape is happening in one node, other nodes could receive lots of RESYNCING messages, so md_bitmap_sync_with_cluster is called. Since the resyncing window is typically small in these RESYNCING messages, so WARN is always triggered, so we should not call the func when reshape is happening. Reviewed-by: NeilBrown <neilb@suse.com> Signed-off-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>
2018-10-18md-cluster/raid10: don't call remove_and_add_spares during reshaping stageGuoqing Jiang
remove_and_add_spares is not needed if reshape is happening in another node, because raid10_add_disk called inside raid10_start_reshape would handle the role changes of disk. Plus, remove_and_add_spares can't deal with the role change due to reshape. Reviewed-by: NeilBrown <neilb@suse.com> Signed-off-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>
2018-10-18md-cluster/raid10: call update_size in md_reap_sync_threadGuoqing Jiang
We need to change the capacity in all nodes after one node finishs reshape. And as we did before, we can't change the capacity directly in md_do_sync, instead, the capacity should be only changed in update_size or received CHANGE_CAPACITY msg. So master node calls update_size after completes reshape in md_reap_sync_thread, but we need to skip ops->update_size if MD_CLOSING is set since reshaping could not be finish. Reviewed-by: NeilBrown <neilb@suse.com> Signed-off-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>
2018-10-18md-cluster: introduce resync_info_get interface for sanity checkGuoqing Jiang
Since the resync region from suspend_info means one node is reshaping this area, so the position of reshape_progress should be included in the area. Reviewed-by: NeilBrown <neilb@suse.com> Signed-off-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>
2018-10-18md-cluster/raid10: support add disk under grow modeGuoqing Jiang
For clustered raid10 scenario, we need to let all the nodes know about that a new disk is added to the array, and the reshape caused by add new member just need to be happened in one node, but other nodes should know about the change. Since reshape means read data from somewhere (which is already used by array) and write data to unused region. Obviously, it is awful if one node is reading data from address while another node is writing to the same address. Considering we have implemented suspend writes in the resyncing area, so we can just broadcast the reading address to other nodes to avoid the trouble. For master node, it would call reshape_request then update sb during the reshape period. To avoid above trouble, we call resync_info_update to send RESYNC message in reshape_request. Then from slave node's view, it receives two type messages: 1. RESYNCING message Slave node add the address (where master node reading data from) to suspend list. 2. METADATA_UPDATED message Once slave nodes know the reshaping is started in master node, it is time to update reshape position and call start_reshape to follow master node's step. After reshape is done, only reshape position is need to be updated, so the majority task of reshaping is happened on the master node. Reviewed-by: NeilBrown <neilb@suse.com> Signed-off-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>
2018-10-18md-cluster/raid10: resize all the bitmaps before start reshapeGuoqing Jiang
To support add disk under grow mode, we need to resize all the bitmaps of each node before reshape, so that we can ensure all nodes have the same view of the bitmap of the clustered raid. So after the master node resized the bitmap, it broadcast a message to other slave nodes, and it checks the size of each bitmap are same or not by compare pages. We can only continue the reshaping after all nodes update the bitmap to the same size (by checking the pages), otherwise revert bitmap size to previous value. The resize_bitmaps interface and BITMAP_RESIZE message are introduced in md-cluster.c for the purpose. Reviewed-by: NeilBrown <neilb@suse.com> Signed-off-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>
2018-10-18dm crypt: make workqueue names device-specificMichał Mirosław
Make cpu-usage debugging easier by naming workqueues per device. Example ps output: root 413 0.0 0.0 0 0 ? I< paź02 0:00 [kcryptd_io/253:0] root 414 0.0 0.0 0 0 ? I< paź02 0:00 [kcryptd/253:0] root 415 0.0 0.0 0 0 ? S paź02 1:10 [dmcrypt_write/253:0] root 465 0.0 0.0 0 0 ? I< paź02 0:00 [kcryptd_io/253:2] root 466 0.0 0.0 0 0 ? I< paź02 0:00 [kcryptd/253:2] root 467 0.0 0.0 0 0 ? S paź02 2:06 [dmcrypt_write/253:2] root 15359 0.2 0.0 0 0 ? I< 19:43 0:25 [kworker/u17:8-kcryptd/253:0] root 16563 0.2 0.0 0 0 ? I< 20:10 0:18 [kworker/u17:0-kcryptd/253:2] root 23205 0.1 0.0 0 0 ? I< 21:21 0:04 [kworker/u17:4-kcryptd/253:0] root 13383 0.1 0.0 0 0 ? I< 21:32 0:02 [kworker/u17:2-kcryptd/253:2] root 2610 0.1 0.0 0 0 ? I< 21:42 0:01 [kworker/u17:12-kcryptd/253:2] root 20124 0.1 0.0 0 0 ? I< 21:56 0:01 [kworker/u17:1-kcryptd/253:2] Signed-off-by: Michał Mirosław <mirq-linux@rere.qmqm.pl> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2018-10-18dm: add dm_table_device_name()Michał Mirosław
Add a shortcut for dm_device_name(dm_table_get_md(t)). Signed-off-by: Michał Mirosław <mirq-linux@rere.qmqm.pl> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2018-10-18dm ioctl: harden copy_params()'s copy_from_user() from malicious usersWenwen Wang
In copy_params(), the struct 'dm_ioctl' is first copied from the user space buffer 'user' to 'param_kernel' and the field 'data_size' is checked against 'minimum_data_size' (size of 'struct dm_ioctl' payload up to its 'data' member). If the check fails, an error code EINVAL will be returned. Otherwise, param_kernel->data_size is used to do a second copy, which copies from the same user-space buffer to 'dmi'. After the second copy, only 'dmi->data_size' is checked against 'param_kernel->data_size'. Given that the buffer 'user' resides in the user space, a malicious user-space process can race to change the content in the buffer between the two copies. This way, the attacker can inject inconsistent data into 'dmi' (versus previously validated 'param_kernel'). Fix redundant copying of 'minimum_data_size' from user-space buffer by using the first copy stored in 'param_kernel'. Also remove the 'data_size' check after the second copy because it is now unnecessary. Cc: stable@vger.kernel.org Signed-off-by: Wenwen Wang <wang6495@umn.edu> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2018-10-16dm: remove unnecessary unlikely() around WARN_ON_ONCE()Igor Stoppa
WARN_ON() already contains an unlikely(), so it's not necessary to wrap it into another. Signed-off-by: Igor Stoppa <igor.stoppa@huawei.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2018-10-16dm zoned: target: use refcount_t for dm zoned reference countersJohn Pittman
The API surrounding refcount_t should be used in place of atomic_t when variables are being used as reference counters. This API can prevent issues such as counter overflows and use-after-free conditions. Within the dm zoned target stack, the atomic_t API is used for bioctx->ref and cw->refcount. Change these to use refcount_t, avoiding the issues mentioned. Signed-off-by: John Pittman <jpittman@redhat.com> Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com> Tested-by: Damien Le Moal <damien.lemoal@wdc.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2018-10-16dm thin: use refcount_t for thin_c reference countingJohn Pittman
The API surrounding refcount_t should be used in place of atomic_t when variables are being used as reference counters. It can potentially prevent reference counter overflows and use-after-free conditions. In the dm thin layer, one such example is tc->refcount. Change this from the atomic_t API to the refcount_t API to prevent mentioned conditions. Signed-off-by: John Pittman <jpittman@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2018-10-14MD: fix invalid stored role for a disk - try2Shaohua Li
Commit d595567dc4f0 (MD: fix invalid stored role for a disk) broke linear hotadd. Let's only fix the role for disks in raid1/10. Based on Guoqing's original patch. Reported-by: kernel test robot <rong.a.chen@intel.com> Cc: Gioh Kim <gi-oh.kim@profitbricks.com> Cc: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>
2018-10-11dm table: require that request-based DM be layered on blk-mq devicesMike Snitzer
Now that request-based DM (multipath) is blk-mq only: this restriction is required while the legacy request-based IO path still exists. Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2018-10-11Merge tag 'alloc-args-v4.19-rc8' of ↵Greg Kroah-Hartman
https://git.kernel.org/pub/scm/linux/kernel/git/kees/linux Kees writes: "Fix open-coded multiplication arguments to allocators - Fixes several new open-coded multiplications added in the 4.19 merge window." * tag 'alloc-args-v4.19-rc8' of https://git.kernel.org/pub/scm/linux/kernel/git/kees/linux: treewide: Replace more open-coded allocation size multiplications
2018-10-11dm: rename DM_TYPE_MQ_REQUEST_BASED to DM_TYPE_REQUEST_BASEDMike Snitzer
Now that request-based DM is only using blk-mq, there is no need to differentiate between legacy "rq" and new "mq". We're back to a single request-based DM -- and there was much rejoicing! Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2018-10-11dm: remove legacy request-based IO pathJens Axboe
dm supports both, and since we're killing off the legacy path in general, get rid of it in dm. Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2018-10-10dm linear: fix linear_end_io conditional definitionDamien Le Moal
The dm-linear target is independent of the dm-zoned target. For code requiring support for zoned block devices, use CONFIG_BLK_DEV_ZONED instead of CONFIG_DM_ZONED. While at it, similarly to dm linear, also enable the DM_TARGET_ZONED_HM feature in dm-flakey only if CONFIG_BLK_DEV_ZONED is defined. Fixes: beb9caac211c1 ("dm linear: eliminate linear_end_io call if CONFIG_DM_ZONED disabled") Fixes: 0be12c1c7fce7 ("dm linear: add support for zoned block devices") Cc: stable@vger.kernel.org Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2018-10-10md/bitmap: use mddev_suspend/resume instead of ->quiesce()Jack Wang
After 9e1cc0a54556 ("md: use mddev_suspend/resume instead of ->quiesce()") We still have similar left in bitmap functions. Replace quiesce() with mddev_suspend/resume. Also move md_bitmap_create out of mddev_suspend. and move mddev_resume after md_bitmap_destroy. as we did in set_bitmap_file. Signed-off-by: Jack Wang <jinpu.wang@profitbricks.com> Reviewed-by: Gioh Kim <gi-oh.kim@profitbricks.com> Signed-off-by: Shaohua Li <shli@fb.com>
2018-10-10md: remove redundant code that is no longer reachableColin Ian King
And earlier commit removed the error label to two statements that are now never reachable. Since this code is now dead code, remove it. Detected by CoverityScan, CID#1462409 ("Structurally dead code") Fixes: d5d885fd514f ("md: introduce new personality funciton start()") Signed-off-by: Colin Ian King <colin.king@canonical.com> Signed-off-by: Shaohua Li <shli@fb.com>
2018-10-10dm linear: eliminate linear_end_io call if CONFIG_DM_ZONED disabledMike Snitzer
It is best to avoid any extra overhead associated with bio completion. DM core will indirectly call a DM target's .end_io if it is defined. In the case of DM linear, there is no need to do so (for every bio that completes) if CONFIG_DM_ZONED is not enabled. Avoiding an extra indirect call for every bio completion is very important for ensuring DM linear doesn't incur more overhead that further widens the performance gap between dm-linear and raw block devices. Fixes: 0be12c1c7fce7 ("dm linear: add support for zoned block devices") Cc: stable@vger.kernel.org Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2018-10-09dm: fix report zone remapping to account for partition offsetDamien Le Moal
If dm-linear or dm-flakey are layered on top of a partition of a zoned block device, remapping of the start sector and write pointer position of the zones reported by a report zones BIO must be modified to account for the target table entry mapping (start offset within the device and entry mapping with the dm device). If the target's backing device is a partition of a whole disk, the start sector on the physical device of the partition must also be accounted for when modifying the zone information. However, dm_remap_zone_report() was not considering this last case, resulting in incorrect zone information remapping with targets using disk partitions. Fix this by calculating the target backing device start sector using the position of the completed report zones BIO and the unchanged position and size of the original report zone BIO. With this value calculated, the start sector and write pointer position of the target zones can be correctly remapped. Fixes: 10999307c14e ("dm: introduce dm_remap_zone_report()") Cc: stable@vger.kernel.org Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2018-10-09dm cache: destroy migration_cache if cache target registration failedShenghui Wang
Commit 7e6358d244e47 ("dm: fix various targets to dm_register_target after module __init resources created") inadvertently introduced this bug when it moved dm_register_target() after the call to KMEM_CACHE(). Fixes: 7e6358d244e47 ("dm: fix various targets to dm_register_target after module __init resources created") Cc: stable@vger.kernel.org Signed-off-by: Shenghui Wang <shhuiw@foxmail.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2018-10-08bcache: panic fix for making cache deviceDongbo Cao
when the nbuckets of cache device is smaller than 1024, making cache device will trigger BUG_ON in kernel, add a condition to avoid this. Reported-by: nitroxis <n@nxs.re> Signed-off-by: Dongbo Cao <cdbdyx@163.com> Signed-off-by: Coly Li <colyli@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>