summaryrefslogtreecommitdiff
path: root/drivers/md
AgeCommit message (Collapse)Author
2011-01-13dm raid1: support discardMike Snitzer
Enable discard support in the DM mirror target. Also change an existing use of 'bvec' to 'addr' in the union. Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2011-01-13dm ioctl: allow rename to fill empty uuidPeter Jones
Allow the uuid of a mapped device to be set after device creation. Previously the uuid (which is optional) could only be set by DM_DEV_CREATE. If no uuid was supplied it could not be set later. Sometimes it's necessary to create the device before the uuid is known, and in such cases the uuid must be filled in after the creation. This patch extends DM_DEV_RENAME to accept a uuid accompanied by a new flag DM_UUID_FLAG. This can only be done once and if no uuid was previously supplied. It cannot be used to change an existing uuid. DM_VERSION_MINOR is also bumped to 19 to indicate this interface extension is available. Signed-off-by: Peter Jones <pjones@redhat.com> Signed-off-by: Jonathan Brassow <jbrassow@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2011-01-13dm io: remove BIO_RW_SYNCIO flag from kcopydMikulas Patocka
Remove the REQ_SYNC flag to improve write throughput when writing to the origin with a snapshot on the same device (using the CFQ I/O scheduler). Sequential write throughput (chunksize of 4k, 32k, 512k) unpatched: 8.5, 8.6, 9.3 MB/s patched: 15.2, 18.5, 17.5 MB/s Snapshot exception reallocations are triggered by writes that are usually async, so mark the associated dm_io_request as async as well. This helps when using the CFQ I/O scheduler because it has separate queues for sync and async I/O. Async is optimized for throughput; sync for latency. With this change we're consciously favoring throughput over latency. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2011-01-13dm mpath: disable blk_abort_queueMike Snitzer
Revert commit 224cb3e981f1b2f9f93dbd49eaef505d17d894c2 dm: Call blk_abort_queue on failed paths Multipath began to use blk_abort_queue() to allow for lower latency path deactivation. This was found to cause list corruption: the cmd gets blk_abort_queued/timedout run on it and the scsi eh somehow is able to complete and run scsi_queue_insert while scsi_request_fn is still trying to process the request. https://www.redhat.com/archives/dm-devel/2010-November/msg00085.html Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com> Cc: Mike Anderson <andmike@linux.vnet.ibm.com> Cc: Mike Christie <michaelc@cs.wisc.edu> Cc: stable@kernel.org
2011-01-13dm: dont take i_mutex to change device sizeMike Snitzer
No longer needlessly hold md->bdev->bd_inode->i_mutex when changing the size of a DM device. This additional locking is unnecessary because i_size_write() is already protected by the existing critical section in dm_swap_table(). DM already has a reference on md->bdev so the associated bd_inode may be changed without lifetime concerns. A negative side-effect of having held md->bdev->bd_inode->i_mutex was that a concurrent DM device resize and flush (via fsync) would deadlock. Dropping md->bdev->bd_inode->i_mutex eliminates this potential for deadlock. The following reproducer no longer deadlocks: https://www.redhat.com/archives/dm-devel/2009-July/msg00284.html Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com> Cc: stable@kernel.org
2011-01-13Merge branch 'for-2.6.38/core' of git://git.kernel.dk/linux-2.6-blockLinus Torvalds
* 'for-2.6.38/core' of git://git.kernel.dk/linux-2.6-block: (43 commits) block: ensure that completion error gets properly traced blktrace: add missing probe argument to block_bio_complete block cfq: don't use atomic_t for cfq_group block cfq: don't use atomic_t for cfq_queue block: trace event block fix unassigned field block: add internal hd part table references block: fix accounting bug on cross partition merges kref: add kref_test_and_get bio-integrity: mark kintegrityd_wq highpri and CPU intensive block: make kblockd_workqueue smarter Revert "sd: implement sd_check_events()" block: Clean up exit_io_context() source code. Fix compile warnings due to missing removal of a 'ret' variable fs/block: type signature of major_to_index(int) to major_to_index(unsigned) block: convert !IS_ERR(p) && p to !IS_ERR_NOR_NULL(p) cfq-iosched: don't check cfqg in choose_service_tree() fs/splice: Pull buf->ops->confirm() from splice_from_pipe actors cdrom: export cdrom_check_events() sd: implement sd_check_events() sr: implement sr_check_events() ...
2011-01-12md: fix regression with re-adding devices to arrays with no metadataNeilBrown
Commit 1a855a0606 (2.6.37-rc4) fixed a problem where devices were re-added when they shouldn't be but caused a regression in a less common case that means sometimes devices cannot be re-added when they should be. In particular, when re-adding a device to an array without metadata we should always access the device, but after the above commit we didn't. This patch sets the In_sync flag in that case so that the re-add succeeds. This patch is suitable for any -stable kernel to which 1a855a0606 was applied. Cc: stable@kernel.org Signed-off-by: NeilBrown <neilb@suse.de>
2011-01-07block: trace event block fix unassigned fieldJeff Moyer
The "error" field in block_bio_complete is not assigned, leaving the memory area uninitialized (keeping garbage data). Pass an additional tracepoint argument to this event to initialize this field. Signed-off-by: Jeff Moyer <jmoyer@redhat.com> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> CC: Steven Rostedt <rostedt@goodmis.org> CC: Frederic Weisbecker <fweisbec@gmail.com> CC: Ingo Molnar <mingo@elte.hu> CC: Thomas Gleixner <tglx@linutronix.de> CC: Li Zefan <lizf@cn.fujitsu.com> CC: Alan.Brunelle@hp.com Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2010-12-20Merge branch 'for-linus' of git://git.kernel.dk/linux-2.6-blockLinus Torvalds
* 'for-linus' of git://git.kernel.dk/linux-2.6-block: cciss: fix cciss_revalidate panic block: max hardware sectors limit wrapper block: Deprecate QUEUE_FLAG_CLUSTER and use queue_limits instead blk-throttle: Correct the placement of smp_rmb() blk-throttle: Trim/adjust slice_end once a bio has been dispatched block: check for proper length of iov entries earlier in blk_rq_map_user_iov() drbd: fix for spin_lock_irqsave in endio callback drbd: don't recvmsg with zero length
2010-12-17block: max hardware sectors limit wrapperMike Snitzer
Implement blk_limits_max_hw_sectors() and make blk_queue_max_hw_sectors() a wrapper around it. DM needs this to avoid setting queue_limits' max_hw_sectors and max_sectors directly. dm_set_device_limits() now leverages blk_limits_max_hw_sectors() logic to establish the appropriate max_hw_sectors minimum (PAGE_SIZE). Fixes issue where DM was incorrectly setting max_sectors rather than max_hw_sectors (which caused dm_merge_bvec()'s max_hw_sectors check to be ineffective). Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@kernel.org Acked-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2010-12-17block: Deprecate QUEUE_FLAG_CLUSTER and use queue_limits insteadMartin K. Petersen
When stacking devices, a request_queue is not always available. This forced us to have a no_cluster flag in the queue_limits that could be used as a carrier until the request_queue had been set up for a metadevice. There were several problems with that approach. First of all it was up to the stacking device to remember to set queue flag after stacking had completed. Also, the queue flag and the queue limits had to be kept in sync at all times. We got that wrong, which could lead to us issuing commands that went beyond the max scatterlist limit set by the driver. The proper fix is to avoid having two flags for tracking the same thing. We deprecate QUEUE_FLAG_CLUSTER and use the queue limit directly in the block layer merging functions. The queue_limit 'no_cluster' is turned into 'cluster' to avoid double negatives and to ease stacking. Clustering defaults to being enabled as before. The queue flag logic is removed from the stacking function, and explicitly setting the cluster flag is no longer necessary in DM and MD. Reported-by: Ed Lin <ed.lin@promise.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Acked-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@kernel.org Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2010-12-09md: protect against NULL reference when waiting to start a raid10.NeilBrown
When we fail to start a raid10 for some reason, we call md_unregister_thread to kill the thread that was created. Unfortunately md_thread() will then make one call into the handler (raid10d) even though md_wakeup_thread has not been called. This is not safe and as md_unregister_thread is called after mddev->private has been set to NULL, it will definitely cause a NULL dereference. So fix this at both ends: - md_thread should only call the handler if THREAD_WAKEUP has been set. - raid10 should call md_unregister_thread before setting things to NULL just like all the other raid modules do. This is applicable to 2.6.35 and later. Cc: stable@kernel.org Reported-by: "Citizen" <citizen_lee@thecus.com> Signed-off-by: NeilBrown <neilb@suse.de>
2010-12-09md: fix bug with re-adding of partially recovered device.NeilBrown
With v0.90 metadata, a hot-spare does not become a full member of the array until recovery is complete. So if we re-add such a device to the array, we know that all of it is as up-to-date as the event count would suggest, and so it a bitmap-based recovery is possible. However with v1.x metadata, the hot-spare immediately becomes a full member of the array, but it record how much of the device has been recovered. If the array is stopped and re-assembled recovery starts from this point. When such a device is hot-added to an array we currently lose the 'how much is recovered' information and incorrectly included it as a full in-sync member (after bitmap-based fixup). This is wrong and unsafe and could corrupt data. So be more careful about setting saved_raid_disk - which is what guides the re-adding of devices back into an array. The new code matches the code in slot_store which does a similar thing, which is encouraging. This is suitable for any -stable kernel. Reported-by: "Dailey, Nate" <Nate.Dailey@stratus.com> Cc: stable@kernel.org Signed-off-by: NeilBrown <neilb@suse.de>
2010-12-09md: fix possible deadlock in handling flush requests.NeilBrown
As recorded in https://bugzilla.kernel.org/show_bug.cgi?id=24012 it is possible for a flush request through md to hang. This is due to an interaction between the recursion avoidance in generic_make_request, the insistence in md of only having one flush active at a time, and the possibility of dm (or md) submitting two flush requests to a device from the one generic_make_request. If a generic_make_request call into dm causes two flush requests to be queued (as happens if the dm table has two targets - they get one each), these two will be queued inside generic_make_request. Assume they are for the same md device. The first is processed and causes 1 or more flush requests to be sent to lower devices. These get queued within generic_make_request too. Then the second flush to the md device gets handled and it blocks waiting for the first flush to complete. But it won't complete until the two lower-device requests complete, and they haven't even been submitted yet as they are on the generic_make_request queue. The deadlock can be broken by using a separate thread to submit the requests to lower devices. md has such a thread readily available: md_wq. So use it to submit these requests. Reported-by: Giacomo Catenazzi <cate@cateee.net> Tested-by: Giacomo Catenazzi <cate@cateee.net> Signed-off-by: NeilBrown <neilb@suse.de>
2010-12-09md: move code in to submit_flushes.NeilBrown
submit_flushes is called from exactly one place. Move the code that is before and after that call into submit_flushes. This has not functional change, but will make the next patch smaller and easier to follow. Signed-off-by: NeilBrown <neilb@suse.de>
2010-12-09md: remove handling of flush_pending in md_submit_flush_dataNeilBrown
None of the functions called between setting flush_pending to 1, and atomic_dec_and_test can change flush_pending, or will anything running in any other thread (as ->flush_bio is not NULL). So the atomic_dec_and_test will always succeed. So remove the atomic_sec and the atomic_dec_and_test. Signed-off-by: NeilBrown <neilb@suse.de>
2010-11-27Merge branch 'cleanup-bd_claim' of ↵Jens Axboe
git://git.kernel.org/pub/scm/linux/kernel/git/tj/misc into for-2.6.38/core
2010-11-24md: Call blk_queue_flush() to establish flush/fua supportDarrick J. Wong
Before 2.6.37, the md layer had a mechanism for catching I/Os with the barrier flag set, and translating the barrier into barriers for all the underlying devices. With 2.6.37, I/O barriers have become plain old flushes, and the md code was updated to reflect this. However, one piece was left out -- the md layer does not tell the block layer that it supports flushes or FUA access at all, which results in md silently dropping flush requests. Since the support already seems there, just add this one piece of bookkeeping. Signed-off-by: Darrick J. Wong <djwong@us.ibm.com> Signed-off-by: NeilBrown <neilb@suse.de>
2010-11-24md/raid1: really fix recovery looping when single good device fails.NeilBrown
Commit 4044ba58dd15cb01797c4fd034f39ef4a75f7cc3 supposedly fixed a problem where if a raid1 with just one good device gets a read-error during recovery, the recovery would abort and immediately restart in an infinite loop. However it depended on raid1_remove_disk removing the spare device from the array. But that does not happen in this case. So add a test so that in the 'recovery_disabled' case, the device will be removed. This suitable for any kernel since 2.6.29 which is when recovery_disabled was introduced. Cc: stable@kernel.org Reported-by: Sebastian Färber <faerber@gmail.com> Signed-off-by: NeilBrown <neilb@suse.de>
2010-11-24md: fix return value of rdev_size_change()Justin Maggard
When trying to grow an array by enlarging component devices, rdev_size_store() expects the return value of rdev_size_change() to be in sectors, but the actual value is returned in KBs. This functionality was broken by commit dd8ac336c13fd8afdb082ebacb1cddd5cf727889 so this patch is suitable for any kernel since 2.6.30. Cc: stable@kernel.org Signed-off-by: Justin Maggard <jmaggard10@gmail.com> Signed-off-by: NeilBrown <neilb@suse.de>
2010-11-16block: Rename "block_remap" tracepoint to "block_bio_remap" to clarify the ↵Mike Snitzer
event. Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com> Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2010-11-13block: clean up blkdev_get() wrappers and their usersTejun Heo
After recent blkdev_get() modifications, open_by_devnum() and open_bdev_exclusive() are simple wrappers around blkdev_get(). Replace them with blkdev_get_by_dev() and blkdev_get_by_path(). blkdev_get_by_dev() is identical to open_by_devnum(). blkdev_get_by_path() is slightly different in that it doesn't automatically add %FMODE_EXCL to @mode. All users are converted. Most conversions are mechanical and don't introduce any behavior difference. There are several exceptions. * btrfs now sets FMODE_EXCL in btrfs_device->mode, so there's no reason to OR it explicitly on blkdev_put(). * gfs2, nilfs2 and the generic mount_bdev() now set FMODE_EXCL in sb->s_mode. * With the above changes, sb->s_mode now always should contain FMODE_EXCL. WARN_ON_ONCE() added to kill_block_super() to detect errors. The new blkdev_get_*() functions are with proper docbook comments. While at it, add function description to blkdev_get() too. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Philipp Reisner <philipp.reisner@linbit.com> Cc: Neil Brown <neilb@suse.de> Cc: Mike Snitzer <snitzer@redhat.com> Cc: Joern Engel <joern@lazybastard.org> Cc: Chris Mason <chris.mason@oracle.com> Cc: Jan Kara <jack@suse.cz> Cc: "Theodore Ts'o" <tytso@mit.edu> Cc: KONISHI Ryusuke <konishi.ryusuke@lab.ntt.co.jp> Cc: reiserfs-devel@vger.kernel.org Cc: xfs-masters@oss.sgi.com Cc: Alexander Viro <viro@zeniv.linux.org.uk>
2010-11-13block: make blkdev_get/put() handle exclusive accessTejun Heo
Over time, block layer has accumulated a set of APIs dealing with bdev open, close, claim and release. * blkdev_get/put() are the primary open and close functions. * bd_claim/release() deal with exclusive open. * open/close_bdev_exclusive() are combination of open and claim and the other way around, respectively. * bd_link/unlink_disk_holder() to create and remove holder/slave symlinks. * open_by_devnum() wraps bdget() + blkdev_get(). The interface is a bit confusing and the decoupling of open and claim makes it impossible to properly guarantee exclusive access as in-kernel open + claim sequence can disturb the existing exclusive open even before the block layer knows the current open if for another exclusive access. Reorganize the interface such that, * blkdev_get() is extended to include exclusive access management. @holder argument is added and, if is @FMODE_EXCL specified, it will gain exclusive access atomically w.r.t. other exclusive accesses. * blkdev_put() is similarly extended. It now takes @mode argument and if @FMODE_EXCL is set, it releases an exclusive access. Also, when the last exclusive claim is released, the holder/slave symlinks are removed automatically. * bd_claim/release() and close_bdev_exclusive() are no longer necessary and either made static or removed. * bd_link_disk_holder() remains the same but bd_unlink_disk_holder() is no longer necessary and removed. * open_bdev_exclusive() becomes a simple wrapper around lookup_bdev() and blkdev_get(). It also has an unexpected extra bdev_read_only() test which probably should be moved into blkdev_get(). * open_by_devnum() is modified to take @holder argument and pass it to blkdev_get(). Most of bdev open/close operations are unified into blkdev_get/put() and most exclusive accesses are tested atomically at the open time (as it should). This cleans up code and removes some, both valid and invalid, but unnecessary all the same, corner cases. open_bdev_exclusive() and open_by_devnum() can use further cleanup - rename to blkdev_get_by_path() and blkdev_get_by_devt() and drop special features. Well, let's leave them for another day. Most conversions are straight-forward. drbd conversion is a bit more involved as there was some reordering, but the logic should stay the same. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Neil Brown <neilb@suse.de> Acked-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp> Acked-by: Mike Snitzer <snitzer@redhat.com> Acked-by: Philipp Reisner <philipp.reisner@linbit.com> Cc: Peter Osterlund <petero2@telia.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Jan Kara <jack@suse.cz> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Andreas Dilger <adilger.kernel@dilger.ca> Cc: "Theodore Ts'o" <tytso@mit.edu> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Joel Becker <joel.becker@oracle.com> Cc: Alex Elder <aelder@sgi.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: dm-devel@redhat.com Cc: drbd-dev@lists.linbit.com Cc: Leo Chen <leochen@broadcom.com> Cc: Scott Branden <sbranden@broadcom.com> Cc: Chris Mason <chris.mason@oracle.com> Cc: Steven Whitehouse <swhiteho@redhat.com> Cc: Dave Kleikamp <shaggy@linux.vnet.ibm.com> Cc: Joern Engel <joern@logfs.org> Cc: reiserfs-devel@vger.kernel.org Cc: Alexander Viro <viro@zeniv.linux.org.uk>
2010-11-13block: simplify holder symlink handlingTejun Heo
Code to manage symlinks in /sys/block/*/{holders|slaves} are overly complex with multiple holder considerations, redundant extra references to all involved kobjects, unused generic kobject holder support and unnecessary mixup with bd_claim/release functionalities. Strip it down to what's necessary (single gendisk holder) and make it use a separate interface. This is a step for cleaning up bd_claim/release. This patch makes dm-table slightly more complex but it will be simplified again with further changes. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Neil Brown <neilb@suse.de> Acked-by: Mike Snitzer <snitzer@redhat.com> Cc: dm-devel@redhat.com
2010-11-10block: read i_size with i_size_read()Mike Snitzer
Convert direct reads of an inode's i_size to using i_size_read(). i_size_{read,write} use a seqcount to protect reads from accessing incomple writes. Concurrent i_size_write()s require mutual exclussion to protect the seqcount that is used by i_size_{read,write}. But i_size_read() callers do not need to use additional locking. Signed-off-by: Mike Snitzer <snitzer@redhat.com> Acked-by: NeilBrown <neilb@suse.de> Acked-by: Lars Ellenberg <lars.ellenberg@linbit.com> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2010-10-29md: tidy up device searches in read_balance.NeilBrown
The code for searching through the device list to read-balance in raid1 is rather clumsy and hard to follow. Try to simplify it a bit. No important functionality change here. Signed-off-by: NeilBrown <neilb@suse.de>
2010-10-29md/raid1: fix some typos in comments.NeilBrown
Signed-off-by: NeilBrown <neilb@suse.de>
2010-10-29md/raid1: discard unused variable.NeilBrown
This structure field (flushing_bio_list) is never used, so remove it. Signed-off-by: NeilBrown <neilb@suse.de>
2010-10-29md: unplug writes to external bitmaps.NeilBrown
When writing to an 'external' bitmap we don't currently unplug the device before waiting, so we can get a 3msec delay each time; So use REQ_UNPLUG to force and unplug. Signed-off-by: NeilBrown <neilb@suse.de>
2010-10-28md: use separate bio pool for each md device.NeilBrown
bio_clone and bio_alloc allocate from a common bio pool. If an md device is stacked with other devices that use this pool, or under something like swap which uses the pool, then the multiple calls on the pool can cause deadlocks. So allocate a local bio pool for each md array and use that rather than the common pool. This pool is used both for regular IO and metadata updates. Signed-off-by: NeilBrown <neilb@suse.de>
2010-10-28md: change type of first arg to sync_page_io.NeilBrown
Currently sync_page_io takes a 'bdev'. Every caller passes 'rdev->bdev'. We will soon want another field out of the rdev in sync_page_io, So just pass the rdev instead of the bdev out of it. Signed-off-by: NeilBrown <neilb@suse.de>
2010-10-28md/raid1: perform mem allocation before disabling writes during resync.NeilBrown
Though this mem alloc is GFP_NOIO an so will not deadlock, it seems better to do the allocation before 'raise_barrier' which stops any IO requests while the resync proceeds. raid10 always uses this order, so it is at least consistent to do the same in raid1. Signed-off-by: NeilBrown <neilb@suse.de>
2010-10-28md: use bio_kmalloc rather than bio_alloc when failure is acceptable.NeilBrown
bio_alloc can never fail (as it uses a mempool) but an block indefinitely, especially if the caller is holding a reference to a previously allocated bio. So these to places which both handle failure and hold multiple bios should not use bio_alloc, they should use bio_kmalloc. Signed-off-by: NeilBrown <neilb@suse.de>
2010-10-28md: Fix possible deadlock with multiple mempool allocations.NeilBrown
It is not safe to allocate from a mempool while holding an item previously allocated from that mempool as that can deadlock when the mempool is close to exhaustion. So don't use a bio list to collect the bios to write to multiple devices in raid1 and raid10. Instead queue each bio as it becomes available so an unplug will activate all previously allocated bios and so a new bio has a chance of being allocated. This means we must set the 'remaining' count to '1' before submitting any requests, then when all are submitted, decrement 'remaining' and possible handle the write completion at that point. Reported-by: Torsten Kaiser <just.for.lkml@googlemail.com> Tested-by: Torsten Kaiser <just.for.lkml@googlemail.com> Signed-off-by: NeilBrown <neilb@suse.de>
2010-10-28md: fix and update workqueue usageTejun Heo
Workqueue usage in md has two problems. * Flush can be used during or depended upon by memory reclaim, but md uses the system workqueue for flush_work which may lead to deadlock. * md depends on flush_scheduled_work() to achieve exclusion against completion of removal of previous instances. flush_scheduled_work() may incur unexpected amount of delay and is scheduled to be removed. This patch adds two workqueues to md - md_wq and md_misc_wq. The former is guaranteed to make forward progress under memory pressure and serves flush_work. The latter serves as the flush domain for other works. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: NeilBrown <neilb@suse.de>
2010-10-28md: use sector_t in bitmap_get_counterNeilBrown
bitmap_get_counter returns the number of sectors covered by the counter in a pass-by-reference variable. In some cases this can be very large, so make it a sector_t for safety. Signed-off-by: NeilBrown <neilb@suse.de>
2010-10-28md: remove md_mutex locking.NeilBrown
lock_kernel calls were recently pushed down into open/release functions. md doesn't need that protection. Then the BKL calls were change to md_mutex. We don't need those either. So remove it all. Signed-off-by: NeilBrown <neilb@suse.de>
2010-10-28md: Fix regression with raid1 arrays without persistent metadata.NeilBrown
A RAID1 which has no persistent metadata, whether internal or external, will hang on the first write. This is caused by commit 070dc6dd7103b6b3f7e4d46e754354a5c15f366e In that case, MD_CHANGE_PENDING never gets cleared. So during md_update_sb, is neither persistent or external, clear MD_CHANGE_PENDING. This is suitable for 2.6.36-stable. Signed-off-by: NeilBrown <neilb@suse.de> Cc: stable@kernel.org
2010-10-26workqueues: s/ON_STACK/ONSTACK/Andrew Morton
Silly though it is, completions and wait_queue_heads use foo_ONSTACK (COMPLETION_INITIALIZER_ONSTACK, DECLARE_COMPLETION_ONSTACK, __WAIT_QUEUE_HEAD_INIT_ONSTACK and DECLARE_WAIT_QUEUE_HEAD_ONSTACK) so I guess workqueues should do the same thing. s/INIT_WORK_ON_STACK/INIT_WORK_ONSTACK/ s/INIT_DELAYED_WORK_ON_STACK/INIT_DELAYED_WORK_ONSTACK/ Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-10-22Merge branch 'for-2.6.37/barrier' of git://git.kernel.dk/linux-2.6-blockLinus Torvalds
* 'for-2.6.37/barrier' of git://git.kernel.dk/linux-2.6-block: (46 commits) xen-blkfront: disable barrier/flush write support Added blk-lib.c and blk-barrier.c was renamed to blk-flush.c block: remove BLKDEV_IFL_WAIT aic7xxx_old: removed unused 'req' variable block: remove the BH_Eopnotsupp flag block: remove the BLKDEV_IFL_BARRIER flag block: remove the WRITE_BARRIER flag swap: do not send discards as barriers fat: do not send discards as barriers ext4: do not send discards as barriers jbd2: replace barriers with explicit flush / FUA usage jbd2: Modify ASYNC_COMMIT code to not rely on queue draining on barrier jbd: replace barriers with explicit flush / FUA usage nilfs2: replace barriers with explicit flush / FUA usage reiserfs: replace barriers with explicit flush / FUA usage gfs2: replace barriers with explicit flush / FUA usage btrfs: replace barriers with explicit flush / FUA usage xfs: replace barriers with explicit flush / FUA usage block: pass gfp_mask and flags to sb_issue_discard dm: convey that all flushes are processed as empty ...
2010-10-22Merge branch 'for-2.6.37/core' of git://git.kernel.dk/linux-2.6-blockLinus Torvalds
* 'for-2.6.37/core' of git://git.kernel.dk/linux-2.6-block: (39 commits) cfq-iosched: Fix a gcc 4.5 warning and put some comments block: Turn bvec_k{un,}map_irq() into static inline functions block: fix accounting bug on cross partition merges block: Make the integrity mapped property a bio flag block: Fix double free in blk_integrity_unregister block: Ensure physical block size is unsigned int blkio-throttle: Fix possible multiplication overflow in iops calculations blkio-throttle: limit max iops value to UINT_MAX blkio-throttle: There is no need to convert jiffies to milli seconds blkio-throttle: Fix link failure failure on i386 blkio: Recalculate the throttled bio dispatch time upon throttle limit change blkio: Add root group to td->tg_list blkio: deletion of a cgroup was causes oops blkio: Do not export throttle files if CONFIG_BLK_DEV_THROTTLING=n block: set the bounce_pfn to the actual DMA limit rather than to max memory block: revert bad fix for memory hotplug causing bounces Fix compile error in blk-exec.c for !CONFIG_DETECT_HUNG_TASK block: set the bounce_pfn to the actual DMA limit rather than to max memory block: Prevent hang_check firing during long I/O cfq: improve fsync performance for small files ... Fix up trivial conflicts due to __rcu sparse annotation in include/linux/genhd.h
2010-10-22Merge branch 'llseek' of git://git.kernel.org/pub/scm/linux/kernel/git/arnd/bklLinus Torvalds
* 'llseek' of git://git.kernel.org/pub/scm/linux/kernel/git/arnd/bkl: vfs: make no_llseek the default vfs: don't use BKL in default_llseek llseek: automatically add .llseek fop libfs: use generic_file_llseek for simple_attr mac80211: disallow seeks in minstrel debug code lirc: make chardev nonseekable viotape: use noop_llseek raw: use explicit llseek file operations ibmasmfs: use generic_file_llseek spufs: use llseek in all file operations arm/omap: use generic_file_llseek in iommu_debug lkdtm: use generic_file_llseek in debugfs net/wireless: use generic_file_llseek in debugfs drm: use noop_llseek
2010-10-22Merge branch 'trivial' of git://git.kernel.org/pub/scm/linux/kernel/git/arnd/bklLinus Torvalds
* 'trivial' of git://git.kernel.org/pub/scm/linux/kernel/git/arnd/bkl: block: autoconvert trivial BKL users to private mutex drivers: autoconvert trivial BKL users to private mutex ipmi: autoconvert trivial BKL users to private mutex mac: autoconvert trivial BKL users to private mutex mtd: autoconvert trivial BKL users to private mutex scsi: autoconvert trivial BKL users to private mutex Fix up trivial conflicts (due to addition of private mutex right next to deletion of a version string) in drivers/char/pcmcia/cm40[04]0_cs.c
2010-10-19Merge branch 'v2.6.36-rc8' into for-2.6.37/barrierJens Axboe
Conflicts: block/blk-core.c drivers/block/loop.c mm/swapfile.c Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2010-10-15llseek: automatically add .llseek fopArnd Bergmann
All file_operations should get a .llseek operation so we can make nonseekable_open the default for future file operations without a .llseek pointer. The three cases that we can automatically detect are no_llseek, seq_lseek and default_llseek. For cases where we can we can automatically prove that the file offset is always ignored, we use noop_llseek, which maintains the current behavior of not returning an error from a seek. New drivers should normally not use noop_llseek but instead use no_llseek and call nonseekable_open at open time. Existing drivers can be converted to do the same when the maintainer knows for certain that no user code relies on calling seek on the device file. The generated code is often incorrectly indented and right now contains comments that clarify for each added line why a specific variant was chosen. In the version that gets submitted upstream, the comments will be gone and I will manually fix the indentation, because there does not seem to be a way to do that using coccinelle. Some amount of new code is currently sitting in linux-next that should get the same modifications, which I will do at the end of the merge window. Many thanks to Julia Lawall for helping me learn to write a semantic patch that does all this. ===== begin semantic patch ===== // This adds an llseek= method to all file operations, // as a preparation for making no_llseek the default. // // The rules are // - use no_llseek explicitly if we do nonseekable_open // - use seq_lseek for sequential files // - use default_llseek if we know we access f_pos // - use noop_llseek if we know we don't access f_pos, // but we still want to allow users to call lseek // @ open1 exists @ identifier nested_open; @@ nested_open(...) { <+... nonseekable_open(...) ...+> } @ open exists@ identifier open_f; identifier i, f; identifier open1.nested_open; @@ int open_f(struct inode *i, struct file *f) { <+... ( nonseekable_open(...) | nested_open(...) ) ...+> } @ read disable optional_qualifier exists @ identifier read_f; identifier f, p, s, off; type ssize_t, size_t, loff_t; expression E; identifier func; @@ ssize_t read_f(struct file *f, char *p, size_t s, loff_t *off) { <+... ( *off = E | *off += E | func(..., off, ...) | E = *off ) ...+> } @ read_no_fpos disable optional_qualifier exists @ identifier read_f; identifier f, p, s, off; type ssize_t, size_t, loff_t; @@ ssize_t read_f(struct file *f, char *p, size_t s, loff_t *off) { ... when != off } @ write @ identifier write_f; identifier f, p, s, off; type ssize_t, size_t, loff_t; expression E; identifier func; @@ ssize_t write_f(struct file *f, const char *p, size_t s, loff_t *off) { <+... ( *off = E | *off += E | func(..., off, ...) | E = *off ) ...+> } @ write_no_fpos @ identifier write_f; identifier f, p, s, off; type ssize_t, size_t, loff_t; @@ ssize_t write_f(struct file *f, const char *p, size_t s, loff_t *off) { ... when != off } @ fops0 @ identifier fops; @@ struct file_operations fops = { ... }; @ has_llseek depends on fops0 @ identifier fops0.fops; identifier llseek_f; @@ struct file_operations fops = { ... .llseek = llseek_f, ... }; @ has_read depends on fops0 @ identifier fops0.fops; identifier read_f; @@ struct file_operations fops = { ... .read = read_f, ... }; @ has_write depends on fops0 @ identifier fops0.fops; identifier write_f; @@ struct file_operations fops = { ... .write = write_f, ... }; @ has_open depends on fops0 @ identifier fops0.fops; identifier open_f; @@ struct file_operations fops = { ... .open = open_f, ... }; // use no_llseek if we call nonseekable_open //////////////////////////////////////////// @ nonseekable1 depends on !has_llseek && has_open @ identifier fops0.fops; identifier nso ~= "nonseekable_open"; @@ struct file_operations fops = { ... .open = nso, ... +.llseek = no_llseek, /* nonseekable */ }; @ nonseekable2 depends on !has_llseek @ identifier fops0.fops; identifier open.open_f; @@ struct file_operations fops = { ... .open = open_f, ... +.llseek = no_llseek, /* open uses nonseekable */ }; // use seq_lseek for sequential files ///////////////////////////////////// @ seq depends on !has_llseek @ identifier fops0.fops; identifier sr ~= "seq_read"; @@ struct file_operations fops = { ... .read = sr, ... +.llseek = seq_lseek, /* we have seq_read */ }; // use default_llseek if there is a readdir /////////////////////////////////////////// @ fops1 depends on !has_llseek && !nonseekable1 && !nonseekable2 && !seq @ identifier fops0.fops; identifier readdir_e; @@ // any other fop is used that changes pos struct file_operations fops = { ... .readdir = readdir_e, ... +.llseek = default_llseek, /* readdir is present */ }; // use default_llseek if at least one of read/write touches f_pos ///////////////////////////////////////////////////////////////// @ fops2 depends on !fops1 && !has_llseek && !nonseekable1 && !nonseekable2 && !seq @ identifier fops0.fops; identifier read.read_f; @@ // read fops use offset struct file_operations fops = { ... .read = read_f, ... +.llseek = default_llseek, /* read accesses f_pos */ }; @ fops3 depends on !fops1 && !fops2 && !has_llseek && !nonseekable1 && !nonseekable2 && !seq @ identifier fops0.fops; identifier write.write_f; @@ // write fops use offset struct file_operations fops = { ... .write = write_f, ... + .llseek = default_llseek, /* write accesses f_pos */ }; // Use noop_llseek if neither read nor write accesses f_pos /////////////////////////////////////////////////////////// @ fops4 depends on !fops1 && !fops2 && !fops3 && !has_llseek && !nonseekable1 && !nonseekable2 && !seq @ identifier fops0.fops; identifier read_no_fpos.read_f; identifier write_no_fpos.write_f; @@ // write fops use offset struct file_operations fops = { ... .write = write_f, .read = read_f, ... +.llseek = noop_llseek, /* read and write both use no f_pos */ }; @ depends on has_write && !has_read && !fops1 && !fops2 && !has_llseek && !nonseekable1 && !nonseekable2 && !seq @ identifier fops0.fops; identifier write_no_fpos.write_f; @@ struct file_operations fops = { ... .write = write_f, ... +.llseek = noop_llseek, /* write uses no f_pos */ }; @ depends on has_read && !has_write && !fops1 && !fops2 && !has_llseek && !nonseekable1 && !nonseekable2 && !seq @ identifier fops0.fops; identifier read_no_fpos.read_f; @@ struct file_operations fops = { ... .read = read_f, ... +.llseek = noop_llseek, /* read uses no f_pos */ }; @ depends on !has_read && !has_write && !fops1 && !fops2 && !has_llseek && !nonseekable1 && !nonseekable2 && !seq @ identifier fops0.fops; @@ struct file_operations fops = { ... +.llseek = noop_llseek, /* no read or write fn */ }; ===== End semantic patch ===== Signed-off-by: Arnd Bergmann <arnd@arndb.de> Cc: Julia Lawall <julia@diku.dk> Cc: Christoph Hellwig <hch@infradead.org>
2010-10-07md: check return code of read_sb_pageVasiliy Kulikov
Function read_sb_page may return ERR_PTR(...). Check for it. Signed-off-by: Vasiliy Kulikov <segooon@gmail.com> Cc: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: NeilBrown <neilb@suse.de>
2010-10-07md/raid1: minor bio initialisation improvements.NeilBrown
When performing a resync we pre-allocate some bios and repeatedly use them. This requires us to re-initialise them each time. One field (bi_comp_cpu) and some flags weren't being initiaised reliably. Signed-off-by: NeilBrown <neilb@suse.de>
2010-10-07md/raid1: avoid overflow in raid1 resync when bitmap is in use.NeilBrown
bitmap_start_sync returns - via a pass-by-reference variable - the number of sectors before we need to check with the bitmap again. Since commit ef4256733506f245 this number can be substantially larger, 2^27 is a common value. Unfortunately it is an 'int' and so when raid1.c:sync_request shifts it 9 places to the left it becomes 0. This results in a zero-length read which the scsi layer justifiably complains about. This patch just removes the shift so the common case becomes safe with a trivially-correct patch. In the next merge window we will convert this 'int' to a 'sector_t' Reported-by: "George Spelvin" <linux@horizon.com> Signed-off-by: NeilBrown <neilb@suse.de>
2010-10-05block: autoconvert trivial BKL users to private mutexArnd Bergmann
The block device drivers have all gained new lock_kernel calls from a recent pushdown, and some of the drivers were already using the BKL before. This turns the BKL into a set of per-driver mutexes. Still need to check whether this is safe to do. file=$1 name=$2 if grep -q lock_kernel ${file} ; then if grep -q 'include.*linux.mutex.h' ${file} ; then sed -i '/include.*<linux\/smp_lock.h>/d' ${file} else sed -i 's/include.*<linux\/smp_lock.h>.*$/include <linux\/mutex.h>/g' ${file} fi sed -i ${file} \ -e "/^#include.*linux.mutex.h/,$ { 1,/^\(static\|int\|long\)/ { /^\(static\|int\|long\)/istatic DEFINE_MUTEX(${name}_mutex); } }" \ -e "s/\(un\)*lock_kernel\>[ ]*()/mutex_\1lock(\&${name}_mutex)/g" \ -e '/[ ]*cycle_kernel_lock();/d' else sed -i -e '/include.*\<smp_lock.h\>/d' ${file} \ -e '/cycle_kernel_lock()/d' fi Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2010-09-17md: fix v1.x metadata update when a disk is missing.NeilBrown
If an array with 1.x metadata is assembled with the last disk missing, md doesn't properly record the fact that the disk was missing. This is unlikely to cause a real problem as the event count will be different to the count on the missing disk so it won't be included in the array. However it could still cause confusion. So make sure we clear all the relevant slots, not just the early ones. Signed-off-by: NeilBrown <neilb@suse.de>