summaryrefslogtreecommitdiff
path: root/drivers/md/raid10.h
AgeCommit message (Collapse)Author
2016-11-22md/raid10: add failfast handling for reads.NeilBrown
If a device is marked FailFast, and it is not the only device we can read from, we mark the bio as MD_FAILFAST. If this does fail-fast, we don't try read repair but just allow failure. If it was the last device, it doesn't get marked Faulty so the retry happens on the same device - this time without FAILFAST. A subsequent failure will not retry but will just pass up the error. During resync we may use FAILFAST requests, and on a failure we will simply use the other device(s). During recovery we will only use FAILFAST in the unusual case were there are multiple places to read from - i.e. if there are > 2 devices. If we get a failure we will fail the device and complete the resync/recovery with remaining devices. Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>
2016-07-19raid10: improve random reads performanceTomasz Majchrzak
RAID10 random read performance is lower than expected due to excessive spinlock utilisation which is required mostly for rebuild/resync. Simplify allow_barrier as it's in IO path and encounters a lot of unnecessary congestion. As lower_barrier just takes a lock in order to decrement a counter, convert counter (nr_pending) into atomic variable and remove the spin lock. There is also a congestion for wake_up (it uses lock internally) so call it only when it's really needed. As wake_up is not called constantly anymore, ensure process waiting to raise a barrier is notified when there are no more waiting IOs. Signed-off-by: Tomasz Majchrzak <tomasz.majchrzak@intel.com> Signed-off-by: Shaohua Li <shli@fb.com>
2015-08-31md/raid10: ensure device failure recorded before write request returns.NeilBrown
When a write to one of the legs of a RAID10 fails, the failure is recorded in the metadata of the other legs so that after a restart the data on the failed drive wont be trusted even if that drive seems to be working again (maybe a cable was unplugged). Currently there is no interlock between the write request completing and the metadata update. So it is possible that the write will complete, the app will confirm success in some way, and then the machine will crash before the metadata update completes. This is an extremely small hole for a racy to fit in, but it is theoretically possible and so should be closed. So: - set MD_CHANGE_PENDING when requesting a metadata update for a failed device, so we can know with certainty when it completes - queue requests that experienced an error on a new queue which is only processed after the metadata update completes - call raid_end_bio_io() on bios in that queue when the time comes. Signed-off-by: NeilBrown <neilb@suse.com>
2015-02-04md: make ->congested robust against personality changes.NeilBrown
There is currently no locking around calls to the 'congested' bdi function. If called at an awkward time while an array is being converted from one level (or personality) to another, there is a tiny chance of running code in an unreferenced module etc. So add a 'congested' function to the md_personality operations structure, and call it with appropriate locking from a central 'mddev_congested'. When the array personality is changing the array will be 'suspended' so no IO is processed. If mddev_congested detects this, it simply reports that the array is congested, which is a safe guess. As mddev_suspend calls synchronize_rcu(), mddev_congested can avoid races by included the whole call inside an rcu_read_lock() region. This require that the congested functions for all subordinate devices can be run under rcu_lock. Fortunately this is the case. Signed-off-by: NeilBrown <neilb@suse.de>
2013-02-26MD RAID10: Improve redundancy for 'far' and 'offset' algorithms (part 1)Jonathan Brassow
The MD RAID10 'far' and 'offset' algorithms make copies of entire stripe widths - copying them to a different location on the same devices after shifting the stripe. An example layout of each follows below: "far" algorithm dev1 dev2 dev3 dev4 dev5 dev6 ==== ==== ==== ==== ==== ==== A B C D E F G H I J K L ... F A B C D E --> Copy of stripe0, but shifted by 1 L G H I J K ... "offset" algorithm dev1 dev2 dev3 dev4 dev5 dev6 ==== ==== ==== ==== ==== ==== A B C D E F F A B C D E --> Copy of stripe0, but shifted by 1 G H I J K L L G H I J K ... Redundancy for these algorithms is gained by shifting the copied stripes one device to the right. This patch proposes that array be divided into sets of adjacent devices and when the stripe copies are shifted, they wrap on set boundaries rather than the array size boundary. That is, for the purposes of shifting, the copies are confined to their sets within the array. The sets are 'near_copies * far_copies' in size. The above "far" algorithm example would change to: "far" algorithm dev1 dev2 dev3 dev4 dev5 dev6 ==== ==== ==== ==== ==== ==== A B C D E F G H I J K L ... B A D C F E --> Copy of stripe0, shifted 1, 2-dev sets H G J I L K Dev sets are 1-2, 3-4, 5-6 ... This has the affect of improving the redundancy of the array. We can always sustain at least one failure, but sometimes more than one can be handled. In the first examples, the pairs of devices that CANNOT fail together are: (1,2) (2,3) (3,4) (4,5) (5,6) (1, 6) [40% of possible pairs] In the example where the copies are confined to sets, the pairs of devices that cannot fail together are: (1,2) (3,4) (5,6) [20% of possible pairs] We cannot simply replace the old algorithms, so the 17th bit of the 'layout' variable is used to indicate whether we use the old or new method of computing the shift. (This is similar to the way the 16th bit indicates whether the "far" algorithm or the "offset" algorithm is being used.) This patch only handles the cases where the number of total raid disks is a multiple of 'far_copies'. A follow-on patch addresses the condition where this is not true. Signed-off-by: Jonathan Brassow <jbrassow@redhat.com> Signed-off-by: NeilBrown <neilb@suse.de>
2012-08-18md/raid10: fix problem with on-stack allocation of r10bio structure.NeilBrown
A 'struct r10bio' has an array of per-copy information at the end. This array is declared with size [0] and r10bio_pool_alloc allocates enough extra space to store the per-copy information depending on the number of copies needed. So declaring a 'struct r10bio on the stack isn't going to work. It won't allocate enough space, and memory corruption will ensue. So in the two places where this is done, declare a sufficiently large structure and use that instead. The two call-sites of this bug were introduced in 3.4 and 3.5 so this is suitable for both those kernels. The patch will have to be modified for 3.4 as it only has one bug. Cc: stable@vger.kernel.org Reported-by: Ivan Vasilyev <ivan.vasilyev@gmail.com> Tested-by: Ivan Vasilyev <ivan.vasilyev@gmail.com> Signed-off-by: NeilBrown <neilb@suse.de>
2012-07-31MD RAID10: Export md_raid10_congestedJonathan Brassow
md/raid10: Export is_congested test. In similar fashion to commits 11d8a6e3719519fbc0e2c9d61b6fa931b84bf813 1ed7242e591af7e233234d483f12d33818b189d9 we export the RAID10 congestion checking function so that dm-raid.c can make use of it and make use of the personality. The 'queue' and 'gendisk' structures will not be available to the MD code when device-mapper sets up the device, so we conditionalize access to these fields also. Signed-off-by: Jonathan Brassow <jbrassow@redhat.com> Signed-off-by: NeilBrown <neilb@suse.de>
2012-07-31MD: Move macros from raid1*.h to raid1*.cJonathan Brassow
MD RAID1/RAID10: Move some macros from .h file to .c file There are three macros (IO_BLOCKED,IO_MADE_GOOD,BIO_SPECIAL) which are defined in both raid1.h and raid10.h. They are only used in there respective .c files. However, if we wish to make RAID10 accessible to the device-mapper RAID target (dm-raid.c), then we need to move these macros into the .c files where they are used so that they do not conflict with each other. The macros from the two files are identical and could be moved into md.h, but I chose to leave the duplication and have them remain in the personality files. Signed-off-by: Jonathan Brassow <jbrassow@redhat.com> Signed-off-by: NeilBrown <neilb@suse.de>
2012-07-31MD RAID10: rename mirror_info structureJonathan Brassow
MD RAID10: Rename the structure 'mirror_info' to 'raid10_info' The same structure name ('mirror_info') is used by raid1. Each of these structures are defined in there respective header files. If dm-raid is to support both RAID1 and RAID10, the header files will be included and the structure names must not collide. Signed-off-by: Jonathan Brassow <jbrassow@redhat.com> Signed-off-by: NeilBrown <neilb@suse.de>
2012-05-22md/raid10: add reshape supportNeilBrown
A 'near' or 'offset' lay RAID10 array can be reshaped to a different 'near' or 'offset' layout, a different chunk size, and a different number of devices. However the number of copies cannot change. Unlike RAID5/6, we do not support having user-space backup data that is being relocated during a 'critical section'. Rather, the data_offset of each device must change so that when writing any block to a new location, it will not over-write any data that is still 'live'. This means that RAID10 reshape is not supportable on v0.90 metadata. The different between the old data_offset and the new_offset must be at least the larger of the chunksize multiplied by offset copies of each of the old and new layout. (for 'near' mode, offset_copies == 1). A larger difference of around 64M seems useful for in-place reshapes as more data can be moved between metadata updates. Very large differences (e.g. 512M) seem to slow the process down due to lots of long seeks (on oldish consumer graded devices at least). Metadata needs to be updated whenever the place we are about to write to is considered - by the current metadata - to still contain data in the old layout. [unbalanced locking fix from Dan Carpenter <dan.carpenter@oracle.com>] Signed-off-by: NeilBrown <neilb@suse.de>
2012-05-21md/raid10: Introduce 'prev' geometry to support reshape.NeilBrown
When RAID10 supports reshape it will need a 'previous' and a 'current' geometry, so introduce that here. Use the 'prev' geometry when before the reshape_position, and the current 'geo' when beyond it. At other times, use both as appropriate. For now, both are identical (And reshape_position is never set). When we use the 'prev' geometry, we must use the old data_offset. When we use the current (And a reshape is happening) we must use the new_data_offset. Signed-off-by: NeilBrown <neilb@suse.de>
2012-05-21md/raid10: collect some geometry fields into a dedicated structure.NeilBrown
We will shortly be adding reshape support for RAID10 which will require it having 2 concurrent geometries (before and after). To make that easier, collect most geometry fields into 'struct geom' and access them from there. Then we will more easily be able to add a second set of fields. Note that 'copies' is not in this struct and so cannot be changed. There is little need to change this number and doing so is a lot more difficult as it requires reallocating more things. So leave it out for now. Signed-off-by: NeilBrown <neilb@suse.de>
2011-12-23md/raid10: prepare data structures for handling replacement.NeilBrown
Allow each slot in the RAID10 to have 2 devices, the want_replacement and the replacement. Also an r10bio to have 2 bios, and for resync/recovery allocate the second bio if there are any replacement devices. Signed-off-by: NeilBrown <neilb@suse.de>
2011-10-11md: add proper write-congestion reporting to RAID1 and RAID10.NeilBrown
RAID1 and RAID10 handle write requests by queuing them for handling by a separate thread. This is because when a write-intent-bitmap is active we might need to update the bitmap first, so it is good to queue a lot of writes, then do one big bitmap update for them all. However writeback request devices to appear to be congested after a while so it can make some guesstimate of throughput. The infinite queue defeats that (note that RAID5 has already has a finite queue so it doesn't suffer from this problem). So impose a limit on the number of pending write requests. By default it is 1024 which seems to be generally suitable. Make it configurable via module option just in case someone finds a regression. Signed-off-by: NeilBrown <neilb@suse.de>
2011-10-11md/raid10: typedef removal: conf_t -> struct r10confNeilBrown
Signed-off-by: NeilBrown <neilb@suse.de>
2011-10-11md: remove typedefs: mirror_info_t -> struct mirror_infoNeilBrown
Signed-off-by: NeilBrown <neilb@suse.de>
2011-10-11md: remove typedefs: r10bio_t -> struct r10bio and r1bio_t -> struct r1bioNeilBrown
Signed-off-by: NeilBrown <neilb@suse.de>
2011-10-11md: remove typedefs: mdk_thread_t -> struct md_threadNeilBrown
Signed-off-by: NeilBrown <neilb@suse.de>
2011-10-11md: remove typedefs: mddev_t -> struct mddevNeilBrown
Having mddev_t and 'struct mddev_s' is ugly and not preferred Signed-off-by: NeilBrown <neilb@suse.de>
2011-10-11md: removing typedefs: mdk_rdev_t -> struct md_rdevNeilBrown
The typedefs are just annoying. 'mdk' probably refers to 'md_k.h' which used to be an include file that defined this thing. Signed-off-by: NeilBrown <neilb@suse.de>
2011-07-28md/raid10: Handle write errors by updating badblock log.NeilBrown
When we get a write error (in the data area, not in metadata), update the badblock log rather than failing the whole device. As the write may well be many blocks, we trying writing each block individually and only log the ones which fail. Signed-off-by: NeilBrown <neilb@suse.de>
2011-07-28md/raid10: clear bad-block record when write succeeds.NeilBrown
If we succeed in writing to a block that was recorded as being bad, we clear the bad-block record. This requires some delayed handling as the bad-block-list update has to happen in process-context. Signed-off-by: NeilBrown <neilb@suse.de>
2011-07-28md/raid10: avoid reading from known bad blocks - part 1NeilBrown
This patch just covers the basic read path: 1/ read_balance needs to check for badblocks, and return not only the chosen slot, but also how many good blocks are available there. 2/ read submission must be ready to issue multiple reads to different devices as different bad blocks on different devices could mean that a single large read cannot be served by any one device, but can still be served by the array. This requires keeping count of the number of outstanding requests per bio. This count is stored in 'bi_phys_segments' On read error we currently just fail the request if another target cannot handle the whole request. Next patch refines that a bit. Signed-off-by: NeilBrown <neilb@suse.de>
2011-07-27md/raid10: Make use of new recovery_disabled handlingNeilBrown
When we get a read error during recovery, RAID10 previously arranged for the recovering device to appear to fail so that the recovery stops and doesn't restart. This is misleading and wrong. Instead, make use of the new recovery_disabled handling and mark the target device and having recovery disabled. Add appropriate checks in add_disk and remove_disk so that devices are removed and not re-added when recovery is disabled. Signed-off-by: NeilBrown <neilb@suse.de>
2011-03-31Fix common misspellingsLucas De Marchi
Fixes generated by 'codespell' and manually reviewed. Signed-off-by: Lucas De Marchi <lucas.demarchi@profusion.mobi>
2010-06-24md: fix handling of array level takeover that re-arranges devices.NeilBrown
Most array level changes leave the list of devices largely unchanged, possibly causing one at the end to become redundant. However conversions between RAID0 and RAID10 need to renumber all devices (except 0). This renumbering is currently being done in the ->run method when the new personality takes over. However this is too late as the common code in md.c might already have invalidated some of the devices if they had a ->raid_disk number that appeared to high. Moving it into the ->takeover method is too early as the array is still active at that time and wrong ->raid_disk numbers could cause confusion. So add a ->new_raid_disk field to mdk_rdev_s and use it to communicate the new raid_disk number. Now the common code knows exactly which devices need to be renumbered, and which can be invalidated, and can do it all at a convenient time when the array is suspend. It can also update some symlinks in sysfs which previously were not be updated correctly. Reported-by: Maciej Trela <maciej.trela@intel.com> Signed-off-by: NeilBrown <neilb@suse.de>
2010-05-18md: Add support for Raid0->Raid10 takeoverTrela, Maciej
Signed-off-by: Maciej Trela <maciej.trela@intel.com> Signed-off-by: NeilBrown <neilb@suse.de>
2009-06-16md: remove mddev_to_conf "helper" macroNeilBrown
Having a macro just to cast a void* isn't really helpful. I would must rather see that we are simply de-referencing ->private, than have to know what the macro does. So open code the macro everywhere and remove the pointless cast. Signed-off-by: NeilBrown <neilb@suse.de>
2009-03-31md: move lots of #include lines out of .h files and into .cNeilBrown
This makes the includes more explicit, and is preparation for moving md_k.h to drivers/md/md.h Remove include/raid/md.h as its only remaining use was to #include other files. Signed-off-by: NeilBrown <neilb@suse.de>
2009-03-31md: move headers out of include/linux/raid/Christoph Hellwig
Move the headers with the local structures for the disciplines and bitmap.h into drivers/md/ so that they are more easily grepable for hacking and not far away. md.h is left where it is for now as there are some uses from the outside. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: NeilBrown <neilb@suse.de>