Age | Commit message (Collapse) | Author |
|
I no longer have access to the Panasas email.
So change to an email that can always reach me.
Signed-off-by: Boaz Harrosh <ooo@electrozaur.com>
|
|
This simple patch adds support for raid6 to the ORE.
Most operations and calculations where already for the general
case. Only things left:
* call async_gen_syndrome() in the case of raid6
(NOTE that the raid6 math is the one supported by the Linux Kernel
see: crypto/async_tx/async_pq.c)
* call _ore_add_parity_unit() twice with only last call generating
the redundancy pages.
* Fix couple BUGS in old code
a. In reads when parity==2 it can happen that per_dev->length=0
but per_dev->offset was set and adjusted by _ore_add_sg_seg().
Don't let it be overwritten.
b. The all 'cur_comp > starting_dev' thing to determine if:
"per_dev->offset is in the current stripe number or the
next one."
Was a complete raid5/4 accident. When parity==2 this is not
at all true usually. All we need to do is increment si->ob_offset
once we pass by the first parity device.
(This also greatly simplifies the code, amen)
c. Calculation of si->dev rotation can overflow when parity==2.
* Then last enable raid6 in ore_verify_layout()
I want to deeply thank Daniel Gryniewicz who found first all the
bugs in the old raid code, and inspired these patches:
Inspired-by Daniel Gryniewicz <dang@linuxbox.com>
Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
|
|
Two cleanups:
* si->cur_comp, si->cur_pg where always calculated after
the call to ore_calc_stripe_info() with the help of
_dev_order(...). But these are already calculated by
ore_calc_stripe_info() and can be just set there.
(This is left over from the time that si->cur_comp, si->cur_pg
were only used by raid code, but now the main loop manages
them anyway even though they are ultimately not used in
none raid code)
* si->cur_comp - For the very last stripe case, was set inside
_ore_add_parity_unit(). This is not clear and will be wrong
for coming raid6 so move this to only caller. Now si->cur_comp
is only manipulated within _prepare_for_striping(), always next
to the manipulation of cur_dev.
Which is much easier to understand and follow.
Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
|
|
This is finally the RAID5 Write support.
The bigger part of this patch is not the XOR engine itself, But the
read4write logic, which is a complete mini prepare_for_striping
reading engine that can read scattered pages of a stripe into cache
so it can be used for XOR calculation. That is, if the write was not
stripe aligned.
The main algorithm behind the XOR engine is the 2 dimensional array:
struct __stripe_pages_2d.
A drawing might save 1000 words
---
__stripe_pages_2d
|
n = pages_in_stripe_unit;
w = group_width - parity;
| pages array presented to the XOR lib
| |
V |
__1_page_stripe[0].pages --> [c0][c1]..[cw][c_par] <---|
| |
__1_page_stripe[1].pages --> [c0][c1]..[cw][c_par] <---
|
... | ...
|
__1_page_stripe[n].pages --> [c0][c1]..[cw][c_par]
^
|
data added columns first then row
---
The pages are put on this array columns first. .i.e:
p0-of-c0, p1-of-c0, ... pn-of-c0, p0-of-c1, ...
So we are doing a corner turn of the pages.
Note that pages will zigzag down and left. but are put sequentially
in growing order. So when the time comes to XOR the stripe, only the
beginning and end of the array need be checked. We scan the array
and any NULL spot will be field by pages-to-be-read.
The FS that wants to support RAID5 needs to supply an
operations-vector that searches a given page in cache, and specifies
if the page is uptodate or need reading. All these pages to be read
are put on a slave ore_io_state and synchronously read. All the pages
of a stripe are read in one IO, using the scatter gather mechanism.
In write we constrain our IO to only be incomplete on a single
stripe. Meaning either the complete IO is within a single stripe so
we might have pages to read from both beginning or end of the
strip. Or we have some reading to do at beginning but end at strip
boundary. The left over pages are pushed to the next IO by the API
already established by previous work, where an IO offset/length
combination presented to the ORE might get the length truncated and
the user must re-submit the leftover pages. (Both exofs and NFS
support this)
But any ORE user should make it's best effort to align it's IO
before hand and avoid complications. A cached ore_layout->stripe_size
member can be used for that calculation. (NOTE: that ORE demands
that stripe_size may not be bigger then 32bit)
What else? Well read it and tell me.
Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
|
|
This patch introduces the first stage of RAID5 support
mainly the skip-over-raid-units when reading. For
writes it inserts BLANK units, into where XOR blocks
should be calculated and written to.
It introduces the new "general raid maths", and the main
additional parameters and components needed for raid5.
Since at this stage it could corrupt future version that
actually do support raid5. The enablement of raid5
mounting and setting of parity-count > 0 is disabled. So
the raid5 code will never be used. Mounting of raid5 is
only enabled later once the basic XOR write is also in.
But if the patch "enable RAID5" is applied this code has
been tested to be able to properly read raid5 volumes
and is according to standard.
Also it has been tested that the new maths still properly
supports RAID0 and grouping code just as before.
(BTW: I have found more bugs in the pnfs-obj RAID math
fixed here)
The ore.c file is getting too big, so new ore_raid.[hc]
files are added that will include the special raid stuff
that are not used in striping and mirrors. In future write
support these will get bigger.
When adding the ore_raid.c to Kbuild file I was forced to
rename ore.ko to libore.ko. Is it possible to keep source
file, say ore.c and module file ore.ko the same even if there
are multiple files inside ore.ko?
Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
|