Any one know of how to test rebuild times on RAID5 900GB FC drives? I know this is hugely dependent on the data written on the drive and the number of drives participating in the rebuild but is there any way to estimate or calculate this kind of thing?
Basically my proposal to use RAID5 has been questioned by cautious stakeholders and they want to use RAID6 to head off the possibility of multiple near simultaneous drive failures. The explanations of fast raid are being persuasive but they want cold hard figures from HP. I've contacted our rep but wondered if anybody on here could suggest anything.
Also, anyone know of any performance comparisons between FC RAID5 3+1 & RAID6 4+2?
thanks
Quantifying Rebuild Times
- Richard Siemers
- Site Admin
- Posts: 1333
- Joined: Tue Aug 18, 2009 10:35 pm
- Location: Dallas, Texas
Re: Quantifying Rebuild Times
Here is an old one, that compares R10 and the various R5s...
http://3parblog.typepad.com/.a/6a00e553 ... 970c-popup
And here is a comparative illustration, without specific numbers:
http://3parblog.typepad.com/.a/6a00e553 ... 970c-popup
And here is a comparative illustration, without specific numbers:
Richard Siemers
The views and opinions expressed are my own and do not necessarily reflect those of my employer.
The views and opinions expressed are my own and do not necessarily reflect those of my employer.
Re: Quantifying Rebuild Times
It's an incredibly frustrating thing because people will invariably talk about "traditional" raid5 and get their jimmies rustled over running 96 disks or whatever reliant on not seeing two failures at once. Whereas the RAID5 in a 3par world is closer aligned to "5+0".
In a R5 3+1 configuration, one disk going offline only leaves those remaining three vulnerable, and relocating appropriate chunks has been a matter of minutes in our experience.
In a R5 3+1 configuration, one disk going offline only leaves those remaining three vulnerable, and relocating appropriate chunks has been a matter of minutes in our experience.
- Richard Siemers
- Site Admin
- Posts: 1333
- Joined: Tue Aug 18, 2009 10:35 pm
- Location: Dallas, Texas
Re: Quantifying Rebuild Times
Drive "failures" is my favorite topic, as its my most common issue.
This is a 300gb 15K FC drive that was aprox. 83% allocated.
Aug 03 2014 04:24:53 CDT,pd 351 failure: hardware failed for I/O- Internal reason:- Sense key 0004 : Hardware error.- Asc/ascq 0032/0000 : No defect spare location available.- All used chunklets on this disk will be relocated.,Disk fail alert
Aug 03 2014 05:39:34 CDT,Magazine 8:7:3 Physical Disk 351 Failed (Vacated Invalid Media Failed Hardware),Component state change
So 1h 15m for a "vacate" operation. = Aprox 52 Mb/sec
However, this is a vacate operation, not a rebuild from parity... hence, I don't think there is any exposure to double disk failure during THIS specific case. I "presume" what happened is a block was written, then failed verification... since there were no more "defect spare" locations available, I assume that immediate write was immediately serviced on a new disk, in a system "spare" chunk let, then the remainder of the drive was vacated. My theory being even though this drive "failed", its data never is not subject to "double drive jeopardy".
Does anyone have a log sample of a true drive failure where data has to be rebuilt from parity?
This is a 300gb 15K FC drive that was aprox. 83% allocated.
Aug 03 2014 04:24:53 CDT,pd 351 failure: hardware failed for I/O- Internal reason:- Sense key 0004 : Hardware error.- Asc/ascq 0032/0000 : No defect spare location available.- All used chunklets on this disk will be relocated.,Disk fail alert
Aug 03 2014 05:39:34 CDT,Magazine 8:7:3 Physical Disk 351 Failed (Vacated Invalid Media Failed Hardware),Component state change
So 1h 15m for a "vacate" operation. = Aprox 52 Mb/sec
However, this is a vacate operation, not a rebuild from parity... hence, I don't think there is any exposure to double disk failure during THIS specific case. I "presume" what happened is a block was written, then failed verification... since there were no more "defect spare" locations available, I assume that immediate write was immediately serviced on a new disk, in a system "spare" chunk let, then the remainder of the drive was vacated. My theory being even though this drive "failed", its data never is not subject to "double drive jeopardy".
Does anyone have a log sample of a true drive failure where data has to be rebuilt from parity?
Richard Siemers
The views and opinions expressed are my own and do not necessarily reflect those of my employer.
The views and opinions expressed are my own and do not necessarily reflect those of my employer.
-
- Posts: 142
- Joined: Wed May 07, 2014 10:29 am
Re: Quantifying Rebuild Times
For what it's worth I am not in the storage dept,
but we do have thousands of drives in 3PAR systems.
For all the cases I am aware of, there has only been predictive drive failures.
First time I discovered a 3PAR in degraded state I was concerned as we did not get an alert,
but the 3PAR will vacate the drive slowly in the background (low priority?)
and then fail the drive/alert HP.
The process seems slow, but I guess a rebuild would have much higher priority.
but we do have thousands of drives in 3PAR systems.
For all the cases I am aware of, there has only been predictive drive failures.
First time I discovered a 3PAR in degraded state I was concerned as we did not get an alert,
but the 3PAR will vacate the drive slowly in the background (low priority?)
and then fail the drive/alert HP.
The process seems slow, but I guess a rebuild would have much higher priority.
Re: Quantifying Rebuild Times
bajorgensen wrote:For what it's worth I am not in the storage dept,
but we do have thousands of drives in 3PAR systems.
For all the cases I am aware of, there has only been predictive drive failures.
First time I discovered a 3PAR in degraded state I was concerned as we did not get an alert,
but the 3PAR will vacate the drive slowly in the background (low priority?)
and then fail the drive/alert HP.
The process seems slow, but I guess a rebuild would have much higher priority.
I'm pretty sure it's yes on low priority, 3Par seems to do all background tasks on a 20% or less threshold (ie if I compare what AO or multiple tuneld jobs does as far as data movement on the backend to what I can push with a few svmotions it's in the 15-20% range). I also only have predictive failures to go by as far as time to rebuild (that shows to me that their monitoring is quite good as we've had hard failures on all of our previous arrays).