HPE Storage Users Group https://3parug.net/ |
|
Physical Disk Failures https://3parug.net/viewtopic.php?f=18&t=582 |
Page 2 of 3 |
Author: | Richard Siemers [ Tue Mar 25, 2014 9:32 am ] |
Post subject: | Re: Physical Disk Failures |
I was able to confirm that "Used Fail" chunklets is a good one to watch to get to 0. Just had a 1 TB NL drive fail: Code: ESFWT800-1 cli% showpd -c 362 ------- Normal Chunklets -------- ---- Spare Chunklets ---- - Used - -------- Unused -------- - Used - ---- Unused ---- Id CagePos Type State Total OK Fail Free Uninit Unavail Fail OK Fail Free Uninit Fail 362 0:5:2 NL failed 3724 0 1078 0 1046 0 1586 0 0 0 0 14 ----------------------------------------------------------------------------------------- 1 total 3724 0 1078 0 1046 0 1586 0 0 0 0 14 That number of failed chunklets is slowly ticking down over time as they, and I cant tell which, move or rebuild from parity. "showpdch -sync" did not show anything. "showpdch -mov" showed all the chunklets from the failed PD that had already been relocated, and 2 that were actively moving. "showpdch 362" showed all the chunklets left on the drive, and the current 2 that were moving. This list is getting shorter and shorter, it only takes a short time per chunklet. "showpdch -mov 362" shows just the 2 chunklets being moved off the failed drive. What is interesting is that "showpd -c 362" shows all the remaining chunklets as "failed" and that number is shriking over time... however, "showpdch 362" shows all the chunklets as "normal" but its clearly evacuating them to other disks 2 at a time. |
Author: | afidel [ Thu Mar 27, 2014 4:15 pm ] |
Post subject: | Re: Physical Disk Failures |
Hmm, only 2 chunklets at a time? That seems like a rather slow way to restore availability. I was led to believe that recovery operations were done on a many to many basis like XIV but 2 chunklets concurrent sounds much closer to RID's on EVA. |
Author: | Richard Siemers [ Thu Mar 27, 2014 11:43 pm ] |
Post subject: | Re: Physical Disk Failures |
This 2 chunklets moving at a time thing *seems* to be a new feature since we upgraded from 2.3.1 to 3.1.1. With 2.3.1, rebuilds would go fast enough to trigger our IOPS/PD alerts every 5 minutes for about 30 minutes total... then the rebuild would complete. I suspect there is more to it than that... I *think* these chunks on the failed drive were still online/readable so it may have chosen an low priority move since availability was not impacted... I hope thats the case. Would be nice to have some documentation of how drive errors are dealt with. |
Author: | afidel [ Fri Mar 28, 2014 7:23 am ] |
Post subject: | Re: Physical Disk Failures |
Ah, that makes sensse, if it sees the drive as online but degraded it's logical to do a low priority evacuation. |
Author: | 3parlrn [ Fri Aug 14, 2015 2:03 pm ] |
Post subject: | Re: Physical Disk Failures |
What happens if someone pulls wrong disk out and wants to put it back ? 1, Does it move chunklets from removed disk to other PD ? 2, How to bring PD back online after putting it back inside ? 3, How to restore those chunklets back to the disk which was pulled out ? Thanks for your recommendations and expert views |
Author: | AMINHETFIELD [ Tue May 01, 2018 4:41 am ] |
Post subject: | Re: Physical Disk Failures |
Hello I have some problem with hp 3par 7200 with 900GB FC HDD. one of the HDDs is fail about 1 month ago , the pdid of my hdd is 0 19 , i replace it withe resvicemag procedure and everything is ok. after 1 day my new hdd is normal and the failed disk is gone but next day new disk is fail , i replace the fail disk again and after 1 day everything is ok. after 1 month hdd in 0 19 fail again and i replace it but after 2 day the new hdd has fail again. cli% showpd ----Size(MB)---- ----Ports---- Id CagePos Type RPM State Total Free A B Cap(GB) 0 0:0:0 FC 10 normal 838656 146432 1:0:1* 0:0:1 900 1 0:1:0 FC 10 normal 838656 143360 1:0:1 0:0:1* 900 2 0:2:0 FC 10 normal 838656 585728 1:0:1* 0:0:1 900 3 0:3:0 FC 10 normal 838656 136192 1:0:1 0:0:1* 900 4 0:4:0 FC 10 normal 838656 147456 1:0:1* 0:0:1 900 5 0:5:0 FC 10 normal 838656 117760 1:0:1 0:0:1* 900 6 0:6:0 FC 10 normal 838656 148480 1:0:1* 0:0:1 900 7 0:7:0 FC 10 normal 838656 129024 1:0:1 0:0:1* 900 8 0:8:0 FC 10 normal 838656 148480 1:0:1* 0:0:1 900 9 0:9:0 FC 10 normal 838656 105472 1:0:1 0:0:1* 900 10 0:10:0 FC 10 normal 838656 0 1:0:1* 0:0:1 900 11 0:11:0 FC 10 normal 838656 0 1:0:1 0:0:1* 900 12 0:12:0 FC 10 normal 838656 0 1:0:1* 0:0:1 900 13 0:13:0 FC 10 normal 838656 0 1:0:1 0:0:1* 900 14 0:14:0 FC 10 normal 838656 0 1:0:1* 0:0:1 900 15 0:15:0 FC 10 normal 838656 1024 1:0:1 0:0:1* 900 16 0:16:0 FC 10 normal 838656 0 1:0:1* 0:0:1 900 17 0:17:0 FC 10 normal 838656 0 1:0:1 0:0:1* 900 18 0:18:0 FC 10 normal 838656 0 1:0:1* 0:0:1 900 19 0:19:0 FC 10 failed 838656 0 1:0:1 0:0:1* 900 20 0:21:0 FC 10 normal 838656 0 1:0:1 0:0:1* 900 21 0:22:0 FC 10 normal 838656 5120 1:0:1* 0:0:1 900 22 0:23:0 FC 10 normal 838656 2048 1:0:1 0:0:1* 900 23 0:20:0 FC 10 normal 838656 0 1:0:1* 0:0:1 900 cli% checkhealth Checking alert Checking cabling Checking cage Checking dar Checking date Checking ld Checking license Checking network Checking node Checking pd Checking port Checking rc Checking snmp Checking task Checking vlun Checking vv Component ---------------Description--------------- Qty Network Too few working admin network connections 1 PD PDs that are failed 1 cli% showcage Id Name LoopA Pos.A LoopB Pos.B Drives Temp RevA RevB Model Side 0 cage0 1:0:1 0 0:0:1 0 24 26-30 320e 320e DCN1 n/a cli% showversion Release version 3.1.2 (MU2) Patches: P10 Component Name Version CLI Server 3.1.2 (MU2) CLI Client 3.1.2 (MU2) System Manager 3.1.2 (MU2) Kernel 3.1.2 (MU2) TPD Kernel Code 3.1.2 (MU2) cli% servicemag start -pdid 19 -seucceeded Expecting integer pdid, got: -succeeded SAN.SER cli% servicemag start -pdid 19 -succeeded Are you sure you want to run servicemag? select q=quit y=yes n=no: y servicemag start -pdid 19 ... servicing disks in mag: 0 19 ... normal disks: ... not normal disks: WWN [XXXXXXXXXXXXXXXX] Id [19] diskpos [0] The servicemag start operation will continue in the background. cli% showpd -space 19 -----------------(MB)------------------ Id CagePos Type -State- Size Volume Spare Free Unavail Failed 19 0:19:0 FC failed 838656 0 0 0 0 838656 --------------------------------------------------------------- 1 total 838656 0 0 0 0 838656 SAN.SER cli% servicemag resume 0 19 Are you sure you want to run servicemag? select q=quit y=yes n=no: y servicemag status 0 19 The magazine is being brought online due to a servicemag resume. The last status update was at Tue May 1 10:27:04 2018. Chunklets relocated: 6 in 4 minutes and 45 seconds Chunklets remaining: 2232 Chunklets marked for moving: 2232 Estimated time for relocation completion based on 47 seconds per chunklet is: 1 days, 5 hours, 8 minutes and 24 seconds servicemag resume 0 19 -- is in Progress cli% exit may the os version is my problem? please help me about this problem. thank you |
Author: | MammaGutt [ Tue May 01, 2018 5:56 am ] |
Post subject: | Re: Physical Disk Failures |
Could be OS. Could also be the cage slot. How are you getting your replacement drives? If they are from ebay or some third party these may have been used and have some SMART counters just waiting to fail the drive. |
Author: | AMINHETFIELD [ Tue May 01, 2018 6:02 am ] |
Post subject: | Re: Physical Disk Failures |
Thank you for reply i buy my hdd from hp. so if the slot is my problem , new hhd must be fail after the i insert the disk in slot. but hdd fail after the chunklet relocation is end and hdd state is normal for 3 days or 1 month. |
Author: | ailean [ Tue May 01, 2018 6:27 am ] |
Post subject: | Re: Physical Disk Failures |
Slot could be causing some intermittent errors that add up over time, reaching a threshold that fails the disk. Next time check the slot for any debris or pin damage just in case. Maybe occasional showpd -e commands to see if any errors are climbing. |
Author: | sanjac [ Tue Sep 25, 2018 6:52 am ] |
Post subject: | Re: Physical Disk Failures |
at one of my drives i discovered 3Gib failed. when is time for concern? for how many failed chuncklets I have right to call support to change the drive? |
Page 2 of 3 | All times are UTC - 5 hours |
Powered by phpBB © 2000, 2002, 2005, 2007 phpBB Group http://www.phpbb.com/ |