Page 2 of 3
Re: Physical Disk Failures
Posted: Tue Mar 25, 2014 9:32 am
by Richard Siemers
I was able to confirm that "Used Fail" chunklets is a good one to watch to get to 0. Just had a 1 TB NL drive fail:
Code: Select all
ESFWT800-1 cli% showpd -c 362
------- Normal Chunklets -------- ---- Spare Chunklets ----
- Used - -------- Unused -------- - Used - ---- Unused ----
Id CagePos Type State Total OK Fail Free Uninit Unavail Fail OK Fail Free Uninit Fail
362 0:5:2 NL failed 3724 0 1078 0 1046 0 1586 0 0 0 0 14
-----------------------------------------------------------------------------------------
1 total 3724 0 1078 0 1046 0 1586 0 0 0 0 14
That number of failed chunklets is slowly ticking down over time as they, and I cant tell which, move or rebuild from parity.
"showpdch -sync" did not show anything.
"showpdch -mov" showed all the chunklets from the failed PD that had already been relocated, and 2 that were actively moving.
"showpdch 362" showed all the chunklets left on the drive, and the current 2 that were moving. This list is getting shorter and shorter, it only takes a short time per chunklet.
"showpdch -mov 362" shows just the 2 chunklets being moved off the failed drive.
What is interesting is that "showpd -c 362" shows all the remaining chunklets as "failed" and that number is shriking over time... however, "showpdch 362" shows all the chunklets as "normal" but its clearly evacuating them to other disks 2 at a time.
Re: Physical Disk Failures
Posted: Thu Mar 27, 2014 4:15 pm
by afidel
Hmm, only 2 chunklets at a time? That seems like a rather slow way to restore availability. I was led to believe that recovery operations were done on a many to many basis like XIV but 2 chunklets concurrent sounds much closer to RID's on EVA.
Re: Physical Disk Failures
Posted: Thu Mar 27, 2014 11:43 pm
by Richard Siemers
This 2 chunklets moving at a time thing *seems* to be a new feature since we upgraded from 2.3.1 to 3.1.1. With 2.3.1, rebuilds would go fast enough to trigger our IOPS/PD alerts every 5 minutes for about 30 minutes total... then the rebuild would complete.
I suspect there is more to it than that... I *think* these chunks on the failed drive were still online/readable so it may have chosen an low priority move since availability was not impacted... I hope thats the case.
Would be nice to have some documentation of how drive errors are dealt with.
Re: Physical Disk Failures
Posted: Fri Mar 28, 2014 7:23 am
by afidel
Ah, that makes sensse, if it sees the drive as online but degraded it's logical to do a low priority evacuation.
Re: Physical Disk Failures
Posted: Fri Aug 14, 2015 2:03 pm
by 3parlrn
What happens if someone pulls wrong disk out and wants to put it back ?
1, Does it move chunklets from removed disk to other PD ?
2, How to bring PD back online after putting it back inside ?
3, How to restore those chunklets back to the disk which was pulled out ?
Thanks for your recommendations and expert views
Re: Physical Disk Failures
Posted: Tue May 01, 2018 4:41 am
by AMINHETFIELD
Hello
I have some problem with hp 3par 7200 with 900GB FC HDD.
one of the HDDs is fail about 1 month ago , the pdid of my hdd is 0 19 , i replace it withe resvicemag procedure and everything is ok.
after 1 day my new hdd is normal and the failed disk is gone but next day new disk is fail , i replace the fail disk again and after 1 day everything is ok.
after 1 month hdd in 0 19 fail again and i replace it but after 2 day the new hdd has fail again.
cli% showpd
----Size(MB)---- ----Ports----
Id CagePos Type RPM State Total Free A B Cap(GB)
0 0:0:0 FC 10 normal 838656 146432 1:0:1* 0:0:1 900
1 0:1:0 FC 10 normal 838656 143360 1:0:1 0:0:1* 900
2 0:2:0 FC 10 normal 838656 585728 1:0:1* 0:0:1 900
3 0:3:0 FC 10 normal 838656 136192 1:0:1 0:0:1* 900
4 0:4:0 FC 10 normal 838656 147456 1:0:1* 0:0:1 900
5 0:5:0 FC 10 normal 838656 117760 1:0:1 0:0:1* 900
6 0:6:0 FC 10 normal 838656 148480 1:0:1* 0:0:1 900
7 0:7:0 FC 10 normal 838656 129024 1:0:1 0:0:1* 900
8 0:8:0 FC 10 normal 838656 148480 1:0:1* 0:0:1 900
9 0:9:0 FC 10 normal 838656 105472 1:0:1 0:0:1* 900
10 0:10:0 FC 10 normal 838656 0 1:0:1* 0:0:1 900
11 0:11:0 FC 10 normal 838656 0 1:0:1 0:0:1* 900
12 0:12:0 FC 10 normal 838656 0 1:0:1* 0:0:1 900
13 0:13:0 FC 10 normal 838656 0 1:0:1 0:0:1* 900
14 0:14:0 FC 10 normal 838656 0 1:0:1* 0:0:1 900
15 0:15:0 FC 10 normal 838656 1024 1:0:1 0:0:1* 900
16 0:16:0 FC 10 normal 838656 0 1:0:1* 0:0:1 900
17 0:17:0 FC 10 normal 838656 0 1:0:1 0:0:1* 900
18 0:18:0 FC 10 normal 838656 0 1:0:1* 0:0:1 900
19 0:19:0 FC 10 failed 838656 0 1:0:1 0:0:1* 900
20 0:21:0 FC 10 normal 838656 0 1:0:1 0:0:1* 900
21 0:22:0 FC 10 normal 838656 5120 1:0:1* 0:0:1 900
22 0:23:0 FC 10 normal 838656 2048 1:0:1 0:0:1* 900
23 0:20:0 FC 10 normal 838656 0 1:0:1* 0:0:1 900
cli% checkhealth
Checking alert
Checking cabling
Checking cage
Checking dar
Checking date
Checking ld
Checking license
Checking network
Checking node
Checking pd
Checking port
Checking rc
Checking snmp
Checking task
Checking vlun
Checking vv
Component ---------------Description--------------- Qty
Network Too few working admin network connections 1
PD PDs that are failed 1
cli% showcage
Id Name LoopA Pos.A LoopB Pos.B Drives Temp RevA RevB Model Side
0 cage0 1:0:1 0 0:0:1 0 24 26-30 320e 320e DCN1 n/a
cli% showversion
Release version 3.1.2 (MU2)
Patches: P10
Component Name Version
CLI Server 3.1.2 (MU2)
CLI Client 3.1.2 (MU2)
System Manager 3.1.2 (MU2)
Kernel 3.1.2 (MU2)
TPD Kernel Code 3.1.2 (MU2)
cli% servicemag start -pdid 19 -seucceeded
Expecting integer pdid, got: -succeeded
SAN.SER cli% servicemag start -pdid 19 -succeeded
Are you sure you want to run servicemag?
select q=quit y=yes n=no: y
servicemag start -pdid 19
... servicing disks in mag: 0 19
... normal disks:
... not normal disks: WWN [XXXXXXXXXXXXXXXX] Id [19] diskpos [0]
The servicemag start operation will continue in the background.
cli% showpd -space 19
-----------------(MB)------------------
Id CagePos Type -State- Size Volume Spare Free Unavail Failed
19 0:19:0 FC failed 838656 0 0 0 0 838656
---------------------------------------------------------------
1 total 838656 0 0 0 0 838656
SAN.SER cli% servicemag resume 0 19
Are you sure you want to run servicemag?
select q=quit y=yes n=no: y
servicemag status 0 19
The magazine is being brought online due to a servicemag resume.
The last status update was at Tue May 1 10:27:04 2018.
Chunklets relocated: 6 in 4 minutes and 45 seconds
Chunklets remaining: 2232
Chunklets marked for moving: 2232
Estimated time for relocation completion based on 47 seconds per chunklet is: 1 days, 5 hours, 8 minutes and 24 seconds
servicemag resume 0 19 -- is in Progress
cli% exit
may the os version is my problem?
please help me about this problem.
thank you
Re: Physical Disk Failures
Posted: Tue May 01, 2018 5:56 am
by MammaGutt
Could be OS.
Could also be the cage slot.
How are you getting your replacement drives? If they are from ebay or some third party these may have been used and have some SMART counters just waiting to fail the drive.
Re: Physical Disk Failures
Posted: Tue May 01, 2018 6:02 am
by AMINHETFIELD
Thank you for reply
i buy my hdd from hp.
so if the slot is my problem , new hhd must be fail after the i insert the disk in slot.
but hdd fail after the chunklet relocation is end and hdd state is normal for 3 days or 1 month.
Re: Physical Disk Failures
Posted: Tue May 01, 2018 6:27 am
by ailean
Slot could be causing some intermittent errors that add up over time, reaching a threshold that fails the disk.
Next time check the slot for any debris or pin damage just in case.
Maybe occasional showpd -e commands to see if any errors are climbing.
Re: Physical Disk Failures
Posted: Tue Sep 25, 2018 6:52 am
by sanjac
at one of my drives i discovered 3Gib failed. when is time for concern? for how many failed chuncklets I have right to call support to change the drive?