What happens during disk replacement?
Posted: Wed Jan 18, 2012 7:34 am
We have a T800 running v2.3.1, loaded with ~500 1-2TB SATA drives. Four clients (a clustered NetApp v3140, a Solaris x86 box, and a Win2008 box).
Does anyone else experience hiccups when a tech replaces a disk?
We lose disks of course: ~1-3 / month. On three occasions over the last two years, while the HP tech is replacing the failed drive in the T800 (Cobalt), one of the NetApp heads (same NetApp head each time: Tungsten-A) panics and hands its services to its partner (Tungsten-B). [Regrettably, that process doesn't work cleanly -- some services require manual intervention before they start on Tungsten-B ... ouch.]
Tungsten-B doesn't notice the event (other than receiving services from Tungsten-A). The Solaris and Windows boxes don't notice the event -- nothing in their logs.
A few tidbits from syslog (I'm leaving most lines out of course):
[I think this is where the HP tech pulls the magazine]
Jan 10 12:13:41 cobalt comp_state_change hw_cage:17,hw_cage_sled:2:0:0 Cage 17, Magazine 2:0:0 Degraded (Offloop_Req_Via_Admin_Interface)
Jan 10 12:13:42 cobalt lesb_error sw_port:0:0:3 FC LESB Error Port ID [0:0:3] Counters: (Invalid transmission word) ALPAs: a3, e0, a9
Jan 10 12:13:45 cobalt comp_state_change hw_cage_sled:17:2:0,sw_pd:225 Magazine 17:2:0, Physical Disk 225 Degraded (Notready, Not Available For Allocations, Missing A Port, Missing B Port, Sysmgr Spundown)
Jan 10 12:13:45 cobalt comp_state_change hw_cage_sled:17:2:3,sw_pd:228 Magazine 17:2:3, Physical Disk 228 Failed (Invalid Media, Smart Threshold Exceeded, Not Available For Allocations, Missing A Port, Missing B Port, Sysmgr Spundown)
Jan 10 12:13:45 cobalt disk_state_change sw_pd:228 pd 228 wwn 5000C500196CF4C4 changed state from valid to missing because disk gone event was received for this disk.
Jan 10 12:17:54 cobalt comp_state_change hw_cage:17,hw_cage_sled:2:0:0 Cage 17, Magazine 2:0:0 Failed (Missing)
Jan 10 12:18:25 cobalt cli_cmd_err sw_cli {3parsvc super all {{0 8}} -1 140.107.42.192 20329} {Command: servicemag start -dryrun -mag 17:2 Error: } {}
[I think the tech has replaced the disk and has reinserted the magazine.]
Jan 10 12:21:15 cobalt comp_state_change sw_port:0:0:3 Port 0:0:3 Normal (Online)
Jan 10 12:21:15 cobalt comp_state_change sw_port:1:0:3 Port 1:0:3 Normal (Online)
Jan 10 12:21:20 cobalt comp_state_change hw_cage:17,hw_cage_sled:2:0:0 Cage 17, Magazine 2:0:0 Normal
[Various Tungsten clients start complaining]
Jan 10 12:31:10 hamster-1 MSSQL$SPS: 833: SQL Server has encountered 1 occurrence(s) of I/O requests taking longer than 15 seconds to complete on ...
Jan 10 12:31:18 tungsten-a-svif1 [echodata@tungsten-a: iscsi.notice:notice]: ISCSI: Initiator (iqn.1991-05.com.microsoft:csssql1.[...]) sent LUN Reset request, aborting all SCSI commands on lun 0
[I don't understand this section.]
Jan 10 12:32:00 cobalt dskabrt hw_disk:5000C5001987AC14;sw_pd:231 pd 231 port b0 on 1:0:3: scsi abort/sick/hwerr status TE_NORESPONSE
Jan 10 12:32:00 cobalt comp_state_change hw_cage_sled:17:7:1,sw_pd:231 Magazine 17:7:1, Physical Disk 231 Degraded (Errors on B Port)
Jan 10 12:32:12 cobalt dskabrt hw_disk:5000C5001987AC14;sw_pd:231 pd 231 port a0 on 0:0:3: scsi abort/sick/hwerr status TE_ABORTED
Jan 10 12:32:32 cobalt comp_state_change hw_cage_sled:17:7:1,sw_pd:231 Magazine 17:7:1, Physical Disk 231 Failed (No Valid Ports, Errors on A Port, Errors on B Port)
Jan 10 12:32:32 cobalt dskfail sw_pd:231 pd 231 failure: drive has no valid ports All used chunklets on this disk will be relocated.
[Tungsten-A gives up]
Jan 10 12:33:17 tungsten-a-svif1 [tungsten-a: cf.fsm.takeoverOfPartnerEnabled:notice]: Cluster monitor: takeover of tungsten-b enabled
Jan 10 12:34:41 tungsten-a-svif1 [tungsten-a: cf.fsm.takeoverOfPartnerEnabled:notice]: Cluster monitor: takeover of tungsten-b enabled
[More Cobalt error messages]
Jan 10 12:35:01 cobalt comp_state_change hw_cage_sled:17:7:1,sw_pd:231 Magazine 17:7:1, Physical Disk 231 Failed (Invalid Media, Smart Threshold Exceeded, No Valid Ports, Errors on A Port, Errors on B Port)
Jan 10 12:35:01 cobalt dskfail sw_pd:231 pd 231 failure: drive SMART threshold exceeded Internal reason: Smart code 0x00 : Unknown SMART code. All used chunklets on this disk will be relocated.
[More Tungsten clients complaining]
Jan 10 12:52:47 hamster-2 iScsiPrt: 63: Can not Reset the Target or LUN. Will attempt session recovery.
[Tungsten is deep into its failover procedure ... dang this failover takes a while]
Jan 10 12:46:12 tungsten-a-svif1 [tungsten-a: cf.fsm.takeoverOfPartnerEnabled:notice]: Cluster monitor: takeover of tungsten-b enabled
Jan 10 12:52:36 tungsten-b-svif1 [tungsten-b: cf.fsm.takeoverOfPartnerEnabled:notice]: Cluster monitor: takeover of tungsten-a enabled
==> Does anyone else notice hiccups during disk replacement?
==> Suggestions on where to look to better understand what happened?
==> Suggestions for monitoring we could put in place, to capture more data during the next event?
==> Pointers to URLs to read on interpreting T800 log messages
==> I'm building a diagram of which ports on the clients plug into which ports on the T800, including how each port is configured. Suggestions on what parameters to include in the diagram?
--sk
Stuart Kendrick
FHCRC
Does anyone else experience hiccups when a tech replaces a disk?
We lose disks of course: ~1-3 / month. On three occasions over the last two years, while the HP tech is replacing the failed drive in the T800 (Cobalt), one of the NetApp heads (same NetApp head each time: Tungsten-A) panics and hands its services to its partner (Tungsten-B). [Regrettably, that process doesn't work cleanly -- some services require manual intervention before they start on Tungsten-B ... ouch.]
Tungsten-B doesn't notice the event (other than receiving services from Tungsten-A). The Solaris and Windows boxes don't notice the event -- nothing in their logs.
A few tidbits from syslog (I'm leaving most lines out of course):
[I think this is where the HP tech pulls the magazine]
Jan 10 12:13:41 cobalt comp_state_change hw_cage:17,hw_cage_sled:2:0:0 Cage 17, Magazine 2:0:0 Degraded (Offloop_Req_Via_Admin_Interface)
Jan 10 12:13:42 cobalt lesb_error sw_port:0:0:3 FC LESB Error Port ID [0:0:3] Counters: (Invalid transmission word) ALPAs: a3, e0, a9
Jan 10 12:13:45 cobalt comp_state_change hw_cage_sled:17:2:0,sw_pd:225 Magazine 17:2:0, Physical Disk 225 Degraded (Notready, Not Available For Allocations, Missing A Port, Missing B Port, Sysmgr Spundown)
Jan 10 12:13:45 cobalt comp_state_change hw_cage_sled:17:2:3,sw_pd:228 Magazine 17:2:3, Physical Disk 228 Failed (Invalid Media, Smart Threshold Exceeded, Not Available For Allocations, Missing A Port, Missing B Port, Sysmgr Spundown)
Jan 10 12:13:45 cobalt disk_state_change sw_pd:228 pd 228 wwn 5000C500196CF4C4 changed state from valid to missing because disk gone event was received for this disk.
Jan 10 12:17:54 cobalt comp_state_change hw_cage:17,hw_cage_sled:2:0:0 Cage 17, Magazine 2:0:0 Failed (Missing)
Jan 10 12:18:25 cobalt cli_cmd_err sw_cli {3parsvc super all {{0 8}} -1 140.107.42.192 20329} {Command: servicemag start -dryrun -mag 17:2 Error: } {}
[I think the tech has replaced the disk and has reinserted the magazine.]
Jan 10 12:21:15 cobalt comp_state_change sw_port:0:0:3 Port 0:0:3 Normal (Online)
Jan 10 12:21:15 cobalt comp_state_change sw_port:1:0:3 Port 1:0:3 Normal (Online)
Jan 10 12:21:20 cobalt comp_state_change hw_cage:17,hw_cage_sled:2:0:0 Cage 17, Magazine 2:0:0 Normal
[Various Tungsten clients start complaining]
Jan 10 12:31:10 hamster-1 MSSQL$SPS: 833: SQL Server has encountered 1 occurrence(s) of I/O requests taking longer than 15 seconds to complete on ...
Jan 10 12:31:18 tungsten-a-svif1 [echodata@tungsten-a: iscsi.notice:notice]: ISCSI: Initiator (iqn.1991-05.com.microsoft:csssql1.[...]) sent LUN Reset request, aborting all SCSI commands on lun 0
[I don't understand this section.]
Jan 10 12:32:00 cobalt dskabrt hw_disk:5000C5001987AC14;sw_pd:231 pd 231 port b0 on 1:0:3: scsi abort/sick/hwerr status TE_NORESPONSE
Jan 10 12:32:00 cobalt comp_state_change hw_cage_sled:17:7:1,sw_pd:231 Magazine 17:7:1, Physical Disk 231 Degraded (Errors on B Port)
Jan 10 12:32:12 cobalt dskabrt hw_disk:5000C5001987AC14;sw_pd:231 pd 231 port a0 on 0:0:3: scsi abort/sick/hwerr status TE_ABORTED
Jan 10 12:32:32 cobalt comp_state_change hw_cage_sled:17:7:1,sw_pd:231 Magazine 17:7:1, Physical Disk 231 Failed (No Valid Ports, Errors on A Port, Errors on B Port)
Jan 10 12:32:32 cobalt dskfail sw_pd:231 pd 231 failure: drive has no valid ports All used chunklets on this disk will be relocated.
[Tungsten-A gives up]
Jan 10 12:33:17 tungsten-a-svif1 [tungsten-a: cf.fsm.takeoverOfPartnerEnabled:notice]: Cluster monitor: takeover of tungsten-b enabled
Jan 10 12:34:41 tungsten-a-svif1 [tungsten-a: cf.fsm.takeoverOfPartnerEnabled:notice]: Cluster monitor: takeover of tungsten-b enabled
[More Cobalt error messages]
Jan 10 12:35:01 cobalt comp_state_change hw_cage_sled:17:7:1,sw_pd:231 Magazine 17:7:1, Physical Disk 231 Failed (Invalid Media, Smart Threshold Exceeded, No Valid Ports, Errors on A Port, Errors on B Port)
Jan 10 12:35:01 cobalt dskfail sw_pd:231 pd 231 failure: drive SMART threshold exceeded Internal reason: Smart code 0x00 : Unknown SMART code. All used chunklets on this disk will be relocated.
[More Tungsten clients complaining]
Jan 10 12:52:47 hamster-2 iScsiPrt: 63: Can not Reset the Target or LUN. Will attempt session recovery.
[Tungsten is deep into its failover procedure ... dang this failover takes a while]
Jan 10 12:46:12 tungsten-a-svif1 [tungsten-a: cf.fsm.takeoverOfPartnerEnabled:notice]: Cluster monitor: takeover of tungsten-b enabled
Jan 10 12:52:36 tungsten-b-svif1 [tungsten-b: cf.fsm.takeoverOfPartnerEnabled:notice]: Cluster monitor: takeover of tungsten-a enabled
==> Does anyone else notice hiccups during disk replacement?
==> Suggestions on where to look to better understand what happened?
==> Suggestions for monitoring we could put in place, to capture more data during the next event?
==> Pointers to URLs to read on interpreting T800 log messages
==> I'm building a diagram of which ports on the clients plug into which ports on the T800, including how each port is configured. Suggestions on what parameters to include in the diagram?
--sk
Stuart Kendrick
FHCRC