3Par 3.3.1 - Crashed and completed Cold Reboot
Posted: Thu Sep 07, 2017 1:02 am
Hi all,
We had a frightning experience on Tuesday night with our brand new (2 weeks into production) 3par 8400 running 3.3.1 MU4, crashing and then subsequently it did a cold reboot on all four nodes. This then caused the array to start doing integrity checking on the 25TB's of data which were inaccessible for the entire 6 hour process.
HPE have told us that they haven't seen this before in 3.3.1 and have escalated it to their Engineering team to determine what went wrong and hopefully for all of us, provide a fix to prevent it from happening again..
Logs:
Event ID: 918446 Node 0 Customer Alert - No, Service Alert - Yes
Severity: Critical
Event time: Tue Sep 05 19:46:14 2017
Event type: Configuration Lock Hold Time
Alert ID: null
Msg ID: e001c
Component: System Manager
Event string: lock hold seconds: 0, virtual volume lock count: 1, ioctl request count: 7, mcall active count: 12, mcall request waiting count: 0, mcall request blocked count: 0, mcalls (msec/name/pid): 984589336/MC_NEVER_RETURN/30015 338092/MCKV_OKV_QUERY/40619 550011/MCKV_OKV_QUERY/40535 154533/MCKV_OKV_QUERY/40578 520083/MCKV_OKV_QUERY/40623 429607/MCKV_OKV_QUERY/40625 161511/MCKV_OKV_QUERY/40644 373667/MCKV_OKV_QUERY/40645 245012/MCKV_OKV_QUERY/40651 64181/MCKV_OKV_QUERY/40744 53980/MCVL_REMOVE/40745 42161/MCVL_MAKE/40756.
Notification key: 0x00e001c
-->
Event ID: 919637 Node 0 Customer Alert - No, Service Alert - Yes
Severity: Critical
Event time: Tue Sep 05 20:05:01 2017
Event type: Process Event Handling Appears Unresponsive
Alert ID: null
Msg ID: 3f0003
Component: Node 0
Event string: sysmgr event handling appears to be unresponsive.
Notification key: 0x03f0003
-->
35 minutes later it completed an automatic Cold Reboot.
Just thought people should be informed as to what is being seen in the field..
Still feeling very nervous.
The only answer I have back from them so far is as below:
We have reviewed the logs and could see a few IO control commands outstanding causing the array to not respond, the outcome of this is, the array wouldn’t respond to commands in CLI and also in SSMC resulting in crashing the node to access the array again.
Will let you know what is found, once I know myself.
Andrew.
We had a frightning experience on Tuesday night with our brand new (2 weeks into production) 3par 8400 running 3.3.1 MU4, crashing and then subsequently it did a cold reboot on all four nodes. This then caused the array to start doing integrity checking on the 25TB's of data which were inaccessible for the entire 6 hour process.
HPE have told us that they haven't seen this before in 3.3.1 and have escalated it to their Engineering team to determine what went wrong and hopefully for all of us, provide a fix to prevent it from happening again..
Logs:
Event ID: 918446 Node 0 Customer Alert - No, Service Alert - Yes
Severity: Critical
Event time: Tue Sep 05 19:46:14 2017
Event type: Configuration Lock Hold Time
Alert ID: null
Msg ID: e001c
Component: System Manager
Event string: lock hold seconds: 0, virtual volume lock count: 1, ioctl request count: 7, mcall active count: 12, mcall request waiting count: 0, mcall request blocked count: 0, mcalls (msec/name/pid): 984589336/MC_NEVER_RETURN/30015 338092/MCKV_OKV_QUERY/40619 550011/MCKV_OKV_QUERY/40535 154533/MCKV_OKV_QUERY/40578 520083/MCKV_OKV_QUERY/40623 429607/MCKV_OKV_QUERY/40625 161511/MCKV_OKV_QUERY/40644 373667/MCKV_OKV_QUERY/40645 245012/MCKV_OKV_QUERY/40651 64181/MCKV_OKV_QUERY/40744 53980/MCVL_REMOVE/40745 42161/MCVL_MAKE/40756.
Notification key: 0x00e001c
-->
Event ID: 919637 Node 0 Customer Alert - No, Service Alert - Yes
Severity: Critical
Event time: Tue Sep 05 20:05:01 2017
Event type: Process Event Handling Appears Unresponsive
Alert ID: null
Msg ID: 3f0003
Component: Node 0
Event string: sysmgr event handling appears to be unresponsive.
Notification key: 0x03f0003
-->
35 minutes later it completed an automatic Cold Reboot.
Just thought people should be informed as to what is being seen in the field..
Still feeling very nervous.
The only answer I have back from them so far is as below:
We have reviewed the logs and could see a few IO control commands outstanding causing the array to not respond, the outcome of this is, the array wouldn’t respond to commands in CLI and also in SSMC resulting in crashing the node to access the array again.
Will let you know what is found, once I know myself.
Andrew.