3Par 3.3.1 - Crashed and completed Cold Reboot
-
- Posts: 8
- Joined: Mon Aug 07, 2017 8:18 pm
3Par 3.3.1 - Crashed and completed Cold Reboot
Hi all,
We had a frightning experience on Tuesday night with our brand new (2 weeks into production) 3par 8400 running 3.3.1 MU4, crashing and then subsequently it did a cold reboot on all four nodes. This then caused the array to start doing integrity checking on the 25TB's of data which were inaccessible for the entire 6 hour process.
HPE have told us that they haven't seen this before in 3.3.1 and have escalated it to their Engineering team to determine what went wrong and hopefully for all of us, provide a fix to prevent it from happening again..
Logs:
Event ID: 918446 Node 0 Customer Alert - No, Service Alert - Yes
Severity: Critical
Event time: Tue Sep 05 19:46:14 2017
Event type: Configuration Lock Hold Time
Alert ID: null
Msg ID: e001c
Component: System Manager
Event string: lock hold seconds: 0, virtual volume lock count: 1, ioctl request count: 7, mcall active count: 12, mcall request waiting count: 0, mcall request blocked count: 0, mcalls (msec/name/pid): 984589336/MC_NEVER_RETURN/30015 338092/MCKV_OKV_QUERY/40619 550011/MCKV_OKV_QUERY/40535 154533/MCKV_OKV_QUERY/40578 520083/MCKV_OKV_QUERY/40623 429607/MCKV_OKV_QUERY/40625 161511/MCKV_OKV_QUERY/40644 373667/MCKV_OKV_QUERY/40645 245012/MCKV_OKV_QUERY/40651 64181/MCKV_OKV_QUERY/40744 53980/MCVL_REMOVE/40745 42161/MCVL_MAKE/40756.
Notification key: 0x00e001c
-->
Event ID: 919637 Node 0 Customer Alert - No, Service Alert - Yes
Severity: Critical
Event time: Tue Sep 05 20:05:01 2017
Event type: Process Event Handling Appears Unresponsive
Alert ID: null
Msg ID: 3f0003
Component: Node 0
Event string: sysmgr event handling appears to be unresponsive.
Notification key: 0x03f0003
-->
35 minutes later it completed an automatic Cold Reboot.
Just thought people should be informed as to what is being seen in the field..
Still feeling very nervous.
The only answer I have back from them so far is as below:
We have reviewed the logs and could see a few IO control commands outstanding causing the array to not respond, the outcome of this is, the array wouldn’t respond to commands in CLI and also in SSMC resulting in crashing the node to access the array again.
Will let you know what is found, once I know myself.
Andrew.
We had a frightning experience on Tuesday night with our brand new (2 weeks into production) 3par 8400 running 3.3.1 MU4, crashing and then subsequently it did a cold reboot on all four nodes. This then caused the array to start doing integrity checking on the 25TB's of data which were inaccessible for the entire 6 hour process.
HPE have told us that they haven't seen this before in 3.3.1 and have escalated it to their Engineering team to determine what went wrong and hopefully for all of us, provide a fix to prevent it from happening again..
Logs:
Event ID: 918446 Node 0 Customer Alert - No, Service Alert - Yes
Severity: Critical
Event time: Tue Sep 05 19:46:14 2017
Event type: Configuration Lock Hold Time
Alert ID: null
Msg ID: e001c
Component: System Manager
Event string: lock hold seconds: 0, virtual volume lock count: 1, ioctl request count: 7, mcall active count: 12, mcall request waiting count: 0, mcall request blocked count: 0, mcalls (msec/name/pid): 984589336/MC_NEVER_RETURN/30015 338092/MCKV_OKV_QUERY/40619 550011/MCKV_OKV_QUERY/40535 154533/MCKV_OKV_QUERY/40578 520083/MCKV_OKV_QUERY/40623 429607/MCKV_OKV_QUERY/40625 161511/MCKV_OKV_QUERY/40644 373667/MCKV_OKV_QUERY/40645 245012/MCKV_OKV_QUERY/40651 64181/MCKV_OKV_QUERY/40744 53980/MCVL_REMOVE/40745 42161/MCVL_MAKE/40756.
Notification key: 0x00e001c
-->
Event ID: 919637 Node 0 Customer Alert - No, Service Alert - Yes
Severity: Critical
Event time: Tue Sep 05 20:05:01 2017
Event type: Process Event Handling Appears Unresponsive
Alert ID: null
Msg ID: 3f0003
Component: Node 0
Event string: sysmgr event handling appears to be unresponsive.
Notification key: 0x03f0003
-->
35 minutes later it completed an automatic Cold Reboot.
Just thought people should be informed as to what is being seen in the field..
Still feeling very nervous.
The only answer I have back from them so far is as below:
We have reviewed the logs and could see a few IO control commands outstanding causing the array to not respond, the outcome of this is, the array wouldn’t respond to commands in CLI and also in SSMC resulting in crashing the node to access the array again.
Will let you know what is found, once I know myself.
Andrew.
Re: 3Par 3.3.1 - Crashed and completed Cold Reboot
morrie_morrie wrote:
HPE have told us that they haven't seen this before in 3.3.1 and have escalated it to their Engineering team to determine what went wrong and hopefully for all of us, provide a fix to prevent it from happening again..
That's a stock answer with 3PAR support you know ... we had that same answer for a full crash on 3.2.2EMU2 code, and also for an issue with high latency ... everything is all a surprise to level 1 support .. get to the level 4 guys and you get the real honest, non-polically correct answers (which I much prefer!)
Why are you running 3.3.1 code by the way ? Our local HPE engineer says it isn't GA yet and only installed to customers who want the latest bells and whistles (and to be guinea pigs) .. I know I'm not going there on the arrays I look after until it is GA + 6 months + the release notes for fixes need to stop mentioning "unexpected node restarts" ... oh and I'd like updatevv to work properly as well as dedupe and compression working would be good too
-
- Posts: 8
- Joined: Mon Aug 07, 2017 8:18 pm
Re: 3Par 3.3.1 - Crashed and completed Cold Reboot
Hmm.. This is even more concerning.
3.3.1 isn't GA?
I was told it went GA in February.
I'm assuming we're now stuck on 3.3.1
Andrew.
3.3.1 isn't GA?
I was told it went GA in February.
I'm assuming we're now stuck on 3.3.1
Andrew.
-
- Posts: 142
- Joined: Wed May 07, 2014 10:29 am
Re: 3Par 3.3.1 - Crashed and completed Cold Reboot
You are either running 3.2.2 MU4 or 3.3.1 something
If it is 3.3.1, what version? That release is only at MU1
If it is 3.3.1, what version? That release is only at MU1
-
- Posts: 8
- Joined: Mon Aug 07, 2017 8:18 pm
Re: 3Par 3.3.1 - Crashed and completed Cold Reboot
Yep 3.3.1 MU1 - P02 and P04.
Am just about to work with HPE to upgrade to P12.
Am just about to work with HPE to upgrade to P12.
- Richard Siemers
- Site Admin
- Posts: 1333
- Joined: Tue Aug 18, 2009 10:35 pm
- Location: Dallas, Texas
Re: 3Par 3.3.1 - Crashed and completed Cold Reboot
Please keep us posted if this turns out to be a bug and not a localized incident.
Richard Siemers
The views and opinions expressed are my own and do not necessarily reflect those of my employer.
The views and opinions expressed are my own and do not necessarily reflect those of my employer.
-
- Posts: 41
- Joined: Thu Jan 22, 2015 3:37 pm
Re: 3Par 3.3.1 - Crashed and completed Cold Reboot
We just updated one of our 20840's to 3.3.1 mu1 P07. I asked about the P12 referenced above but the remote engineer said he doesn't know of a P12. Unfortunately I was not able to look on the website on what patches are available because it has been down since last night.
Re: 3Par 3.3.1 - Crashed and completed Cold Reboot
mitchellm3 wrote:We just updated one of our 20840's to 3.3.1 mu1 P07. I asked about the P12 referenced above but the remote engineer said he doesn't know of a P12. Unfortunately I was not able to look on the website on what patches are available because it has been down since last night.
Not seen a P12 yet but there is a P11 "Improves SSMC connectivity when LDAP is used".
-
- Posts: 41
- Joined: Thu Jan 22, 2015 3:37 pm
Re: 3Par 3.3.1 - Crashed and completed Cold Reboot
Found this reference to P12. I'm asking my HPE team for updates as this is a concern. We use VEEAM with storage snapshots.
https://forums.veeam.com/veeam-backup-r ... 45526.html
https://forums.veeam.com/veeam-backup-r ... 45526.html
Re: 3Par 3.3.1 - Crashed and completed Cold Reboot
morrie_morrie wrote:Hi all,
35 minutes later it completed an automatic Cold Reboot.
Just thought people should be informed as to what is being seen in the field..
Still feeling very nervous.
Andrew.
hi
I have exactly this problem with my storeserv 8440 running 3.3.1
What should I do? Can I solve it myself?