HPE Storage Users Group

Posted: **Thu Sep 07, 2017 1:02 am**

Hi all,

We had a frightning experience on Tuesday night with our brand new (2 weeks into production) 3par 8400 running 3.3.1 MU4, crashing and then subsequently it did a cold reboot on all four nodes. This then caused the array to start doing integrity checking on the 25TB's of data which were inaccessible for the entire 6 hour process.

HPE have told us that they haven't seen this before in 3.3.1 and have escalated it to their Engineering team to determine what went wrong and hopefully for all of us, provide a fix to prevent it from happening again..

Logs:

Event ID: 918446 Node 0 Customer Alert - No, Service Alert - Yes
Severity: Critical
Event time: Tue Sep 05 19:46:14 2017
Event type: Configuration Lock Hold Time
Alert ID: null
Msg ID: e001c
Component: System Manager
Event string: lock hold seconds: 0, virtual volume lock count: 1, ioctl request count: 7, mcall active count: 12, mcall request waiting count: 0, mcall request blocked count: 0, mcalls (msec/name/pid): 984589336/MC_NEVER_RETURN/30015 338092/MCKV_OKV_QUERY/40619 550011/MCKV_OKV_QUERY/40535 154533/MCKV_OKV_QUERY/40578 520083/MCKV_OKV_QUERY/40623 429607/MCKV_OKV_QUERY/40625 161511/MCKV_OKV_QUERY/40644 373667/MCKV_OKV_QUERY/40645 245012/MCKV_OKV_QUERY/40651 64181/MCKV_OKV_QUERY/40744 53980/MCVL_REMOVE/40745 42161/MCVL_MAKE/40756.
Notification key: 0x00e001c

-->
Event ID: 919637 Node 0 Customer Alert - No, Service Alert - Yes
Severity: Critical
Event time: Tue Sep 05 20:05:01 2017
Event type: Process Event Handling Appears Unresponsive
Alert ID: null
Msg ID: 3f0003
Component: Node 0
Event string: sysmgr event handling appears to be unresponsive.
Notification key: 0x03f0003

-->

35 minutes later it completed an automatic Cold Reboot.
Just thought people should be informed as to what is being seen in the field..
Still feeling very nervous.

The only answer I have back from them so far is as below:

We have reviewed the logs and could see a few IO control commands outstanding causing the array to not respond, the outcome of this is, the array wouldnâ€™t respond to commands in CLI and also in SSMC resulting in crashing the node to access the array again.

Will let you know what is found, once I know myself.

Andrew.

Posted: **Thu Sep 07, 2017 6:16 am**

morrie_morrie wrote:
HPE have told us that they haven't seen this before in 3.3.1 and have escalated it to their Engineering team to determine what went wrong and hopefully for all of us, provide a fix to prevent it from happening again..

That's a stock answer with 3PAR support you know ... we had that same answer for a full crash on 3.2.2EMU2 code, and also for an issue with high latency ... everything is all a surprise to level 1 support .. get to the level 4 guys and you get the real honest, non-polically correct answers (which I much prefer!)

Why are you running 3.3.1 code by the way ? Our local HPE engineer says it isn't GA yet and only installed to customers who want the latest bells and whistles (and to be guinea pigs) .. I know I'm not going there on the arrays I look after until it is GA + 6 months + the release notes for fixes need to stop mentioning "unexpected node restarts" ... oh and I'd like updatevv to work properly as well as dedupe and compression working would be good too ;-)

Posted: **Thu Sep 07, 2017 11:34 pm**

Hmm.. This is even more concerning.

3.3.1 isn't GA?
I was told it went GA in February.

I'm assuming we're now stuck on 3.3.1

Andrew.

Posted: **Fri Sep 08, 2017 1:30 am**

You are either running 3.2.2 MU4 or 3.3.1 something

If it is 3.3.1, what version? That release is only at MU1

Posted: **Fri Sep 08, 2017 8:36 pm**

Yep 3.3.1 MU1 - P02 and P04.
Am just about to work with HPE to upgrade to P12.

Posted: **Sat Sep 09, 2017 9:16 pm**

Please keep us posted if this turns out to be a bug and not a localized incident.

Posted: **Wed Sep 13, 2017 10:58 am**

We just updated one of our 20840's to 3.3.1 mu1 P07. I asked about the P12 referenced above but the remote engineer said he doesn't know of a P12. Unfortunately I was not able to look on the website on what patches are available because it has been down since last night.

Posted: **Thu Sep 14, 2017 5:21 am**

mitchellm3 wrote:We just updated one of our 20840's to 3.3.1 mu1 P07. I asked about the P12 referenced above but the remote engineer said he doesn't know of a P12. Unfortunately I was not able to look on the website on what patches are available because it has been down since last night.

Not seen a P12 yet but there is a P11 "Improves SSMC connectivity when LDAP is used".

Posted: **Sun Sep 17, 2017 7:49 pm**

Found this reference to P12. I'm asking my HPE team for updates as this is a concern. We use VEEAM with storage snapshots.

https://forums.veeam.com/veeam-backup-r ... 45526.html

Posted: **Mon Apr 12, 2021 12:22 pm**

morrie_morrie wrote:Hi all,

35 minutes later it completed an automatic Cold Reboot.
Just thought people should be informed as to what is being seen in the field..
Still feeling very nervous.

Andrew.

hi
I have exactly this problem with my storeserv 8440 running 3.3.1
What should I do? Can I solve it myself?

HPE Storage Users Group

3Par 3.3.1 - Crashed and completed Cold Reboot

3Par 3.3.1 - Crashed and completed Cold Reboot

Re: 3Par 3.3.1 - Crashed and completed Cold Reboot

Re: 3Par 3.3.1 - Crashed and completed Cold Reboot

Re: 3Par 3.3.1 - Crashed and completed Cold Reboot

Re: 3Par 3.3.1 - Crashed and completed Cold Reboot

Re: 3Par 3.3.1 - Crashed and completed Cold Reboot

Re: 3Par 3.3.1 - Crashed and completed Cold Reboot

Re: 3Par 3.3.1 - Crashed and completed Cold Reboot

Re: 3Par 3.3.1 - Crashed and completed Cold Reboot

Re: 3Par 3.3.1 - Crashed and completed Cold Reboot