Page 1 of 2

T400 CRC errors on shelf/node

Posted: Wed Apr 16, 2014 11:25 am
by fsprout
On our T400 (3.1.2 MU1, P04), we've been getting alerts from the SP indicating CRC errors on a single node port.

Thinking hardware faults, we've:
Replaced the shelf transceiver -- no change in CRC
Replaced the node transceiver -- no change in CRC
Cleaned the fiber -- no change in CRC
Replaced the fiber -- no change in CRC

We are getting these alerts 2-3 times a day. Once, it took down the port and then it self-resolved after the CRC count dropped.

Thinking that there were disks generating errors, we looked at the error count on the shelf disks -- while there are some, other shelves have more disk errors and we are not seeing CRC errors from those node ports.

So... looking further, I come across:

% checkhealth -detail node
Checking node
Component ----------------Description----------------- Qty
Node PCI card model differs for slot in node pair 4

Component -Identifier- ----------------------Description----------------------
Node node:0 PCI card in Slot:1 is empty, but is not empty in Node:1
Node node:1 PCI card in Slot:1 is empty, but is not empty in Node:0
Node node:2 PCI card in Slot:1 is empty, but is not empty in Node:3
Node node:3 PCI card in Slot:1 is empty, but is not empty in Node:2

Huh?

Any ideas on 1) resolving the CRC error and 2) what's up with the node checkhealth report?

I'm digging through our system reporter to see if any disks/VVs are getting hit hard, but nothing really jumps out at the moment.

TIA
Frostie

Re: T400 CRC errors on shelf/node

Posted: Wed Apr 16, 2014 1:27 pm
by fsprout
Oh, the "HP 3PAR: Handling CRC errors (Port intermittent)" document on HP's site is blocked.

https://h20565.www2.hp.com/portal/site/ ... id=5044394

Stupid new lockouts -- HP acts more like Cisco every day.

I'm assuming it's telling me to apply P23 or P24, but would be nice to verify.

Frostie

Re: T400 CRC errors on shelf/node

Posted: Wed Apr 16, 2014 3:31 pm
by Richard Siemers
% checkhealth -detail node
Checking node
Component ----------------Description----------------- Qty
Node PCI card model differs for slot in node pair 4

Component -Identifier- ----------------------Description----------------------
Node node:0 PCI card in Slot:1 is empty, but is not empty in Node:1
Node node:1 PCI card in Slot:1 is empty, but is not empty in Node:0
Node node:2 PCI card in Slot:1 is empty, but is not empty in Node:3
Node node:3 PCI card in Slot:1 is empty, but is not empty in Node:2


I believe this is a new(ish) check for port persistence features. For port persistence to work, I believe its required that the Cards line up between node pairs.

I have chased CRC errors like that before, in my situation it was resolved after replacing both FCALs on the shelf (the boards the SFPs plug into).

Re: T400 CRC errors on shelf/node

Posted: Wed Apr 16, 2014 3:37 pm
by Richard Siemers
P.S. If you have system reporter, watch your Hi-Res-->Port Perf-->Port Type:Disk-->Compare by:N:S:P

Look for your port with CRC errors to see if its causing higher latency.

Here's a link - just change the hostname to be your system reporter servername.

Code: Select all

http://YOUR-HOST-NAME/cgi-bin/3par-rpts/inserv_perf.exe?reptype=vstime&compare=n%3As%3Ap&maxgraphs=16&comparesel=total_svctms&refresh=&begintsecs=&endtsecs=&selporttype=disk&selrate=--All+Port+Rates--&selnsp=&charttab=chart&chartlib=gdgraph&charttype=lines&graphx=1000&graphy=400&timeform=Full&graphlegpos=&report=port_perf_time&category=hires

Re: T400 CRC errors on shelf/node

Posted: Wed Apr 16, 2014 3:59 pm
by fsprout
I'll watch that in SR for latency. Yes, I've considered changing out the FCAL, but haven't done that particular hardware change yet.

As for the cards on the nodes -- they haven't changed and there is a proper match between each node, but I just realized that the slot 1 cards are eth ports, not FC... I might have to do a bit of moving my iSCSI connections around to bring up the same port on each eth card. So much for that being part of the CRC port errors.

Thanks!

Frostie

Re: T400 CRC errors on shelf/node

Posted: Wed Apr 16, 2014 4:33 pm
by fsprout
Instead of replacing the FCAL, I just reseated it. I'll watch to see if that resolves the CRC problem. Maybe/maybe not, but a replacement is rather expensive as a refurb and a reseat is quick to try.

As for the iSCSI cards being reported as different, definitely not the FC CRC issue, but they all match so I'm not certain why the node is bringing up those 'errors'. It isn't related, so I'm ignoring it for a later date. Don't do much iSCSI anyway.

Frostie

Re: T400 CRC errors on shelf/node

Posted: Thu Apr 17, 2014 12:00 pm
by fsprout
And, to followup, the FCAL reseat did not resolve the CRC errors.

Richard, looks like I'll be replacing the FCAL per your recommendation. I'll just be replacing the one since I don't see any CRCs from the other on the same shelf.

Frostie

Re: T400 CRC errors on shelf/node

Posted: Fri Apr 18, 2014 3:01 pm
by Richard Siemers
Thanks for the update. Please let us know how it goes. I have heard from one other customer that sometimes the CRC errors on a good FCAL can be caused by weirdness on the other, since they share the loop... so if you still see CRC errors on the new FCAL, you may want to use the old FCAL to replace the seemingly innocent one.

Re: T400 CRC errors on shelf/node

Posted: Thu May 15, 2014 4:54 pm
by fsprout
There was no change in the CRC errors on the loop after replacing the FCAL. Even swapped out the other FCAL with no change.

Then, one of the drives in the shelf showed up with 'PD Degraded (prolonged missing B port)' and the CRC errors stopped. Makes sense -- failing drive throwing errors down the loop.

Now, just waiting on the drive to complete the servicemag and some dwell time to ensure that the CRC errors go away.

Frostie

Re: T400 CRC errors on shelf/node

Posted: Fri May 23, 2014 9:49 am
by trireed
Just as a follow I saw the same issues on a E200, show can look at the port more in detail to see what the hosts the CRC error's are coming from. This will show you if its due to a few host or all hosts on the port which would indicate a issue with SFP itself. In my case I saw CRC errors on partner ports coming from the same hosts so I knew then it was either being gerenated from the hosts or the a PD this particular hosts were using. Few weeks later after getting the errors everyday I had a PD degrade and the Fail and what do you know the CRC's stopped as well.

Here is the command to check what hosts are seeing the CRC's

showportlesb single|both <N:S:P>