RC in "Failed" status after network outage

Post Reply
keenerb
Posts: 11
Joined: Thu Jun 13, 2019 7:12 pm

RC in "Failed" status after network outage

Post by keenerb »

An existing periodic remote copy setup that's worked flawlessly for several years has suddenly stopped working after a lengthy network outage about two weeks ago. I have been on vacation and returned to a real mess. To make matters worse, my SAN administrator had already turned in his notice and is long gone now.

Both targets are marked as "FAILED" under Remote Copy Configuration/Targets.

All RCIP ports report READY, and can ping each other with no issues, but the LINKS are "Down" status.

What other information can I provide to help here? My only options for the remote copy groups is "Failover remote copy groups."
MammaGutt
Posts: 1578
Joined: Mon Sep 21, 2015 2:11 pm
Location: Europe

Re: RC in "Failed" status after network outage

Post by MammaGutt »

I don't think RCIP ports shouldn't respond to ping. If they are, then I think you might have an IP conflict causing you trouble.

edit: I suggest looking at the checkrclink command in CLI to check the status... Or maybe log a call with HPE.
The views and opinions expressed are my own and do not necessarily reflect those of my current or previous employers.
keenerb
Posts: 11
Joined: Thu Jun 13, 2019 7:12 pm

Re: RC in "Failed" status after network outage

Post by keenerb »

Code: Select all

Running Client Side
Running link test on:  0:3:1
Test length (secs):    10
Destination Addr:      X.X.210.X
Local IP Addr:           X.X.121.X
Local Device name:       eth1

------------------------------------------------------------
Measuring link latency
------------------------------------------------------------

Average measured latency: 12.575 ms
Pings Lost:               0 %

------------------------------------------------------------
Starting max MTU test, from 0:3:1 -> X.X.210.X
------------------------------------------------------------
ping: local error: Message too long, mtu=1500
ping: local error: Message too long, mtu=1500
ping: local error: Message too long, mtu=1500

MTU: 1500

------------------------------------------------------------
Starting throughput test, from 0:3:1 -> X.X.210.X
------------------------------------------------------------
------------------------------------------------------------
Client connecting to X.X.210.X, TCP port 5001
TCP window size: 4096 KByte (WARNING: requested 2048 KByte)
------------------------------------------------------------
[ 12] local X.X.121.X port 45132 connected with X.X.210.X port 5001
[  6] local X.X.121.X port 45126 connected with X.X.210.X port 5001
[  9] local X.X.121.X port 45129 connected with X.X.210.X port 5001
[  8] local X.X.121.X port 45128 connected with X.X.210.X port 5001
[  4] local X.X.121.X port 45125 connected with X.X.210.X port 5001
[  5] local X.X.121.X port 45124 connected with X.X.210.X port 5001
[  3] local X.X.121.X port 45123 connected with X.X.210.X port 5001
[ 11] local X.X.121.X port 45131 connected with X.X.210.X port 5001
[  7] local X.X.121.X port 45127 connected with X.X.210.X port 5001
[ 10] local X.X.121.X port 45130 connected with X.X.210.X port 5001


All the ports report virtually identical output from checkrclink.

It appears to "hang" at this point, it never returns to the CLI, not sure if that's normal or not.

HPE support is sadly not an option; executive leadership decided that we have no need for HP support on hardware that will be replaced in a few months. I mean, why would we need support for our production SAN and DR environment? That's crazy talk.
MammaGutt
Posts: 1578
Joined: Mon Sep 21, 2015 2:11 pm
Location: Europe

Re: RC in "Failed" status after network outage

Post by MammaGutt »

For the "message too long" you need to reduce MTU. Try 1450 and increase one by one until it stops working. Not sure if that is your only problem.
The views and opinions expressed are my own and do not necessarily reflect those of my current or previous employers.
jbguy
Posts: 70
Joined: Thu Nov 30, 2017 11:20 am
Location: WI

Re: RC in "Failed" status after network outage

Post by jbguy »

Im going to throw my hat in the ring here as well and say network issue. The only time I have seen it fail like that is when it couldnt reach the other side. We use periodic remote copy as well.

Can your network team help you to see if the traffic is reaching the other side? What caused the outage? Maybe a switch they are connected to lost a config? Someone forget to do a write mem?

Im a total newb on this (only been a 3par admin for about a year and a half) but here are a couple of commands that may give you more info. They may not help but not sure if you knew them at all.

Show Remote copy links
Showrctransport -rcip

Show all target and links for the remote copy group
Showrcopy –d targets or links

Start and Stop RCOPY from command line

Get list of Rcopy groups
showrcopy

Stop Rcopy Groups
stoprcopygroup <groupname>

Starting Rcopy groups
startrcopygroup <groupname>


Report back if you figure it out.
keenerb
Posts: 11
Joined: Thu Jun 13, 2019 7:12 pm

Re: RC in "Failed" status after network outage

Post by keenerb »

RCIP pings are succeeding.

I've "stoprcopy" and "startrcopy" several times on both ends.

showrctransport -rcip reports "State" as "Missing" on all four ports (two local, two remote). Configuration looks good otherwise.

checkrclink freezes as reported in the previous listing with server in production testing from DR.

When running startserver in DR, I get the following at the end, after normal MTU check and whatnot

Code: Select all

------------------------------------------------------------

Starting throughput test, from 0:3:1 -> x.x.121.x

------------------------------------------------------------



Could not connect with server.

Please ensure server is running.





============================================================

TEST SUMMARY from 0:3:1 -> x.x.121.x

Test Started:     Mon Jun 17 13:44:04 EDT 2019

Test Finisshed:   Mon Jun 17 13:45:09 EDT 2019

============================================================



Latency:                  12.058 ms

Lost pings:                    0 %

Through-put:                   0 Bits/second

Max MTU:                    1500

Tx TCP Segs:                 688

Rx TCP Segs:                 647

TCP retrans:                   8 %

Errored Segs:                  0 %





Check remote server is running.





Link 0:3:1 is NOT SUITABLE for Remote Copy Use



============================================================


khasck
Posts: 1
Joined: Wed Jul 03, 2019 6:06 pm

Re: RC in "Failed" status after network outage

Post by khasck »

We're seeing an almost identical situation here. We were testing a fiber failover and after we finished testing, put everything back how it was, we saw this exact same situation as you. Link in "down" status even tho all the ports are pingable and up.

Same thing -- DR-side shows links up, Prod shows links down.

Code: Select all

PRODSAN1 cli% checkrclink startclient 0:9:1 x.x.110.131 60
Running Client Side
Running link test on:  0:9:1
Test length (secs):    60
Destination Addr:      x.x.110.131
Local IP Addr:           x.x.110.41
Local Device name:       eth1

------------------------------------------------------------
Measuring link latency
------------------------------------------------------------

Average measured latency: 16.761 ms
Pings Lost:               3 %

------------------------------------------------------------
Starting max MTU test, from 0:9:1 -> x.x.110.131
------------------------------------------------------------
ping: local error: Message too long, mtu=1500
ping: local error: Message too long, mtu=1500
ping: local error: Message too long, mtu=1500

MTU: 1500

------------------------------------------------------------
Starting throughput test, from 0:9:1 -> x.x.110.131
------------------------------------------------------------

Could not connect with server.
Please ensure server is running.


============================================================
TEST SUMMARY from 0:9:1 -> x.x.110.131
Test Started:     Wed Jul  3 19:02:31 EDT 2019
Test Finished:   Wed Jul  3 19:02:37 EDT 2019
============================================================

Latency:                  16.761 ms
Lost pings:                    3 %
Through-put:                   0 Bits/second
Max MTU:                    1500
Tx TCP Segs:                 806
Rx TCP Segs:                 758
TCP retrans:                  16 %
Errored Segs:                  0 %


Check remote server is running.


Link 0:9:1 is NOT SUITABLE for Remote Copy Use

============================================================
keenerb
Posts: 11
Joined: Thu Jun 13, 2019 7:12 pm

Re: RC in "Failed" status after network outage

Post by keenerb »

No resolution yet. I've got some professional assistance scheduled early next week from a vendor, they're still unwilling to spring for actual HP support.
keenerb
Posts: 11
Joined: Thu Jun 13, 2019 7:12 pm

Re: RC in "Failed" status after network outage

Post by keenerb »

Professional services assistance never happened, but thsi article at least allowed me to clean up the mess. It didn't fix the connectivity issue, but it let me un-replicate the volumes and clean up snapshots and whatnot.

https://community.hpe.com/t5/3PAR-Store ... crmidV7kuU

Specifically,

cli%setrcopytarget no_mirror_config <target array name>

is what let me clean up the leftover RC pieces.
Post Reply